SQL and NoSQL
Nov 30, 2010
Guyub adalah perusahaan TI berpusat di Palembang dengan fokus pada F/OSS Produk-produk >> Layanan-layanan >>
Nov 30, 2010
Nov 14, 2010
I talk with a lot of people about NoSQL. I?ve been following it pretty closely for about 3 years now. One of the things I often tell people who are trying to wrap their heads around the concepts of NoSQL and what all of the data stores mean is to go search for Justin Sheehy and his NoSQL East Talk. Then it dawned on me, why don?t I just embed the video on my site? Makes it super easy for people to find (travisswicegood.com/tags/riak) and I don?t have to go searching for it. So, here?s what I consider one of the best talks to date on NoSQL.
Nov 05, 2010
INSERT INTO table (col1, col2, col3) VALUES (?, ?, ?);
INSERT INTO table (col1, col2, col3) VALUES (?, ?, ?);
INSERT INTO table (col1, col2, col3) VALUES (?, ?, ?);
While this works, there are two scalability limitations. First is the network overhead of the back and forth of each SQL statement, the second is the synchronous nature, that is your code can not continue until your INSERT is successfully completed.
The first improvement is to use MySQL’s multi values capability with INSERT. That is you can insert multiple rows with a single INSERT statement. For example:
INSERT INTO table (col1, col2, col3) VALUES (?, ?, ?), VALUES (?, ?, ?), (?, ?, ?);
Depending on how you collect the information to be inserted, you may be able to easily implement this. The benefit, as previously mentioned in The value of multi insert values shows an 88% improvement in performance.
One system variable to be aware of is max_allowed_packet. You may have to adjust this for larger INSERT statements.
Next is the ability to INSERT data based on information already in another table. You can also leverage for example another storage engine like MEMORY to batch up data to be inserted via this approach.
INSERT INTO table (col1, col2, col3) SELECT col1, col2, col3 FROM anothertable
The third option is to batch load your data from a flat file. Depending on how you source the information you are inserting, that may also be a significant improvement in throughput in bulk inserting data.
LOAD DATA [CONCURRENT] INFILE ‘file’
INTO TABLE (col1, col2, col3)
On a closing note, the choice of storage engine can also have a significant effect on INSERT throughput. MySQL also offers other non ANSI options including DELAYED, IGNORE and LOW_PRIORITY. These should definitely be avoided.
Nov 01, 2010
Open source has gone a long way toward putting power back in the hands of developers, who can download, install and deploy software without having to go through any kind of?convoluted sales or budget approval process. ?You want?MySQL? ?You can download and install?in?15 minutes, and you don?t have to talk to anyone to do it.
Software as a service (SaaS) takes this to an even broader audience, enabling employees to get the kind of lightweight, consumer, self-serve capabilities in their job without even having to run their own servers. ?Platforms like?Amazon AWS, Heroku, Makara, RightScale?and others put this same kind of SaaS power in the hands of developers…
My view: ease of use trumps a long feature list any day of the week. There are both techological reasons as well as sociological and economic reasons for why organizations are seeking greater simplicity. ?Part of this stems from the fact that complex enterprise applications grew beyond the ability of most organizations to successfully adopt. ?
Head over to GigaOm for the full post.
Oct 26, 2010
MASTER=()
CURRENT=()
FIRST=1
for SERVER in ${SERVERS}; do
# collect all builds from server and populate CURRENT list
COMMAND=”${LS} -1fd ${WEBROOT}/${SITE}.*”
BUILDS=`${SSH} ${SSHOPTS} root@${SERVER} “${COMMAND}”`
for BUILD in ${BUILDS}; do
CURRENT=( ${CURRENT[@]-} ${BUILD} )
done
# if this is our first time around, copy CURRENT to MASTER
if [ ${FIRST} -eq 1 ]; then
MASTER=( ${CURRENT[@]} )
FIRST=0
fi
# now we do a compare between MASTER and CURRENT to see what builds
# are common
INTERSECT=()
for ENTRY in ${CURRENT[@]}; do
in_array “${ENTRY}” “${MASTER[@]}”
RET=$?
if [ "${RET}" -eq 0 ]; then
INTERSECT=( ${INTERSECT[@]-} ${ENTRY} )
fi
done
MASTER=( ${INTERSECT[@]} )
# clear the CURRENT array
CURRENT=()
done
Let me take a moment to explain the code above:
In order to check for array intersection, you need an in_array()
function
The first argument as the “needle” and the second is the
“haystack”
We verify that both parameters were passed
We simply loop through the haystack checking for the needle
If we find it, return success. Otherwise, eventually return
false
We need to loop through each server eventually, but we’ll start with
the first one
Run an SSH command to get a listing of builds
Populate an array ($CURRENT) with the builds that were found
Since the first server has no previous server to compare with, so we
just copy it to $MASTER
We then loop to the 2nd server, and put the result of getting builds
into $CURRENT
Now that we have the first server’s builds in $MASTER, we perform an
intersect with $CURRENT
We realize the need for an $INTERSECT array to hold the intersections
found above
$INTERSECT becomes $MASTER since it only contains similar builds from
the 1st and 2nd server
Looping to the 3rd server, we get the builds and put them in $CURRENT
Since $MASTER contains only the similar builds thus far, we again
compare it with $CURRENT
The intersect can now be used to compare against builds on the 4th
server, and so on
Once you finish looping through all servers, your $MASTER should
contain only similar builds
There are a few guides out there which show you how to do this via
forking, but I thought someone may appreciate the elegance of using 100%
bash to accomplish this. I hope this helps someone else out there!
Oct 25, 2010
select * from a where a.id in (select id from b);
Users expect the inner query to execute first, then the results to be substituted into the IN() list. But what happens instead is usually a full scan or index scan of table a, followed by N queries to table b. This is because MySQL rewrites the query to make the inner query dependent on the outer query, which could be an optimization in some cases, but de-optimizes the query in many other cases. NOT IN(SELECT …) queries execute badly, too. (Note: putting a literal list of items in the IN() clause performs fine. It’s only when there is a SELECT inside it that it works poorly.)
The fix for this has been in progress for a few years, and Sergey Petrunia committed working code to the stalled 6.0 release. But it’s not quite clear whether that code was a complete solution. It has not been in any GA or RC release, so it hasn’t been used widely.
To be fair, many other database servers also have poor subquery performance, or have had it in the past and have fixed it. And many MySQL users have learned to simply write JOINs instead, so it isn’t that much of a limitation. But it would be a big improvement if it were fixed.
See if you can guess what limitation number 4 will be!
Entry posted by Baron Schwartz |
One comment
Add to: | | | |
Oct 04, 2010
mysql> show create table logs \G
Create Table: CREATE TABLE `logs` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`name` varchar(20) CHARACTER SET ascii COLLATE ascii_bin NOT NULL,
`ts` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP,
`origin` varchar(64) CHARACTER SET ascii COLLATE ascii_bin NOT NULL,
`message` text NOT NULL,
`level` tinyint(11) NOT NULL DEFAULT ‘0′,
`s` char(16) CHARACTER SET ascii COLLATE ascii_bin NOT NULL DEFAULT ”,
PRIMARY KEY (`id`),
KEY `s` (`s`),
KEY `name` (`name`,`ts`),
KEY `origin` (`origin`,`ts`)
) ENGINE=InnoDB AUTO_INCREMENT=186878729 DEFAULT CHARSET=utf8 ROW_FORMAT=COMPRESSED KEY_BLOCK_SIZE=8
The table had log records starting 2010-08-23 and up till 2010-09-02 noon. Table status:
mysql> show table status like ‘logs’\G
*************************** 1. row ***************************
Name: logs
Engine: InnoDB
Version: 10
Row_format: Compressed
Rows: 22433048
Avg_row_length: 206
Data_length: 4625285120
Max_data_length: 0
Index_length: 1437073408
Data_free: 4194304
Auto_increment: 186878920
Create_time: 2010-08-24 18:10:49
Update_time: NULL
Check_time: NULL
Collation: utf8_general_ci
Checksum: NULL
Create_options: row_format=COMPRESSED KEY_BLOCK_SIZE=8
Comment:
(A bit puzzled on the Create_time; the table was taken from an LVM snapshot of another server, so it existed for a very long time before. Not sure why the Create_time field is as it is here; I assume the MySQL upgrade marked it so, did not have the time nor need to look into it).
I was using Percona-Server-5.1.47-11.2, and so was able to look at the index statistics for that table:
mysql> SELECT * FROM information_schema.INNODB_INDEX_STATS WHERE table_name=’logs’;
+————–+————+————–+——–+—————-+————+————+
| table_schema | table_name | index_name?? | fields | row_per_keys?? | index_size | leaf_pages |
+————–+————+————–+——–+—————-+————+————+
| newsminer??? | logs?????? | PRIMARY????? |????? 1 | 1????????????? |???? 282305 |???? 246856 |
| newsminer??? | logs?????? | s |????? 2 | 17, 1????????? |????? 38944 |????? 33923 |
| newsminer??? | logs?????? | name???????? |????? 3 | 2492739, 10, 2 |????? 22432 |????? 19551 |
| newsminer??? | logs?????? | origin?????? |????? 3 | 1303, 4, 1???? |????? 26336 |????? 22931 |
+————–+————+————–+——–+—————-+————+————+
Status after massive purge
My first requirement was to purge out all record up to 2010-09-01 00:00:00. I did so in small chunks, using openark kit‘s oak-chunk-update (same can be achieved with maatkit‘s mk-archiver). The process purged 1000 rows at a time, with some sleep in between, and ran for about a couple of hours. It may be interesting to note that since ts is in monotonically ascending values, purging of old rows also means purging of lower PKs, which means we’re trimming the PK tree from left.
Even while purging took place, I could see the index_size/leaf_pages values dropping, until, finally:
mysql> SELECT * FROM information_schema.INNODB_INDEX_STATS WHERE table_name=’logs’;
+————–+————+————–+——–+————–+————+————+
| table_schema | table_name | index_name?? | fields | row_per_keys | index_size | leaf_pages |
+————–+————+————–+——–+————–+————+————+
| newsminer??? | logs?????? | PRIMARY????? |????? 1 | 1??????????? |????? 40961 |????? 35262 |
| newsminer??? | logs?????? | s |????? 2 | 26, 1??????? |????? 34440 |?????? 3798 |
| newsminer??? | logs?????? | name???????? |????? 3 | 341011, 4, 1 |?????? 4738 |?????? 2774 |
| newsminer??? | logs?????? | origin?????? |????? 3 | 341011, 4, 2 |????? 10178 |?????? 3281 |
+————–+————+————–+——–+————–+————+————+
The number of deleted rows was roughly 85% of total rows, so down to 15% number of rows.
Status after OPTIMIZE TABLE
Time to see whether OPTIMIZE really optimizes! Will it reduce number of leaf pages in PK? In secondary keys?
mysql> OPTIMIZE TABLE logs;
…
mysql> SELECT * FROM information_schema.INNODB_INDEX_STATS WHERE table_name=’logs’;
+————–+————+————–+——–+————–+————+————+
| table_schema | table_name | index_name?? | fields | row_per_keys | index_size | leaf_pages |
+————–+————+————–+——–+————–+————+————+
| newsminer??? | logs?????? | PRIMARY????? |????? 1 | 1??????????? |????? 40436 |????? 35323 |
| newsminer??? | logs?????? | s |????? 2 | 16, 1??????? |?????? 5489 |?????? 4784 |
| newsminer??? | logs?????? | name???????? |????? 3 | 335813, 7, 1 |?????? 3178 |?????? 2749 |
| newsminer??? | logs?????? | origin?????? |????? 3 | 335813, 5, 2 |?????? 3951 |?????? 3446 |
+————–+————+————–+——–+————–+————+————+
4 rows in set (0.00 sec)
The above shows no significant change in either of the indexes: not for index_size, not for leaf_pages, not for statistics (row_per_keys). The OPTIMIZE did not reduce index size. It did not reduce the number of index pages (leaf_pages are the major factor here). Some leaff_pages values have even increased, but in small enough margin to consider as equal.
Index-wise, the above example does not show an advantage to using OPTIMIZE. I confess, I was surprised. And for the better. This indicates InnoDB makes good merging of index pages after massive purging.
So, no use for OPTIMIZE?
Think again: file system-wise, things look different.
Before purging of data:
bash:~# ls -l logs.* -h
-rw-r—– 1 mysql mysql 8.6K 2010-08-15 17:40 logs.frm
-rw-r—– 1 mysql mysql 2.9G 2010-09-02 14:01 logs.ibd
After purging of data:
bash:~# ls -l logs.* -h
-rw-r—– 1 mysql mysql 8.6K 2010-08-15 17:40 logs.frm
-rw-r—– 1 mysql mysql 2.9G 2010-09-02 14:21 logs.ibd
Recall that InnoDB never releases table space back to file system!
After OPTIMIZE on table:
bash:~# ls -l logs.* -h
-rw-rw—- 1 mysql mysql 8.6K 2010-09-02 14:26 logs.frm
-rw-rw—- 1 mysql mysql 428M 2010-09-02 14:43 logs.ibd
On innodb_file_per_table an OPTIMIZE creates a new table space, and the old one gets destroyed. Space goes back to file system. Don’t know about you; I like to have my file system with as much free space as possible.
Need to verify
I’ve tested Percona Server, since this is where I can find INNODB_INDEX_STATS. But this begs the following questions:
Perhaps the results only apply for Percona Server? (I’m guessing not).
Or only for InnoDB plugin? Does the same hold for “builtin” InnoDB? (dunno)
Only on >= 5.1? (Maybe; 5.0 is becoming rare now anyway)
Only on InnoDB (Well, of course this test is storage engine dependent!)
Conclusion
The use case above is a particular example. Other use cases may include tables where deletions often occur in middle of table (remember we were trimming the tree from left side only). Other yet may need to handle UPDATEs to indexed columns. I have some more operations to do here, with larger tables (e.g. 40GB compressed). If anything changes, I’ll drop a note.
Sep 21, 2010