Pencarian

Rss Posts

 

 

 

Berita pada kategori ‘Basisdata’

SQL and NoSQL

Nov 30, 2010

Alaric Snell-Pym discusses why choose between SQL and NoSQL? Why can’t you use both in your infrastructure?”NoSQL engines abandon SQL for the chance to have more flexible data models and softer semantics for update operations – but they also abandon it because it?s a lot of work to implement. And, creating a new database from scratch, they?re keen on solving the interesting hard problems (such as replicated data storage), rather than following the well-trodden path of writing SQL parsers and query planners, with a few decades of catching up with the competition ahead of them.”

NoSQL and Riak – Travis Swicegood

Nov 14, 2010

I talk with a lot of people about NoSQL. I?ve been following it pretty closely for about 3 years now. One of the things I often tell people who are trying to wrap their heads around the concepts of NoSQL and what all of the data stores mean is to go search for Justin Sheehy and his NoSQL East Talk.

Then it dawned on me, why don?t I just embed the video on my site? Makes it super easy for people to find (travisswicegood.com/tags/riak) and I don?t have to go searching for it.

So, here?s what I consider one of the best talks to date on NoSQL.

Improving MySQL Insert thoughput

Nov 05, 2010

There are three simple practices that can improve general INSERT throughput. Each requires consideration on how the data is collected and what is acceptable data loss in a disaster.
General inserting of rows can be performed as single INSERT’s for example.

INSERT INTO table (col1, col2, col3) VALUES (?, ?, ?);
INSERT INTO table (col1, col2, col3) VALUES (?, ?, ?);
INSERT INTO table (col1, col2, col3) VALUES (?, ?, ?);

While this works, there are two scalability limitations. First is the network overhead of the back and forth of each SQL statement, the second is the synchronous nature, that is your code can not continue until your INSERT is successfully completed.
The first improvement is to use MySQL’s multi values capability with INSERT. That is you can insert multiple rows with a single INSERT statement. For example:

INSERT INTO table (col1, col2, col3) VALUES (?, ?, ?), VALUES (?, ?, ?), (?, ?, ?);

Depending on how you collect the information to be inserted, you may be able to easily implement this. The benefit, as previously mentioned in The value of multi insert values shows an 88% improvement in performance.
One system variable to be aware of is max_allowed_packet. You may have to adjust this for larger INSERT statements.
Next is the ability to INSERT data based on information already in another table. You can also leverage for example another storage engine like MEMORY to batch up data to be inserted via this approach.

INSERT INTO table (col1, col2, col3) SELECT col1, col2, col3 FROM anothertable

The third option is to batch load your data from a flat file. Depending on how you source the information you are inserting, that may also be a significant improvement in throughput in bulk inserting data.

LOAD DATA [CONCURRENT] INFILE ‘file’
INTO TABLE (col1, col2, col3)

On a closing note, the choice of storage engine can also have a significant effect on INSERT throughput. MySQL also offers other non ANSI options including DELAYED, IGNORE and LOW_PRIORITY. These should definitely be avoided.

Cloud, SaaS and The Consumerization of IT

Nov 01, 2010


I wrote a guest column for GigaOm on how open source software, cloud and software as a service are helping to bring about the consumerization of IT: namely bringing simplicity where complexity reigned. ?I cited some examples including New Relic, Box.net and Apple.

Open source has gone a long way toward putting power back in the hands of developers, who can download, install and deploy software without having to go through any kind of?convoluted sales or budget approval process. ?You want?MySQL? ?You can download and install?in?15 minutes, and you don?t have to talk to anyone to do it.
Software as a service (SaaS) takes this to an even broader audience, enabling employees to get the kind of lightweight, consumer, self-serve capabilities in their job without even having to run their own servers. ?Platforms like?Amazon AWS, Heroku, Makara, RightScale?and others put this same kind of SaaS power in the hands of developers…
My view: ease of use trumps a long feature list any day of the week. There are both techological reasons as well as sociological and economic reasons for why organizations are seeking greater simplicity. ?Part of this stems from the fact that complex enterprise applications grew beyond the ability of most organizations to successfully adopt. ?

Head over to GigaOm for the full post.

SHOW ENGINE INNODB STATUS truncation, innodb_truncated_status_writes

Oct 29, 2010

Another piece of good news for MySQL 5.5 – the output of SHOW ENGINE INNODB STATUS has now been increased from 64kB, to 1MB. For those running with systems that have thousands of running transactions, or large lock outputs, it should take quite a bit more to force truncation now.
We also added a new status variable to track when truncation happens as well – innodb_truncated_status_writes, so you can detect this should you have automated monitoring depending on this output.
Bug#56922 for details.

Bookmark It

Hide Sites

[Bash] Performing array intersection with Bash

Oct 26, 2010

I am currently working on a project to deploy new website builds to a
small number of servers. I needed something simple and reliable that could
be built in a very short period of time. I decided to whip something up in
bash with the intent of refining it in Python later.
As I began to write this code, I realized that it probably would have been
quicker to do it in Python from the start. I decided to stick with bash as
somewhat of an academic exercise. The vast majority of these deployment
scripts were trivial; check the code out of git, create a manifest, package
it up, spew it to the servers, etc, etc. The problem came during the last
step. We decided to use a symlink to point to the active build out of a
number of builds that could be available on the server at any given time.
Since all of our servers should be running the exact same version of the
build, it makes sense that I should only allow a user of my deployment
scripts to link a build which exists on all servers. But how do you
accomplish this in bash?
In most other languages, you have access to numerous array helping
functions that allow you to perform intersects, uniqs, and merges. My goal
was to do the same thing in bash without forking out to any external
binary. So how do you ensure that a particular thing exists on N number of
servers? Here it is:
function in_array() {
local x
ENTRY=$1
shift 1
ARRAY=( “$@” )
[ -z "${ARRAY}" ] && return 1
[ -z "${ENTRY}" ] && return 1
for x in ${ARRAY[@]}; do
[ "${x}" == "${ENTRY}" ] && return 0
done
return 1
}

MASTER=()
CURRENT=()
FIRST=1
for SERVER in ${SERVERS}; do
# collect all builds from server and populate CURRENT list
COMMAND=”${LS} -1fd ${WEBROOT}/${SITE}.*”
BUILDS=`${SSH} ${SSHOPTS} root@${SERVER} “${COMMAND}”`
for BUILD in ${BUILDS}; do
CURRENT=( ${CURRENT[@]-} ${BUILD} )
done

# if this is our first time around, copy CURRENT to MASTER
if [ ${FIRST} -eq 1 ]; then
MASTER=( ${CURRENT[@]} )
FIRST=0
fi

# now we do a compare between MASTER and CURRENT to see what builds
# are common
INTERSECT=()
for ENTRY in ${CURRENT[@]}; do
in_array “${ENTRY}” “${MASTER[@]}”
RET=$?
if [ "${RET}" -eq 0 ]; then
INTERSECT=( ${INTERSECT[@]-} ${ENTRY} )
fi
done
MASTER=( ${INTERSECT[@]} )

# clear the CURRENT array
CURRENT=()
done
Let me take a moment to explain the code above:

In order to check for array intersection, you need an in_array()
function

The first argument as the “needle” and the second is the
“haystack”
We verify that both parameters were passed
We simply loop through the haystack checking for the needle
If we find it, return success. Otherwise, eventually return
false

We need to loop through each server eventually, but we’ll start with
the first one

Run an SSH command to get a listing of builds
Populate an array ($CURRENT) with the builds that were found

Since the first server has no previous server to compare with, so we
just copy it to $MASTER

We then loop to the 2nd server, and put the result of getting builds
into $CURRENT

Now that we have the first server’s builds in $MASTER, we perform an
intersect with $CURRENT
We realize the need for an $INTERSECT array to hold the intersections
found above
$INTERSECT becomes $MASTER since it only contains similar builds from
the 1st and 2nd server

Looping to the 3rd server, we get the builds and put them in $CURRENT

Since $MASTER contains only the similar builds thus far, we again
compare it with $CURRENT
The intersect can now be used to compare against builds on the 4th
server, and so on

Once you finish looping through all servers, your $MASTER should
contain only similar builds

There are a few guides out there which show you how to do this via
forking, but I thought someone may appreciate the elegance of using 100%
bash to accomplish this. I hope this helps someone else out there!

MySQL Limitations Part 3: Subqueries

Oct 25, 2010

This is the third in a series on what?s seriously limiting MySQL in certain circumstances (links: part 1, 2). This post is about subqueries, which in some cases execute outside-in instead of inside-out as users expect.
It’s easy to pick on subqueries in MySQL, so I’ll try to be gentle. The following query will surprise users unpleasantly:

select * from a where a.id in (select id from b);

Users expect the inner query to execute first, then the results to be substituted into the IN() list. But what happens instead is usually a full scan or index scan of table a, followed by N queries to table b. This is because MySQL rewrites the query to make the inner query dependent on the outer query, which could be an optimization in some cases, but de-optimizes the query in many other cases. NOT IN(SELECT …) queries execute badly, too. (Note: putting a literal list of items in the IN() clause performs fine. It’s only when there is a SELECT inside it that it works poorly.)
The fix for this has been in progress for a few years, and Sergey Petrunia committed working code to the stalled 6.0 release. But it’s not quite clear whether that code was a complete solution. It has not been in any GA or RC release, so it hasn’t been used widely.
To be fair, many other database servers also have poor subquery performance, or have had it in the past and have fixed it. And many MySQL users have learned to simply write JOINs instead, so it isn’t that much of a limitation. But it would be a big improvement if it were fixed.
See if you can guess what limitation number 4 will be!

Entry posted by Baron Schwartz |
One comment
Add to: | | | |

How often should you use OPTIMIZE TABLE? ? followup

Oct 04, 2010

This post follows up on Baron’s How often should you use OPTIMIZE TABLE?. I had the opportunity of doing some massive purging of data from large tables, and was interested to see the impact of the OPTIMIZE operation on table’s indexes. I worked on some production data I was authorized to provide as example.
The use case
I’ll present a single use case here. The table at hand is a compressed InnoDB table used for logs. I’ve rewritten some column names for privacy:

mysql> show create table logs \G

Create Table: CREATE TABLE `logs` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`name` varchar(20) CHARACTER SET ascii COLLATE ascii_bin NOT NULL,
`ts` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP,
`origin` varchar(64) CHARACTER SET ascii COLLATE ascii_bin NOT NULL,
`message` text NOT NULL,
`level` tinyint(11) NOT NULL DEFAULT ‘0′,
`s` char(16) CHARACTER SET ascii COLLATE ascii_bin NOT NULL DEFAULT ”,
PRIMARY KEY (`id`),
KEY `s` (`s`),
KEY `name` (`name`,`ts`),
KEY `origin` (`origin`,`ts`)
) ENGINE=InnoDB AUTO_INCREMENT=186878729 DEFAULT CHARSET=utf8 ROW_FORMAT=COMPRESSED KEY_BLOCK_SIZE=8

The table had log records starting 2010-08-23 and up till 2010-09-02 noon. Table status:

mysql> show table status like ‘logs’\G
*************************** 1. row ***************************
Name: logs
Engine: InnoDB
Version: 10
Row_format: Compressed
Rows: 22433048
Avg_row_length: 206
Data_length: 4625285120
Max_data_length: 0
Index_length: 1437073408
Data_free: 4194304
Auto_increment: 186878920
Create_time: 2010-08-24 18:10:49
Update_time: NULL
Check_time: NULL
Collation: utf8_general_ci
Checksum: NULL
Create_options: row_format=COMPRESSED KEY_BLOCK_SIZE=8
Comment:

(A bit puzzled on the Create_time; the table was taken from an LVM snapshot of another server, so it existed for a very long time before. Not sure why the Create_time field is as it is here; I assume the MySQL upgrade marked it so, did not have the time nor need to look into it).
I was using Percona-Server-5.1.47-11.2, and so was able to look at the index statistics for that table:

mysql> SELECT * FROM information_schema.INNODB_INDEX_STATS WHERE table_name=’logs’;
+————–+————+————–+——–+—————-+————+————+
| table_schema | table_name | index_name?? | fields | row_per_keys?? | index_size | leaf_pages |
+————–+————+————–+——–+—————-+————+————+
| newsminer??? | logs?????? | PRIMARY????? |????? 1 | 1????????????? |???? 282305 |???? 246856 |
| newsminer??? | logs?????? | s |????? 2 | 17, 1????????? |????? 38944 |????? 33923 |
| newsminer??? | logs?????? | name???????? |????? 3 | 2492739, 10, 2 |????? 22432 |????? 19551 |
| newsminer??? | logs?????? | origin?????? |????? 3 | 1303, 4, 1???? |????? 26336 |????? 22931 |
+————–+————+————–+——–+—————-+————+————+

Status after massive purge
My first requirement was to purge out all record up to 2010-09-01 00:00:00. I did so in small chunks, using openark kit‘s oak-chunk-update (same can be achieved with maatkit‘s mk-archiver). The process purged 1000 rows at a time, with some sleep in between, and ran for about a couple of hours. It may be interesting to note that since ts is in monotonically ascending values, purging of old rows also means purging of lower PKs, which means we’re trimming the PK tree from left.
Even while purging took place, I could see the index_size/leaf_pages values dropping, until, finally:

mysql> SELECT * FROM information_schema.INNODB_INDEX_STATS WHERE table_name=’logs’;
+————–+————+————–+——–+————–+————+————+
| table_schema | table_name | index_name?? | fields | row_per_keys | index_size | leaf_pages |
+————–+————+————–+——–+————–+————+————+
| newsminer??? | logs?????? | PRIMARY????? |????? 1 | 1??????????? |????? 40961 |????? 35262 |
| newsminer??? | logs?????? | s |????? 2 | 26, 1??????? |????? 34440 |?????? 3798 |
| newsminer??? | logs?????? | name???????? |????? 3 | 341011, 4, 1 |?????? 4738 |?????? 2774 |
| newsminer??? | logs?????? | origin?????? |????? 3 | 341011, 4, 2 |????? 10178 |?????? 3281 |
+————–+————+————–+——–+————–+————+————+

The number of deleted rows was roughly 85% of total rows, so down to 15% number of rows.
Status after OPTIMIZE TABLE
Time to see whether OPTIMIZE really optimizes! Will it reduce number of leaf pages in PK? In secondary keys?

mysql> OPTIMIZE TABLE logs;

mysql> SELECT * FROM information_schema.INNODB_INDEX_STATS WHERE table_name=’logs’;
+————–+————+————–+——–+————–+————+————+
| table_schema | table_name | index_name?? | fields | row_per_keys | index_size | leaf_pages |
+————–+————+————–+——–+————–+————+————+
| newsminer??? | logs?????? | PRIMARY????? |????? 1 | 1??????????? |????? 40436 |????? 35323 |
| newsminer??? | logs?????? | s |????? 2 | 16, 1??????? |?????? 5489 |?????? 4784 |
| newsminer??? | logs?????? | name???????? |????? 3 | 335813, 7, 1 |?????? 3178 |?????? 2749 |
| newsminer??? | logs?????? | origin?????? |????? 3 | 335813, 5, 2 |?????? 3951 |?????? 3446 |
+————–+————+————–+——–+————–+————+————+
4 rows in set (0.00 sec)

The above shows no significant change in either of the indexes: not for index_size, not for leaf_pages, not for statistics (row_per_keys). The OPTIMIZE did not reduce index size. It did not reduce the number of index pages (leaf_pages are the major factor here). Some leaff_pages values have even increased, but in small enough margin to consider as equal.
Index-wise, the above example does not show an advantage to using OPTIMIZE. I confess, I was surprised. And for the better. This indicates InnoDB makes good merging of index pages after massive purging.
So, no use for OPTIMIZE?
Think again: file system-wise, things look different.
Before purging of data:

bash:~# ls -l logs.* -h
-rw-r—– 1 mysql mysql 8.6K 2010-08-15 17:40 logs.frm
-rw-r—– 1 mysql mysql 2.9G 2010-09-02 14:01 logs.ibd

After purging of data:

bash:~# ls -l logs.* -h
-rw-r—– 1 mysql mysql 8.6K 2010-08-15 17:40 logs.frm
-rw-r—– 1 mysql mysql 2.9G 2010-09-02 14:21 logs.ibd

Recall that InnoDB never releases table space back to file system!
After OPTIMIZE on table:

bash:~# ls -l logs.* -h
-rw-rw—- 1 mysql mysql 8.6K 2010-09-02 14:26 logs.frm
-rw-rw—- 1 mysql mysql 428M 2010-09-02 14:43 logs.ibd

On innodb_file_per_table an OPTIMIZE creates a new table space, and the old one gets destroyed. Space goes back to file system. Don’t know about you; I like to have my file system with as much free space as possible.
Need to verify
I’ve tested Percona Server, since this is where I can find INNODB_INDEX_STATS. But this begs the following questions:

Perhaps the results only apply for Percona Server? (I’m guessing not).
Or only for InnoDB plugin? Does the same hold for “builtin” InnoDB? (dunno)
Only on >= 5.1? (Maybe; 5.0 is becoming rare now anyway)
Only on InnoDB (Well, of course this test is storage engine dependent!)

Conclusion
The use case above is a particular example. Other use cases may include tables where deletions often occur in middle of table (remember we were trimming the tree from left side only). Other yet may need to handle UPDATEs to indexed columns. I have some more operations to do here, with larger tables (e.g. 40GB compressed). If anything changes, I’ll drop a note.

Call for Papers for the O’Reilly MySQL Conference

Sep 21, 2010

The call for papers for the O’Reilly MySQL Conference is now open, and closes October 25th.  Submit your proposal now at http://en.oreilly.com/mysql2011/user/proposal/propose/cfp/126!

Rails on PostgreSQL: Rails 3, bundler, and the pg gem

Sep 17, 2010

I was moving a Rails 2 app up to Rails 3 and, because I have PostgreSQL installed in a non-standard location on my server, I ran into problems when bundler was trying to install the pg gem. After fiddling about for a bit I ended up with this in my config/deploy.rb:

require 'bundler/capistrano'

task :set_config_for_pg_gem, :roles => [:app, :db] do
  run "cd #{current_path} && bundle config build.pg --with-pg-config=/usr/local/pgsql/bin/pg_config --no-rdoc --no-ri"
end

before "bundle:install", :set_config_for_pg_gem

This sets up the appropriate command line flags for the pg gem so that they’re in place when Capistrano runs the bundle:install task. The --no-rdoc --no-ri part isn’t necessary, but I figured it’ll save a second or two. Note that these flags end up in the deploy user’s home directory on the server:

$ cat ~/.bundle/config
---
BUNDLE_BUILD__PG: --with-pg-config=/usr/local/pgsql/bin/pg_config --no-rdoc --no-ri

Running this task every time you deploy is a little wasteful since it sets the configuration unnecessarily – really you just need it before the first time you deploy. So you could optimize things by touching a file in shared/system/ or some such and checking it as part of this task.