Pencarian

Rss Posts

 

 

 

Database Architectures & Performance

Jul 20, 2010

For decades the debate between shared-disk and shared-nothing databases has raged. The shared-disk camp points to the laundry list of functional benefits such as improved data consistency, high-availability, scalability and elimination of partitioning/replication/promotion. The shared-nothing camp shoots back with superior performance and reduced costs. Both sides have a point.First, let?s look at the performance issue. RAM (average access time of 200 nanoseconds) is considerably faster than disk (average access time of 12,000,000 nanoseconds). Let me put this 200:12,000,000 ratio into perspective. A task that takes a single minute in RAM would take 41 days in disk. So why do I bring this up?Shared-Nothing: Since the shared-nothing database has sole ownership of its data?it doesn?t share the data with other nodes?it can operate in the machine?s local RAM, only writing infrequently to disk (flushing the data to disk). This makes shared-nothing databases very fast.Shared-Disk: Cannot rely on the machine?s local RAM, because every write by one node must be instantly available to the other nodes, to ensure that they don?t use stale data and corrupt the database. So instead of relying on local RAM, all write transactions must be written to disk. This is where the 1 minute to 41 days ratio above comes into play and kills performance of shared-disk databases.Let?s look at some of the ways databases can utilize RAM instead of disk to improve performance:Read Cache: Databases typically use the RAM as a fast read cache. Upon reading data from the disk, this data is stored in the read cache so that subsequent use of that data is satisfied from RAM instead of the disk. For example, upon reading a person?s name from disk, that name is stored in the cache for fast access. The database wouldn?t need to read that name from disk again until that person?s name is changed (rare), or that RAM space is reused for a piece of data that is used more frequently. Read cache can significantly improve database performance. BOTH shared-disk and shared-nothing databases can exploit read cache. The shared-disk database just needs a system to either invalidate or update the data in read cache when one of the nodes has made a change. This is pretty standard in shared-disk databases.Background Writing: Writing data to the disk is by far the most time consuming process in a write transaction. During the transaction, that portion of the data is locked, meaning it is unavailable for other functions. So, if you can move the writing of the data outside of the transaction?write the data in the background?you get faster transactions, which means less locking contention, which means faster throughput. SHARED-NOTHING can exploit this performance enhancement, since each server owns the data in its RAM. However, shared-disk databases cannot do this because they need to share that updated data with the other database nodes in the cluster. Since the local node?s cache is not shared, in a shared-disk database, the only option is to use the shared disk to share that data across the nodes.Transactional Cache: The next step in utilizing RAM instead of disk is to use it in a transactional manner. This means that the database can make multiple changes to data in RAM prior to writing the final results to disk. For example, if you have 100 widgets, you can store that inventory count in RAM, and then decrement it with each sale. If you sell 23 widgets, then instead of writing each transaction to disk, you update it in RAM. When you flush this data to disk, it results in a single disk write, writing the inventory number 77, instead of writing each of the 23 transactions individually to disk.SHARED-NOTHING can perform transactions on data while it is in RAM. Once again, shared-disk databases cannot do this because you might have multiple nodes updating the inventory. Since they cannot look into each others local RAM, they must once again write each transaction to disk.As you can see, shared-nothing databases have an inherent performance advantage. The next blog post will address how modern shared-disk databases address these performance challenges.

Making ?Insert Ignore? Fast, by Avoiding Disk Seeks

Jul 06, 2010


In my post from three weeks ago, I explained why the semantics of normal ad-hoc insertions with a primary key are expensive because they require disk seeks on large data sets. Towards the end of the post, I claimed that it would be better to use ?replace into? or ?insert ignore? over normal inserts, because the semantics of these statements do NOT require disk seeks. In my post last week, I explained how the command ?replace into? can be fast with TokuDB’s fractal trees. Today, I explain how “insert ignore” can be fast, using a strategy that is very similar to what we do with “replace into”.

The semantics of “insert ignore” are similar to that of “replace into”:

if the primary (or unique) key does not exist: insert the new row
if the primary (or unique) key does exist: do nothing

B-trees have the same problem with “insert ignore” that they have with “replace into”. They perform a lookup of the primary key, incurring a disk seek. We have already shown how fractal trees do not incur this disk seek for “replace into”, so let’s see how we can avoid disk seeks with “insert ignore”.

The only difference with “replace into” is when the primary (or unique) key exists, instead of overwriting the old row with the new row, we disregard the new row. So, all we need to do is tweak our tombstone messaging scheme (that we use for deletes and “replace into”) so that when “insert ignore” commands do not overwrite old rows with new rows. Similar to deletes and replace into, with this scheme, “insert ignore? can be two orders of magnitude faster than insertions into a B-tree.

Here is what we do. We insert a message into the fractal tree, with a new message “ii”, to signify that we are doing an “insert ignore”. The only difference between this message and the normal “i” message for insertions is what we do on queries and merges. On queries, if the message is an “ii”, then the value in the LOWER node is read, and not the higher node. On merges, if the higher node has a message of “ii”, the value in the LOWER node takes precedence over the value in the higher node.

Let’s look at an example that is similar to what we looked at for “replace into”:

create table foo (a int, b int, primary key (a));

Suppose the fractal tree for this table looks as follows:

-

- -

- – - -

….

(i (1,1)) (i (2,2)) (i (3,3)) (i (4,4)) … (i (1000,1000)) … (i (2^32, 2^32))

The ?i? stands for insertion message. Now suppose we do:

insert ignore into foo values (1000, 1001).

With fractal trees, we insert (ii (1000,1001)) into the top node. The tree then looks as such:

(ii (1000,1001))

- -

- – - -

….

(i (1,1)) (i (2,2)) (i (3,3)) (i (4,4)) … (i (2^32, 2^32))

So upon querying the key ?1000′, a cursor notices that (1000,1001) has a message of “ii”. If it finds another value for the key 1000 in a lower node, it reads that value, otherwise, it reads (1000,1001). Because (1000,1000) is located in a lower node, the cursor returns (1000,1000) to the user. On merges, the message in the lower node, (1000,1000) overwrites the message in the higher node, (1000,1001).

While “insert ignore” can be fast, there are caveats (indexes, triggers, replication), just as there are with “replace into”. In a future posting, I will get into some of them.

Benchmarking MySQL ACID performance with SysBench

Jun 20, 2010

A couple of question I get a lot from MySQL customers is “how will this hardware upgrade improve my transactions per second (TPS)” and “what level of TPS will MySQL perform on this hardware if I’m running ACID settings?” Running sysbench against MySQL with different values for per-thread and global memory buffer sizes, ACID settings, and other settings gives me concrete values to bring to the customer to show the impact that more RAM, faster CPUs, faster disks, or cnf changes have on the server. Here are some examples for a common question: “If I’m using full ACID settings vs non-ACID settings what performance am I going to get from this server?”
Let’s find out by running sysbench with the following settings (most are self explanatory – if not the man page can explain them):

sysbench –test=oltp –db-driver=mysql –oltp-table-size=1000000 –mysql-engine-trx=yes –oltp-test-mode=complex –oltp-read-only=off –oltp-dist-type=special –max-requests=0 –num-threads=8 –max-time=120 –init-rng=on run

MySQL Settings:
In the first test MySQL is set to the following ACID related settings. This will give us results for TPS performance without full ACID compliance – very common settings on a server that is handling blogs, ad serving, general business websites, and other roles where full ACID is not required and performance is valued over the benefits of full ACID. These are important settings when we look at the difference in performance when we change to full ACID in the second test.

innodb_flush_log_at_trx_commit = 0
sync_binlog=0
transaction-isolation=REPEATABLE-READ

System configuration and InnoDB buffer pool size:

XEON E5345 Series 2.33ghz 8-core, 16GB RAM, Local SATA 7.2K disks
innodb_buffer_pool_size = 10G

Full result set from sysbench:
Summary OLTP test statistics:

queries performed:
transactions: ? ? ? ? ? ? ? ? ? ? ? ?172426 (1436.83 per sec.)
read/write requests: ? ? ? ? ? ? ? ? 3276664 (27304.51 per sec.)
other operations: ? ? ? ? ? ? ? ? ? ?344882 (2873.91 per sec.)

Non-ACID results:
We can simplify the results by looking at the following TPS results for this non-ACID test:

transactions: ? ? ? ? ? ? ? ? ? ? ? ?172426 (1436.83 per sec.)

Full ACID results:
Let’s go ahead and run the test again with different ACID settings. This will give us the TPS results for full ACID compliance:

innodb_flush_log_at_trx_commit = 1
sync_binlog=1
transaction-isolation=REPEATABLE-READ

We get the following results for TPS:

transactions: ? ? ? ? ? ? ? ? ? ? 3197 ? (26.58 per sec.)
read/write requests: ? ? ? ? ? ? ? ? 60743 ?(505.04 per sec.)
other operations: ? ? ? ? ? ? ? ? ? ?6394 ? (53.16 per sec.)

Final Results:
So as you can see the difference between full ACID settings and not (on the same server with only those values on the cnf being changed) results in a huge difference in performance on this standard database server. We can now hand this data to the customer and they will know what impact the settings will have on their application’s performance and what to expect when running full ACID vs non-ACID.
More info on using sysbench here:?http://sysbench.sourceforge.net

Can we get faster expression handling in MySQL

Mar 30, 2010

Andrew from Sphinx continues to work on improving SQL (or SphinxQL) support and now he published benchmarks comparing arithmetic expression handling in Sphinx to one in MySQL. The result ? Sphinx scored 3x to 20x faster. Andrew goes to explain results are not 100% comparable (as we can see in the table results are even different) and some performance can be attributed to Sphinx using different typing for expression. Though I#8217;m wondering how much of Performance difference that really explain. I doubt it is 20x. Andrew goes on to explain there are further optimizations possible for Sphinx such as JIT compilation and expression optimization which he claims can increase the lead even further
Andrew is kind enough to provide explanations and benchmark results though this is far from the first notion of this problem I hear about. Many other companies looking to optimize MySQL problem, especially in analytics space when expressions should be computed over hundreds of millions of rows have expressed their concerns with this part of MySQL execution. I#8217;m wondering if we#8217;re going to see any improvements in this space by Oracle, MariaDB or Drizzle. I think Drizzle is in advantage here #8211; with simplifying MySQL functionality a lot they probably can also simplify expression handling to be faster.

Entry posted by peter |
3 comments
Add to: | | | |