More and more of the database vendors are talking about the wonders that SSD can do for transactional (OLTP) databases. So I read Anand's latest SSD article with more than usual interest. If many of the cheaper MLC SSD's write small blocks 20 times slower than a decent harddrive, these SSD's are an absolute nightmare for OLTP databases. 
 
In our last Dunnington review, we showed our latest virtualization test which includes 4 concurrent OLTP ("Sysbench") tests on four separate MySQL 5.1.23 databases in four ESX virtual machines. We were fairly confident that our 6 disks RAID-0 for data and 1 separate disk for logging were capable of keeping up. After all, each disk is a 300 GB Cheetah Seagate at 15000 rpm, probably one of the fastest (mechanical) disks on this planet as it can deliver up to 400 I/O per second (and 125 MB/s sequential data rate).
 
But it is better to be safe than to be sorry. We did extensive monitoring with IOstat (on a "native" SLES 10 SP2) and found the following numbers on the disk that performs the logging transactions: 
  • queue length is about 0.22 (More than 2 indicates that the harddisk can not keep up)
  • typical average I/O latency is 0.23 ms (90%), with about 10% spikes of 7 to 12 ms (we measure the average over the past 2 seconds) 
That reassured us that our transaction log disk was not a bottleneck. On a "normal" SLES 10 SP2 we achieved 1400 tr/s on a quad core (an anonymous CPU for now ;-). But Anand's article really got us curious and we replaced our mighty Cheetah disk with the Intel x25-M SSD (80 GB). All of a sudden we achieved 1900 tr/s! No less than 35% more transactions, just by replacing the disk that holds the log with the fastest SSD of the moment. That is pretty amazing if you consider that there is no indication whatsoever that we were bottlenecked by our log disk.
 
So we had to delve a little deeper. I first thought that as long as the harddisk is not the bottleneck, the number of transactions would be more or less the same with a faster disk. It turned out that I was somewhat wrong. 

In MySQL each user thread can issue a write when the transaction is commited . More importantly is a completely serial, there doesn't seem to be a separate log I/O thread which would allow our user thread to "fire" a disk operation "and forget". As we want to be fully ACID compliant our database is configured with
 innodb_flush_log_at_trx_commit = 1
 
So after each transaction is committed, there is a "pwrite" first, then followed by a flush to the disk. So the actual transactions performance is also influenced by the disk write latency even if the disk is nowhere near it's limits.
 
We still have to investigate this further but this seems to go a bit against the typical sizing advice that is given for OLTP databases: make sure your log disks achieve a certain numbers of I/Os or put otherwise: "make sure you have enough spindles". That doesn't seem to paint the complete picture: as each write to disk action seems to be in the "critical speed path" of your transaction, each individual access latency seems to influence performance.
 
We monitored the same Sysbench benchmark on our Intel X25-M disk: 
  • Queue length is lower: 0.153 (but 0.2 was already very low)
  • typical access latency: an average 0.1 with very few spikes of 0.5 ms.
  • 1900 instead of 1400 tr/s
 So our conclusion so far seems to be that in case of MySQL OLTP, sizing for IO/s seems to be less important than the individual write latency. To put it more blunt: in many cases even tens of of spindles will not be able to beat one SSD as each individual disk spindle has a relatively high latency. We welcome your feedback!
 
 
 


Comments Locked

33 Comments

View All Comments

  • JohanAnandtech - Monday, November 10, 2008 - link

    I am trying to understand every step of the transaction, but I don't think it is that simple.

    I can not imagine that a thread would actually wait for the disk to say "it is finished" ? After all, for the thread, the disk is just a block device it writes too, and there is no further communication?

    The surprise was that the current advice of "get enough spindles" is not accurate. I would have expected that as long as the disks can keep up, the user thread would sent their transaction to the disk and forget about it. Kind of an async operation.

    Secondly, I would have expected our BBU protected RAID card which is a really fast card(*) to take care of disk latency as long as the disks can keep up.

    (*)http://it.anandtech.com/cpuchipsets/intel/showdoc....">http://it.anandtech.com/cpuchipsets/intel/showdoc....
  • rmlarsen - Monday, November 10, 2008 - link

    I think your assumption about asynchronous writes is correct for the data part (on some databases), but not for the transaction log data. To ensure ACID compliance the thread writing the transaction log HAS to wait for the disk controller to say "finished". I believe that is what the "flush" part of "innodb_flush_log_at_trx_commit" refers to.

    As you and others point out, the controller cache can mostly hide the latency, but if the cache runs full your thread will have to wait for one or more rotational periods of the disk (4ms). This appears to happen ~10% of the time for you. As you increase the size of the controller cache and/or the rate at which the controller can flush it to disk (which depends on both transfer speed and latency of the disk(s) itself), the fraction of transactions having to wait drops, as you also see in your experiment. I wonder if allocating more than one spindle to the transaction log file (in your non-SSD setup) would actually help in this regard?
  • JohanAnandtech - Tuesday, November 11, 2008 - link

    "I wonder if allocating more than one spindle to the transaction log file (in your non-SSD setup) would actually help in this regard? "

    Do you have a suggestion on how this could be done best? RAID-10 with a stripe size of 16 KB (= log writes)?
  • rmlarsen - Tuesday, November 11, 2008 - link

    Sorry, but I don't have concrete experience with this setup, so your guess is probably better than mine. Anything involving RAID-1 (and by extension RAID-10) might not be optimal since the controller has to wait for both disks in the pair to commit a block and therefore the average latency is higher. I think you'll just have to experiment...RAID-0, RAID-10, different block sizes etc.

    RAID-0 is probably not acceptable from a reliability standpoint, but at least it should give you an idea whether anything can be gained by going this route. I look forward to hearing what you find.
  • CU - Monday, November 10, 2008 - link

    Is this a new AMD cpu? I thought all the Core i7 stuff had already been released.
  • RagingDragon - Monday, November 10, 2008 - link

    The desktop i7's have been released, but the not the server variants haven't been released yet. And of course, AMD's Shanghai is coming soon too, so we can't be completely certain which processor it is... ;)
  • JarredWalton - Monday, November 10, 2008 - link

    Probably not for the server market... I can't say for sure.
  • davepermen - Monday, November 10, 2008 - link

    I'd like to see how the Intel SLC SSD would perform.
  • JohanAnandtech - Monday, November 10, 2008 - link

    I am not able to answer that question accurately, but my first estimate would be that the difference would be rather small. I don't really see how SLC would improve the already spectacular low write latencies much more.
  • JohanAnandtech - Tuesday, November 11, 2008 - link

    I stand corrected: apparantely: SLC drives have up to 3 times lower write latency.

Log in

Don't have an account? Sign up now