More and more of the database vendors are talking about the wonders that SSD can do for transactional (OLTP) databases. So I read Anand's latest SSD article with more than usual interest. If many of the cheaper MLC SSD's write small blocks 20 times slower than a decent harddrive, these SSD's are an absolute nightmare for OLTP databases. 
In our last Dunnington review, we showed our latest virtualization test which includes 4 concurrent OLTP ("Sysbench") tests on four separate MySQL 5.1.23 databases in four ESX virtual machines. We were fairly confident that our 6 disks RAID-0 for data and 1 separate disk for logging were capable of keeping up. After all, each disk is a 300 GB Cheetah Seagate at 15000 rpm, probably one of the fastest (mechanical) disks on this planet as it can deliver up to 400 I/O per second (and 125 MB/s sequential data rate).
But it is better to be safe than to be sorry. We did extensive monitoring with IOstat (on a "native" SLES 10 SP2) and found the following numbers on the disk that performs the logging transactions: 
  • queue length is about 0.22 (More than 2 indicates that the harddisk can not keep up)
  • typical average I/O latency is 0.23 ms (90%), with about 10% spikes of 7 to 12 ms (we measure the average over the past 2 seconds) 
That reassured us that our transaction log disk was not a bottleneck. On a "normal" SLES 10 SP2 we achieved 1400 tr/s on a quad core (an anonymous CPU for now ;-). But Anand's article really got us curious and we replaced our mighty Cheetah disk with the Intel x25-M SSD (80 GB). All of a sudden we achieved 1900 tr/s! No less than 35% more transactions, just by replacing the disk that holds the log with the fastest SSD of the moment. That is pretty amazing if you consider that there is no indication whatsoever that we were bottlenecked by our log disk.
So we had to delve a little deeper. I first thought that as long as the harddisk is not the bottleneck, the number of transactions would be more or less the same with a faster disk. It turned out that I was somewhat wrong. 

In MySQL each user thread can issue a write when the transaction is commited . More importantly is a completely serial, there doesn't seem to be a separate log I/O thread which would allow our user thread to "fire" a disk operation "and forget". As we want to be fully ACID compliant our database is configured with
 innodb_flush_log_at_trx_commit = 1
So after each transaction is committed, there is a "pwrite" first, then followed by a flush to the disk. So the actual transactions performance is also influenced by the disk write latency even if the disk is nowhere near it's limits.
We still have to investigate this further but this seems to go a bit against the typical sizing advice that is given for OLTP databases: make sure your log disks achieve a certain numbers of I/Os or put otherwise: "make sure you have enough spindles". That doesn't seem to paint the complete picture: as each write to disk action seems to be in the "critical speed path" of your transaction, each individual access latency seems to influence performance.
We monitored the same Sysbench benchmark on our Intel X25-M disk: 
  • Queue length is lower: 0.153 (but 0.2 was already very low)
  • typical access latency: an average 0.1 with very few spikes of 0.5 ms.
  • 1900 instead of 1400 tr/s
 So our conclusion so far seems to be that in case of MySQL OLTP, sizing for IO/s seems to be less important than the individual write latency. To put it more blunt: in many cases even tens of of spindles will not be able to beat one SSD as each individual disk spindle has a relatively high latency. We welcome your feedback!



View All Comments

  • ggordonliddy - Tuesday, November 11, 2008 - link

    Yes, thanks. Reply
  • hieuu - Monday, November 10, 2008 - link


    please explain your exact setup / drive configuration / controller configuration.

    was the log drive after swap on the same controller? what was your caching scheme.

  • JohanAnandtech - Tuesday, November 11, 2008 - link

    This should answer your question:

    Innodbpool of 950 MB, 258 MB Database (so all indexes are cached), Adaptec 5805 RAID card with 512 MB of DDR-2.">

    Basically the SSD was plugged into the backplane of our Supermicro server. So it is running on a relatively low end LSI 1068 controller.
  • torsteinowich - Monday, November 10, 2008 - link

    Are MLC flash devices even worth considering for database applications with their 10000 cycle erase limit? I'm no database expert, but 10000 sounds low for a database application.

    Even SLCs might not live that long on databases with much write activity. For mostly read access applications SSDs are great, though.
  • JohanAnandtech - Monday, November 10, 2008 - link

    Well, the MLC was the only one that was available to me, you are probably right. I do think the SLC "wear levelling" is capable of making the drive survive longer than a mechanical disk, especially Intel quality wear levelling.

    But I was quite surprised to hear how many IT people (*) were using these cheap MLC drives for databases and quite happy about it. They might be in for a bad surprise.

    (*) Just "hear say", no real statistical data :-)
  • vol7ron - Thursday, November 13, 2008 - link

    Intel's MLC has a wear-leveling controller that reduces wear of a cell and extends the hard disk life. Because the location of where the data is stored doesn't matter (unlike a hard disk, where closer is faster) the algorithm makes sure to use the whole drive. Additionally, it reduces the amount of times a cell is used, to minimize erases.

    Notice first that it is not the re-writes that kill the MLCs, it is the erase and then write again. So long as a cell is not erased, it will not be damaged. I'm not sure how this will affect the storage, but if they're using a good logging system, no DELETEs/UPDATEs will occur, only inserts with adjustments, which would mean that data is hardly ever being changed (minus the indexes and logs).
  • RagingDragon - Monday, November 10, 2008 - link

    Well, hopefully they backup their databases regularly. No matter what HD or SSD they use. Reply
  • Goto 10 - Monday, November 10, 2008 - link

    I would Imagine the raid controller would hide the write latency as long you have full write cache with battery backup running.
    Since the price has dropped on "real" raid controllers and the cache is now 256/512 MB per card, going with with 2 controllers, 1 for data, and 1 for logging. This should give you the write speed you need, but it wouldn't really help with the low access speed of the reads if it's not cached though.
    I think you would get alot more if you swapped to x25-M on the data part than on the logging part.

    But I'm eagerly waiting for a full SSD / Nehalem database article, hopefully it will change the battlefield like AMD did with it's opterons when they where launched and got to fight the P4 based Xeons.
  • JohanAnandtech - Monday, November 10, 2008 - link

    "I think you would get alot more if you swapped to x25-M on the data part than on the logging part. "

    I don't think so. Average latency on the data disk is much lower, as the RAID controller is accessing the disks a lot more sequential. We are seeing very low activity (a few KB/s) and 50 MB/s at the end of the test. I am pretty sure those 50 MB/s are very sequential writes to our RAID-0 of 6 disks. That RAID-0 set (capable of 500 MB/s probably *) would not have any problem with a sequential 50 MB/s. I don't see how the SSD could make a difference there. In this case more spindles do help.

    (*)"> (gives a rough idea)
  • rmlarsen - Monday, November 10, 2008 - link

    Yes, the combination of low latency and non-volatility of SSDs is really a game changer for database applications.

    I does trouble me, that you seem surprised to (re-)discover that low write latency to the WAL is absolutely critical to the performance of a transactional database. In fact, the proof is right in your own numbers:

    quote: "* typical average I/O latency is 0.23 ms (90%), with about 10% spikes of 7 to 12 ms

    That reassured us that our transaction log disk was not a bottleneck"

    No, that shows exactly that your disk latency is the limit: If these number are in the right ball park, the average latency is at least (0.9*0.23 + 0.1*7) ~= 0.9 ms, which limits the number of transactions per second to ~1100. Your performance is limited by the 10% of transactions that actually incur a disk related latency.

    Anyhow, thanks for posting these measurements.

Log in

Don't have an account? Sign up now