Putting Theory to Practice: Understanding the SSD Performance Degradation Problem

Let’s look at the problem in the real world. You, me and our best friend have decided to start making SSDs. We buy up some NAND-flash and build a controller. The table below summarizes our drive’s characteristics:

  Our Hypothetical SSD
Page Size 4KB
Block Size 5 Pages (20KB)
Drive Size 1 Block (20KB
Read Speed 2 KB/s
Write Speed 1 KB/s

 

Through impressive marketing and your incredibly good looks we sell a drive. Our customer first goes to save a 4KB text file to his brand new SSD. The request comes down to our controller, which finds that all pages are empty, and allocates the first page to this text file.


Our SSD. The yellow boxes are empty pages

The user then goes and saves an 8KB JPEG. The request, once again, comes down to our controller, and fills the next two pages with the image.


The picture is 8KB and thus occupies two pages, which are thankfully empty

The OS reports that 60% of our drive is now full, which it is. Three of the five open pages are occupied with data and the remaining two pages are empty.

Now let’s say that the user goes back and deletes that original text file. This request doesn’t ever reach our controller, as far as our controller is concerned we’ve got three valid and two empty pages.

For our final write, the user wants to save a 12KB JPEG, that requires three 4KB pages to store. The OS knows that the first LBA, the one allocated to the 4KB text file, can be overwritten; so it tells our controller to overwrite that LBA as well as store the last 8KB of the image in our last available LBAs.

Now we have a problem once these requests get to our SSD controller. We’ve got three pages worth of write requests incoming, but only two pages free. Remember that the OS knows we have 12KB free, but on the drive only 8KB is actually free, 4KB is in use by an invalid page. We need to erase that page in order to complete the write request.


Uhoh, problem. We don't have enough empty pages.

Remember back to Flash 101, even though we have to erase just one page we can’t; you can’t erase pages, only blocks. We have to erase all of our data just to get rid of the invalid page, then write it all back again.

To do so we first read the entire block back into memory somewhere; if we’ve got a good controller we’ll just read it into an on-die cache (steps 1 and 2 below), if not hopefully there’s some off-die memory we can use as a scratch pad. With the block read, we can modify it, remove the invalid page and replace it with good data (steps 3 and 4). But we’ve only done that in memory somewhere, now we need to write it to flash. Since we’ve got all of our data in memory, we can erase the entire block in flash and write the new block (step 5).

Now let’s think about what’s just happened. As far as the OS is concerned we needed to write 12KB of data and it got written. Our SSD controller knows what really transpired however. In order to write that 12KB of data we had to first read 12KB then write an entire block, or 20KB.

Our SSD is quite slow, it can only write at 1KB/s and read at 2KB/s. Writing 12KB should have taken 12 seconds but since we had to read 12KB and then write 20KB the whole operation now took 26 seconds.

To the end user it would look like our write speed dropped from 1KB/s to 0.46KB/s, since it took us 26 seconds to write 12KB.

Are things starting to make sense now? This is why the Intel X25-M and other SSDs get slower the more you use them, and it’s also why the write speeds drop the most while the read speeds stay about the same. When writing to an empty page the SSD can write very quickly, but when writing to a page that already has data in it there’s additional overhead that must be dealt with thus reducing the write speeds.

The Blind SSD Free Space to the Rescue
POST A COMMENT

350 Comments

View All Comments

  • FHDelux - Wednesday, March 18, 2009 - link

    That was the best review i have read in a long time. I originally bought an OCZ Core drive when they first came out. It was the worst piece of garbage i had ever used. Newegg wouldn't let me send it back and OCZ support forums told me all sorts of junk to get me to fix it but it was just a poorly designed drive. I eventually ended up getting the egg to take it back for credit and i wrote OCZ off as a company blinded by the marketing department. I currently own an Intel SSD and its wonderfull, everytime i see OCZ statements saying their drive competes with the Intel drive i would laugh and think back to the OCZ techs telling me i need to update my bios, or i need to install vista service pack 1 before it would work right.

    I am thankful that you slapped that OCZ big wig around until they made a good product. All of us out there that wasted our time and money on Pre-vertex generation drives are greatfull to you and the whole industry should be kissing your butt right now.

    One thing these companies need to learn is that marketing isn't the answer, creating solid products is. Hopefully OCZ has learned their lesson, and because of your article i will give them another chance.

    THANK YOU!
    Reply
  • kelstertx - Wednesday, March 18, 2009 - link

    I didn't want to worry about eventual failure of the Flash chips of an SSD, and went with an SDRAM based Ramdrive from Acard. These drives have no latency of any kind, since they use SDRAM, and no lifespan of write cycles. I've been using mine for a couple of weeks now, and I like it a lot. I put Ubuntu on mine, and had 2G left for my small home folder. The standard HDD is my long-term storage for data files, music, etc. As SDRAM gets more affordable over time, I can add DIMMs and bump up the size.

    I know this review was about SSDs strictly, so an SDRAM drive doesn't technically fit, but it would have been interesting to see a 9010 or 9010b in there for comparison. It beat the Intel SSD in almost all the tests. http://techreport.com/articles.x/16255/1">http://techreport.com/articles.x/16255/1

    Reply
  • 7Enigma - Wednesday, March 18, 2009 - link

    I've been eying these guys ever since the announced their first press release. Every time I always was drawn away by the constant need for power (4h max on battery scares the bejeezus out of me if I was to be gone on vacation during a storm), high power usage at all times, and high cost of entry (after factoring in all of the ram modules).

    I really dislike that article as well, since I think the bottlenecks were much less apparent with such a horribly slow cpu. The majority of that review's data is extremely compressed. I mean a P4, and 1 gig of memory; are you F'ing kidding me? This article was written in Jan of this year!? Why didn't they just use my old 486DX?
    Reply
  • tirez321 - Wednesday, March 18, 2009 - link

    What would a drive zeroing tool do to write performance, like if you used acronis privacy expert to zero only the "free space" regularly? Would it help write performance due to the drive not having to erase pages before writing? Reply
  • tirez321 - Wednesday, March 18, 2009 - link

    I can kinda see that it wouldn't now.
    Because there would still be states there regardless.
    But if you could inform the drive that it is deleted somehow, hmm.

    Reply
  • strikeback03 - Wednesday, March 18, 2009 - link

    The subjective experiences with stuttering are more important to me than most of the test numbers. Other tests I have found of the G.Skill Titan and similar have looked pretty good, but left out mention of stuttering in use.

    Too bad, as the 80GB Intel is too small and the ~$300 for a 120GB is about the most I am willing to pay. Maybe sometime this year the OCZ Vertex or similar will get there.
    Reply
  • strikeback03 - Tuesday, March 24, 2009 - link

    When I wrote that, the Newegg price for the 120GB Vertex was near $400. Now they have it for $339 with a $30 MIR. Now that's progress. Reply
  • kamikaz1k - Wednesday, March 18, 2009 - link

    the latency times are switched...incase u wanted to kno.
    also, first post ^^ hallo!
    Reply
  • GourdFreeMan - Wednesday, March 18, 2009 - link

    It seems rather premature to assume the ATA TRIM command will significantly improve the SSD experience on the desktop. If you were to use TRIM to rewrite a nonempty physical block, you do not avoid the 2ms erase penalty when more data is written to that block later on and instead simply add the wear of another erase cycle. TRIM, then, is only useful for performance purposes when an entire 512 KiB physical block is free.

    A well designed operating system would have to keep track of both the physical and logical maps of used space on an SSD, and only issue TRIM when deletion of a logical cluster coincides with the freeing of an entire physical block. Issuing TRIMs at any other time would only hurt performance. This means the OS will have significantly fewer opportunities to issue TRIMs than you assume. Moreover, after significant usage the physical blocks will become fragmented and fewer and fewer TRIMs will be able to be issued.

    TRIM works great as long as you only deal with large files, or batches of small files contiguously created and deleted with significant temporal locality. It would greatly aid SSDs in the "used" state Anand artificially creates in this article, but on a real system where months of web browsing, Windows updates and software installing/uninstalling have occurred the effect would be less striking.

    TRIM could be mated with periodic internal (not filesystem) defragmentation to mitigate these issues, but that would significantly reduce the lifespan of the SSD...

    It seems the real solution to the SSD performance problem would be to decrease the size of the physical block... ideally to 4 KiB, as that is the most common cluster size on modern filesystems. (This assumes, of course, that the erase, read and write latencies could be scaled down linearly.)
    Reply
  • Kary - Thursday, March 19, 2009 - link

    Why use TRIM at all?!?!?

    If you have extras Blocks on the drive (NOT PAGES, FULL BLOCKS) then there is no need for TRIM command.

    1)Currently in use BLOCK is half full
    2)More than half a block needs to be written
    3)extra BLOCK is mapped into the system
    4)original/half full block is mapped out of system.. can be erased during idle time.

    You could even bind multiple continuous blocks this way (I assume that it is possible to erase simultaneously any of the internal groupings pages from Blocks on up...they probably share address lines...ex. erase 0000200 -> just erase block #200 ....erase 00002*0 -> erase block 200 to 290...btw, did addressing in base ten instead of binary just to simplify for some :)
    Reply

Log in

Don't have an account? Sign up now