The Secret Sauce: 0.5x Write Amplification

The downfall of all NAND flash based SSDs is the dreaded read-modify-write scenario. I’ve explained this a few times before. Basically your controller goes to write some amount of data, but because of a lot of reorganization that needs to be done it ends up writing a lot more data. The ratio of how much you write to how much you wanted to write is write amplification. Ideally this should be 1. You want to write 1GB and you actually write 1GB. In practice this can be as high as 10 or 20x on a really bad SSD. Intel claims that the X25-M’s dynamic nature keeps write amplification down to a manageable 1.1x. SandForce says its controllers write a little less than half what Intel does.

SandForce states that a full install of Windows 7 + Office 2007 results in 25GB of writes to the host, yet only 11GB of writes are passed on to the drive. In other words, 25GBs of files are written and available on the SSD, but only 11GB of flash is actually occupied. Clearly it’s not bit-for-bit data storage.

What SF appears to be doing is some form of real-time compression on data sent to the drive. SandForce told me that it’s not strictly compression but a combination of several techniques that are chosen on the fly depending on the workload.

SandForce referenced data deduplication as a type of data reduction algorithm that could be used. The principle behind data deduplication is simple. Instead of storing every single bit of data that comes through, simply store the bits that are unique and references to them instead of any additional duplicates. Now presumably your hard drive isn’t full of copies of the same file, so deduplication isn’t exactly what SandForce is doing - but it gives us a hint.

Straight up data compression is another possibility. The idea behind lossless compression is to use fewer bits to represent a larger set of bits. There’s additional processing required to recover the original data, but with a fast enough processor (or dedicated logic) that part can be negligible.

Assuming this is how SandForce works, it means that there’s a ton of complexity in the controller and firmware. Much more than what even a good SSD controller needs to deal with. Not only does SandForce have to manage bad blocks, block cleaning/recycling, LBA mapping and wear leveling, but it also needs to manage this tricky write optimization algorithm. It’s not a trivial matter, SandForce must ensure that the data remains intact while tossing away nearly half of it. After all, the primary goal of storage is to store data.

The whole write-less philosophy has tremendous implications for SSD performance. The less you write, the less you have to worry about garbage collection/cleaning and the less you have to worry about write amplification. This is how the SF controllers get by without having any external DRAM, there’s just no need. There are fairly large buffers on chip though, most likely on the order of a couple of MBs (more on this later).

Manufacturers are rarely honest enough to tell you the downsides to their technologies. Representing a collection of bits with a fewer number of bits works well if you have highly compressible data or a ton of duplicates. Data that is already well compressed however, shouldn’t work so nicely with the DuraWrite engine. That means compressed images, videos or file archives will most likely exhibit higher write amplification than SandForce’s claimed 0.5x. Presumably that’s not the majority of writes your SSD will see on a day to day basis, but it’s going to be some portion of it.

Enter the SandForce Controlling Costs with no DRAM and Cheaper Flash
Comments Locked

100 Comments

View All Comments

  • semo - Saturday, January 2, 2010 - link

    Anand,

    After reading your very informative SSD articles, I still found something new from GullLars. I think it would be useful to include the queue length when stating IOPS figures as it will give us more technical insight of the inner workings of the different SSD models and give hints to performance for future uses.

    When dial up was the most common way of connecting to the internet, most sites were small with static content. As connection and CPU speeds grew, so did the websites. Try going to a big ugly site like cnet with a 7-8 year old pc with even the fastest internet connection. I'm sure that all this supposed untapped performance in SSDs will be quickly utilized in future (probably because of inefficient software in most cases rather than for legit reasons). With virtualization slowly entering the consumer space (XP mode, VM unity and so on) as giant sandboxes and legacy platforms, surely disk queue lengths can only grow...
  • shawkie - Saturday, January 2, 2010 - link

    Anand,

    I agree that its also helpful to know what the hardware can really do. It seems to me that longer queue depths are becoming important for high performance on all storage devices (even hard disks have NCQ and can be put in RAID arrays). At some point software manufacturers are going to wake up to that fact. This is just like the situation with multi-core CPUs. I'm fortunate because in my work I not only select the hardware platform but also develop the software to run on it.
  • DominionSeraph - Monday, January 4, 2010 - link

    A jumble of numbers that don't apply to the scenario at hand is nothing but misleading.

    Savvio 15K.1 SAS: 416 IOPS
    1TB Caviar Black: 181.

    Ooooh... the 15k SAS is waaaay faster!! Sure, in a file server access pattern at a queue depth of 64. Try benchmarking desktop use and you'll find the 7200RPM SATA is generally faster.
  • BrightCandle - Friday, January 1, 2010 - link

    With which software and parameters did you achieve the results you are talking about? Everything I've thrown at my X25-M has shown results in the same park as Anand's figures so I'm interested to see how you got to those numbers.
  • GullLars - Friday, January 1, 2010 - link

    These numbers have been generated by several testing methods.
    *AS SSD benchmark shows 4KB random read and random write at Queue Depth (QD) 64, and x25-M gets in the area of 120-160MB/s on read and 65-85MB/s on write.
    *Crystal Disk Mark 3.0 (beta) tests 4KB random at both QD1 and QD32. At QD32 4KB random read, Intel x25-M gets 120-160MB/s, and at random write it gets 65-85MB/s here too.
    Here's to a screenshot of CDM 2.2 and 3.0 of x25-M 80GB on 750SB with AHCI in fresh state. http://www.diskusjon.no/index.php?act=attach&t...">http://www.diskusjon.no/index.php?act=attach&t...
    *Testing with IOmeter, parameters 2GB length, 30 sec runtime, 1 worker, 32 outstanding IO's (QD), 100% read, 100% random, 4KB blocks, burst lenght 1. On a forum i frequent most users with x25-M get between 30-40.000 IOPS with theese parameters. For the same parameters only 100% write the norm is around 15K IOPS on a fresh drive, and a bit closer to 10K in used state with OS running from the drive. x25-E has been benched to 43K random write 4KB IOPS.

    Regarding the practical difference 4KB IOPS makes, the biggest difference can be seen in the PCmark vantage test Application Launching. Such workloads involve reading a massive amount of small files and database listings, pluss logging all file access this creates. Prefetch and superfetch may help storage units with less than a few thousand IOPS, but x25-M in many cases actually get worse launch times with these activated. Using a RAM disk for known targets of small random writes make sense, and i've put my browser cache and temp files on a RAM disk even though i have an SSD.
    With x25-M's insane IOPS performance, the random part of most workloads is done whitin a second and what you are left waiting for is the loading of larger files and the CPU. Attempting to lower the load time of small random reads during an application launch from say 0,5 sec by running a superfetch script or read-caching with a RAMdisk makes little sense.
  • Zool - Friday, January 1, 2010 - link

    For a average user 4KB random performance are the most useless results out there. If a user encounters that much random 4KB read/writes than he need to change the operating system asap.
    And if something realy needs to randomly read/write 4KB files than your best bet is to cache it to Ram or make Ram disk i think.
  • LTG - Thursday, December 31, 2009 - link

    This statement seems really dubious - Isn't it in fact the opposite?

    The majority of storage space is taken up by things that don't compress well: Music, Videos, Photos, Zip style archives...

    Everything else is smaller.


    Anand Says:
    ==========================
    That means compressed images, videos or file archives will most likely exhibit higher write amplification than SandForce’s claimed 0.5x. Presumably that’s not the majority of writes your SSD will see on a day to day basis, but it’s going to be some portion of it.
  • DominionSeraph - Friday, January 1, 2010 - link

    That stuff just gets written once.
    Day-to-day operations sees a whole lot of transient data.
  • Shining Arcanine - Thursday, December 31, 2009 - link

    As someone else suggested, I imagine that the SATA driver could take all of the data written/read to the drive and transparently implement the algorithms on the much more powerful CPU.

    Is there anything to stop people from reverse engineering the firmware to figure out exactly what the drive in terms of compression is doing and then externalizing it to the SATA driver, so other SSDs can benefit from it as well? i.e. Are there any legal issues with this?
  • Anand Lal Shimpi - Friday, January 1, 2010 - link

    Patents :) SandForce holds a few of them with regards to this technology.

    Obviously that's up to the courts to determine if they are enforceable or not, SandForce believes they are. Other companies could license the technology though...

    Take care,
    Anand

Log in

Don't have an account? Sign up now