RAPID: PCIe-like Performance from a SATA SSD

The software story around Samsung's SSD 840 EVO is quite possibly the strongest we've ever seen from an SSD manufacturer. Samsung's SSD Magician got a major update not too long ago, giving it a downright awesome UI. Magician gives you access to SMART details about your drive and provides decent visualization of things like total host writes. I'd love to see the inclusion of total NAND writes reported somewhere, as reporting host writes alone doesn't take into account write amplification and can give a false sense of security for those users deploying drives into very write intensive environments. There's a prominent drive health indicator which is tied to NAND wear and should draw a lot of attention to itself should things get bad. Samsung's SSD Magician also includes a built in benchmark, controllable overprovisioning and secure erase functionality.

Samsung sent us a beta of the next version of its Magician software (4.2) which includes support for RAPID mode (Real-time Accelerated Processing of I/O Data). RAPID is a feature exclusive to the EVO (for now) and comes courtesy of Samsung's NVELO acquisition from last year. As NVELO focused on NAND caching software, you shouldn't be too surprised by RAPID's role in improving storage performance. Unlike traditional SSD caches however that use NAND to cache mechanical storage, RAPID is designed to further improve the performance of an SSD and not make a HDD more SSD-like. RAPID uses some of your system memory and CPU resources to cache hot data, serving it out of DRAM rather than your SSD.

The architecture is rather simple to understand. Enabling RAPID installs a filter driver on your Windows machine that keeps track of all reads/writes to a single EVO (RAPID only supports caching a single drive today). The filter driver looks at both file types/sizes and LBAs, but it fundamentally caches at the block level (it simply gets hints from the filesystem to determine what to cache). File types that are meaningless to cache are automatically excluded (think very large media files), but things like Outlook PST files are prime targets for caching. Since RAPID works at the block level you can cache frequently used parts of a file, rather than having to worry about a file being too big for the cache.

The cache resides in main memory and is allocated out of non-paged kernel memory. In fact, that's the easiest way to determine whether or not RAPID is actually working - you'll see non-paged kernel memory jump in size after about a minute of idle time on your machine:

Presently RAPID will use no more than 25% of system memory or 1GB, whichever comes first. Both reads and writes are cached, but in different ways. The read cache works as you'd expect, while RAPID more accurately does something like buffering/combining for writes. Reads are simple to cache (just look at what addresses are frequently accessed and draw those into the cache), but writes offer a different set of challenges. If you write to DRAM first and write back to the SSD you run the risk of losing a ton of data in the event of a crash or power failure. Although RAPID obeys flush commands, there's always the risk that anything pending could be lost in a system crash. Recognizing this potential, Samsung tells me that RAPID tries to instead focus on combining low queue depth writes into much larger bundles of data that can be written more like large transfers across many NAND die. To test this theory I ran our 4KB random write IOmeter test at a queue depth of 1 with RAPID enabled and disabled:

Samsung SSD 840 EVO 250GB - 4K Random Write, QD1, 8GB LBA Space
  IOPS MB/s Average Latency Max Latency CPU Utilization
RAPID Disabled 22769.31 93.26 MB/s 0.0435 ms 0.7512 ms 13.81%
RAPID Enabled 73466.28 300.92 MB/s 0.0135 ms 31.4259 ms 31.18%

Write coalescing seems to work extremely well here. With RAPID enabled the system sees even better random write performance than it would at a queue depth of 32. Average latency drops although the max observed latency was definitely higher. I've seen max latency peaks as high as 10ms on the EVO, so the increase in max latency is a bit less severe than what the data here indicates (but it's still large).

My test system uses a quad-core Sandy Bridge, so we're looking at an additional 60 - 70% CPU load on a single core when running an unconstrained IO workload. In real world scenarios I'd expect that impact to be much lower, but there's no getting around the fact that you're spending extra cycles on doing this DRAM caching. RAPID will revert into a pass-through mode if the CPU is already tied up doing other things. The technology is really designed to make use of excess CPU and DRAM in modern day PCs.

The potential performance upside is tremendous. While the EVO is ultimately limited by the performance of 6Gbps SATA, any requests serviced out of main memory are limited by the speed of your DRAM. In practice I never saw more than 4 - 5GB/s out of the cache, but that's still an order of magnitude better than what you'd get from the SSD itself. I ran a couple of tests with and without RAPID enabled to further characterize the performance gains:

Samsung SSD 840 EVO 250GB
  PCMark 7 Secondary Storage Score ATSB - Heavy 2011 Workload (Avg Data Rate) ATSB - Heavy 2011 Workload (Avg Service Time) ATSB - Light 2011 Workload (Avg Data Rate) ATSB - Light 2011 Workload (Avg Service Time)
RAPID Disabled 5414 229.6 MB/s 1101.0 µs 338.3 MB/s 331.4 µs
RAPID Enabled 5977 307.7 MB/s 247.0 µs 597.7 MB/s 145.4 µs
% Increase 10.4% 34.0%   75.0%  

The gains in these tests range from only 10% in PCMark 7 to as much as 75% in our Light 2011 workload. I'm in the process of running a RAPID enabled drive against our Destroyer benchmark to see how it fares there. In our two storage bench tests here the impact is actually mostly on the write side, average performance actually regresses slightly in both cases. I'm not entirely sure why that is other than both of these tests were designed to be a bit more write intensive than normal in order to really stress the weaknesses on SSDs at the time. To make sure that reads could indeed be cached I ran ATTO at a couple of different test sizes, starting with our standard 2GB test:

ATTO makes for a great test because we can see the impact transfer size has on RAPID's caching algorithms. Here we see pretty much no improvement until transfers get larger than 32KB, indicating an optimization for caching large block sequential reads. Note that even though ATTO's test file is 2GB in size (and RAPID's cache is limited to 1GB) we're still able to see some increase in performance. At best RAPID boosts sequential read performance by 34%, driving the 250GB EVO beyond 700MB/s. Since the test file is larger than the maximum size of the cache we're ultimately limited by the performance of the EVO itself.

Writes show a different optimization point. Here we see big uplift above 4KB transfer sizes but more or less the same performance once we move to large block sequential transfers. Again this makes sense as Samsung would want to coalesce small writes into large blocks it can burst across many NAND die, but caching large sequential transfers is just risking potential data loss in the event of a crash/unexpected power loss. Here the potential uplift is even larger - nearly 60% over the RAPID-disabled configuration.

To see what would happen if the entire workload could fit within a 1GB cache I reduced the size of ATTO's test set to 512MB and re-ran the tests:

Oh man. Here performance just shoots through the roof. Max sequential read performance tops out at 3.8GB/s. Note that once again we don't RAPID attempting to cache any smaller transfers, only large sequential transfers are of interest. Towards the end of the curve performance appears to regress when the transfer size exceeds 1MB. What's actually happening is RAPID's performance is exceeding the variable ATTO uses to store its instantaneous performance results. What we're seeing here is a 32-bit integer wrapping itself. 

Writes see similarly insane increases in performance. Here the best performance is north of 4GB/s. When the entire workload can fit in the cache, Samsung appears to relax some of its feelings about not caching large transfers unfortunately. The focus extends beyond just small file writes and we see nearly 4GB/s when we're transferring 8MB of data at a time. We're likely also seeing the same issue where RAPID's performance is so high that it's overflowing the 32-bit integer ATTO uses to report it.

While I appreciate the tremendous increase in both read and write performance, part of me wishes that Samsung would be more conservative in buffering writes. Although the cache map is stored on the C: drive and is persistent across boots, any crash or power loss with uncommitted (non-flushed) writes in the DRAM cache runs the risk of not making it to disk. Samsung is quick to point out that Windows issues flush commands regularly, so the risk should be as low as possible, but you're still risking more than had you not deployed another DRAM cache. If you've got a stable system connected to a UPS (or a notebook on a battery) this will sound like paranoia, but it's still a concern. 

If, however, you want to get PCIe-like SSD speeds without shelling out the money for a PCIe SSD, Samsung's RAPID is the closest you'll get.

TurboWrite: MLC Performance on a TLC Drive Performance Consistency & Testing TRIM
Comments Locked

137 Comments

View All Comments

  • Riven98 - Thursday, July 25, 2013 - link

    Anand,
    Thanks for the great article. I had just been thinking that there had been a downturn in the number of articles like these, which are the main reasons I visit on an almost daily basis.
  • chrnochime - Friday, July 26, 2013 - link

    Still recommending a technology that's known to not last as long as the MLC. Yes the *extropolated* result indicates that its lifetime is far longer than advertised, but really, why when even M500 is not that slow in the first place and cost about the same, why risk going with the TLC? Not to mention Samsung's 830 has its fair share of horror stories as well...
  • watersb - Friday, July 26, 2013 - link

    Excellent review.

    How does write amplification scale as the disk fills up? Wouldn't a full disk fail more rapidly than a half-full one?
  • BobAjob2000 - Tuesday, January 28, 2014 - link

    Hopefully wear leveling and TRIM/garbage collection algorithms should take care of your concerns. They should take existing unchanged 'cold' data and move it around to make way for regularly changed 'hot' data. This should reduce the impact of both data longevity and write amplification as it guides new writes to hit the 'freshest' unused or rarely written blocks on the disk and also helps to ensure that data goes not go 'stale' after being untouched for years. Different vendors use different algorithms that have evolved and improved over time. I think Samsung (being a RAM manufacturer) can possibly provide better RAM caches for their disks that may provide advantages for garbage collection and wear leveling algorithms by improving the available 'thinking space' for the caching and sorting/organizing of 'hot' data.
    Its all to do with managing the 'temperature' of your data somewhat like a data 'weather forecast' which can be very useful in the short term or for simple predictable/settled patterns but less practical for long term or unseasonal data storms.
    Would like to see these things tested by 'what if' scenarios though to demonstrate the differences between different vendors algorithms.
  • xtreme2k - Friday, July 26, 2013 - link

    Can anyone tell me why I am paying 90% of the price for 33% of the endurance of a drive?
  • MrSpadge - Saturday, July 27, 2013 - link

    Because endurance doesn't matter (very likely also for you), but price does.
  • log - Friday, July 26, 2013 - link

    Can you partition this drive and still take advantage of its features? Thnaks
  • Timur Born - Friday, July 26, 2013 - link

    I don't quite understand exactly why the Samsung RAPID software cache brings higher performance in *practice* than Windows' own cache? Using two software caches will lead to the same information being stored in RAM twice or even thrice, which is exactly what the Windows cache tries to avoid since XP days.

    That the usual benchmark programs get fooled is visible, as they think to be working without a software cache. So the higher values ​there are not surprising. But I am a bit puzzled why the Anand Storage Bench results increase, too?! Why is RAPID software caching better than Windows' own cache in this scenario? Or does the ASB bypass Windows' cache, too (like most benchmarks)?

    By the way: ATTO allows the Windows cache to be turned ON for testing. My "old" Crucial M4 256 gets sees very high read results once ATTO makes use of Windows' cache. Only the write rates remain significantly smaller.

    Therefor an ATTO test with combinations of either or both software caches (RAPID and Windows) would be interesting.
  • MrSpadge - Saturday, July 27, 2013 - link

    I think it's because Samsung is being much more agressive with caching than Win dares to be, i.e. it holds files far longer before writing them, so they can be combined more efficiently but are longer at risk of being lost.
  • Timur Born - Sunday, July 28, 2013 - link

    I am not convinced about that yet, especially since you can turn off drive cache flushing via Device-Manager and thus should get an even more aggressive Windows cache behavior than what RAPID offers (which is reported to adhere to Windows' flush commands).

    The Windows cache is designed to keep data in RAM for as long as it's not needed for something else. Even more important, data is *directly* executed from inside the Windows cache instead of being copied back and forth between separate memory regions. This keeps duplication to a minimum (implemented since XP as far as I remember). So at least for reads the Windows cache is very useful, especially in combination with Superfetch, which is *not* disabled for SSDs btw (even Prefetch for the boot phase isn't disabled, but in practice it makes not much of a difference whether you boot with or without Prefetch from an SSD).

    There is something funky going on with Windows' cache and the drive's onboard cache of my Crucial M4 in combination with ATTO (Windows cache enabled). Different block sizes get very different results, with some *larger* block sizes not benefiting from Windows' cache either at read or write, the latter depending on the block size chosen. Turning the drive's own cache flushing on/off via Device-Manager can have an impact on that, too.

    In some cases I get less throughput with Windows cache than without (i.e. 512 kb block size with drive flushing on). This may be an issue of ATTO, though, because I also got some measurements where ATTO claimed a write speed of zero (0)! Turning off either drive cache flushing or the Windows cache or both helps ATTO to get meaningful measurements again.

    So the main question remains: How and why would RAPID affect "real-world" performance on top of the Windows cache and does the Anand Storage Bench deliberately circumvent the Windows cache?

    The reason I was looking at this review was that I am currently looking for a new SSD to build a desktop PC and the 840 EVO looks like the thing to buy. So once I get my hands on one myself I will just try RAPID myself. ;)

Log in

Don't have an account? Sign up now