Enterprise NVMe Round-Up 2: SK Hynix, Samsung, DapuStor and DERA

Name: Enterprise NVMe Round-Up 2: SK Hynix, Samsung, DapuStor and DERA
Item: Enterprise NVMe Round-Up 2: SK Hynix, Samsung, DapuStor and DERA
Author: Billy Tallis

by Billy Tallis on February 14, 2020 11:15 AM EST

33 Comments | Add A Comment

33 Comments

Test Setup

For this year's enterprise SSD reviews, we've overhauled our test suite. The overall structure of our tests is the same, but a lot has changed under the hood. We're using newer versions of our benchmarking tools and the latest longterm support kernel branch. The tests have been reconfigured to drastically reduce CPU overhead, which has minimal impact on SATA drives but lets us properly push the limits of the many enterprise NVMe drives for the first time.

The general philosophy underlying the test configuration was to keep everything at its default or most reasonable everyday settings, and change as little as possible while still allowing us to measure the full performance of the SSDs. Esoteric kernel and driver options that could marginally improve performance were ignored. The biggest change from last year's configuration and away from normal everyday usage is in the IO APIs used by the fio benchmarking tool to interact with the operating system.

In the past, we configured fio to use ordinary synchronous IO APIs: read() and write() style system calls. The way these work is that the application makes a system call to perform a read or write application, and control transfers to the kernel to handle the IO. The application thread is suspended until that IO is complete. This means we can only have one outstanding IO request per thread, and hitting a drive with a queue depth of 32 requires 32 threads. That's no problem on a 36-core test system, but when it takes a queue depth of 200 or more to saturate a high-end NVMe SSD, we run out of CPU power. Running more threads than cores can get us a bit more throughput than just QD36, but that causes latency to suffer not just from the overhead of each system call, but from threads fighting over the limited number of CPU cores. In practice, this testbed is limited to about 560k IOPS when performing IO this way, and that leaves no CPU time for doing anything useful with the data that's moving around. Spectre, Meltdown and other vulnerability mitigations tend to keep increasing system call and context switch overhead, so this situation isn't getting any better.

The alternative is to use asynchronous storage APIs that allow an application thread to submit an IO request to the operating system but then continue executing while the IO is performed. For benchmarking purposes, that continued execution means the application can keep submitting more IO requests before the first one is complete, and a single thread can load down a SSD with a reasonably high queue depth.

Asynchronous IO presents challenges, especially on Linux. On any platform, asynchronous IO is a bit more complicated for the application programmer to deal with, because submitting a request and getting the result become separate steps, and operations may complete out of order. On Linux specifically, the original async IO APIs were fraught with limitations. The most significant is that Linux native AIO is only actually asynchronous when IO is set to bypass the operating system's caches, which is the opposite of what most real-world software should want. (Our benchmarking tools have to bypass the caches to ensure we're measuring the SSD and not the testbed's 192GB of RAM.) Other AIO limitations include support for only one filesystem, and myriad scenarios in which IO silently falls back to being synchronous, unexpectedly halting the application thread that submitted the request. The end result of all those issues is that true asynchronous IO on Linux is quite rare and only usable by some applications with dedicated programmers and competent sysadmins. Benchmarking with Linux AIO makes it possible to stress even the fastest SSD, but such a benchmark can never be representative of how mainstream software does IO.

The best way to set storage benchmark records is to get the operating system kernel out of the way entirely using a userspace IO framwork like SPDK. This eliminates virtually all system call overhead and makes truly asynchronous IO possible and fast. It also eliminates the filesystem and the operating system's caching infrastructure and makes those the application's responsibility. Sharing a SSD between applications becomes almost impossible, and at the very least requires rewriting both applications to use SPDK and overtly cooperate in how they use the drive. SPDK works well for use cases where a heavily customized application stack and system configuration is possible, but it is no more capable of becoming a mainstream solution than Linux AIO.

A New Hope

What's changed recently is that Linux kernel developer (and fio author) Jens Axboe introduced a new asynchronous IO API that's easy to use and very fast. Axboe has documented the rationale behind the new API and how to use it. In summary: The core principle is that communication between the kernel and userspace software takes place with a pair of ring buffers, so the API is called io_uring. One ring buffer is the IO submission queue: the application writes requests into this buffer, and the kernel reads them to act on. The other is the completion queue, where the kernel writes notification of completed IOs, which the application is watching for. This dual queue structure is basically the same as how the operating system communicates with NVMe devices. For io_uring, both queues are mapped into the memory address spaces of both the application and the kernel, so there's no copying of data required. The application doesn't need to make any system calls to check for completed IO; it just needs to inspect the contents of the completion ring. Submitting IO requests involves putting the request in the submission queue, then making a system call to notify the kernel that the queue isn't empty. There's an option to tell the kernel to keep checking the submission queue as long as it doesn't stay idle for long. When that mode is used, a large number of IOs can be handled with an average of approximately zero system calls per IO. Even without it, io_uring allows for IO to be done with one system call per IO compared to two per IO with the old Linux AIO API.

Using synchronous IO, our enterprise SSD testbed cannot reach 600k IOPS. With io_uring, we can do more than 400k IOPS on a single CPU core without any extra performance tuning effort. Hitting 1M IOPS on a real SSD takes at most 4 CPU cores, so even the Micron X100 and upcoming Intel Alder Stream 3D XPoint SSDs should pose no challenge to our new benchmarks.

The first stable kernel to include the io_uring API was version 5.1 released in May 2019. The first long term support (LTS) branch with io_uring is 5.4, released in November 2019 and used in this review. The io_uring API is still very new and not used by much real-world software. But unlike the situation with the old Linux AIO APIs or SPDK, this seems likely to change. It can do more than previous asynchronous IO solutions, including being used for both high-performance storage and network IO. New features are arriving with every new kernel release; lots of developers are trying it out, and I've seen feature requests fulfilled in a matter of days. Many high-level languages and frameworks that currently simulate asynchronous IO using thread pools will be able to implement new io_uring backends.

For storage benchmarking on Linux, io_uring currently strikes the best balance between the competing desires to simulate workloads in a realistic manner, and to accurately gauge what kind of performance a solid state drive is capable of providing. All of the fio-based tests in our enterprise SSD test suite now use io_uring and never run more than 16 threads even when testing queue depths up to 512. With the CPU bottlenecks eliminated, we have also disabled HyperThreading.

Enterprise SSD Test System
System Model	Intel Server R2208WFTZS
CPU	2x Intel Xeon Gold 6154 (18C, 3.0GHz)
Motherboard	Intel S2600WFT Firmware 2.01.0009 CPU Microcode 0x2000065
Chipset	Intel C624
Memory	192GB total, Micron DDR4-2666 16GB modules
Software	Linux kernel 5.4.0 fio version 3.16
Thanks to StarTech for providing a RK2236BKF 22U rack cabinet.

Drives In Detail: DapuStor & DERA Performance at QD1

PRINT THIS ARTICLE

Post Your Comment
Please log in or sign up to comment.

Comments Locked

33 Comments

View All Comments

Billy Tallis - Friday, February 14, 2020 - link
Me, too. It's a pity that we'll probably never see the Micron X100 out in the open, but I'm hopeful about Intel Alder Stream.

I do find it interesting how Optane doesn't even come close to offering the highest throughput (sequential reads or writes or random reads), but its performance varies so little with workload that it excels in all the corner cases where flash fails.
curufinwewins - Friday, February 14, 2020 - link
Absolutely. It's so completely counter to the reliance on massive parallelization and over provisioning/cache to hide the inherent weaknesses of flash that I just can't help but being excited about what is actually possible with it.
extide - Friday, February 14, 2020 - link
And honestly most of those corner cases are far more important/common in real world workloads. Mixed read/write, and low QD random reads are hugely important and in those two metrics it annihilates the rest of the drives.
PandaBear - Friday, February 14, 2020 - link
Throughput has alot to do with how many dies you can run in parallel, and since optane has a much lower density (therefore more expensive and lower capacity), they don't have as many dies on the same drive, and that's why peak throughput will not be similar to the monsters out there with 128-256 dies on the same drive. They make it back in other spec of course, and therefore demand a premium for that.
swarm3d - Monday, February 17, 2020 - link
Sequential read/write speed is highly overrated. Random reads and writes make up the majority of a typical workload for most people, though sequential reads will benefit things like game load times and possibly video edit rendering (if processing isn't a bottleneck, which is usually is).

Put another way, if sequential read/write speed was important, tape drives would probably be the dominant storage tech by now.
PandaBear - Friday, February 14, 2020 - link
Some info from the industry is that AWS is internally designing their own SSD and the 2nd generation is based off the same Zao architecture and 96 layer Kioxia NAND that DapuStor makes. For this reason it is likely that it will be a baseline benchmark for most ESSD out there (i.e. you have to be better than that or we can make it cheaper). Samsung is always going to be the powerhouse because they can afford to make a massive controller with so much more circuits that would be too expensive for others. SK Hynix's strategy is to make an expensive controller so they can make money back from the NAND. Dera and DapuStor will likely only focus in China and Africa like their Huawei pal. Micron has a bad reputation as an ESSD vendor and they ended up firing their whole Tidal System team after Sanjay joined, and Sanjay pouched a bunch of WD/SanDisk people to rebuild the whole group from ground up.
eek2121 - Friday, February 14, 2020 - link
I wish higher capacity SSDs were available for consumers. Yes, there are only a small minority of us, but I would gladly purchase a high performance 16TB SSD.

I suspect the m.2 form factor is imperfect for high density solid state storage, however. Between heat issues (my 2 TB 970 EVO has hit 88C in rare cases...with a heatsink. My other 960 EVO without a heatsink has gotten even hotter.) and the lack of physical space for NAND, we will likely have to come up with another solution if capacities are to go up.
Billy Tallis - Friday, February 14, 2020 - link
Going beyond M.2 for the sake of higher capacity consumer storage would only happen if it becomes significantly cheaper to make SSDs with more than 64 NAND dies, which is currently 4TB for TLC. Per-die capacity is going up slowly over time, but fast enough to keep up with consumer storage needs. In order for the consumer market to shift toward drives with way more than 64 NAND dies, we would need to see per-wafer costs drop dramatically, and that's just not going to happen.
Hul8 - Saturday, February 15, 2020 - link
I think the number of consumers both interested in 6GB+ *and* able to afford them are so few, SSD manufacturers figure they can just go buy enterprise stuff.
Hul8 - Saturday, February 15, 2020 - link
*6TB+, obviously... :-D

Enterprise NVMe Round-Up 2: SK Hynix, Samsung, DapuStor and DERA

Test Setup

A New Hope

Post Your Comment

33 Comments

View All Comments

Billy Tallis - Friday, February 14, 2020 - link

curufinwewins - Friday, February 14, 2020 - link

extide - Friday, February 14, 2020 - link

PandaBear - Friday, February 14, 2020 - link

swarm3d - Monday, February 17, 2020 - link

PandaBear - Friday, February 14, 2020 - link

eek2121 - Friday, February 14, 2020 - link

Billy Tallis - Friday, February 14, 2020 - link

Hul8 - Saturday, February 15, 2020 - link

Hul8 - Saturday, February 15, 2020 - link

Log in

Don't have an account? Sign up now