Peak Throughput

For client/consumer SSDs we primarily focus on low queue depth performance for its relevance to interactive workloads. Server workloads are often intense enough to keep a pile of drives busy, so the maximum attainable throughput of enterprise SSDs is actually important. But it usually isn't a good idea to focus solely on throughput while ignoring latency, because somewhere down the line there's always an end user waiting for the server to respond.

In order to characterize the maximum throughput an SSD can reach, we need to test at a range of queue depths. Different drives will reach their full speed at different queue depths, and increasing the queue depth beyond that saturation point may be slightly detrimental to throughput, and will drastically and unnecessarily increase latency. Because of that, we are not going to compare drives at a single fixed queue depth. Instead, each drive was tested at a range of queue depths up to the excessively high QD 512. For each drive, the queue depth with the highest performance was identified. Rather than report that value, we're reporting the throughput, latency, and power efficiency for the lowest queue depth that provides at least 95% of the highest obtainable performance. This often yields much more reasonable latency numbers, and is representative of how a reasonable operating system's IO scheduler should behave. (Our tests have to be run with any such scheduler disabled, or we would not get the queue depths we ask for.)

Unlike last year's enterprise SSD reviews, we're now using the new io asynchronous IO APIs on Linux instead of the simpler synchronous APIs that limit software to one outstanding IO per thread. This means we can hit high queue depths without loading down the system with more threads than we have physical CPU cores, and that leads to much better latency metrics—but the impact on SATA drives is minimal because they are limited to QD32. Our new test suite uses up to 16 threads to issue IO.

Peak Random Read Performance

4kB Random Read

Our new test suite with the CPU bottleneck removed is very helpful to the peak random read performance scores of most of these drives. The two SSDs with a PCIe x8 interface stand out. Both can hit over 1M IOPS with a sufficiently high queue depth, though the scores shown here are for somewhat lower queue depths where latency is more reasonable. We're still looking at very high queue depths to get within a few percent of 1M IOPS: QD192 for the Samsung PM1725a and QD384 for the Memblaze PBlaze5 C916.

The U.2 drives are all limited to PCIe 3.0 x4 speeds, and the best random read performance we see out of them comes from the DapuStor Haishen3 H3000 at 751k IOPS, but that's closely followed by the other Dapu drive and all four of the DERA SSDs. The SK hynix PE6011 is the slowest NVMe model here, with its 8TB version coming up just short of 600k IOPS. The Intel Optane SSD's standing is actually harmed significantly by this year's test suite upgrade, because even under last year's suite the drive was as much of a bottleneck as the CPU. Reducing the CPU overhead has allowed many of the flash-based SSDs to pull ahead of the Optane SSD for random read throughput.

4kB Random Read (Power Efficiency)
Power Efficiency in kIOPS/W Average Power in W

Now that we're letting the drives run at high queue depths, the big 16-channel controllers aren't automatically at a disadvantage for power efficiency. Those drives are still drawing much more power (13-14W for the DERA and Memblaze, almost 20W for the Samsung PM1725a), but they can deliver a lot of performance as a result. The drives with 8-channel controllers are mostly operating around 7W, though the 7.68TB SK Hynix PE6011 pushes that up to 10W.

Putting that all in terms of performance per Watt, the DapuStor Haishen3 drives score another clear win on efficiency. Second and third palce are taken by the Samsung 983 DCT and Memblaze PBlaze5 C916, two drives at the opposite end of the power consumption spectrum. After that the scores are fairly tightly clustered with smaller capacity models generally delivering better performance per Watt, because even the 2TB class drives get pretty close to saturating the PCIe 3.0 x4 link and they don't need as much power as their 8TB siblings.

 

For latency scores, we're no longer going to look at just the mean and tail latencies at whatever queue depth gives peak throughput. Instead, we've run a separate test that submits IO requests at fixed rates, rather than at fixed queue depths. This is a more realistic way of looking at latency under load, because in the real world user requests don't stop arriving just because your backlog hits 32 or 256 IOs. This test starts at a mere 5k IOPS and steps up at 5k increments up to 100k IOPS, and then at 10k increments the rest of the way up to the throughput limit of these drives. That's a lot of data points per drive, so each IO rate is only tested for 64GB of random reads and that leads to the tail latency scores being a bit noisy.

Mean Median 99th Percentile 99.9th Percentile 99.99th Percentile

For most drives, the mean and median latency curves show pretty much what we expect: moderate latency increases through most of the performance range, and a steep spike as the drive approaches saturation. When looking at 99th and higher percentiles, things get more interesting. Quite a few drives end up with high tail latency long before reaching their throughput limit, especially the ones with the highest capacities. This leads to the DapuStor Haishen3 SSDs (1.6 and 2 TB) having the best QoS scores from roughly 550k (where the Optane SSD drops out) to their limit around 750k IOPS. The Memblaze PBlaze5 and Samsung PM1725a may both be able to get up to 1M IOPS, but by about 600k IOPS their 99th percentile read latency is already closing in on 10ms. The Intel, Hynix and DERA 8TB class drives also show 99th percentile latency spiking by the time the reach 400k IOPS even though all three can handle throughput up to at least ~600k IOPS.

When going beyond 99th percentile, most of the differences between drives get lost in the noise, but a few are still clearly identifiable losers: the SK hynix PE6011 7.68TB and Intel P4510 8TB, with 10-20ms tail latencies that show up even at relatively low throughput.

Peak Sequential Read Performance

Rather than simply increase the queue depth of a single benchmark thread, our sequential read and write tests first scale up the number of threads performing IO, up to 16 threads each working on different areas of the drive. This more accurately simulates serving up different files to multiple users, but it reduces the effectiveness of any prefetching the drive is doing.

128kB Sequential Read

The two PCIe x8 drives stand out on the sequential read test; the Samsung PM1725a at 6GB/s is quite a bit faster than the Memblaze's 4.3GB/s. The U.2 drives all perform fairly similarly, at or just below 3GB/s. Many of them are rated to perform more around 3.2-3.5GB/s, but our test includes multiple threads reading sequentially at moderate queue depths rather than one thread at high queue depths, so the SSDs don't have as much spatial locality to benefit from.

128kB Sequential Read (Power Efficiency)
Power Efficiency in MB/s/W Average Power in W

With a fairly level playing field in terms of sequential read performance, it's no surprise to see  big disparities show up again in the power efficiency scores. The DERA SSDs at just under 12W have the worst efficiency among the NVMe drives. The Samsung PM1725a isn't much better, because even though it delivers 6GB/s, it needs over 22W to do so. The DapuStor Haishen3 SSDs are once again the most efficient, with slightly above-average performance and the lowest total power draw among the NVMe SSDs.

Steady-State Random Write Performance

Enterprise SSD write performance is conventionally reported as steady-state performance rather than peak performance. Sustained writing to a flash-based SSD usually causes performance to drop as the drive's spare area fills up and the SSD needs to spend some time on background work to clean up stale data and free up space for new writes. Conventional wisdom holds that writing several times the drive's capacity should be enough to get a drive to steady-state, because nobody actually ships SSDs with greater than 100% overprovisioning ratios. In practice things are sometimes a bit more complicated, especially for SATA drives where the host interface can be such a severe bottleneck. Real-world write performance ultimately depends not just on the current workload, but also on the recent history of how a drive has been used, and no single performance test can capture all the relevant effects.

4kB Random Write

Steady-state random write throughput is determined mostly by how much spare area a drive has: the product of its capacity and overprovisioning ratio. That's how the 1.6TB DeraStor Haishen3 H3100 (2TB raw) is able to beat the 8TB and 7.68TB models that have very slim OP ratios. It's also how the Micron 5100 MAX SATA drive is able to beat several NVMe drives. The 6.4TB drives combine high OP and high raw capacity and take the top three spots among the flash-based SSDs. The Samsung PM1725a is the slowest of those three despite carrying the highest write endurance rating, likely because the older Samsung 48L flash it used has worse program or erase times than the IMFT 64L flash used by the DERA and Memblaze drives. And of course, the Optane SSD performs far beyond what any of these drives can sustain, because it doesn't have to shuffle around data behind the scenes while performing really slow block erase operations.

4kB Random Write (Power Efficiency)
Power Efficiency in kIOPS/W Average Power in W

The steady-state random write test pushes each drive to its power limits. That brings the most power-hungry high-capacity 16-channel drives up to almost 20W, which is about as much as the U.2 form factor can reasonably handle. The Optane SSD and the handful of drives with high OP turn in the best efficiency scores. Among the drives with low OP and write endurance ratings around 1 DWPD, the Intel P4510 seems to score best, and the 16-channel DERA D5437 is slightly more efficient than the 8-channel SK hynix PE6011.

 

To analyze random write latency vs throughput, we run the same kind of test as for random reads: writing at a series of fixed rates rather than at fixed queue depths. These results show two probable artifacts of our test procedure that we haven't fully investigated. First, latency at the slowest IO rates is excessively high, which may be a result of how fio's completion latency measurement interacts with its rate-limiting mechanism. There's also a dip in latency right before 100k IOPS, which is where this test switches from using 8 threads to 16 threads. Threads that are relatively busy and don't spend much time sleeping seem to have noticeably better response times. It might be possible to eliminate both of these effects by playing around with scheduler and power management settings, but for this review we wanted to stick to the defaults as much as reasonably possible.

Mean Median 99th Percentile 99.9th Percentile 99.99th Percentile

For most of their performance range, these drives stick close to the 20-30µs mean latency we measured at QD1 (which corresponds to around 30k IOPS). The Memblaze PBlaze5 C916 is the only flash-based SSD that maintains great QoS past 100k IOPS. The other drives that make it that far (the Samsung PM1725 and the larger DERA SSDs) start to show 99th percentile latencies over 100µs. The DapuStor Haishen3 H3100 1.6TB showed great throughput when testing at fixed queue depths, but during this test of fixed IO rates it failed out early from an excessive IO backlog, and the H3000 has the worst 99th percentile write scores out of all of the NVMe drives.

Steady-State Sequential Write Performance

As with our sequential read test, we test sequential writes with multiple threads each performing sequential writes to different areas of the drive. This is more challenging for the drive to handle, but better represents server workloads with multiple active processes and users.

128kB Sequential Write

As with random writes, the biggest drives with the most overprovisioning tend to also do best on the sequential write test. However, the Intel and Hynix 8TB drives with more modest OP ratios also perform quite well, a feat that the 8TB DERA D5437 fails to match. The DapuStor Haishen3 drives perform a bit better than other small drives: the 2TB H3000 is faster than its competitors from Samsung, Hynix and DERA, and extra OP helps the 1.6TB H3100 perform almost 50% better. However, even the H3100's performance is well below spec; most of these drives are pretty severely affected by this test's multithreaded nature.

128kB Sequential Write (Power Efficiency)
Power Efficiency in MB/s/W Average Power in W

For the most part, the fast drives are also the ones with the good power efficiency scores on this test. The 8TB Intel and 6.4TB Memblaze have the two best scores. The SATA drives are also quite competitive on efficiency since they use half the power of even the low-power NVMe drives in this bunch. The low-power 2TB class drives from Hynix, Samsung and DapuStor all have similar efficiency scores, and the DERA D5437 drives that are slow in spite of their 16-channel controller turn in the worst efficiency scores.

Performance at QD1 Mixed I/O & NoSQL Database Performance
Comments Locked

33 Comments

View All Comments

  • James5mith - Monday, February 17, 2020 - link

    "... but I would gladly purchase a high performance 16TB SSD."

    Then do so. They aren't ridiculously priced anymore. It's $2000-$4000 per drive depending on manufacturer and interface type.

    What is stopping you?

    The Micro 9300 Pro 15.36TB is ~$3000 on average. That's a U.2. interface drive. Too slow?
  • eek2121 - Monday, February 17, 2020 - link

    The lack of an M.2 offering? I have yet to find a single 16 TB M.2 SSD available for retail purchase. I have no problem plunking down a few thousand (provided the performance is comparable to Samsung's offerings).
  • CrystalCowboy - Tuesday, February 18, 2020 - link

    Most enterprise drives come either in U.2 or in PCIe. And you can buy PCIe-U.2 adapters.
  • CrystalCowboy - Tuesday, February 18, 2020 - link

    For that matter, M.2 - U.2 adapters are available and cheap.
  • NV_Me - Friday, February 14, 2020 - link

    Thanks for all of the insights Billy! BTW I like the addition of the drop down selection on top,

    For the PE6011, what is the TBW on either the 1.92TB or 7.68TB drive? I was curious to know if this was a true "1 DWPD" drive.
  • Billy Tallis - Friday, February 14, 2020 - link

    The full spec sheet for the PE6011 just says 1.0 DWPD. It doesn't list TBW.
  • NV_Me - Friday, February 14, 2020 - link

    Next time would it be possible to RANK the charts high-low or low-high for improved readability?
  • Hul8 - Saturday, February 15, 2020 - link

    If you retain the order, it's easier to compare performance of particular drives by glancing from one chart to the next. That's important with a 9-drive roundup.

    Normally when they're doing a single product review, that product is highlighted in one color, and it's predecessors or alteratives with another. In that case those items can always be easily spotted in a ranked graph.
  • JohnLee-SZ - Friday, February 14, 2020 - link

    Thanks very much Billy, it's a great review! We DapuStor are continuing developing the whole product portfolio and hope we can deliver some great products to fulfill industry needs.
  • CrystalCowboy - Tuesday, February 18, 2020 - link

    PCIe 3.0? Are we supposed to take this seriously?

Log in

Don't have an account? Sign up now