Mixed Random Performance

Real-world storage workloads usually aren't pure reads or writes but a mix of both. It is completely impractical to test and graph the full range of possible mixed I/O workloads—varying the proportion of reads vs writes, sequential vs random and differing block sizes leads to far too many configurations. Instead, we're going to focus on just a few scenarios that are most commonly referred to by vendors, when they provide a mixed I/O performance specification at all. We tested a range of 4kB random read/write mixes at queue depth 32 (the maximum supported by SATA SSDs) and at QD 128 to better stress the NVMe SSDs. This gives us a good picture of the maximum throughput these drives can sustain for mixed random I/O, but in many cases the queue depth will be far higher than necessary, so we can't draw meaningful conclusions about latency from this test. This test uses 8 threads when testing at QD32, and 16 threads when testing at QD128. This spreads the work over many CPU cores, and for NVMe drives it also spreads the I/O across the drive's several queues.

The full range of read/write mixes is graphed below, but we'll primarily focus on the 70% read, 30% write case that is commonly quoted for mixed IO performance specs.

4kB Mixed Random Read/Write
Queue Depth 32 Queue Depth 128

A queue depth of 32 is only enough to saturate the slowest of these NVMe drives on a 70/30 mixed random workload. All of the high-end drives aren't being stressed enough. At QD128 we see a much wider spread of scores. The DERA and Memblaze 6.4TB drives have pulled past the Optane SSD for overall throughput, but the Samsung PM1725a can't come close to keeping up with them—its throughput is more on par with the DERA D5437 drives with relatively low overprovisioning. The high OP ratio on the DapuStor Haishen3 H3100 allows it to perform much better than any of the other drives with 8-channel controllers, and better than the Intel P4510 which has a 12-channel controller.

4kB Mixed Random Read/Write
QD32 Power Efficiency in MB/s/W QD32 Average Power in W
QD128 Power Efficiency in MB/s/W QD128 Average Power in W

The DapuStor Haishen3 H3100 is the main standout on the power efficiency charts: at QD32 it's the only flash-based NVMe SSD that's more efficient than both of the SATA SSDs, and at QD128 it's getting close to the Optane SSD's efficiency score. Also at QD128 the two fastest 6.4TB drives have pretty good efficiency scores, but still quite a ways behind the Optane SSD: 15-18W vs 10W for similar performance.

QD32 QD128

Most of these drives have hit their power limit by the time the mix is up to about 30% writes. After that point, their performance steadily declines as the workload (and thus power budget) shift more toward slower more power hungry write operations. This is especially true at the higher queue depth. At QD32 things look quite different for the DERA D5457 and Memblaze PBlaze5 C916, because QD32 isn't enough to get close to their full read throughput and they're actually able to deliver higher throughput for writes than for reads. That's not quite true of the Samsung PM1725a because its steady-state random write speed is so much slower, but it does see a bit of an increase in throughput toward the end of the QD32 test run as it gets close to pure writes.

Aerospike Certification Tool

Aerospike is a high-performance NoSQL database designed for use with solid state storage. The developers of Aerospike provide the Aerospike Certification Tool (ACT), a benchmark that emulates the typical storage workload generated by the Aerospike database. This workload consists of a mix of large-block 128kB reads and writes, and small 1.5kB reads. When the ACT was initially released back in the early days of SATA SSDs, the baseline workload was defined to consist of 2000 reads per second and 1000 writes per second. A drive is considered to pass the test if it meets the following latency criteria:

  • fewer than 5% of transactions exceed 1ms
  • fewer than 1% of transactions exceed 8ms
  • fewer than 0.1% of transactions exceed 64ms

Drives can be scored based on the highest throughput they can sustain while satisfying the latency QoS requirements. Scores are normalized relative to the baseline 1x workload, so a score of 50 indicates 100,000 reads per second and 50,000 writes per second. Since this test uses fixed IO rates, the queue depths experienced by each drive will depend on their latency, and can fluctuate during the test run if the drive slows down temporarily for a garbage collection cycle. The test will give up early if it detects the queue depths growing excessively, or if the large block IO threads can't keep up with the random reads.

We used the default settings for queue and thread counts and did not manually constrain the benchmark to a single NUMA node, so this test produced a total of 64 threads scheduled across all 72 virtual (36 physical) cores.

The usual runtime for ACT is 24 hours, which makes determining a drive's throughput limit a long process. For fast NVMe SSDs, this is far longer than necessary for drives to reach steady-state. In order to find the maximum rate at which a drive can pass the test, we start at an unsustainably high rate (at least 150x) and incrementally reduce the rate until the test can run for a full hour, and the decrease the rate further if necessary to get the drive under the latency limits.

Aerospike Certification Tool Score

The strict QoS requirements of this test keep a number of these drives from scoring as well as we would expect based on their throughput on our other tests. The biggest disappointment is the Samsung PM1725a that's barely any faster than their newer 983 DCT. The PM1725a has no problem with outliers above the 8ms or 64ms thresholds, but it cannot get 95% of the reads to complete in under 1ms until the workload slows way down. This suggests that it is not as good as newer SSDs at suspending writes in favor of handling a read request. The DapuStor Haishen3 SSDs also underperform relative to comparable drives, which is a surprise given that they offered pretty good QoS on some of the pure read or write tests.

The Memblaze PBlaze5 C916 is the fastest flash SSD in this bunch, but only scores 60% of what the Optane SSD gets. The DERA SSDs that also use 16-channel controllers are the next fastest, though the 8TB D5437 is substantially slower than the 4TB model.

Aerospike ACT: Power Efficiency
Power Efficiency Average Power in W

Since the ACT test runs drives at the throughput where they offer good QoS rather than at their maximum throughput, the power draw from these drives isn't particularly high: the NVMe SSDs range from roughly 4-13 W. The top performers are also generally the most efficient drives on this test. Even though it is slower than expected, the DapuStor Haishen3 H3100 is the second most efficient flash SSD in this round-up, using just over half the power that the slightly faster Intel P4510 requires.

Peak Throughput and Steady State Conclusion
Comments Locked

33 Comments

View All Comments

  • Billy Tallis - Friday, February 14, 2020 - link

    Me, too. It's a pity that we'll probably never see the Micron X100 out in the open, but I'm hopeful about Intel Alder Stream.

    I do find it interesting how Optane doesn't even come close to offering the highest throughput (sequential reads or writes or random reads), but its performance varies so little with workload that it excels in all the corner cases where flash fails.
  • curufinwewins - Friday, February 14, 2020 - link

    Absolutely. It's so completely counter to the reliance on massive parallelization and over provisioning/cache to hide the inherent weaknesses of flash that I just can't help but being excited about what is actually possible with it.
  • extide - Friday, February 14, 2020 - link

    And honestly most of those corner cases are far more important/common in real world workloads. Mixed read/write, and low QD random reads are hugely important and in those two metrics it annihilates the rest of the drives.
  • PandaBear - Friday, February 14, 2020 - link

    Throughput has alot to do with how many dies you can run in parallel, and since optane has a much lower density (therefore more expensive and lower capacity), they don't have as many dies on the same drive, and that's why peak throughput will not be similar to the monsters out there with 128-256 dies on the same drive. They make it back in other spec of course, and therefore demand a premium for that.
  • swarm3d - Monday, February 17, 2020 - link

    Sequential read/write speed is highly overrated. Random reads and writes make up the majority of a typical workload for most people, though sequential reads will benefit things like game load times and possibly video edit rendering (if processing isn't a bottleneck, which is usually is).

    Put another way, if sequential read/write speed was important, tape drives would probably be the dominant storage tech by now.
  • PandaBear - Friday, February 14, 2020 - link

    Some info from the industry is that AWS is internally designing their own SSD and the 2nd generation is based off the same Zao architecture and 96 layer Kioxia NAND that DapuStor makes. For this reason it is likely that it will be a baseline benchmark for most ESSD out there (i.e. you have to be better than that or we can make it cheaper). Samsung is always going to be the powerhouse because they can afford to make a massive controller with so much more circuits that would be too expensive for others. SK Hynix's strategy is to make an expensive controller so they can make money back from the NAND. Dera and DapuStor will likely only focus in China and Africa like their Huawei pal. Micron has a bad reputation as an ESSD vendor and they ended up firing their whole Tidal System team after Sanjay joined, and Sanjay pouched a bunch of WD/SanDisk people to rebuild the whole group from ground up.
  • eek2121 - Friday, February 14, 2020 - link

    I wish higher capacity SSDs were available for consumers. Yes, there are only a small minority of us, but I would gladly purchase a high performance 16TB SSD.

    I suspect the m.2 form factor is imperfect for high density solid state storage, however. Between heat issues (my 2 TB 970 EVO has hit 88C in rare cases...with a heatsink. My other 960 EVO without a heatsink has gotten even hotter.) and the lack of physical space for NAND, we will likely have to come up with another solution if capacities are to go up.
  • Billy Tallis - Friday, February 14, 2020 - link

    Going beyond M.2 for the sake of higher capacity consumer storage would only happen if it becomes significantly cheaper to make SSDs with more than 64 NAND dies, which is currently 4TB for TLC. Per-die capacity is going up slowly over time, but fast enough to keep up with consumer storage needs. In order for the consumer market to shift toward drives with way more than 64 NAND dies, we would need to see per-wafer costs drop dramatically, and that's just not going to happen.
  • Hul8 - Saturday, February 15, 2020 - link

    I think the number of consumers both interested in 6GB+ *and* able to afford them are so few, SSD manufacturers figure they can just go buy enterprise stuff.
  • Hul8 - Saturday, February 15, 2020 - link

    *6TB+, obviously... :-D

Log in

Don't have an account? Sign up now