Random Read Performance

Although sequential performance is important, a true staple of any multi-user server is an IO load that appears highly random. For our small block random read test we first fill all drives sequentially, then perform one full drive write using a random 4K pass at a queue depth of 128. We then perform a 3 minute random read run at each queue depth, plotting bandwidth and latency along the way.

Small block random read operations have inherent limits when it comes to parallelism. In the case of all of the drives here, QD1 performance ends up around 20 - 40MB/s. The P3700 manages 36.5MB/s (~8900 IOPS) compared to 27.2MB/s (~6600 IOPS) for the SATA S3700. Even at a queue depth of 8 there's only a small advantage to the P3700 from a bandwidth perspective (~77000 IOPS vs. ~58400 IOPS). Performance does scale incredibly well with increasing queue depths though. By QD16 we see the P3700 pull away, at even as low as QD32 the P3700 delivers roughly 3.5x the performance of the S3700. There's a 70% advantage at QD32 compared to Intel's SSD 910, but that advantage grows to 135% at QD128.

Micron's P420m is incredibly competitive, substantially outperforming the P3700 at the highest queue depth.

Random read latency is incredibly important for applications where response time matters. Even more important for these applications is keeping latency below a certain threshold, what we're looking for here is a flat curve across all queue depths:

And that's almost exactly what the P3700 delivers. While the average latency for Intel's SSD DC S3700 (SATA) sky rockets after QD32, the P3700 remains mostly flat throughout the sweep. It's only at QD128 that we see a bit of an uptick. Even the 910 shows bigger jumps at higher queue depths.

If we remove the SATA drive and look exclusively at PCIe solutions, we get a better idea of the P3700's low latencies:

In this next chart we'll look at some specific numbers. Here we've got average latency (expressed in µs) for 4KB reads at a queue depth of 32. This is the same data as in the charts above, just formatted differently:

Average Latency - 4KB Random Read QD32

The P3700's latency advantage over its SATA counterpart is huge. Compared to other PCIe solutions, the P3700 is still leading but definitely not by as large of a margin. Micron's P420m comes fairly close.

Next up is average latency, but now at our highest tested queue depth: 128.

Average Latency - 4KB Random Read QD128

Micron's P420m definitely takes over here. Micron seems to have optimized the P420m for operation at higher queue depths while Intel focused the P3700 a bit lower. The SATA based S3700 is just laughable here, average completion latency is over 1.6ms.

Looking at maximum latency is interesting from a workload perspective, as well as from a drive architecture perspective. Latency sensitive workloads tend to have a max latency they can't exceed, but at the same time a high max latency but low average latency implies that the drive sees these max latencies infrequently. From an architectural perspective, consistent max latencies across the entire QD sweep give us insight into how the drive works at a lower level. It's during these max latency events that the drive's controller can schedule cleanup and defragmentation routines. I recorded max latency at each queue depth and presented an average of all max latencies across the QD sweet (From QD1 - QD128). In general, max latencies remained consistent across the sweep.

Max Latency - 4KB Random Read

The 910's max latencies never really get out of hand. Part of the advantage is each of the 910's four controllers only ever see a queue depth of 32, so no individual controller is ever stressed all that much. The S3700 is next up with remarkably consistent performance here. The range of values the S3700 had was 2ms - 10ms, not correlating in any recognizable way to queue depth. Note the huge gap between max and average latency for the S3700 - it's an order of magnitude. These high latency events are fairly rare.

The P3700 sees two types of long latency events: one that takes around 3ms and another that takes around 15ms. The result is a higher max latency than the other two Intel drives, but with a lower average latency than both it's still fairly rare.

Micron's P420m runs the longest background task routine of anything here, averaging nearly 53ms. Whatever Micron is doing here, it seems consistent across all queue depths.

Random Write Performance

Now we get to the truly difficult workload: a steady state 4KB random write test. We first fill the drive to capacity, then perform a 4KB (QD128) random write workload until we fill the drive once more. We then run a 3 minute 4KB random write test across all queue depths, recording bandwidth and latency values. This gives us a good indication of steady state performance, which should be where the drives end up over days/weeks/months of continued use in a server.

Despite the more strenuous workload, the P3700 absolutely shines here. We see peak performance attained at a queue depth of 8 and it's sustained throughout the rest of the range.

Average latency is also class leading - it's particularly impressive when you compare the P3700 to its SATA counterpart.

Average Latency - 4KB Random Write QD32

Average Latency - 4KB Random Write QD128

The absolute average latency numbers are particularly impressive. The P3700 at a queue depth of 128 can sustain 4KB random writes with IOs completing at 0.86ms.

Max Latency - 4KB Random Write

Sequential Read & Write Performance Mixed Read/Write Performance
Comments Locked

85 Comments

View All Comments

  • andrewaggb - Tuesday, June 3, 2014 - link

    that's really the question isn't it. I'm skeptical until somebody proves otherwise. Seems like you'd need a bios update at a minimum.
  • BeethovensCat - Tuesday, June 3, 2014 - link

    Yes, this would be key. Would be annoying to buy a card and not be able to boot Windows from it. Would it be only be possible with Z97 based chipsets or also Z87? Have a relatively new Z87 card. As much as I don't want to change to Apple, one must admit they are better at getting some of these things right. Come on Intel (Asus) - make it possible to boot from one of these on a Z87 motherboard and I will buy one right away!!
  • Taurothar - Tuesday, June 3, 2014 - link

    Honestly, it's up to the motherboard's capabilities. A bios update should be possible but it depends on many things like how the PCIe lanes are distributed etc, I wouldn't count on getting the full performance out of a chipset designed before PCIe SSDs. PCIe RAID cards have the controller to boot from built in, but these stand alone SSDs mean the chipset or other onboard controller has to be able to recognize it, that might not be as simple as a bios update.
  • morganf - Tuesday, June 3, 2014 - link

    I was disappointed that the 4K QD1 read was no better than 40 MB/sec that can be achieved by SATA / AHCI SSDs like the Samsung 840 Pro.

    FusionIO has been getting twice that (i.e., around 80 MB/sec) for years. I was expecting NVMe to achieve something similar.

    But maybe the 40 MB/sec is an OS driver limitation? Perhaps FusionIO is able to get around that because they have their own driver.
  • boogerlad - Tuesday, June 3, 2014 - link

    Why does the p3500 have such low 4k random write IOPS? Is it merely the worst case/steady-state performance? Is it much lower quality NAND? Is it lack of over provisioning and not a problem if the drive is not filled to the brim? I've been waiting for a product like this for a very long time. To be honest, I was surprised Intel was the one to deliver. It looks like they checked out of making innovative products looking at their CPU lineup.
  • boogerlad - Tuesday, June 3, 2014 - link

    Then again, I guess as long as 4k qd1 write speeds are the same as the p3700, it doesn't really matter. Many enthusiasts will buy the p3500 and put it under a consumer workload anyways that rarely has qd > 1.
  • Dangledon - Wednesday, June 4, 2014 - link

    Low random write performance is probably an indirect reflection of TLC. The endurance numbers make this pretty clear. TLC has relatively long P/E dwell times. These times become apparent when garbage collection is triggered by sustained random write workloads. I don't know whether these devices support overprovisioning. Having it might help deal with spikey workloads as long as the "runway" is long enough. Though, frankly, the P3500 was not designed for a high random write workload.
  • Dangledon - Wednesday, June 4, 2014 - link

    My bad. They're using MLC, not TLC. The reserve/spare capacity is 7% on the P3500, which in-part accounts for the relatively low endurance. Intel is probably also doing NAND part binning, using the poorest quality parts in the P3500.
  • rob_allshouse - Tuesday, June 3, 2014 - link

    One comment (and I do work for Intel, to be open about it)... but the P3700 does this all in x4 while the p420m does it in x8, so half the PCIe lanes consumed. I didn't see this in the article, and feel like it's very relevant. It also explains the disparity in sequential read performance.
  • mfenn - Tuesday, June 3, 2014 - link

    I find it interesting that this article is presented as an enterprise SSD review and even goes so far as to decry the performance of previous implementations, but does not mention Fusion-io or Virident. We've had 500K IOPS and latencies in the tens of microseconds for years now without Intel or NVMe, those are not the stories here.

    NVMe is not some wonderful advance from a performance point of view, and should not be presented as such. What it is is a path towards the commoditization of relatively high performance PCIe SSDs. That's an incredibly important achievement and should have been the focus of the discussion.

    As it stands, this article follows the the Intel marketing tune a little too closely and does not respect the deep market insights that I've come to expect from AnandTech.

Log in

Don't have an account? Sign up now