Original Link: http://www.anandtech.com/show/8104/intel-ssd-dc-p3700-review-the-pcie-ssd-transition-begins-with-nvme



In 2008 Intel introduced its first SSD, the X25-M, and with it Intel ushered in a new era of primary storage based on non-volatile memory. Intel may have been there at the beginning, but it missed out on most of the evolution that followed. It wasn't until late 2012, four years later, that Intel showed up with another major controller innovation. The Intel SSD DC S3700 added a focus on IO consistency, which had a remarkable impact on both enterprise and consumer workloads. Once again Intel found itself at the forefront of innovation in the SSD space, only to let others catch up in the coming years. Now, roughly two years later, Intel is back again with another significant evolution of its solid state storage architecture.

Nearly all prior Intel drives, as well as drives of its most qualified competitors have played within the confines of the SATA interface. Designed for and limited by the hard drives that came before it, SSDs used SATA to sneak in and take over the high performance market, but they did so out of necessity, not preference. The SATA interface and the hard drive form factors that went along with it were the sheep's clothing to the SSD's wolf. It became clear early on that a higher bandwidth interface was necessary to really give SSDs room to grow.

We saw a quick transition from 3Gbps to 6Gbps SATA for SSDs, but rather than move to 12Gbps SATA only to saturate it a year later most SSD makers set their eyes on PCIe. With PCIe 3.0 x16 already capable of delivering 128Gbps of bandwidth, it's clear this was the appropriate IO interface for SSDs. Many SSD vendors saw the writing on the wall initially, but their PCIe based SSD solutions typically leveraged a bunch of SATA SSD controllers behind a PCIe RAID controller. Only a select few PCIe SSD makers developed their own native controllers. Micron was among the first to really push a native PCIe solution with its P320h and P420m drives.

Bandwidth limitations were only one reason to want to ditch SATA. The other bit of legacy that needed shedding was AHCI, the interface protocol for communication between host machines and their SATA HBAs (Host Bus Adaptors). AHCI was designed for a world where low latency NAND based SSDs didn't exist. It ends up being a fine protocol for communicating with high latency mechanical disks, but one that consumes an inordinate amount of CPU cycles for high performance SSDs.

In the example above, the Linux AHCI stack alone requires around 27,000 cycles. The result is you need 10 Sandy Bridge CPU cores to drive 1 million IOPS. The solution is a new lightweight, low latency interface - one designed around SSDs and not hard drives. The result is NVMe (Non-Volatile Memory Express otherwise known as NVM Host Controller Interface Specification - NVMHCI). And in the same example, total NVMe overhead is reduced to 10,000 cycles, or roughly 3.5 Sandy Bridge cores needed to drive 1 million IOPS.

NVMe drives do require updated OS/driver support. Windows 8.1 and Server 2012R2 both include NVMe support out of the box, older OSes require the use of a miniport driver to enable NVMe support. Booting to NVMe drives shouldn't be an issue either. 

NVMe is a standard that seems to have industry support behind it. Samsung already launched its own NVMe drives, SandForce announced NVMe support with its SF3700 and today Intel is announcing a family of NVMe SSDs.

The Intel SSD DC P3700, P3600 and P3500 are all PCIe SSDs that feature a custom Intel NVMe controller. The controller is an evolution of the design used in the S3700/S3500, with improved internal bandwidth via an expanded 18-channel design, reduced internal latencies and NVMe support built in. The controller connects to as much as 2TB of Intel's own 20nm MLC NAND. The three different drives offer varying endurance and performance needs:

The pricing is insanely competitive for brand new technology. The highest endurance P3700 drive is priced at around $3/GB, which is similar to what enthusiasts were paying for their SSDs not too long ago. The P3600 trades some performance and endurance for $1.95/GB, and the P3500 drops pricing down to $1.495/GB. The P3700 ships with Intel's highest endurance NAND and highest over provisioning percentage (25% spare area vs. 12% on the P3600 and 7% on the P3500). DRAM capacities range from 512MB to 2.5GB of DDR3L on-board. All drives will be available in half-height, half-length PCIe 3.0 x4 add in cards or 2.5" SFF-8639 drives.

Intel sent us a 1.6TB DC P3700 for review. Based on Intel's 400GB drive pricing from the table above, the drive we're reviewing should retail for $4828.

IO Consistency

A cornerstone of Intel's DC S3700 architecture was its IO consistency. As the P3700 leverages the same basic controller architecture as the S3700, I'd expect a similar IO consistency story. I ran a slightly modified version of our IO consistency test, but the results should still give us some insight into the P3700's behavior from a consistency standpoint:

IO consistency seems pretty solid, the IOs are definitely not as tightly grouped as we've seen elsewhere. The P3700 still appears to be reasonably consistent and it does attempt to increase performance over time.



Sequential Read Performance

Sequential operations still make up a substantial portion of enterprise storage workloads. One of the most IO heavy workloads we run in AnandTech's own infrastructure are our traffic and ad stats processing routines. These tasks run daily as well as weekly and both create tremendous IO load for our database servers. Profiling the workload reveals an access pattern that's largely sequential in nature.

We'll start with a look at sequential read performance. Here we fill the drive multiple times with sequential data and then read it back for a period of 3 minutes. I'm reporting performance in MB/s as well as latency over a range of queue depths. First up, bandwidth figures:

The biggest takeaway from this graph is just how much parallelism Intel manages to extract from each transfer even at a queue depth of 1. The P3700 delivers more than 1GB/s of bandwidth at QD1. That's more than double any of the other competitors here, and equal to the performance of 3.7x SATA Intel SSD DC S3700s. Note that if you force the P3700 into a higher power, 25W operating mode, Intel claims peak performance hits 2.8GB/s compared to the 2.2GB/s we show here.

With NVMe you get a direct path to the PCIe controller, and in the case of any well designed system the storage will communicate directly to a PCIe controller on the CPU's die. With a much lower overhead interface and protocol stack, the result should be substantially lower latency. The graph below looks at average

The P3700 also holds a nice latency advantage here. You'll be able to see just how low the absolute latencies are in a moment, but for now we can look at the behavior of the drives vs. queue depth. The P3700's latencies stay mostly flat up to a queue depth of 16, it's only after QD32 that we see further increased latencies. The comparison to the SATA based S3700 is hilarious. The P3700's IO latency at QD32 is lower than the S3700 at QD8.

The next graph removes the sole SATA option and looks at PCIe comparisons alone, including the native PCIe (non-NVMe) Micron P420m:

Micron definitely holds the latency advantage over Intel's design at higher queue depths. Remember that the P420m also happens to be a native PCIe SSD controller, it's just using a proprietary host controller interface.

Sequential Write Performance

Similar to our discussion around sequential read performance, sequential write performance is still a very valuable metric in the enterprise space. Large log processing can stress a drive's sequential write performance, and once again it's something we see in our own server environment.

Here we fill the drive multiple times with sequential data and then write it back for a period of 3 minutes. I'm reporting performance in MB/s as well as latency over a range of queue depths.

Once again we see tremendous performance at very low queue depths. At a queue depth of 1 the P3700 already performs better than any of the other drives here, and delivers 1.3GB/s of sequential write performance. That's just insane performance at such a low queue depth. By QD4, the P3700 reaches peak performance at roughly 1.9GB/s regardless of what power mode you operate it in.

The chart below shows average latency across the QD sweep:

The P3700 continues to do extremely well in the latency tests, although Intel's original PCIe SSD didn't do so badly here either - its bandwidth was simply nowhere as good. Another way to look at it is that Intel now delivers better latency than the original 910, at substantially higher bandwidths. Micron's P420m manages to land somewhere between a good SATA drive and the P3700.

The next chart just removes the SATA drive so we get a better look at the PCIe comparison:

 



Random Read Performance

Although sequential performance is important, a true staple of any multi-user server is an IO load that appears highly random. For our small block random read test we first fill all drives sequentially, then perform one full drive write using a random 4K pass at a queue depth of 128. We then perform a 3 minute random read run at each queue depth, plotting bandwidth and latency along the way.

Small block random read operations have inherent limits when it comes to parallelism. In the case of all of the drives here, QD1 performance ends up around 20 - 40MB/s. The P3700 manages 36.5MB/s (~8900 IOPS) compared to 27.2MB/s (~6600 IOPS) for the SATA S3700. Even at a queue depth of 8 there's only a small advantage to the P3700 from a bandwidth perspective (~77000 IOPS vs. ~58400 IOPS). Performance does scale incredibly well with increasing queue depths though. By QD16 we see the P3700 pull away, at even as low as QD32 the P3700 delivers roughly 3.5x the performance of the S3700. There's a 70% advantage at QD32 compared to Intel's SSD 910, but that advantage grows to 135% at QD128.

Micron's P420m is incredibly competitive, substantially outperforming the P3700 at the highest queue depth.

Random read latency is incredibly important for applications where response time matters. Even more important for these applications is keeping latency below a certain threshold, what we're looking for here is a flat curve across all queue depths:

And that's almost exactly what the P3700 delivers. While the average latency for Intel's SSD DC S3700 (SATA) sky rockets after QD32, the P3700 remains mostly flat throughout the sweep. It's only at QD128 that we see a bit of an uptick. Even the 910 shows bigger jumps at higher queue depths.

If we remove the SATA drive and look exclusively at PCIe solutions, we get a better idea of the P3700's low latencies:

In this next chart we'll look at some specific numbers. Here we've got average latency (expressed in µs) for 4KB reads at a queue depth of 32. This is the same data as in the charts above, just formatted differently:

Average Latency - 4KB Random Read QD32

The P3700's latency advantage over its SATA counterpart is huge. Compared to other PCIe solutions, the P3700 is still leading but definitely not by as large of a margin. Micron's P420m comes fairly close.

Next up is average latency, but now at our highest tested queue depth: 128.

Average Latency - 4KB Random Read QD128

Micron's P420m definitely takes over here. Micron seems to have optimized the P420m for operation at higher queue depths while Intel focused the P3700 a bit lower. The SATA based S3700 is just laughable here, average completion latency is over 1.6ms.

Looking at maximum latency is interesting from a workload perspective, as well as from a drive architecture perspective. Latency sensitive workloads tend to have a max latency they can't exceed, but at the same time a high max latency but low average latency implies that the drive sees these max latencies infrequently. From an architectural perspective, consistent max latencies across the entire QD sweep give us insight into how the drive works at a lower level. It's during these max latency events that the drive's controller can schedule cleanup and defragmentation routines. I recorded max latency at each queue depth and presented an average of all max latencies across the QD sweet (From QD1 - QD128). In general, max latencies remained consistent across the sweep.

Max Latency - 4KB Random Read

The 910's max latencies never really get out of hand. Part of the advantage is each of the 910's four controllers only ever see a queue depth of 32, so no individual controller is ever stressed all that much. The S3700 is next up with remarkably consistent performance here. The range of values the S3700 had was 2ms - 10ms, not correlating in any recognizable way to queue depth. Note the huge gap between max and average latency for the S3700 - it's an order of magnitude. These high latency events are fairly rare.

The P3700 sees two types of long latency events: one that takes around 3ms and another that takes around 15ms. The result is a higher max latency than the other two Intel drives, but with a lower average latency than both it's still fairly rare.

Micron's P420m runs the longest background task routine of anything here, averaging nearly 53ms. Whatever Micron is doing here, it seems consistent across all queue depths.

Random Write Performance

Now we get to the truly difficult workload: a steady state 4KB random write test. We first fill the drive to capacity, then perform a 4KB (QD128) random write workload until we fill the drive once more. We then run a 3 minute 4KB random write test across all queue depths, recording bandwidth and latency values. This gives us a good indication of steady state performance, which should be where the drives end up over days/weeks/months of continued use in a server.

Despite the more strenuous workload, the P3700 absolutely shines here. We see peak performance attained at a queue depth of 8 and it's sustained throughout the rest of the range.

Average latency is also class leading - it's particularly impressive when you compare the P3700 to its SATA counterpart.

Average Latency - 4KB Random Write QD32

Average Latency - 4KB Random Write QD128

The absolute average latency numbers are particularly impressive. The P3700 at a queue depth of 128 can sustain 4KB random writes with IOs completing at 0.86ms.

Max Latency - 4KB Random Write



Mixed Read/Write Performance

Although our four corner testing is useful, many real world enterprise workloads are composed of a mixture of reads and writes. OLTP environments in particular tend to see a 70/30 split between reads and writes. The test below is conducted the same way as our 4KB random write test (1 sequential drive write, 1 4K-QD128 random drive write, then 3 minute test), but the actual test is 70% reads and 30% writes.

The results here look a lot like the 4KB random read results, but with a slightly different slope. The P3700 and Micron's P420m compete for top billing, but the P3700's superior random write performance and solid midrange queue depth random read performance ultimately give it the edge here.



CPU Utilization

With the move to NVMe not only do we get lower latency IOs but we should also see lower CPU utilization thanks to the lower overhead protocol. To quantify the effects I used task manager to monitor CPU utilization across all four cores in a Core i7 4770K system (with HT disabled). Note that these values don't just look at the impact of the storage device, but also the CPU time required to generate the 4KB random read (QD128) workload. I created four QD32 threads so all cores are taxed and we're not limited by a single CPU core.

Total System CPU Utilization (4 x 3.5GHz Haswell Cores)

To really put these values in perspective though we need to take into account performance as well. The chart below divides total IOPS during this test by total CPU usage to give us IOPS/% CPU usage:

Platform Efficiency: IOPS per % CPU Utilization

Here all of the PCIe solutions do pretty well. The SATA based S3700 is put to shame but even the Intel SSD 910 does well here.

For the next charts I'm removing Iometer from the CPU usage calculation and instead looking at the CPU usage from the rest of the software stack:

Storage Subsystem CPU Utilization (4 x 3.5GHz Haswell Cores)

 

Platform Efficiency: IOPS per % Storage CPU Utilization

Here the 910 looks very good, it's obviously a much older (and slower) drive but it's remarkably CPU efficient. Micron's P420m doesn't look quite as good, and the SATA S3700 is certainly far less efficient when it comes to IOPS/CPU.



Power Consumption

To quantify power consumption we're looking at total system power, which takes into account the CPU load associated with workload generation as well as the power drawn by the drive itself. The P3700 was left in its default lower power operating mode, not the higher 25W mode.

Overall the P3700 doesn't raise any power consumption flags. The overall system power consumption ends up being around 16W higher than with a single S3700, but if you normalize for performance the two end up looking quite similar. There seems to be some power savings compared to the old Intel SSD 910, and roughly comparable power consumption to the P420m.

Total System Power Consumption - Sequential Writes

Total System Power Consumption - Random Write (QD128)



Final Words

The vast majority of PCIe SSDs have been disappointing up to this point. We either saw poorly implemented designs that offered SATA RAID on a PCIe card or high priced, proprietary PCIe designs. The arrival of NVMe gives SSDs the breathing room they need to continue to grow. We finally get a low latency, low overhead interface and we get to shed SATA once and for all.

Intel's SSD DC P3700 gives us our first look at an NVMe drive, and the results are impressive. A single P3700 can deliver up to 450K random read IOPS, 150K random write IOPS and nearly 2GB/s of sequential writes. Sequential reads are even more impressive at between 2 - 3GB/s. All of this performance comes with very low latency operation thanks to an updated controller and the new NVMe stack. CPU efficiency is quite good thanks to NVMe as well. You get all of this at $3/GB, or less ($1.4975/GB) if you're willing to give up some performance and endurance. As an enterprise drive, the P3700 is an excellent option. I can't imagine what a few of these would do in a server. At some of the price points that Intel is talking about for the lower models, the P3xxx series won't be too far out of the reach of performance enthusiasts either. 

Intel's P3700 launch deck had a slide that put the P3700's performance in perspective compared to the number of SATA SSDs it could replace. I found the comparison interesting so I ran similar data, assuming perfect RAID scaling from adding together multiple DC S3700s. The comparison isn't perfect (capacity differences for one), but here's what I came up with:

A single P3700 ends up replacing 4 - 6 high performance SATA drives. If you don't need high sustained 4KB random write performance, you can get similar numbers out of the cheaper P3600 and P3500 as well. This is a very big deal.

Once again we see Intel at the forefront of a new wave of SSDs. What I really want to see now however is continued execution. We don't see infrequent blips of CPU architecture releases from Intel, we get a regular, 2-year tick-tock cadence. It's time for Intel's NSG to be given the resources necessary to do the same. I long for the day when we don't just see these SSD releases limited to the enterprise and corporate client segments, but spread across all markets - from mobile to consumer PC client and of course up to the enterprise as well.

Log in

Don't have an account? Sign up now