Fine Tuning Performance

The Intel Optane SSD DC P4800X 750GB is expected to perform the same as the 375GB model - and the same as the consumer Optane SSD 900p for that matter. Rather than simply repeat the same tests those drives have already been been subjected to, this review seeks to dig deeper into the performance characteristics of the Optane SSD and explore what it takes to extract the full performance of the drive. There are quite a few esoteric system settings that can have an impact, since a microsecond gained or lost matters more to an Optane SSD than to a flash-based drive.

Multiple Queues

One of the most important features of NVMe that allows for higher performance than the AHCI protocol used for SATA is support for multiple queues of I/O commands. The NVMe protocol allows for up to 64k queues each with up to 64k pending commands, but current hardware cannot actually reach those limits.

Intel's first-generation NVMe controller (used in the P3x00 drives and the SSD 750) supports a total of 32 queues in hardware: the admin queue, and 31 I/O queues. For the P4x00 generation NAND SSDs, the new controller supports 128 queues. However, the Optane SSD controller still only supports 32 queues, because that's more than enough to reach the full performance of the drive even if each queue only contains one command at a time. The Microsemi controller used by the Micron 9100 MAX supports 128 queues, the same as Intel's latest flash SSDs.

Achieving the highest and most consistent performance requires each CPU core that is performing I/O to have its own NVMe queue assigned to it. When multiple cores share the same queue, the synchronization overhead can increase latency and reduce performance consistency. The Linux NVMe driver currently spreads the NVMe queue assignments evenly across all CPU cores. On our server with 36 physical cores, this means that the Intel SSDs with only 31 I/O queues require several cores on each socket to share queues. Rather than patch the kernel to allow for manual queue assignment, the testbed was simply configured to only enable 16 our of 18 cores on each CPU. This caused the OS to assign one queue exclusively per core on CPU #2 that the SSDs are attached to (two cores on CPU #1 share a queue). None of the storage benchmarks in this review would benefit significantly from having two extra cores available, and 16 cores is more than enough to saturate any of these SSDs.

Sector Size

NVMe SSDs are capable of supporting different sector sizes. Everything defaults to the 512-byte sector size for the sake of backwards compatibility, but enterprise NVMe SSDs and many client NVMe SSDs also support 4kB sectors. Many enterprise SSDs also support sector formats that include between 8 and 128 extra bytes of metadata to support end to end data protection. Just as with 4k sector sizes on hard drives, using 4kB sectors on NVMe SSDs can slightly reduce overhead.

Switching between sector sizes is accomplished using the NVMe FORMAT command, which is also used for secure erase operations. On most flash-based SSDs, a NVMe FORMAT command takes only a few seconds. On Optane devices, the drive actually performs a low-level format that touches substantially all of the 3D XPoint memory and takes about as long as filling up the drive sequentially. With the 750GB Optane SSD DC P4800X, a NVMe FORMAT command takes several minutes and requires overriding the default command timeout settings. (Coincidentally, a kernel patch to fix this issue showed up on the linux-nvme mailing list while I was testing the drive.)

Interrupts vs Polling

With the performance offered by Optane SSDs, the CPU can become a bottleneck even when running synthetic storage benchmarks. The latency of 3D XPoint memory is low enough that things like CPU context switches, interrupt handler latency and inter-core synchronization can significantly affect results. We've already covered how the test hardware was configured for high performance, but there's further room to fine tune the software configuration.

There are two main ways for the operating system to find out when the SSD has completed a command. The normal method and the best general-purpose choice is for the OS to wait for the drive to signal an interrupt. Upon receipt of an interrupt, the OS will check the relevant NVMe completion queue and pass along the result to the application. The alternative is polling: while waiting for an I/O operation to complete, the CPU constantly checks the status of the completion queue. This wastes a ton of CPU time and is usually only worthwhile when the CPU has nothing better to do and needs the result as soon as possible. However, polling does shave one or two microseconds off the SSD's latency. When CPU power management is enabled, polling can also keep the processor awake when it is awaiting completion of an I/O command, potentially leading to more significant latency and throughput advantages relative to interrupt-driven I/O.

(source: Western Digital)

Recent Linux kernel versions support a hybrid polling mode, where the OS will estimate how long to sleep or run other tasks before starting to poll for completed I/O. This provides a reasonable balance between storage latency and CPU overhead, at least where storage performance is still a high priority. For this review, all drives were set to use the hybrid polling mode. However, polling is by default not used for most ordinary I/O:

APIs

At the application layer, there are several ways to accomplish I/O. Most methods (like the simple and ancient read() and write() system calls) are synchronous, making a single request at a time and blocking: the thread waits until the I/O operation is done before continuing. When these APIs are used, the only way an application can produce a queue depth greater than 1 is to have multiple threads performing I/O simultaneously.

Most operating systems also have APIs for asynchronous I/O, where the application sends requests to the OS but the application chooses when to check if those requests have completed. This allows a single thread to generate queue depths greater than one. Both asynchronous I/O and multithreaded synchronous I/O allow for high queue depths to be generated by a single application, but they are also more complex to use than simple single-threaded synchronous I/O.

Linux has also recently gained a new set of system calls for performing synchronous I/O and optionally flagging the read or write operation as high priority. This signals the OS to poll for completion of the I/O operations, reducing latency at the expense of burning CPU time instead of idling. These preadv2() and pwritev2() system calls are close to being a drop-in replacement for simple read() and write() system calls, but in most programming languages there's a standard library providing abstracted I/O interfaces, so switching an application to use the new system calls and flag some or all I/Os as high priority is not trivial. Currently, the preadv2() and pwritev2() system calls are the only Linux storage API that can trigger the kernel to use polling instead of waiting for an interrupt.

For application developers seeking to squeeze every last microsecond of latency out of their I/O, Intel created the Storage Performance Development Kit (SPDK), an offshoot of their Data Plane Development Kit (DPDK) for networking. The projects allow applications to directly access storage or network devices without going through the OS kernel's drivers. SPDK is an open-source library that is not tied to Intel hardware and can be used on Linux or FreeBSD to access any vendor's NVMe SSD with its polled mode driver. Using SPDK requires more invasive application changes than any of the above mentioned APIs, but it is also the fastest and most direct way for an application to interact with NVMe SSDs. Due to time constraints, this review does not include benchmarks with SPDK.

I/O Scheduler

Most operating systems include some form of I/O scheduler to determine which operations to send to the disk first when multiple processes need to access the disk. Linux includes several I/O schedulers with various strengths and weaknesses on different workloads. For real-world use, the proper choice of I/O scheduler can make a significant difference in overall performance. However, for benchmarking, I/O schedulers can interfere with attempts to test at specific queue depths and with specific I/O ordering. For this review, all SSDs were set to use the Linux no-op I/O scheduler that does no reordering or throttling, and consequently also has the least CPU overhead.

Test Setup Performance VS Transfer Size
POST A COMMENT

60 Comments

View All Comments

  • "Bullwinkle J Moose" - Thursday, November 09, 2017 - link

    Humor me.....

    How fast can you copy and paste a 100GB file from and to the same Optane SSD

    I don't believe your mixed mode results adequately demonstrate the internal throughput

    At least not until you demonstrate a direct comparison
    Reply
  • Billy Tallis - Thursday, November 09, 2017 - link

    Your concept of "internal throughput" has no basis in reality. File copies (on a filesystem that does not do copy-on-write) require the file data to be read from the SSD into system DRAM, then written back to the SSD. There are no "copy" commands in the NVMe command set. Reply
  • "Bullwinkle J Moose" - Thursday, November 09, 2017 - link

    "There are no "copy" commands in the NVMe command set."
    ---------------------------------------------------------------------------------
    That might be fixed with a few more onboard processors in the future but does not answer my question

    How fast can you copy/paste 100GB on THAT specific drive?
    Reply
  • "Bullwinkle J Moose" - Thursday, November 09, 2017 - link

    Better yet, I'd like you to GUESS how fast it can copy and paste based on your mixed mode analysis and then go measure it Reply
  • Lord of the Bored - Friday, November 10, 2017 - link

    How will a new processor change that there is no way to tell the drive to do what you want? We don't trust storage devices to "do what I mean", because the cost of a mistake is too high. No device anyone should be using will say "it looks like they're writing back the data they just read in, I'mma ignore the input and duplicate it from the cache to save time." Especially since they can't know if the data is changed in advance.

    Barring a new interface standard, it will take exactly as long to copy a file to another location on the same drive as it will to read the file and then write the file, because that is the only provision within the NVMe command set.
    Reply
  • "Bullwinkle J Moose" - Friday, November 10, 2017 - link

    "How will a new processor change that there is no way to tell the drive to do what you want?"
    --------------------------------------------------------------------------------------------------------------------------
    I have no idea Mr Bored, the "problem" as outlined by Billy Tallis could be ignored COMPLETELY and a fix is not even needed if he would simply do the copy/paste test that I originally asked

    He won't of course, and continue his downward spiral into depression and resentment lashing out at anyone who dares say a bad word about the Spyware/Adware/Malware/Extortionware Platform that is Windows 10

    Check out the interview between Eli the Computer Guy and Baracules Nerdgasm from a year ago

    Poor depressed Eli is still trying to figure out how he can still have a future in a Microsoft World and Baracules is the happiest Guy on Earth

    Check out the Barnacules Videos on Windows Spyware and you can understand his happiness
    He does not let Microsoft dictate the framework of his business, life and future
    Youtube Search: Windows is Spyware (you will find him)

    Poor Billy is on that same sad downward slope that only leads to Suicide or Mass Murder

    Just pull yer head out of Nadella's Ass long enough to tell us how fast this drive can copy/paste 100GB

    It's not hard at all
    Even I could do it (just give me the drive)

    No need to re-invent the drive or come up with what ifs

    Easy-Peasy
    Now SMILE and repeat 3 times, Microsoft is the problem / not the solution
    Deep breath annnnnnd Relax

    There is a future if you make one!
    Reply
  • Lord of the Bored - Friday, November 10, 2017 - link

    Your test is dumb, and is attempting to measure something that can't actually be measured that way. Reply
  • "Bullwinkle J Moose" - Thursday, November 09, 2017 - link

    What would happen if Intel Colludes with AMD to implement this technology into onboard graphics instead of AMD's plan to use Flash in their graphics cards ?

    Seems to me like Internal throughput would be very important to the design
    Reply
  • Samus - Thursday, November 09, 2017 - link

    That is also file system dependent. For example, in Mac OS High Sierra, you can copy and paste (duplicate) any size file instantly on any drive formatted with APFS.

    But your question of a block by block transfer of a file internally for a 100GB file would likely take 50 seconds if not factoring in file system efficiency.
    Reply
  • cygnus1 - Thursday, November 09, 2017 - link

    That's not a copy of the file though. It's just a duplicate file entry referencing the same blocks. That and things like snapshots are possible thanks to the copy on write nature of that file system. But, if any of those blocks were to become corrupted, both 'copies' of the file are corrupt. Reply

Log in

Don't have an account? Sign up now