The Intel Optane Memory (SSD) Preview: 32GB of Kaby Lake Caching

Name: The Intel Optane Memory (SSD) Preview: 32GB of Kaby Lake Caching
Item: The Intel Optane Memory (SSD) Preview: 32GB of Kaby Lake Caching
Author: Billy Tallis

by Billy Tallis on April 24, 2017 12:00 PM EST

110 Comments | Add A Comment

110 Comments

Random Read

Random read speed is the most difficult performance metric for flash-based SSDs to improve on. There is very limited opportunity for a drive to do useful prefetching or caching, and parallelism from multiple dies and channels can only help at higher queue depths. The NVMe protocol reduces overhead slightly, but even a high-end enterprise PCIe SSDs can struggle to offer random read throughput that would saturate a SATA link.

Real-world random reads are often blocking operations for an application, such as when traversing the filesystem to look up which logical blocks store the contents of a file. Opening even an non-fragmented file can require the OS to perform a chain of several random reads, and since each is dependent on the result of the last, they cannot be queued.

These tests were conducted on the Optane Memory as a standalone SSD, not in any caching configuration.

Queue Depth 1

Our first test of random read performance looks at the dependence on transfer size. Most SSDs focus on 4kB random access as that is the most common page size for virtual memory systems and it is a common filesystem block size. For our test, each transfer size was tested for four minutes and the statistics exclude the first minute. The drives were preconditioned to steady state by filling them with 4kB random writes twice over.


Vertical Axis scale:	Linear	Logarithmic

The Optane Memory module manages to provide slightly higher performance than even the P4800X for small random reads, though it levels out at about half the performance for larger transfers. The Samsung 960 EVO starts out about ten times slower than the Optane Memory but narrows the gap in the second half of the test. The Crucial MX300 is behind the Optane memory by more than a factor of ten through most of the test.

Queue Depth >1

Next, we consider 4kB random read performance at queue depths greater than one. A single-threaded process is not capable of saturating the Optane SSD DC P4800X with random reads so this test is conducted with up to four threads. The queue depths of each thread are adjusted so that the queue depth seen by the SSD varies from 1 to 16. The timing is the same as for the other tests: four minutes for each tested queue depth, with the first minute excluded from the statistics.

The SATA, flash NVMe and two Optane products are each clearly occupying different regimes of performance, though there is some overlap between the two Optane devices. Except at QD1, the Optane Memory offers lower throughput and higher latency than the P4800X. By QD16 the Samsung 960 EVO is able to exceed the throughput of the Optane Memory at QD1, but only with an order of magnitude more latency.


Vertical Axis scale:	Linear	Logarithmic

Comparing random read throughput of the Optane SSDs against the flash SSDs at low queue depths requires plotting on a log scale. The Optane Memory's lead over the Samsung 960 EVO is much larger than the 960 EVO's lead over the Crucial MX300. Even at QD16 the Optane Memory holds on to a 2x advantage over the 960 EVO and a 6x advantage over the MX300. Over the course of the test from QD1 to QD16, the Optane Memory's random read throughput roughly triples.


Mean	Median	99th Percentile	99.999th Percentile

For mean and median random read latency, the two Optane drives are relatively close at low queue depths and far faster than either flash SSD. The 99th and 99.999th percentile latencies of the Samsung 960 EVO are only about twice as high as the Optane Memory while the Crucial MX300 falls further behind with outliers in excess of 20ms.

Random Write

Flash memory write operations are far slower than read operations. This is not always reflected in the performance specifications of SSDs because writes can be deferred and combined, allowing the SSD to signal completion before the data has actually moved from the drive's cache to the flash memory. Consumer workloads consist of far more reads than writes, but there are enough sources of random writes that they also matter to everyday interactive use. These tests were conducted on the Optane Memory as a standalone SSD, not in any caching configuration.

Queue Depth 1

As with random reads, we first examine QD1 random write performance of different transfer sizes. 4kB is usually the most important size, but some applications will make smaller writes when the drive has a 512B sector size. Larger transfer sizes make the workload somewhat less random, reducing the amount of bookkeeping the SSD controller needs to do and generally allowing for increased performance.


Vertical Axis scale:	Linear	Logarithmic

As with random reads, the Optane Memory holds a slight advantage over the P4800X for the smallest transfer sizes, but the enterprise Optane drive completely blows away the consumer Optane Memory for larger transfers. The consumer flash SSDs perform quite similarly in this steady-state test and are consistently about an order of magnitude slower than the Optane Memory.

Queue Depth >1

The test of 4kB random write throughput at different queue depths is structured identically to its counterpart random write test above. Queue depths from 1 to 16 are tested, with up to four threads used to generate this workload. Each tested queue depth is run for four minutes and the first minute is ignored when computing the statistics.

With the Optane SSD DC P4800X included on this graph, the two flash SSDs are have barely perceptible random write throughput, and the Optane Memory's throughput and latency both fall roughly in the middle of the gap between the P4800X and the flash SSDs. The random write latency of the Optane Memory is more than twice that of the P4800X at QD1 and is close to the latency of the Samsung 960 EVO, while the Crucial MX300 starts at about twice that latency.


Vertical Axis scale:	Linear	Logarithmic

When testing across the range of queue depths and at steady state, the 525GB Crucial MX300 is always delivering higher throughput than the Samsung 960 EVO, but with substantial inconsistency at higher queue depths. The Optane Memory almost doubles in throughput from QD1 to QD2, and is completely flat thereafter while the P4800X continues to improve until QD8.


Mean	Median	99th Percentile	99.999th Percentile

The Optane Memory and Samsung 960 EVO start out with the same median latency at QD1 and QD2 of about 20µs. The Optane Memory's latency increases linearly with queue depth after that due to its throughput being saturated, but the 960 EVO's latency stays lower until near the end of the test. The Samsung 960 EVO has relatively poor 99th percentile latency to begin with and is joined by the Crucial MX300 once it has saturated its throughput, while the Optane Memory's latency degrades gradually in the face of overwhelming queue depths. The 99.999th percentile latency of the flash-based consumer SSDs is about 300-400 times that of the Optane Memory.

SYSmark 2014 SE Sequential Access Performance

PRINT THIS ARTICLE

Post Your Comment
Please log in or sign up to comment.

Comments Locked

110 Comments

View All Comments

YazX_ - Monday, April 24, 2017 - link
"Since our Optane Memory sample died after only about a day of testing"

LOL
Chaitanya - Monday, April 24, 2017 - link
And it is supposed to have endurance rating 21x larger than a conventional NAND SSD.
Sarah Terra - Monday, April 24, 2017 - link
Funny yes, but teething issues aside the random write Performance is several orders of magnitude faster than all existing storage mediums, this is the number one metric I find that plays into system responsiveness, boot times, and overall performance and the most ignored metric by all Meg's to date. They all go for sequential numbers, which don't mean jack except when doing large file copies.
ddriver - Monday, April 24, 2017 - link
So let's summarize:

1000 times faster than NAND - in reality only about 10x faster in hypetane's few strongest points, 2-6x better in most others, maximum thorough lower than consumer NVME SSDs, intel lied about speed about 200 times LOL. Also from Tom's review, it became apparent that until the cache of comparable enterprise SSDs fills up, they are just as fast as hypetane, which only further solidifes my claim that xpoint is NO BETTER THAN SLC, because that's what those drives use for cache.

1000 times the endurance of flash - in reality like 2-3x better than MLC. Probably on par with SLC at the same production node. Intel liked about 300-500 times.

10 times denser than flash - in reality it looks like density is actually way lower than. 400 gigs in what.. like 14 chips was it? Samsung has planar flash (no 3d) that has more capacity in a single chip.

So now they step forward to offer this "flash killer" as a puny 32 gb "accelerator" which makes barely any to none improvement whatsoever and cannot even make it through one day of testing.

That's quite exciting. I am actually surprised they brought the lowest capacity 960 evo rather than the p600.

Consumer grade software already sees no improvement whatsoever from going sata to nvme. It won't be any different for hypetane. Latency are low queue depth access is good, but that's mostly the controller here, in this aspect NAND SSDs have a tremendous headroom for improvement. Which is what we are most likely going to see in the next generation from enterprise products, obviously it makes zero sense for consumers, regardless of how "excited" them fanboys are to load their gaming machines with terabytes of hypetane.

Last but not least - being exclusive to intel's latest chips is another huge MEH. Hypetane's value is already low enough at the current price and limited capacity, the last thing that will help adoption is having to buy a low value intel platform for it, when ryzen is available and offers double the value of intel offerings.
Drumsticks - Monday, April 24, 2017 - link
Your bias is showing.

1000x -> Harp on it all you want, but that number was for the architecture not the first generation end product. It represents where we can go, not where we are. I'll also note that Toms gave it their editor approved award - "As tested today with mainstream settings, Optane Memory performed as advertised. We observed increased performance with both a hard disk drive and an entry-level NVMe SSD. The value proposition for a hard drive paired with Optane Memory is undeniable. The combination is very powerful, and for many users, a better solution than a larger SSD."

"1000 times the endurance of flash -> You can concede that 3D XPoint density isn't as good as they originally envisioned, but it's still impressive, gen1, and has nowhere to go but up. It's not really worse than other competing drives per drive capacity - this cache supports like 3 DWPD basically. The MX300 750GB only supports like .3 DWPD. 10x better is still good.

10 times denser than flash -> DRAM, not Flash. And it's going to be much denser than DRAM.

Barely any to no improvement -> LOL, did you look at the graphs? Those lines at the bottom and on the left were 500GB and 250GB Sata and NVMe drives getting killed by Optane in a 32GB configuration. 3D XPoint was designed for low queue depth and random performance - i.e. things that actually matter, where it kills its competition. Even sequential throughput, which is far from its design intention, generally outperforms consumer drives.

So, Optane costs, in an enterprise SSD, 2-3x more than other enterprise drives, for record breaking low queue depth throughput that far surpasses its extra cost, while providing 10-80x less latency. In a consumer drive, Optane regularly approaches an order of magnitude faster than consumer drives in only a 32GB configuration.

If Optane is only as fast as SLC, I'd love to understand why the P4800X broke records as pretty much the fastest drive in the world, barring unrealistically high queue depths.

This 32GB cache might be a stopgap, and less compelling of a product in general because of its capacity, but that you could deny the potential that 3D XPoint holds is absolutely laughable. The random performance and low queue depth performance is undeniably better than NAND, and that's where consumer performance matters.
ddriver - Monday, April 24, 2017 - link
"I'd love to understand why the P4800X broke records"

Because nobody bothered to make a SLC drive for many many years. The last time there were purely SLC drives on the market it was years ago, with controllers completely outdated compared to contemporary standards.

SLC is so good that today they only use it for cache in MLC and TLC drives. Kinda like what intel is trying to push hypetane as. Which is why you can see SSDs hitting hypetane IOPs with inferior controllers, until they run out of SLC cache space and performance plummets due to direct MLC/TLC access.

I bet my right testicle that with a comparable controller, SLC can do as well and even better than hypetane. SLC PE latencies are in the low hundreds of NANOseconds, which is substantially lower than what we see from hypetane. Endurance at 40 nm is rated at 100k PE cycles, which is 3 times more than what hypetane has to offer. It will probably drop as process node shrinks but still.

"10x better is still good"

Yet the difference between 10x and 1000x is 100x. Like imagine your employer tells you he's gonna pay you 100k a year, and ends up paying you a 1000 bucks instead. Surely not something anyone would object to LOL.

I am not having problems with "10x better". I am having problems with the fact it is 100x less than what they claimed. Did they fail to meet their expectations, or did they simply lie?

I am not denying hypetane's "potential". I merely make note that it is nothing radically better than nand flash that has not been compromised for the sake of profit. xpoint is no better than SLC nand. With the right controller, good old, even ancient and almost forgotten SLC is just as good as intel and micron's overhyped love child. Which is kinda like reinventing the wheel a few thousand years later, just to sell it at a few times what its actually worth.

My bias is showing? Nope, your "intel inside" underpants are ;)
Reflex - Monday, April 24, 2017 - link
SLC has severe limits on density and cost. It's not used because of that. Even at the same capacity as these initial Optane drives it would likely cost considerably more, and as Optane's density increases there is no ability to mitigate that cost with SLC, it would grow linearly with the amount of flash. The primary mitigations already exists: MLC and TLC. Of course those reduce the performance profile far below Optane and decrease it's ability to handle wear. Technically SLC could go with a stacked die approach, as MLC/TLC are doing, however nothing really stops Optane from doing the same making that at best a neutral comparison.
ddriver - Monday, April 24, 2017 - link
SLC is half the density of MLC. Samsung has 2 TB of MLC worth in 4 flash chips. Gotta love 3D stacking. Now employ epic math skills and multiply 4 by 0.5, and you get a full TB of SLC goodness, perfectly doable via 3D stacked nand.

And even if you put 3D stacking aside, which if I am not mistaken the sm961 uses planar MLC, 2 chips on each side for a full 1 TB. Cut that in half, you'd get 512 GB of planar SLC in 4 modules.

Now, I don't claim to be that good in math, but if you can have 512 GB of SLC nand in 4 chips, and it takes 14 for a 400 GB of xpoint, that would make planar SLC OVER 4 times denser than xpoint.

Thus if at planar dies SLC is over 4 times better, stacked xpoint could not possibly not possibly be better than stacked SLC.

Severe limits my ass. The only factor at play here is that SSDs are already faster than needed in 99% of the applications. Thus the industry would rather churn MLC and TLC to maximize the profit per grain of sand being used. The moment hypetane begins to take market share, which is not likely, they can immediately launch SLC enterprise products.

Also, it should be noted that there is still ZERO information about what the xpoint medium actually is. For all we know, it may well be SLC, now wouldn't that be a blast. Intel has made a bunch of claims about it, none of which seemed plausible, and most of which have already turned out to be a lie.
ddriver - Monday, April 24, 2017 - link
*multiply 2 by 0.5
Reflex - Monday, April 24, 2017 - link
You can 3D stack Optane as well. That's a wash. You seem very obsessed with being right, and not with understanding the technology.

The Intel Optane Memory (SSD) Preview: 32GB of Kaby Lake Caching

Random Read

Queue Depth 1

Queue Depth >1

Random Write

Queue Depth 1

Queue Depth >1

Post Your Comment

110 Comments

View All Comments

YazX_ - Monday, April 24, 2017 - link

Chaitanya - Monday, April 24, 2017 - link

Sarah Terra - Monday, April 24, 2017 - link

ddriver - Monday, April 24, 2017 - link

Drumsticks - Monday, April 24, 2017 - link

ddriver - Monday, April 24, 2017 - link

Reflex - Monday, April 24, 2017 - link

ddriver - Monday, April 24, 2017 - link

ddriver - Monday, April 24, 2017 - link

Reflex - Monday, April 24, 2017 - link

Log in

Don't have an account? Sign up now