A Brand New Architecture

To understand how the S3700 is different, we need to revisit how SSDs work. I've done this several times over the years so I'll keep it as succinct as possible here. SSDs are made up of a bunch of NAND packages, each with 1 - 8 NAND die per package, with each die made of multiple planes, blocks and finally pages.

NAND is solid-state, non-volatile memory (data is retained even when power is removed, courtesy of some awesome physics). There are no moving parts, and accesses are very memory-like which delivers great sequential and random IO performance. The downside is NAND has some very strict guidelines dictating how it is written to and erased.

The first thing to know about NAND flash is that you can only write to the same NAND cell a finite number of times. The total amount of charge stored in a NAND cell is counted in dozens of electrons. The tunneling process that places the electrons on the floating gate (thus storing data) weakens the silicon oxide insulation layer that keeps the charge there. Over time, that layer degrades to the point where the cell can no longer store data, and it has to be marked as bad/unusable.

The second principle of dealing with NAND is that you can only write to NAND at the page level. In modern drives that's a granularity of 8KB.

The final piece of the puzzle, and the component that makes all of this a pain to deal with is that you can only erase NAND at the block level, which for Intel's 25nm NAND is 256 pages (2048KB).

Modern SSDs present themselves just like hard drives do, as a linear array of logical block addresses. The OS sends an address and command to the SSD, and the controller translates that address to a physical location in NAND.

When writing to an SSD, the SSD controller must balance its desire for performance (striping writes across as many parallel NAND die as possible) with the goal of preserving NAND lifespan by writing to all cells evenly (wear leveling).

As writes come in, new pages are allocated from a pool of free blocks. As the process of erasing a NAND cell reduces endurance, a good SSD controller will prefer allocating an empty page for new data over erasing an old block. Eventually the controller will run out of clean/empty pages to write to and will have to recycle an old block filled (sometimes only partially) with invalid data to keep operating. This process can reduce overall performance and increase wear on the NAND.

When writing sequential data to an SSD it's easy to optimize for performance. Transfers can be broken up and striped across all available NAND die. Reading the data back is perfectly optimized for high performance as well. It's random IO that causes a problem for performance. Writes to random LBA locations are combined and sent out as burst traffic to look sequential, however the mapping of those LBAs to physical NAND locations can leave the drive in a very fragmented state. With enough random data fragmented on a drive, all write performance will suffer as the controller will no longer be able to quickly allocate large contiguous blocks of free pages across all NAND die.


SSD in a fragmented state, white blocks represent free pages, Xes represent invalid data, colored blocks are valid data - more detail here

Modern SSD controllers will attempt to defragment themselves either while the drive is in use, or during periods of idle time (hence the phrase idle garbage collection). Adequate defragmentation is necessary to maintain a drive's performance even after it has been used for a while. The best controllers do a great job of defragmenting themselves as they work, while the worst allow internal fragmentation to get out of hand.

With that recap out of the way, let's talk about how Intel's first and second generation SSD controllers worked.

The Indirection Table

There never was a true Intel X25-M G3, the third generation controller went missing after briefly appearing on Intel roadmaps. Instead we got mild revisions of the X25-M G2's controller with new features enabled through firmware. This old controller was used in the Intel SSD 320 and more recently in the Intel SSD 710.

One notable characteristic of this old controller was that it never required a large external DRAM (16 - 64MB for the early drives). Intel was proud of the fact that it stored no user data in DRAM, which I always assumed kept the size requirements down. It turns out there was another reason.

All controllers have to map logical block addresses to physical locations in NAND. This map is stored on the NAND itself (and wear leveled so it actually moves locations), but it's cached in DRAM for fast access. Intel calls this map its indirection table.

In the old drives, the indirection table was a binary tree. A binary tree is a data structure made up of nodes and branches where each node can have at most two children.

 

 


An example of an LBA-tracking binary tree, Intel's implementation is obviously far more complex. This tree can get huge.

The old indirection table grew in size as the drive was written to. Each node would keep track of a handful of data including logical block address and the physical NAND location that the block mapped to. The mapping wasn't 1:1 so many nodes would refer to a starting LBA address in addition to an offset, allowing a single node to refer to a range of physical locations.

As write requests came in, sequential data was stored as LBA + offset per node in the binary tree. Non-sequential data created a new node, growing the tree, and increasing lookup time. The tree remained balanced (for low-overhead searches, comp sci majors will remember that there's a direct relationship between the height of a binary tree and how long it takes to perform inserts/lookups on the tree), so the creation of new nodes could sometimes be very time intensive.

Given the very small DRAM that Intel wanted on its drives (to help keep costs as low as possible) and the increasing lookup times from managing an ever expanding tree, Intel would regularly defragment/compress the tree. With enough data in the tree you could actually begin compressing various nodes in the tree down into a single node. For example we might have two separate nodes in the tree that refer to sequential physical locations, which can be combined into a single node with location + offset. The tree defrag/compression process would contribute to high latency with random IO.

There was another problem however. The physical NAND had to be defragmented on a regular basis to keep pages contiguous and avoid a random sprinkling of pages on each block (this can negatively impact sequential IO performance if you go to write a large block of data and it either has to be split up amongst multiple randomly distributed blocks, or if you have to erase and rewrite a bunch of blocks to make room for the new data). The problem was that once NAND was defragmented, the logical to physical mapping tree had to be updated to reflect the new mapping, which could sometimes conflict. There could be situations where the tree could just be finished compressing itself, but the NAND would defrag itself forcing a recompression/reorganization of the tree. The fact that both the mapping tree and physical NAND had to be defragmented, and the fact that doing one could create more work for the other contributed to some potentially high latencies in the old design.

The old Intel controller had to defragment both the indirection table and the physical NAND space, and the two processes could conflict, which would create some unexpectedly high latency IO from time to time. On average, Intel was able to keep this under control, but when given the opportunity to start from scratch one major goal was to eliminate this cause of latency.

Introduction & The Drive The New Indirection Table
POST A COMMENT

43 Comments

View All Comments

  • blackbrrd - Monday, November 05, 2012 - link

    Sounds like a huge improvement for databases. The write endurance looks phenomenal! Reply
  • FunBunny2 - Monday, November 05, 2012 - link

    What he said!! Reply
  • Guspaz - Monday, November 05, 2012 - link

    I'm saddened by the increasingly enterprise-oriented focus of Intel. Their SSDs have quite a good reputation in consumer circles as providing reliable performance and operation, and their latest product line (the 330 series) definitely has consumer-level pricing. They're currently sitting at $0.78/GB on the 240GB model, which is pretty competitive with the rest of the market.

    The nice thing was that Intel is the safe bet; if you don't want to sort through all the other stuff on the market, you can feel pretty safe buying an Intel. Yes, they've had issues, but generally less than other SSD manufacturers. But with pricing like the S3700 is featuring, the days of Intel being competitive in the consumer space may be over...

    I'd rather see Intel take a two-tiered approach. By all means, keep putting out the enterprise drives for the high margins, but also keep a toe in the consumer market; they'll get a good deal of sales there based on their reputation alone.
    Reply
  • karasaj - Monday, November 05, 2012 - link

    Just because this is an enterprise SSD doesn't mean that Intel is 100% abandoning the consumer market y'know. They can focus on enterprise but still release consumer SSDs. Reply
  • martyrant - Monday, November 05, 2012 - link

    $235 at launch for a 100GB performance SSD will not seem too bad to the enthusiast "consumer" circle. That will, of course, drop over time, and bring it within the means of even more budget minded enthusiasts. It was not long ago people were shelling out $200-250 for 80GB Intel X-25M / G2s. I still have two in RAID 0 that I just replaced this last weekend with 4x128GB Samsung 830s in RAID 0 (for $70/piece, that's not a bad 512GB [unformatted] setup). My girlfriend's PC is inheriting the G2's. While $235 for 100GB is still on the high end, I'm sure there will be people who will pay that in the consumer market when they launch if they really do solve some of the IO issues (I have noticed quite a few with Windows 8, not so much in Windows 7 remarkably...but Win8 has serious DPC issues to begin with). Reply
  • Omoronovo - Monday, November 05, 2012 - link

    Windows 8 has no DPC issues. There are no updated applications that can measure DPC correctly with the deferred timing in the Windows 8 kernel, making it appear to have a constant/high DPC.

    Additionally, DPC latency has nothing to do with disk accesses. Disk accesses are not a function of interrupts in the kernel, unlike audio and video.
    Reply
  • Kjella - Monday, November 05, 2012 - link

    With the market going to even smaller process sizes and TLC the drives can't take enthusiast use anyway, my SSD life meter tells me my drive is going to die after 3.5 years - and that's after I worn out one in 1.5 years being nasty with it. Right now my C: drive is 83GB... 100GB is maybe cutting it a little short, I'd like at least 150GB, but otherwise yeah this is a drive I could want. Reply
  • ExarKun333 - Monday, November 05, 2012 - link

    Much of the enterprise offerings end-up trickling down to consumer products. Just be patient. :) Reply
  • Beenthere - Monday, November 05, 2012 - link

    No offense intended but it's totally inaccurate to state that "Intel is the safe bet". They have had issues with their consumer grade SSDs like most other SSD suppliers who rush products to market without proper validation. I would not trust an Intel SSD any more than most of the other drives with few exceptions. Until an SSD company proves their product in fully compatible, reliable, doesn't change size or lose data, or disappear from they system, I'm not buying the hype.

    I'm from Missouri - the SHOW ME state.
    Reply
  • martyrant - Monday, November 05, 2012 - link

    So are you speaking from personal experience with Intel SSDs since you are from the "SHOW ME" state?

    I have 4 Intel SSDs (two G2s, two 320s) and have had zero issues with them. I bought four OCZ Vertex 4s a little over a month ago and returned all four of them because of compatibility issues and consistently appearing/disappearing in single and RAID configurations in multiple computer setups. I'd also owned a 64GB OCZ V2 that I've since given away (RMA'd it 3 times it kept dying, didn't care to bother with it after that). I have had zero issues with the Intel SSDs and am hoping to find the same reliability with the 830s I just upgraded to.

    Also, if you actually looked / did some research you would find that Intel has had a lot less issues (even though they have had some of the same Sandforce issues as other mfgs) than other companies....sometimes claiming you sit around waiting for someone to "SHOW" you the proof it sounds like you are couch potato who still cares who wins the election because you actually think one is different than the other...and msnbc/cnn/fox/history/discovery/comedy central told you so (just saying, going out and gathering your own empirical information is worth it sometimes).
    Reply

Log in

Don't have an account? Sign up now