The New Indirection Table

While the binary tree structure was great for sequential IO performance and for keeping DRAM sizes low, it wasn't good for lowering random IO latency. The S3700 controller completely does away with the old indirection table.

The new controller ditches the binary tree entirely and moves to a completely flat structure with 1:1 mapping. What happens now is there's a giant array with each location in the array mapped to a specific portion of NAND. The array isn't dynamically created and, since it's a 1:1 mapping, searches, inserts and updates are all very fast.

The other benefit of being 1:1 mapped with physical NAND is that there's no need to defragment the table, which immediately cuts down the amount of work the controller has to do. Drives based on this new controller only have to keep the NAND defragmented.

The downside to all of this is the DRAM area required by the new flat indirection table. The old binary tree was very space efficient, while the new array is just huge. It requires a large amount of DRAM depending on the capacity of the drive. In its largest implementation (800GB), Intel needs a full 1GB of DRAM to store the indirection table. By my calculations, the table itself should require roughly 100MB of DRAM per 100GB of storage space on the drive itself. Intel appears to be using DDR3-1333 for its DRAM on-board S3700 drives.

There's a bit of space left over after you account for the new indirection table. That area is reserved for a cache of the controller's firmware so it doesn't have to read from slow flash to access it.

Once again, there's no user data stored in the external DRAM. The indirection table itself is physically stored in NAND (just cached in DRAM), and there are two large capacitors on-board to push any updates to non-volatile storage in the event of power loss.

It sounds like a simple change, but building this new architecture took quite a bit of work. The results, if they are anywhere close to what Intel is promising, are pretty awesome.

Final Words

The Intel SSD DC S3700 appears to be a very promising new architecture from Intel. If it ends up performing as Intel promised, the S3700 controller could be the beginning of a new era in SSD performance - one focused on consistency of performance, not just absolute performance. As soon as we run samples through our test suite you can expect a full review, putting Intel's claims to the test. Stay tuned.

A Brand New Architecture & The Old Indirection Table
POST A COMMENT

43 Comments

View All Comments

  • Kevin G - Monday, November 5, 2012 - link

    There is mention of a large capacitor to allow for writing the cache to NAND in the event of a power failure.

    There are a couple of things Intel can do in this event to eliminate the possibility of cache corruption.

    First is write though of any immediate change to the indirection tables. The problem of coherence between the cache and NAND would still exist but wouldn't require writing the entire cache to NAND. Making the DRAM cache write through would impact the write/erase cycles of the drive but I'm uncertain of the magnitude in comparison to heavy write IO.

    The second option is that if the DRAM is used to create an optimized version of the directory tables for read only purposes, the old table in the NAND would still be valid (unless there needs to be change due to a write). Thus power loss would only lose the optimized table in DRAM but the unoptimized would still be functional in the NAND.

    The third option involves optimized tables being written to disk while the unoptimized version is still in use in NAND. The last operation of writing the optimized indirection table to disk would be switching the status of what table is in active use. Thus only the optimized table is put into use after it has successfully been written to NAND. Sudden power failure in this process wouldn't impact the drive.

    A fourth idea that comes to mind would be to make a reservation where the next optimized table would exist in NAND. Thus in the event of a sudden power failure, the SSD will use the unoptimized indirection tables but be able to see if anything has been written to the reserved space - it would know if it suffered a power loss and any recovery actions as necessary. This would eat space as the active table, a table being written and space for a future to be written would be 'in use'.
    Reply
  • cdillon - Monday, November 5, 2012 - link

    Personally, I don't care if an SSD stores my user data (acknowledged writes, specifically) and/or internal metadata in a DRAM cache as long as it is battery and/or capacitor backed so that cache can be flushed to NAND after a power failure.

    I think what I originally intended to say in my first comment was if Intel is not caching user data in DRAM, then what ARE they caching in DRAM that requires the super-capacitors to give them time to write it to NAND? If it isn't user data, then it must be the indirection tables or some other critical internal metadata. This internal metadata is at least as important as the user data itself, so why even make the distinction? The distinction stinks to me as either a marketing ploy or catering to some outdated PHB "requirement" that they need to meet in order to actually sell these drives to some enterprises. I'm not saying it's bad, just odd and probably non-optimal.
    Reply
  • Kevin G - Monday, November 5, 2012 - link

    It is likely buffering the indirection table writes to reduce the number of NAND writes. Essentially it helps with the drives overall endurance. How much so would be dependent on just how frequently the indirection table is written to.

    The other distinction is that they could be hitting a access time limitation by reading the indirection tables from NAND and then reading the data. By caching this in DRAM, the controller can lower access latencies to the NAND itself.
    Reply
  • nexox - Monday, November 5, 2012 - link

    Not storing user data in DRAM still helps - it forces the drive controller to actually operate efficiently instead of just fixing problems with more write cache. The indirection table doesn't change all that fast, so there won't be that much of it to flush out to NAND on power loss, but it's easy to built up a lot of user data in write cache, which requires that much more capacitance to get durably written.

    And FYI, many SSDs will acknowledge a write when the data hits NAND durably, but will not guarantee that the corresponding indirection table entry is durably stored, so on power failure some blocks may appear to revert to their old state, from before the synced write took place.
    Reply
  • Death666Angel - Tuesday, November 6, 2012 - link

    "Not storing user data in DRAM still helps - it forces the drive controller to actually operate efficiently instead of just fixing problems with more write cache."
    And why should I care how the problem is fixed?
    Efficient programming or throwing more hardware at the problem is the same thing for 99% of the usage cases. If maybe power consumption is a problem, then one solution might work better than another, but for the most part, a fix is a fix, at least in my book.
    Reply
  • Kevin G - Tuesday, November 6, 2012 - link

    How the problem is fixed would matter to enterprise environments where reliability reigns supreme. How an issue is fixed in this area matters in the context of it happening again, just under different circumstances.

    In this example, throwing more DRAM as a write cache for SSD's would be appropriate for consumers to address the issue but not necessarily the enterprise market. Keeping data in flash maintains data integrity which matters in scenarios of sudden power failure. The thing is that enterprise markets have a different usage scenario where the large write buffer that resolved the issue for consumers could still an issue at the enterprise level (ie the SSD would need an even larger DRAM buffer).
    Reply
  • Bullwinkle J Moose - Monday, November 5, 2012 - link

    Did I miss something?

    With 1:1 mapping, this this sounds like the Worlds first truly O.S. agnostic controller

    Does it require an O.S. with Trim or a partition offset for XP use, or did Intel just make the Worlds first universal SSD?

    The 320 may have handled partition offsets internally but still required Trim for best performance

    Please correct me if I'm wrong
    Reply
  • jwilliams4200 - Tuesday, November 6, 2012 - link

    You're wrong. You have misunderstood how the indirection table works. Reply
  • iwod - Monday, November 5, 2012 - link

    The only new, and truly innovation in this controller is the actually the software side of thing. 1:1 mapping and basically super fast storage table for updating, deleting by ECC RAM.

    Couldn't 70 - 90% of this performance gain be implemented with other controller if they had large enough ECC DRAM?

    Please correct me if I'm wrong

    And what are the variation of Random I/O in other Enterprise Class SSD like Fusion IO?
    Reply
  • MrSpadge - Tuesday, November 6, 2012 - link

    To me it sounds like this change requires an entirely different controller design, or at least a checking & rethinking of major parts. Intel surely didn't tell us everything that changed, just the most important result of the changes. Reply

Log in

Don't have an account? Sign up now