A Quick Flash Refresher

DRAM is very fast. Writes happen in nanoseconds as do CPU clock cycles, those two get along very well. The problem with DRAM is that it's volatile storage; if the charge stored in each DRAM cell isn't refreshed, it's lost. Pull the plug and whatever you stored in DRAM will eventually disappear (and unlike most other changes, eventually happens in fractions of a second).

Magnetic storage, on the other hand, is not very fast. It's faster than writing trillions of numbers down on paper, but compared to DRAM it plain sucks. For starters, magnetic disk storage is mechanical - things have to physically move to read and write. Now it's impressive how fast these things can move and how accurate and relatively reliable they are given their complexity, but to a CPU, they are slow.

The fastest consumer hard drives take 7 milliseconds to read data off of a platter. The fastest consumer CPUs can do something with that data in one hundred thousandth that time.

The only reason we put up with mechanical storage (HDDs) is because they are cheap, store tons of data and are non-volatile: the data is still there even when you turn em off.

NAND flash gives us the best of both worlds. They are effectively non-volatile (flash cells can lose their charge but after about a decade) and relatively fast (data accesses take microseconds, not milliseconds). Through electron tunneling a charge is inserted into an N-channel MOSFET. Once the charge is in there, it's there for good - no refreshing necessary.


N-Channel MOSFET. One per bit in a NAND flash chip.

One MOSFET is good for one bit. Group billions of these MOSFETs together, in silicon, and you've got a multi-gigabyte NAND flash chip.

The MOSFETs are organized into lines, and the lines into groups called pages. These days a page is usually 4KB in size. NAND flash can't be written to one bit at a time, it's written at the page level - so 4KB at a time. Once you write the data though, it's there for good. Erasing is a bit more complicated.

To coax the charge out of the MOSFETs requires a bit more effort and the way NAND flash works is that you can't discharge a single MOSFET, you have to erase in larger groups called blocks. NAND blocks are commonly 128 pages, that means if you want to re-write a page in flash you have to first erase it and all 127 adjacent pages first. And allow me to repeat myself: if you want to overwrite 4KB of data from a full block, you need to erase and re-write 512KB of data.

To make matters worse, every time you write to a flash page you reduce its lifespan. The JEDEC spec for MLC (multi-level cell) flash is 10,000 writes before the flash can start to fail.

Dealing with all of these issues requires that controllers get very crafty with how they manage writes. A good controller must split writes up among as many flash channels as possible, while avoiding writing to the same pages over and over again. It must also deal with the fact that some data is going to get frequently updated while others will remain stagnant for days, weeks, months or even years. It has to detect all of this and organize the drive in real time without knowing anything about how it is you're using your computer.

It's a tough job.

But not impossible.

Index Live Long and Prosper: The Logical Page
Comments Locked

295 Comments

View All Comments

  • Anand Lal Shimpi - Monday, August 31, 2009 - link

    The tables the drive needs to operate are also stored in a small amount of flash on the drive. The start of the circular logic happens in firmware which points to the initial flash locations, which then tells the controller how to map LBAs to flash pages.

    Take care,
    Anand
  • Bakkone - Monday, August 31, 2009 - link

    Any gossip about the new SATA?
  • Zaitsev - Monday, August 31, 2009 - link

    Thanks for the great article, Anand! It's been quite entertaining thus far.
  • cosmotic - Monday, August 31, 2009 - link

    The page about sizes (GB, GiB, spare areas, etc) is very confusing. It sounds very much like you are confusing the 'missing' space when converting from GB to GiB with the space the drive is using for its spare area.

    Is it the case that the drive has 80GiB internally, uses 5.5GiB for spare, and reports it's size as 80GB to the OS leaving the OS to say 74.5GiB as usable?
  • Anand Lal Shimpi - Monday, August 31, 2009 - link

    I tried to keep it simply by not introducing the Gibibyte but I see that I failed there :)

    You are correct, the drive has 80GiB internally, uses 5.5GiB for spare and reports that it has 156,301,488 sectors (or 74.5GiB) of user addressable space.

    Take care,
    Anand
  • sprockkets - Tuesday, September 1, 2009 - link

    Weird. So what you are saying is, the drive has 80Gib capacity, but then reports it has 80GB to the OS, advertised as having an 80GB capacity, which the OS then says the capacity is 74.5GiB?
  • sprockkets - Tuesday, September 1, 2009 - link

    As a quick followup, this whole SI vs binary thing needs to be clarified using the proper terms, as people like Microsoft and others have been saying GB when it really is GiB (or was the GiB term invented later?)

    For those who want a quick way to convert:

    http://converter.50webs.com">http://converter.50webs.com
  • ilkhan - Monday, August 31, 2009 - link

    so they are artifically bringing the capacity down, because the drive has the full advertised capacity and is getting the "normal" real capacity. :argh:
  • Vozer - Monday, August 31, 2009 - link

    I tried looking for the answer, but haven't found it anywhere so here it is: There are 10 flash memory blocks on both Intel 160GB and 80GB X25-M G2, right? (and 20 blocks with the G1).

    So, is the 80GB version actually a 160GB with some bad blocks or do they actually produce two different kind of flash memory block to use on their drives?
  • Anand Lal Shimpi - Monday, August 31, 2009 - link

    While I haven't cracked open the 80GB G2 I have here, I don't believe the drives are binned for capacity. The 80GB model should have 10 x 8GB NAND flash devices on it, while the 160GB model should have 10 x 16GB NAND flash devices.

    Take care,
    Ananad

Log in

Don't have an account? Sign up now