CloudFounders: No More RAID

CloudFounders, a startup of former Terremark and SUN execs, also leverages a flash cache, but the building blocks are very different. Just like Nutanix, there is a VSA (Virtual Storage Appliance) that tries to make the best out of a local flash cache. The cool thing is that the backend, the second tier of storage, is not a traditional RAID based volume. The backend is either an object storage initiator that links to Amazon S3 or a storage device based upon erasure encoding called Distribute Storage System (DSS). Let's start with the DSS backend.

DSS is an object oriented storage system that uses “Bitspread”, an advanced and flexible erasure encoding system developed by the people of Amplidata. Amplidata is a startup with a mix of Belgian and US based infrastructure experts. Some of the directors are working for both CloudFounders and Amplidata. But there is a solid technical reason why CloudFounders chose to go with the Amplidata storage system. Bitspread is meant to be the “big storage alternative” to RAID.

As you probably know or have experienced hands-on, the current RAID implementations—RAID 5 and RAID 6—have reached their limitations now that we have terabyte disks. A few terabyte disks in RAID 5 can take days to rebuild. The result is that the RAID array performance and reliability is heavily degraded. RAID 6 is more reliable (although hardly 100%) but is not exactly a good performer for writes, which is another reason why VDI does not work well on a low end or midrange SAN.

“Bitspread” erasure encoding, also called Forward Error Correction Code (FEC), encodes data in “check-blocks”. The beauty is that you can configure the durability policy. In other words you can choose over how many disks these check-blocks should be spread and how many check-blocks you can lose before it becomes a problem. For example you could ask it to spread the datablock over 18 drives and tell the codec to make sure you can recover the original datablock from 12 check-blocks. So it's only if you lose more than 6 drives at once that you lose your data. As the codec requires only 12 of the check blocks to rebuild the original data object, a failure of two drives does not mean the rebuild should happen urgently. The rebuild can be done in the background at a very slow pace while the reliability stays high. You can also have the check-blocks spread over several storage modules, ensuring that you even survive a failure of a complete disk enclosure.

Bitspread: original data (yellow) is split up, encoded with high redundancy (green) and then spread over many disks and enclosures.

For those who are not convinced that the small startup Amplidata is onto something: Intel and Dr. Sam Siewert of Trellis Logic explain in this paper why it can even be mathematically proven that the Reed-Solomon based erasure codes of RAID 6 are a dead end road for large storage systems. The paper concludes:

"Amplida's Bitspread is an efficient, scalable and practical alternative to the stop-gap of combined RAID levels like 6+1."

And that is exactly the reason why CloudFounders chose to build their storage system on the Amplidata backend.

The DSS based on “Bitspread” works with objects and is thus not a block device. A volume driver must be installed that converts the DSS into a block device. This way the hypervisor can connect to an iSCSI target that is running on top of the volume driver, as an iSCSI target requires a block device and does not recognize the format of the DSS.

Bitspread is a lot more CPU intensive and needs more storage room than traditional RAID algorithms. To reduce the CPU impact, Amplidata leverages the SSE 4.2 capabilities of the latest Xeons. As Bitspread copes so well with disk failures, it is natural to use relatively slow SATA disks, which negates the capacity disadvantage compared to RAID 6. Decent media transfer can still be achieved as the DSS typically spreads the check-blocks over many disks.

Nutanix: No More SAN CloundFounders: Cloud Storage Router
Comments Locked

60 Comments

View All Comments

  • Brutalizer - Sunday, August 11, 2013 - link

    bitpushr,
    "That's because ZFS has had a minimal impact on the professional storage market."

    That is ignorant. If you had followed the professional storage market, you would have known that ZFS is the most widely deployed storage system in the Enterprise. ZFS systems manages 3-5x more data than NetApp, and manages more data than NetApp and EMC Isilon combined. ZFS is the future and eating other's cake:
    http://blog.nexenta.com/blog/bid/257212/Evan-s-pre...
  • blak0137 - Monday, August 5, 2013 - link

    The Amplidata Bitspread data protection scheme sounds alot like the OneFS filesystem on Isilon.

    A note on the NetApp section, the NVRAM does not store the hottest blocks, rather it is only used for correlating writes to allow destaging entire raid group wide stripes onto disk at once. This utilization of NVRAM in NetApp, along with the write characteristics of the WAFL filesystem, allows RAID-DP (NetApp's slightly customized version of RAID-6) to have similar write performance as RAID-10 with a much smaller usable space penalty up to approximately 85-90% space utilization. Read cache is always held in RAM on the controller and the FlashCache (formerly PAM) cards supplement that RAM-based cache. A thing to remember about the size of the FlashCache cards is that the space still benefits from the data efficiency features of Data OnTap, such as deduplication and compression, and as such applications such as VDI get a massive boost in performance.
  • enealDC - Monday, August 5, 2013 - link

    I think you also need to discuss the effect of OSS or very low cost solutions that can be built on white box hardware. Those cause far greater disruptions than anything I can think of!
    SCST and COMSTAR to name a few.
  • Ammohunt - Monday, August 5, 2013 - link

    One thing i didn't see mention is that in the good old days you spread the I/O out across many spindles which was a huge advantage SCSI which was geared towards such a configuration. As drive sizes have increased the spindles have reduced adding more latency. The fact is that expensive SSD type storage systems are not needed in most medium sized businesses. Their data needs can in most cases be served by spectacularly by using a well architected tiered storage model.
  • mryom - Monday, August 5, 2013 - link

    There's some thing missing - take a look at Pernix Data - That's disruptive and also vSphere 5.5 gonna be a game changer. Software Defined Storage is the way forward - We just need space for more disks in blade servers
  • davegraham - Tuesday, August 6, 2013 - link

    SDS is an EMC-marchitecture discussion (a la ViPR). I'd suggest that you avoid conflating what a marketing talking head discusses with technology can actually do. :)
  • Kevin G - Monday, August 5, 2013 - link

    My understanding withenterprise storage isn't necessarily the hardware but rather the software interface and support that comes with it. NetApp for example will dial home and order replacements for failed hard drives for you. Various interfaces I've used allow for the logical creation multiple arrays across multiple controllers each using a different RAID type. I have no sane reason why some one would want to do that but the option is there and supported for the crazies.

    As far as performance goes, NVMe and SATA Express are clearly the future. I'm surprised that we haven't see any servers with hot swap mini-PCIe slots. With two lanes going to each slot, a single socket Sandy Bridge-e chip could support twenty of those small form factor cards in the front of a 1U server. At 500 GB a piece, that is 10 TB of preformatted storage, not far off of the 16 TB preformatted possible today using hard drives. Cost of course will be more expensive than disk but speeds are ludicrous.

    Going with standard PCIe form factors for storage only makes sense if there are tons of channels connected to the controlller and are PCIe native. So far the majority of offers stick a hardware RAID chip with several SATA SSD controllers onto a PCIe card and call it a day.

    Also for the enterprice market, it would be nice to a PCIe SSD have an out of band management port that communicates via Ethernet and can fully function if the switch on the other end supports power over ethernet. The entire host could be fried but data could still potentially be recovered. Also works great for hardware configuration like on some Areca cards.
  • youshotwhointhewhatnow - Monday, August 5, 2013 - link

    The first link on "Cloudfounders: No More RAID" appears to be broken (http://www.amplidata.com/pdf/The-RAID Catastrophe.pdf).

    I read through the second link on that page (the Intel paper). I wouldn't consider that paper as unbiased considering Intel is clearly trying to use it to sell more Xeon chips. Regardless, I don't think your statement "mathematically proven that the Reed-Solomon based erasure codes of RAID 6 are a dead end road for large storage systems" is justified. Sure RAID6 will eventually give way to RAID7 (or RAIDZ2 in ZFS terms), but this still uses Reed-Solomon codes. The Intel paper just shows that RAID6+1 has much worse efficiency with slightly worse durability compared to Bitspread. The same could be said for RAID7 (instead of Bitspread), which really should have been part of the comparison.

    Another strange statement in the Intel paper is "Traditional erasure coding schemes implemented by competitive storage solutions have limited device-level BER protection (e.g., 4 four bit errors per device)". Umm, with non-degraded RAID6 you could have as many UREs as you like provided less than three occur on the same stripe (or less than two for a degraded array). Again RAID7 allows even more UREs in the same stripe.

    This is not to say that the Bitspread technique isn't interesting, but you seem to be a little to quick to drink the kool-aid.
  • name99 - Tuesday, August 6, 2013 - link

    I imagine the reason people are quick to drink the koolaid is that convolutional FEC codes have proved how well they work through much wireless experience. Loss of some Amplidata data is no different from puncturing, and puncturing just works --- we experience it every time we use WiFi or cell data.

    I also wouldn't read too much into Intel's support here. Obviously running a Viterbi algorithm to cope with a punctured convolutional code is more work than traditional parity-type recovery --- a LOT more work. And obviously, the first round of software you write to prove to yourself that this all works, you're going to write for a standard CPU. Intel is the obvious choice, and they're going to make a big deal about how they were the obvious choice.

    BUT the obvious next step is to go to Qualcomm or Broadcom and ask them to sell you a Viterbi cell, which you put on a SOC along with an ARM front-end, and hey presto --- you have a $20 chip you can stick in your box that's doing all the hard work of that $1500 Xeon.

    The point is, convolutional FEC is operating on a totally different dimension from block parity --- it is just so much more sophisticated, flexible, and powerful. The obvious thing that is being trumpeted here is destruction of one of more blocks in the storage device, but that's not the end of the story. FEC can also handle point bit errors. Recall that a traditional drive (HD or SSD) has its own FEC protecting each block, but if enough point errors occur in the block, that FEC is overwhelmed and the device reports a read error. NOW there is an alternative --- the device can report the raw bad data up to a higher level which can combine it with data from other devices to run the second layer of FEC --- something like a form of Chase combining.

    Convolutional codes are a good start for this, of course, but the state of the art in WiFi and telco is LDPCs, and so the actual logical next step is to create the next device based not on a dedicated convolutional SOC but on a dedicated LDPC SOC. Depending on how big a company grows, and how much clout they eventually have with SSD or HD vendors, there's scope for a whole lot more here --- things like using weaker FEC at the device level precisely because you have a higher level of FEC distributed over multiple devices --- and this may allow you a 10% or more boost in capacity.
  • meorah - Monday, August 5, 2013 - link

    you forgot another implication of scale-out software design. namely, the ability to bypass flash completely and store your most performance intensive workloads that use your most expensive software licensing directly in-memory. 16 gigs to run the host, the other 368 gigs as a nice RAM drive.

Log in

Don't have an account? Sign up now