CloudFounders: No More RAID

CloudFounders, a startup of former Terremark and SUN execs, also leverages a flash cache, but the building blocks are very different. Just like Nutanix, there is a VSA (Virtual Storage Appliance) that tries to make the best out of a local flash cache. The cool thing is that the backend, the second tier of storage, is not a traditional RAID based volume. The backend is either an object storage initiator that links to Amazon S3 or a storage device based upon erasure encoding called Distribute Storage System (DSS). Let's start with the DSS backend.

DSS is an object oriented storage system that uses “Bitspread”, an advanced and flexible erasure encoding system developed by the people of Amplidata. Amplidata is a startup with a mix of Belgian and US based infrastructure experts. Some of the directors are working for both CloudFounders and Amplidata. But there is a solid technical reason why CloudFounders chose to go with the Amplidata storage system. Bitspread is meant to be the “big storage alternative” to RAID.

As you probably know or have experienced hands-on, the current RAID implementations—RAID 5 and RAID 6—have reached their limitations now that we have terabyte disks. A few terabyte disks in RAID 5 can take days to rebuild. The result is that the RAID array performance and reliability is heavily degraded. RAID 6 is more reliable (although hardly 100%) but is not exactly a good performer for writes, which is another reason why VDI does not work well on a low end or midrange SAN.

“Bitspread” erasure encoding, also called Forward Error Correction Code (FEC), encodes data in “check-blocks”. The beauty is that you can configure the durability policy. In other words you can choose over how many disks these check-blocks should be spread and how many check-blocks you can lose before it becomes a problem. For example you could ask it to spread the datablock over 18 drives and tell the codec to make sure you can recover the original datablock from 12 check-blocks. So it's only if you lose more than 6 drives at once that you lose your data. As the codec requires only 12 of the check blocks to rebuild the original data object, a failure of two drives does not mean the rebuild should happen urgently. The rebuild can be done in the background at a very slow pace while the reliability stays high. You can also have the check-blocks spread over several storage modules, ensuring that you even survive a failure of a complete disk enclosure.

Bitspread: original data (yellow) is split up, encoded with high redundancy (green) and then spread over many disks and enclosures.

For those who are not convinced that the small startup Amplidata is onto something: Intel and Dr. Sam Siewert of Trellis Logic explain in this paper why it can even be mathematically proven that the Reed-Solomon based erasure codes of RAID 6 are a dead end road for large storage systems. The paper concludes:

"Amplida's Bitspread is an efficient, scalable and practical alternative to the stop-gap of combined RAID levels like 6+1."

And that is exactly the reason why CloudFounders chose to build their storage system on the Amplidata backend.

The DSS based on “Bitspread” works with objects and is thus not a block device. A volume driver must be installed that converts the DSS into a block device. This way the hypervisor can connect to an iSCSI target that is running on top of the volume driver, as an iSCSI target requires a block device and does not recognize the format of the DSS.

Bitspread is a lot more CPU intensive and needs more storage room than traditional RAID algorithms. To reduce the CPU impact, Amplidata leverages the SSE 4.2 capabilities of the latest Xeons. As Bitspread copes so well with disk failures, it is natural to use relatively slow SATA disks, which negates the capacity disadvantage compared to RAID 6. Decent media transfer can still be achieved as the DSS typically spreads the check-blocks over many disks.

Nutanix: No More SAN CloundFounders: Cloud Storage Router
Comments Locked

60 Comments

View All Comments

  • jhh - Wednesday, August 7, 2013 - link

    And Advanced/SDDC/Chipkill ECC, not the old-fashioned single-bit correct/multiple bit detect. The RAM on the disk controller might be small enough for this not to matter, but not on the system RAM.
  • tuxRoller - Monday, August 5, 2013 - link

    Amplidata's dss seems like a better, more forward looking alternative.
  • Sertis - Monday, August 5, 2013 - link

    The Amplistore design seems a bit better than ZFS. ZFS has a hash to detect bit rot within the blocks, while this stores FEC coding that can potentially recover the data within that block without calculating it based on parity from the other drives on the stripe and the I/O that involves. It also seems to be a bit smarter on how it distributes data by allowing you to cross between storage devices to provide recovery at the node level while ZFS is really just limited to the current pool. It has various out of band data rebalancing which isn't really present in ZFS. For example, add a second vdev to a zpool when it's 90% full and there really isn't a process to automatically rebalance the data across the two pools as you add more data. The original data stays on that first vdev, and new data basically sits in the second vdev. It seems very interesting, but I certainly can't afford it, I'll stick with raidz2 for my puny little server until something open source comes out with a similar feature set.
  • Seemone - Tuesday, August 6, 2013 - link

    Are you aware that with ZFS you can specify the number of replicas each data block should have on a per-filesystem basis? ZFS is indeed not very flexible on pool layout and does not rebalance things (as of now), but there's nothing in the on-disk data structure that prevent this. This means it can be implemented and would be applicable on old pools in a non disruptive way. ZFS, also, is open source, its license is simply not compatible with GPLv2, hence ZFS-On-Linux separate distribution.
  • Brutalizer - Tuesday, August 6, 2013 - link

    If you want to rebalance ZFS, you just copy the data back and forth and rebalancing is done. Assume you have data on some ZFS disks in a ZFS raid, and then you add new empty discs, so all data will sit on the old disks. To spread the data evenly to all disks, you need to rebalance the data. Two ways:
    1) Move all data to another server. And then move back the data to your ZFS raid. Now all data are rebalanced. This requires another server, which is a pain. Instead, do like this:
    2) Create a new ZFS filesystem on your raid. This filesystem is spread out on all disks. Move the data to the new ZFS filesystem. Done.
  • Sertis - Thursday, August 8, 2013 - link

    I'm definitely looking forward to these improvements, if they eventually arrive. I'm aware of the multiple copy solution, but if you read the Intel and Amplistore whitepapers, you will see they have very good arguments that their model works better than creating additional copies by spreading out FEC blocks across nodes. I have used ZFS for years, and while you can work around the issues, it's very clear that it's no longer evolving at the same rate since Oracle took over Sun. Products like this keep things interesting.
  • Brutalizer - Tuesday, August 6, 2013 - link

    Theory is one thing, real life another. There are many bold claims and wonderful theoretical constructs from companies, but do they hold up to scrutiny? Researchers injected artificially constructed errors in different filesystems (NTFS, Ext3, etc), and only ZFS detected all errors. Researchers have verified that ZFS seems to combat data corruption. Are there any research on Amplistore's ability to combat datacorruption? Or do they only have bold claims? Until I see research from a third part, independent part, I will continue with the free open source ZFS. CERN is now switching to ZFS for tier-1 and tier-2 long time term storage, because vast amounts of data _will_ have data corruption, CERN says. Here are research papers on data corruption on NTFS, hardware raid, ZFS, NetApp, CERN, etc:
    http://en.wikipedia.org/wiki/ZFS#Data_integrity
    For instance, Tegile, Coraid, GreenByte, etc - are all storage vendors that offers PetaByte Enterprise servers using ZFS.
  • JohanAnandtech - Tuesday, August 6, 2013 - link

    Thanks, very helpful feedback. I will check the paper out
  • mikato - Thursday, August 8, 2013 - link

    And Isilon OneFS? Care to review one? :)
  • bitpushr - Friday, August 9, 2013 - link

    That's because ZFS has had a minimal impact on the professional storage market.

Log in

Don't have an account? Sign up now