CloudFounders: No More RAID

CloudFounders, a startup of former Terremark and SUN execs, also leverages a flash cache, but the building blocks are very different. Just like Nutanix, there is a VSA (Virtual Storage Appliance) that tries to make the best out of a local flash cache. The cool thing is that the backend, the second tier of storage, is not a traditional RAID based volume. The backend is either an object storage initiator that links to Amazon S3 or a storage device based upon erasure encoding called Distribute Storage System (DSS). Let's start with the DSS backend.

DSS is an object oriented storage system that uses “Bitspread”, an advanced and flexible erasure encoding system developed by the people of Amplidata. Amplidata is a startup with a mix of Belgian and US based infrastructure experts. Some of the directors are working for both CloudFounders and Amplidata. But there is a solid technical reason why CloudFounders chose to go with the Amplidata storage system. Bitspread is meant to be the “big storage alternative” to RAID.

As you probably know or have experienced hands-on, the current RAID implementations—RAID 5 and RAID 6—have reached their limitations now that we have terabyte disks. A few terabyte disks in RAID 5 can take days to rebuild. The result is that the RAID array performance and reliability is heavily degraded. RAID 6 is more reliable (although hardly 100%) but is not exactly a good performer for writes, which is another reason why VDI does not work well on a low end or midrange SAN.

“Bitspread” erasure encoding, also called Forward Error Correction Code (FEC), encodes data in “check-blocks”. The beauty is that you can configure the durability policy. In other words you can choose over how many disks these check-blocks should be spread and how many check-blocks you can lose before it becomes a problem. For example you could ask it to spread the datablock over 18 drives and tell the codec to make sure you can recover the original datablock from 12 check-blocks. So it's only if you lose more than 6 drives at once that you lose your data. As the codec requires only 12 of the check blocks to rebuild the original data object, a failure of two drives does not mean the rebuild should happen urgently. The rebuild can be done in the background at a very slow pace while the reliability stays high. You can also have the check-blocks spread over several storage modules, ensuring that you even survive a failure of a complete disk enclosure.

Bitspread: original data (yellow) is split up, encoded with high redundancy (green) and then spread over many disks and enclosures.

For those who are not convinced that the small startup Amplidata is onto something: Intel and Dr. Sam Siewert of Trellis Logic explain in this paper why it can even be mathematically proven that the Reed-Solomon based erasure codes of RAID 6 are a dead end road for large storage systems. The paper concludes:

"Amplida's Bitspread is an efficient, scalable and practical alternative to the stop-gap of combined RAID levels like 6+1."

And that is exactly the reason why CloudFounders chose to build their storage system on the Amplidata backend.

The DSS based on “Bitspread” works with objects and is thus not a block device. A volume driver must be installed that converts the DSS into a block device. This way the hypervisor can connect to an iSCSI target that is running on top of the volume driver, as an iSCSI target requires a block device and does not recognize the format of the DSS.

Bitspread is a lot more CPU intensive and needs more storage room than traditional RAID algorithms. To reduce the CPU impact, Amplidata leverages the SSE 4.2 capabilities of the latest Xeons. As Bitspread copes so well with disk failures, it is natural to use relatively slow SATA disks, which negates the capacity disadvantage compared to RAID 6. Decent media transfer can still be achieved as the DSS typically spreads the check-blocks over many disks.

Nutanix: No More SAN CloundFounders: Cloud Storage Router
Comments Locked

60 Comments

View All Comments

  • WeaselITB - Tuesday, August 6, 2013 - link

    Fascinating perspective piece. I look forward to the CouldFounders review -- that stuff seems pretty interesting.

    Thanks,
    -Weasel
  • shodanshok - Tuesday, August 6, 2013 - link

    Very interesting article. It basically match my personal option on SAN market: it is an overprice one, with much less performance per $$$ then DAS.

    Anyway, with the advent of thin pools / thin volumes in RHEL 6.4 and dmcache in RHEL 7.0, commodity, cheap Linux distribution (CentOS costs 0, by the way) basically matche the feature-set exposed by most low/mid end SAN. This means that a cheap server with 12-24 2.5'' bays can be converted to SAN-like works, with very good results also.

    In this point of view, the recent S3500 / Crucial M500 disks are very interesting: the first provide enterprise-certified, high performance, yet (relatively) low cost storage, and the second, while not explicitly targeted at the enterprise market, is available at outstanding capacity/cost ratio (the 1TB version is about 650 euros). Moreover it also has a capacitor array to prevent data loss in the case of power failure.

    Bottom line: for high performance, low cost storage, use a Linux server with loads of SATA SSDs. The only drawback is that you _had_ to know the VGS/LVS cli interface, because good GUIs tend to be commercial products and, anyway, for data recovery the cli remains your best friend.

    A note on the RAID level: while most sysadmins continue to use RAID5/6, I think it is really wrong in most cases. The R/M/W penalty is simply too much on mechanincal disks. I've done some tests here: http://www.ilsistemista.net/index.php/linux-a-unix...

    Maybe on SSDs the results are better for RAID5, but the low-performance degraded state (and very slow/dangerous reconstruction process) ramain.
  • Kyrra1234 - Wednesday, August 7, 2013 - link

    The enterprise storage market is about the value-add you get from buying from the big name companies (EMC, Netapp, HP, etc...). All of those will come with support contracts for replacement gear and to help you fix any problems you may run into with the storage system. I'd say the key reasons to buy from some of these big players:

    * Let someone else worry about maintaining the systems (this is helpful for large datacenter operations where the customer has petabytes of data).
    * The data reporting tools you get from these companies will out-shine any home grown solution.
    * When something goes wrong, these systems will have extensive logs about what happened, and those companies will fly out engineers to rescue your data.
    * Hardware/Firmware testing and verification. The testing that is behind these solutions is pretty staggering.

    For smaller operations, rolling out an enterprise SAN is probably overkill. But if your data and uptime is important to you, enterprise storage will be less of a headache when compared to JBOD setups.
  • Adul - Wednesday, August 7, 2013 - link

    We looked at Fusion-IO ioDrive and decided not to go that route as the work loads presented by virtualize desktops we offer would have killed those units in a heartbeat. We opted instead for a product by greenbytes for our VDI offering.
  • Adul - Wednesday, August 7, 2013 - link

    See if you can get one of these devices for review :)

    http://getgreenbytes.com/solutions/vio/

    we have hundreds of VDI instances running on this.
  • Brutalizer - Sunday, August 11, 2013 - link

    These Greenbyte servers are running ZFS and Solaris (illumos)
    http://www.virtualizationpractice.com/greenbytes-a...
  • Brutalizer - Sunday, August 11, 2013 - link

    GreenByte:
    http://www.theregister.co.uk/2012/10/12/greenbytes...

    Also, Tegile is using ZFS and Solaris:
    http://www.theregister.co.uk/2012/06/01/tegile_zeb...

    Who said ZFS is not the future?
  • woogitboogity - Sunday, August 11, 2013 - link

    If there is one thing I absolutely adore about real capitalism it is these moments where the establishment goes down in flames. Just the thought of their jaws dropping and stammering "but that's not fair!" when they themselves were making mockery of fair prices with absurd profit margins... priceless. Working with computers gives you so very many of these wonderful moments of truth...

    On the software end it is almost as much fun as watching plutocrats and dictators alike try to "contain" or "limit" TCP/IP's ability to spread information.
  • wumpus - Wednesday, August 14, 2013 - link

    There also seems to be a disconnect in what Reed-Solomon can do and what they are concerned about (while RAID 6 uses Reed Solomon, it is a specific application and not a general limitation).

    It is almost impossible to scale rotating discs (presumably magnetic, but don't ignore optical forever) to the point where Reed-Solomon becomes an issue. The basic algorithm scales (easily) to 256 disks (or whatever you are striping across) of which typically you want about 16 (or less) parity disks. Any panic over "some byte of data was mangled while a drive died" just means you need to use more parity disks. Somehow using up all 256 is silly (for rotating media) as few applications access data in groups of 256 sectors a time (current 1MB, possibly more by the time somebody might consider it).

    All this goes out the window if you are using flash (and can otherwise deal with the large page clear requirement issue), but I doubt that many are up to such large sizes yet. If extreme multilevel optical disks ever take over, things might get more interesting on this front (I will still expect Reed Solomon to do well, but eventually things might reach the tipping point).
  • equals42 - Saturday, August 17, 2013 - link

    The author misunderstands how NetApp uses NVRAM. NVRAM is not a cache for the hottest data. Writes are always to DRAM memory. The writes are committed to NVRAM (which is mirrored to another controller) before being acknowledged to the host but the write IO and its commitment to disk or SSD via WAFL sequential CP writes is all from DRAM. While any data remains in DRAM, it can be considered cached but the contents of NVRAM do not constitute nor is it used for caching for host reads.

    NVRAM is only to make sure that no writes are ever lost due to a controller loss. This is important to recognize since most mid-range systems (and all the low-end ones I've investigated) do NOT protect from write losses in event of failure. Data loss like this can lead to corruption in block-based scenarios and database corruption in nearly any scenario.

Log in

Don't have an account? Sign up now