Mass Hot Storage: Knox

For OpenRackv1 to work, a new server design was needed that implemented bus bar connectors at the back. Using an updated Freedom-based chassis minus the PSU would cause a fair bit of empty space. Simply filling the entire space with 3.5" HDDs is wasteful, as most of Facebook's workloads aren't so storage hungry. The solution proved to be very similar to the power shelf, namely grouping the additional node storage outside the server chassis on a purpose built shelf: Knox was born.


OCP Knox with one disk sled out (Image Courtesy The Register)

Put simply, Knox is a regular JBOD disk enclosure built for OpenRack which needs to be attached to host bus adapters of surrounding Winterfell compute nodes. It differs from standard 19" enclosures for two main reasons: it can fit 30 3.5" hard disks, and it makes the job of maintenance quite easy. To replace a disk, one must simply slide out the disk sled, pop open the disk bay, swap the disk, close the bay and slide the tray back into the rack. Done.

Object Storage

Seagate has contributed the specification of a "Storage device with Ethernet interface", also known by its productized version as Seagate Kinetic. These hard disks are meant to cut out the middle man and provide an object storage stack directly on the disk, in OCP speak this would mean the Knox node would not need to be connected to a compute instance but can be directly connected to the network. Seagate, together with Rausch Netzwerktechnik, has released the 'BigFoot Storage Object Open', a new chassis designed for these hard disks, with 12x 10GbE connectivity in a 2 OU form factor.

The concept of the BigFoot system is not unknown to Facebook either, as they have released a system with a similar goal, called Honey Badger. Honey Badger is a modified Knox enclosure and pairs with a compute card -- Panther+ -- to provide (cold) object storage services for pictures and such. Panther+ is fitted with an Intel Avoton SoC (C2350 for low end up to C2750 for high end configurations), up to four enabled DDR3 SODIMM slots, and mSATA/M.2 SATA3 onboard storage interfaces. This plugs onto the Honey Badger mainboard, which in turn contains the SAS controller, SAS expander, AST1250 BMC, two miniSAS connectors and a receptacle for a 10GbE OCP mezzanine networking card. Facebook has validated two configurations for the Honey Badger SAS chipset, one based on the LSI SAS3008 chip and LSI SAS3x24R expander, the other configuration consists out of the PMC PM8074 controller joined by the PMC PM8043 expander.

Doing this eliminates the need for a 'head node', usually a Winterfell system (Leopard will not be used by Facebook to serve up Knox storage), replaced by the more efficient Avoton design on the Panther card. Another good example of modularity and lock-in free hardware design, another dollar saved.

Cold Storage

A slightly modified version of Knox is used for cold storage, with specific attention being made to running the fans slowly and only spinning a disk when required. 

Facebook meanwhile has built another cold storage solution, this time using an OpenRack filled with 24 magazines of 36 cartridge-like containers, each of which holds 12 Blu-ray discs. Apply some maths and you get a maximum capacity of 10,368 discs, and knowing you can fit up to 128GB on a single BD-XL disc, you have a very dense data store of up to 1.26PB. Compared to hard disks optical media touts greater reliability, with Blu-ray discs having a life expectancy of 50 years and some discs could even be able to live on for a century.

The rack resembles a jukebox; whenever a data is requested from a certain disk, a robot arm takes the cartridge to the top, where another systems slides the right discs into one of the Blu ray readers. This system serves a simple purpose: getting as much data as possible stored in a single rack, with access latency not being hugely important.

Integrate: OpenRack The Next Generation: Winterfell
POST A COMMENT

27 Comments

View All Comments

  • Black Obsidian - Tuesday, April 28, 2015 - link

    I've always hoped for more in-depth coverage of the OpenCompute initiative, and this article is absolutely fantastic. It's great to see a company like Facebook innovating and contributing to the standard just as much as (if not more than) the traditional hardware OEMs. Reply
  • ats - Tuesday, April 28, 2015 - link

    You missed the best part of the MS OCS v2 in your description: support for up to 8 M.2 x4 PCIe 3.0 drives! Reply
  • nmm - Tuesday, April 28, 2015 - link

    I have always wondered why they bother with a bunch of little PSU's within each system or rack to convert AC power to DC. Wouldn't it make more sense to just provide DC power to the entire room/facility, then use less expensive hardware with no inverter to convert it to the needed voltages near each device? This type of configuration would get along better with battery backups as well, allowing systems to run much longer on battery by avoiding the double conversion between the battery and server. Reply
  • extide - Tuesday, April 28, 2015 - link

    The problem with doing a datacenter wide power distribution is that at only 12v, to power hundreds of servers you would need to provide thousands of amps, and it is essentially impossible to do that efficiently. Basicaly the way FB is doing it, is the way to go -- you keep the 12v current to reasonable levels and only have to pass that high current a reasonable distance. Remember 6KW at 12v is already 500A !! And thats just for HALF of a rack. Reply
  • tspacie - Tuesday, April 28, 2015 - link

    Telcos have done this at -48VDC for a while. I wonder did data center power consumption get too high to support this, or maybe just the big data centers don't have the same continuous up time requirements ?
    Anyway, love the article.
    Reply
  • Notmyusualid - Wednesday, April 29, 2015 - link

    Indeed.

    In the submarine cable industry (your internet backbone), ALL our equipment is -48v DC. Even down to routers / switches (which are fitted with DC power modules, rather than your normal 100 - 250v AC units one expects to see).

    Only the management servers run from AC power (not my decision), and the converters that charge the DC plant.

    But 'extide' has a valid point - the lower voltage and higher currents require huge cabling. Once a electrical contractor dropped a piece of metal conduit from high over the copper 'bus bars' in the DC plant. Need I describe the fireworks that resulted?
    Reply
  • toyotabedzrock - Wednesday, April 29, 2015 - link

    48 v allows 4 times the power at a given amperage.
    12vdc doesn't like to travel far and at the needed amperage would require too much expensive copper.

    I think a pair of square wave pulsed DC at higher voltage could allow them to just use a transformer and some capacitors for the power supply shelf. The pulses would have to be directly opposing each other.
    Reply
  • Jaybus - Tuesday, April 28, 2015 - link

    That depends. The low voltage DC requires a high current, and so correspondingly high line loss. Line loss is proportional to the square of the current, so the 5V "rail" will have more than 4x the line loss of the 12V "rail", and the 3.3V rail will be high current and so high line loss. It is probably NOT more efficient than a modern PS. But what it does do is move the heat generating conversion process outside of the chassis, and more importantly, frees up considerable space inside the chassis. Reply
  • Menno vl - Wednesday, April 29, 2015 - link

    There is already a lot of things going on in this direction. See http://www.emergealliance.org/
    and especially their 380V DC white paper.
    Going DC all the way, but at a higher voltage to keep the demand for cables reasonable. Switching 48VDC to 12VDC or whatever you need requires very similar technology as switching 380VDC to 12VDC. Of-course the safety hazards are different and it is similar when compared to mixing AC and DC which is a LOT of trouble.
    Reply
  • Casper42 - Monday, May 4, 2015 - link

    Indeed, HP already makes 277VAC and 380VDC Power Supplies for both the Blades and Rackmounts.

    277VAC is apparently what you get when you split 480vAC 3phase into individual phases..
    Reply

Log in

Don't have an account? Sign up now