The Next Generation Open Compute Hardware: Tried and Tested

Name: The Next Generation Open Compute Hardware: Tried and Tested
Item: The Next Generation Open Compute Hardware: Tried and Tested

by Johan De Gelas & Wannes De Smet on April 28, 2015 12:00 PM EST

26 Comments | Add A Comment

26 Comments

The Latest and Greatest: Leopard

Leopard, the latest update to the Windmill motherboard, is equipped with the Intel C226 chipset to support up to two E5-2600v3 Haswell Xeons.

One processor mode is fully supported, in which the CPU can access all RAM onboard. Increased thermal margins (mainly because of upping the chassis height to 2 OU in Winterfell), bigger CPU heatsinks, and better airflow guidance allow the system to receive CPUs with a maximum TDP of 145 Watt, which means you can insert every Xeon except for E5-2687W v3 (160W TDP). Only eight DIMM channels are connected per CPU, but DDR4 allows for a maximum capacity of 128GB per DIMM resulting in a theoretical maximum of 2TB RAM, which Facebook reckons is plenty for years to come. New in this generation is that you can now plug NVDIMM modules (persistent flash storage on a DIMM form factor), which Facebook is testing to see if it can replace PCIe-based add-in cards.

Besides the generational CPU update, other major changes include the removal of the onboard external PCIe connector, support for a mezzanine card with dual QSFP receptacles, a TPM header, the addiction of an mSATA/M.2 slot for SATA/NVMe based storage, and 8 more PCIe lanes routed to the riser card slot for a total of 24. The SAS connector has been removed, as Leopard will not be used as a head node for Knox.

Leopards, with the optional debug board (power/reset buttons and serial-to-USB) plugged in

A big addition to the board is a baseboard management controller (BMC). A simple headless Aspeed AST1250 controller provides traditional Out Of Band IPMI access to query sensor and FRU data, control system power and provide Serial Over Lan. But Facebook taught it some new tricks: to aid bare-metal debugging, it keeps 256 post codes in buffer, offers 128kB of serial console output, and you are able to remotely dump MSR data, which is done automatically after the IERR/MCERR signal is active.

A rather unique feature of the BMC is that it allows you to update the CPLD, VR, BMC and UEFI firmware (basically all the firmware present on the motherboard) remotely, a feature also fully validated by all suppliers of the mentioned components. Another feature that's been added is average power reporting, the BMC keeps a buffer of 600 power measurements, and permits you to query the buffer for a specific interval via IPMI. To improve the accuracy of the power sensor data, factory determined (non)-linear compensations are applied to the measured power usage. Lastly, another unique feature that stems from better rack-level integration is the ability to throttle CPU power usage when power demand in the power zone exceeds capacity – for instance when a PSU dies. When the load increases to the PSU capacity, it executes a quick temporary drop to 1 Volt. This triggers an 'Under Voltage' condition in the servers which in turns activates the Fast Proc Hot signal on the CPUs, causing them to clock down for a certain amount of time and thus decreasing PSU load, allowing it to remain active instead of shutting down.

The Next Generation: Winterfell Benchmark Configuration: Leopard Under Stress

PRINT THIS ARTICLE

Post Your Comment
Please log in or sign up to comment.

Comments Locked

26 Comments

View All Comments

Kevin G - Tuesday, April 28, 2015 - link
Excellent article.

The efficiency gains are apparent even using suboptimal PSU for benchmarking. (Though there are repeated concurrency values in the benchmarking tables. Is this intentional?)

I'm looking forward to seeing a more compute node hardware based around Xeon-D, ARM and potentially even POWER8 if we're lucky. Options are never a bad thing.

Kind of odd to see the Knox mass storage units, I would have thought that OCP storage would have gone the BackBlaze route with vertically mount disks for easier hot swap, density and cooling. All they'd need to develop would have been a proprietary backplane to handle the Kinetic disks from Seagate. Basic switching logic could also be put on the backplane so the only external networking would be high speed uplinks (40 Gbit QSFP+?).

Speaking of the Kinetic disks, how is redundancy handled with a network facing drive? Does it get replicated by the host generating the data to multiple network disks for a virtual RAID1 redundancy? Is there an aggregator that handles data replication, scrubbing, drive restoration and distribution, sort of like a poor man's SAN controller? Also do the Kinetic drives have two Ethernet interfaces to emulate multi-pathing in the event of a switch failure (quick Googling didn't give me an answer either way)?

The cold storage racks using Blu-ray discs in cartridges doesn't surprise me for archiving. The issue I'm puzzled with is the process how data gets moved to them. I've been under the impression that there was never enough write throughput to make migration meaningful. For a hypothetical example, by the time 20 TB of data has been written to the discs, over 20 TB has been generated that'd be added to the write queue. Essentially big data was too big to archive to disc or tape. Parallelism here would solve the throughput problem but that get expensive and takes more space in the data center that could be used for hot storage and compute.

Do the Knox storage and Wedge networking hardware use the same PDU connectivity as the compute units?

Are the 600 mm wide racks compatible use US Telecom rack width equipment (23" wide)? A few large OEMs offer equipment in that form factor and it'd be nice for a smaller company to mix and match hardware with OCP to suit their needs.
nils_ - Wednesday, April 29, 2015 - link
You can use something like Ceph or HDFS for data redundancy which is kind of like RAID over network.
davegraham - Tuesday, April 28, 2015 - link
Also, Juniper Networks has an ONIE-compliant OCP switch called the OCX1100 which is the only Tier1 switch manufacturer (e.g. Cisco, Arista, Brocade) to provide such a device.
floobit - Tuesday, April 28, 2015 - link
This is very nice work. One of the best articles I've seen here all year. I think this points at the future state of server computing, but I really wonder if the more traditional datacenter model (VMware on beefy blades with a proprietary FC-connected SAN) can be integrated with this massively-distributed webapp model. Load-balancing and failovering is presumably done in the app layer, removing the need for hypervisors. As pretty as Oracle's recent marketing materials are, I'm pretty sure they don't have an HR app that can be load-balanced on the app layer in alongside an expense app and an ERP app. Maybe in another 10 years. Then again, I have started to see business suites where they host the whole thing for you, and this could be a model for their underlying infrastructure.
ggathagan - Tuesday, April 28, 2015 - link
In the original article on these servers, it was stated that the PSU's were run on 277v, as opposed to 208v.
277v involves three phase power wiring, which is common in commercial buildings, but usually restricted to HVAC-related equipment and lighting.
That article stated that Facebook saved "about 3-4% of energy use, a result of lower power losses in the transmission lines."
If the OpenRack carries that design over, companies will have to add the cost of bringing power 277v to the rack in order to realize that gain in efficiency.
sor - Wednesday, April 29, 2015 - link
208 is 3 phase as well, generally 3x120v phases, with 208 tapping between phases or 120 available to neutral. Its very common for DC equipment. 277 to the rack IS less common, but you seemed to get hung up on the 3 phase part.
Casper42 - Monday, May 4, 2015 - link
3 phase restricted to HVAC?
Thats ridiculous, I see 3 Phase in DataCenters all the time.
And Server vendors are now selling 277vAC PSUs for exactly this reason that FB mentions. Instead of converting the 480v main to 220 or 208, you just take a 277 feed right off the 3 phase and use it.
clehene - Tuesday, April 28, 2015 - link
You mention a reported $2 Billion in savings, but the article you refer to states $1.2 Billion.
FlushedBubblyJock - Tuesday, April 28, 2015 - link
One is the truth and the other is "NON Generally Accepted Accounting Procedures" aka it's lying equivalent.
wannes - Wednesday, April 29, 2015 - link
Link corrected. Thanks!

The Next Generation Open Compute Hardware: Tried and Tested

The Latest and Greatest: Leopard

Post Your Comment

26 Comments

View All Comments

Kevin G - Tuesday, April 28, 2015 - link

nils_ - Wednesday, April 29, 2015 - link

davegraham - Tuesday, April 28, 2015 - link

floobit - Tuesday, April 28, 2015 - link

ggathagan - Tuesday, April 28, 2015 - link

sor - Wednesday, April 29, 2015 - link

Casper42 - Monday, May 4, 2015 - link

clehene - Tuesday, April 28, 2015 - link

FlushedBubblyJock - Tuesday, April 28, 2015 - link

wannes - Wednesday, April 29, 2015 - link

Log in

Don't have an account? Sign up now