Original Link: http://www.anandtech.com/show/6887/server-update-april-2013
Server Update April 2013: Positioning the HP Moonshot 1500by Johan De Gelas on April 11, 2013 8:11 AM EST
HP is not shy of grand statements when it is describing its newest baby: "historic", "enables unprecedented scale" "revolutionary new architecture". HP claims "maximum density" and "unparalleled power efficiency". That of course, simply begs for closer inspection.
The HP Moonshot 1500 System chassis is a proprietary 4.3U chassis that is pretty heavy: 180 lbs or 81.6 Kg. The chassis hosts:
- 45 hot-pluggable Atom S1260 based server nodes
- A backplane with 3 different "fabrics": network, storage and cluster
- Two Ethernet switch modules
- Two uplink with SFPs
- a management module (with a sort of ILO "light")
- Two to four 1200W PSUs (94% efficient)
- 5 dual rotor, hot plug fans (N+1 redundancy)
Each server node has two 1 Gbit connections to one of the two Ethernet switch modules, or four Ethernet links in total. The cluster fabric allows a fast 2D Torus interconnect for linking up server nodes. The storage fabric is implemented but seems to be unused for now.
The two switch modules are located in the middle of the chassis, and are placed in the length of the backplane. They can be teamed up, but will probably end up in a redundant 1+1 configuration. The server nodes connect to the backplane by using PCI express slots, and also get their power from PCI Express pins, similar to what SeaMicro servers. All fans are located at the back of chassis.
The back is very similar to a blade chassis, with shared power, fans, management and uplink modules for all 45 server nodes.
The Moonshot Server Cartridge
As always, HP's server chassis is a well designed, excellent documented chassis with tons of options. However, the server node, or cartridge as HP likes to call it, is the part that will run the software services and the key factor when trying to estimate the performance per watt ratio that this server is capable off.
The server cartridges look like very small blade servers. Inside we find the Atom S1260 at 2GHz, one SO-DIMM slot with 8GB of 1333 MHz of ECC protected DDR-3, a Broadcom 5720 Dual port 1 Gb Ethernet controller and a Marvell 9125 SATA controller.
HP also incorporates a local 500GB SATA drive and later on it will be possible to buy a 200GB SSD or 1 TB SATA disk.
The Atom S1260 "Centeron"
Ryan already broke the news on the Atom S1260, but it is good to recap. The Atom launched in 2008 and the CPU architecture was called Bonnel. Bonnel was a dual issue, in order design with a rather long pipeline (16 stage).
Since then, the core architecture has had one minor update, codenamed "Saltwell". Saltwell came with an improved branch predictor and a post-fetch instruction buffer to make sure that the same instructions are not fetched twice. Those were about two of the very few IPC improving features that Saltwell added. Saltwell also got turbo boost and finer grained DVFS (Dynamic Voltage and Frequency scaling). All other improvements were almost strictly power related: the L2-cache got a separate voltage rail, deep sleep state C6 was added.
In the some models, the L2-cache doubled to 1 MB. The end result was that we found Saltwell cores to be about 8% faster than Bonnel based cores, clock for clock. Using a similar amount of power, the 32 nm Saltwell core at 1.86GHz (N2800) was about 20% faster than the older 45 nm 1.66GHz Atom N450 with Bonnel architecture.
The Atom S1260 uses the same "Saltwell" core as all current Atoms. Intel added ECC support, support for twice as much memory (8 vs 4GB), the virtualization technology VT-x, and 4 extra PCIe lanes (8 in total). Those features were necessary to make it a worthy member of the Xeon family.
HP chose the fastest member of the S1200 family, the S1260. This Atom operates at 2GHz and comes with a 8.5 W TDP. It is interesting to note that despite the focus on power efficiency, HP did not favor the S1240 at 1.6GHz with a 6.1 W TDP.
Our First Impressions
So how attractive is HP's Moonshot 1500 system chassis? HP claims that for the right workloads, these systems will be 77% less costly, consume 89% less energy, take 80% less space and be 97% less complex. HP bases this on a comparison of one of its own proprietary 47U (non-industry standard thus) Moonshot racks with 5 racks of traditional 1U, 2 socket servers. That is a server form factor which was very popular in the beginning of this century. For some reason, some of the PR people at HP have missed the launch of HP's blade servers in 2002, have never heard about the Twin(²) designs of Supermicro and also conveniently forgot about the fact that we now fill up our 2U and 3U boxes with memory, install a hypervisor on top of it and run a few tens of virtual machines upon it.
We have serious doubts that the current implementation will offer a significantly higher performance per watt ratio than the current low power server options, even when running the "right" workloads such as web front serving, content delivery (mostly photos and images) and memcached servers. Our doubts are based upon several datapoints.
Firstly, the performance per watt of the current Atom S1260. At 2GHz, it is clocked 7.5% higher than the Atom N2800 1.86GHz we tested. That clockspeed advantage is about the only advantage it has over the N2800, so we expect it to perform up to 7.5% better but not more. The advantage of using 1333 MHz instead of 1066 MHz DDR3 is probably very small and partly negated by the fact that ECC takes a bit of performance away. Therefore, we can get an idea how the individual Moonshot server Catridge will perform.
Take a look at the compression benchmark below. Compression is a low IPC workload that's sensitive to memory parallelism and latency. The instruction mix is a bit different, but this kind of workload is still somewhat similar to many server workloads.
You can easily calculate where the Atom S1260 would land. Add about 8% to the scores of the Atom N2800. Extrapolating these number, we may expect the Atom S1260 to score 2570 at the most. That is about 22% better than the ARM Cortex A9 based quad core of Calxeda at 1.4GHz. Single threaded performance would be less than 10% better.
If you consider the TDP of this chip, the age of the Atom core is really showing. The heavily integrated ECX-1000 SoC - 4 cores, management, networking and IO controllers - needs about 5W at the most (at 1.4GHz, 3.8W at 1.1GHz). The Atom S1260 needs 8.5W and that does not include its network and management chips. So we estimate that the current Atom S1260 probably needs twice as much power to offer the 20% performance advantage illustrated above.
There is more...
Estimating S1260 Server Performance
As the Atom S1260 is very similar to the Atom N2800, we went one step further. We tested the 32 nm Atom in a real workload, the same workload that we used in our review of the Calxeda based Boston Viridis server.
It is simple: even at 2GHz, the Atom S1260 is no match for Calxeda's EnergyCore at 1.4GHz. The EnergyCore is the better server chip thanks to out of order execution, a 4 times larger L2-cache (4 MB) and the fact that it can offer 4 real cores. Even if we assume that the 2GHz Atom S1260 performs 8% better thanks to its higher clockspeed, it is no match for Calxeda's EnergyCore.
So let us summarize. The current A9 based Calxeda EC 1.4GHz is about 40% faster and consumes half the power of the Atom S1260. Therefore it is not unreasonable to assume that the performance per Watt ratio of the Calxeda SoC will be up to 3 times better.
There are more indications that our assumptions are not far off. We quote the white paper at HP's site (page 5):
"A completely populated system will be in the ~850W ballpark. That system powers 180 x 2.0GHz threads, with 2GB of RAM for each thread, at under 5W per thread."
That means that each cartridge needs about 19 W. Let us assume that 4 W is taken by the disk. That is generous as in most webserving workloads, the disk will be hardly active. That means that one server node needs about 15 W. Compare this with a measured 8.3W per Calxeda server node and you'll understand that there is little doubt in our minds that the S1260 is nowhere near the performance/watt of the ARM alternative.
HP's Moonshot 1500: Our Evalution So Far
We have not tested it yet, but we have no doubts that HP's Moonshot 1500 is a great chassis. The server nodes get power and access to 3 different fabrics from a very advanced backplane, and share power, cooling and management. HP brings some of the best ideas of the blade and the microserver world together. Network and server management gets a lot simpler this way.
But the server cartridge is a whole different story. The inclusion of the local harddisk is the ideal recipe for increasing management costs quickly. Replacing a bad harddisk - still the most common failure inside the server rack - involves pulling out an extremely heavy server as far as you can, removing the access panel, removing a server cartridge and lastly replacing the harddisk. That is a long and costly procedure compared to simply pushing the release button of hotpluggable harddisks, which can be found in the front of most servers, including the Boston Viridis.
Secondly, the performance per Watt is fantastic... if you compare it with old 1U servers. Consuming 850 Watt for 180 slow threads, or 5W per thread is nothing to write home about. A modern low power blade or a design like Supermicro's Twin² or even HP's own SL series can offer 32 Xeon E5 threads for about 200 W. That is 6.25 Watt per "real Xeon" thread, which is, even in very low IPC workloads, at least 3 times faster! Want even more proof that 5W for a wimpy thread is nothing special? SeaMicro claims 3.2 kW for a complete system with harddisks, memory and 512 Opteron Piledriver cores. That is 6.25W per real heavy duty core!
Of course, those calculations are based upon paper specs. But we tested Calxeda's technology first hand, and the Boston Viridis with Calxeda went as low as 8.33W for 4 threads, or a little bit more than 2W per thread. Granted, that is without storage but even if you add a 4W 2.5 inch harddisk, we get 3W per thread!
Still not convinced? Well, Intel's own benchmarks are pretty spicy to say the least. This slide can be found in the S1200 presentation:
Again, a relatively simple Atom S1260 server node needs about 20 W. But notice how little power the Xeon E3 needs. Even with 2 SSDs, 16 (!)GB of RAM, 10 Gigabit Ethernet, you are looking at 60W for 8 fast threads. Make the two systems similar (the 10 Gb PHY consume easily 5-8W more than the 1 Gb) and you get about 6 W for the Xeon thread and 5 W for the Atom thread.
Let us cut to the chase: the current Atom has a pretty bad performance/watt ratio. And the way the HP server cartridges are built today does not make it better. Compare the mini-blade approach with the EnergyCard approach of Calxeda or the "credit card servers" of SeaMicro and you'll understand that there are better and more innovative ways to design microservers.
To sum it all up, the HP Moonshot is a good platform. But both SeaMicro and Calxeda already offer better designed server nodes. And there are much better CPUs on the market for microservers: AMD's lowest power Piledrivers, Calxeda A9 based EnergyCore and Intel's own Xeon E3-1265L offer a massively better performance/watt ratio. No matter what the software stack is, no matter what the most important metric is for you, one of those three will be able to beat the S1260. The Xeon offers the best single threaded integer performance, the Opteron can offer the best floating point performance (for HPC apps that can be recompiled) and if performance does not matter, the ARM based Calxeda EnergyCard sips much less power than the Atom.
Sure, the Atom S1200 can run on top of (64 bit) Windows Server and ESXi, something that is not possible now with the ARM based Calxeda EneryCore. But Windows server is seldomly seen in the hyperscale servers and so is ESXi. ESXi is a lot more popular in the hosting market, but let us not forget that VT-x is not enough to run ESXi smoothly. We have been accustomed to very low virtualization overhead thanks to the fact that the CPU architects have reduced the VMexit and entry latencies and introduced technologies like Extended Page Tables combine with very large TLBs. Those vast improvements for virtualization performance have not been implemented in the Atom. Running ESXi on top of the S1260 with 8GB of RAM might not be a very pleasant experience..
To sum it up, we are not exactly thrilled with HP's CPU choices. Luckily, Calxeda is one of the prime partners of HP's Moonshot Project. Lastly, this platform will be outdated soon.
Intel's Next Generation Low Power Server CPUs
As I was writing this, Intel revealed details of new low-power SoCs for the data center, all coming in 2013.
The Intel Atom™ Processor S12x9 product family for Storage. The Atom s12x9 will get up to 40 lanes of integrated PCIe 2.0 and hardware RAID storage acceleration. With Asynchronous DRAM Self-Refresh (ADR), the Intel Atom S12x9 family can protect critical DRAM data in the event of a power interruption. This is probably the Atom S1200 that will be hard to beat in its intended market.
The Intel Avoton is most likely the first Atom that makes sense for the server market. Built on Intel’s 22nm process technology, using cores based upon the brand new Atom micro architecture "Silvermont," and integrating an Ethernet controller, this Atom holds a lot of promise. Intel announced that Avoton is now being sampled to customers and the first systems are expected to be available in the second half of 2013. With Avoton, the HP's Moonshot performance per watt ratio will improve significantly.
But even with a new architecture and better integration, the Atom will be facing stiff competition from ARM A15 & A57 based server cores, and even from the newest Intel Xeon processor E3 1200 v3. Intel announced that the low power versions of the Haswell based Xeon will have a TDP as low as 13 Watts. This chip will further blur the line between the "micro server CPUs" and "general purpose CPUs" even further. There is no telling which CPU will be the performance/watt king even in server workloads with relatively low computational demand.