Conclusion: the SoCs

The proponents of high core and thread counts are quick to discount the brawny cores for scale-out applications. On paper, wide issue cores are indeed a bad match for such low ILP applications. However, in reality, the high clock speed and multi-threading of the Xeon E3s proved to be very powerful in a wide range of server applications.

The X-Gene 1 is the most potent ARM Server SoC we have ever seen. However, it is unfortunate that AppliedMicro's presentations have created inflated expectations.

AppliedMicro insisted that the X-Gene 1 is a competitor for the powerful Haswell and Ivy Bridge cores. So how do we explain the large difference between our benchmarks and theirs? The benchmark they used is the "wrk" benchmark, which is very similar to Apache Bench. The benchmark will hit the same page over and over again, and unless you do some serious OS and Network tweaking, your server will quickly run out of ports/connections and other OS resources. So the most likely explanation is that the Xeon measurements are achieved at lower CPU load and are bottlenecked by running out of network and OS resources.

This is in sharp contrast with our Drupal test, where we test with several different user patterns and thus requests. Each request is a lot heavier, and as a result available connections/ports are not the bottleneck. Also, all CPUs were in the 90-98% CPU load range when we compared the maximum throughput numbers.

The 40nm X-Gene can compete with the 22nm Atom C2000 performance wise, and that is definitely an accomplishment on its own. But the 40nm process technology and the current "untuned" state of ARMv8 software does not allow it to compete in performance/watt. The biggest advantage of the first 64-bit ARM SoCs is the ability for an ARM processor to use eight DIMM slots and address more RAM. Better software support (compilers, etc.) and the 28nm X-Gene 2 SoC will be necessary for AppliedMicro to compete with the Intel Xeon performance/watt wise.

The Atom C2750's raw performance fails to impress us in most of our server applications. Then again we were pleasantly surprised that its power consumption is below the official TDP. Still, in most server applications, a low voltage Xeon E3 outperforms it by a large margin.

And then there's the real star, the Xeon E3-1230L v3. It does not live up to the promise of staying below 25W, but the performance surpassed our expectations. The result is – even if you take into account the extra power for the chipset – an amazing performance/watt ratio. The end conclusion is that the introduction of the Xeon-D, which is basically an improved Xeon E3 with integrated PCH, will make it very hard for any competitor to beat the higher-end (20-40W TDP) Intel SoCs in performance/watt in a wide range of "scale-out" applications.

Conclusion: the Servers

As our tests with motherboards have shown, building an excellent micro or scale-out server requires much more thought than placing a low power SoC in a rack server. It is very easy to negate the power savings of such an SoC completely if the rest of server (motherboard, fans, etc.) is not built for efficiency.

The Supermicro MicroCloud server is about low acquisition costs and simplicity. In our experience, it is less efficient with low power Xeons as the cooling tends to consume proportionally more (between 7-12W per node with eight nodes installed). The cooling system and power supplies are built to work with high performance Xeon E3 processors.

HP limits the most power efficient SoCs (such as the Atom C2730) to cartridges that are very energy efficient but also come with hardware limitations (16GB max. RAM, etc.). HP made the right choice as it is the only way to turn the advantages of low power SoCs into real-world energy efficiency, but that means the low power SoC cartridges may not be ideal for many situations. You will have to monitor your application carefully and think hard about what you need and what not to create an efficient datacenter.

Of the products tested so far, the HP Moonshot tends to impress the most. Its cleverly designed cartridges use very little power and the chassis allows you to choose the right server nodes to host your application. There were a few application tests missing in this review, namely the web caching (memcached) and web front-end tests, but based on our experiences we are willing to believe that the m300/m350 cartridge are perfect for those use cases.

Still, we would like to see a Xeon E3 low voltage cartridge for a "full web infrastructure" (front- and back-end) solution. That is probably going to be solved once HP introduces a Xeon-D based cartridge. Once that is a reality, you can really "right-size" the Moonshot nodes to your needs. But even now, the HP Moonshot chassis offers great flexibility and efficiency. The flexibility does tend to cost more than other potential solutions – we have yet to find out the exact pricing details – but never before was it so easy to adapt your tier one OEM server hardware so well to your software.

The War of the SoCs: Performance/Watt
Comments Locked

47 Comments

View All Comments

  • Wilco1 - Tuesday, March 10, 2015 - link

    GCC4.9 doesn't contain all the work in GCC5.0 (close to final release, but you can build trunk). As you hinted in the article, it is early days for AArch64 support, so there is a huge difference between a 4.9 and 5.0 compiler, so 5.0 is what you'd use for benchmarking.
  • JohanAnandtech - Tuesday, March 10, 2015 - link

    You must realize that the situation in the ARM ecosystem is not as mature as on x86. the X-Gene runs on a specially patched kernel that has some decent support for ACPI, PCIe etc. If you do not use this kernel, you'll get in all kinds of hardware trouble. And afaik, gcc needs a certain version of the kernel.
  • Wilco1 - Tuesday, March 10, 2015 - link

    No you can use any newer GCC and GLIBC with an older kernel - that's the whole point of compatibility.

    Btw your results look wrong - X-Gene 1 scores much lower than Cortex-A15 on the single threaded LZMA tests (compare with results on http://www.7-cpu.com/). I'm wondering whether this is just due to using the wrong compiler/options, or running well below 2.4GHz somehow.
  • JohanAnandtech - Tuesday, March 10, 2015 - link

    Hmm. the A57 scores 1500 at 1.9 GHz on compression. The X-Gene scores 1580 with Gcc 4.8 and 1670 with gcc 4.9. Our scores are on the low side, but it is not like they are impossibly low.

    Ubuntu 14.04, 3.13 kernel and gcc 4.8.2 was and is the standard environment that people will get on the the m400. You can tweak a lot, but that is not what most professionals will do. Then we can also have to start testing with icc on Intel. I am not convinced that the overall picture will change that much with lots of tweaking
  • Wilco1 - Tuesday, March 10, 2015 - link

    Yes, and I'd expect the 7420 will do a lot better than the 5433. But the real surprise to me is that X-Gene 1 doesn't even beat the A15 in Tegra K1 despite being wider, newer and running at a higher frequency - that's why the results look too low.

    I wouldn't call upgrading to the latest compiler tweaking - for AArch64 that is kind of essential given it is early days and the rate of development is extremely high. If you tested 32-bit mode then I'd agree GCC 4.8 or 4.9 are fine.
  • CajunArson - Tuesday, March 10, 2015 - link

    This is all part of the problem: Requiring people to use cutting edge software with custom recompilation just to beat a freakin' Atom much less a real CPU?

    You do realize that we could play the same game with all the Intel parts. Believe me, the people who constantly whine that Haswell isn't any faster than Sandy Bridge have never properly recompiled computationally intensive code to take advantage of AVX2 and FMA.

    The fact that all those Intel servers were running software that was only compiled for a generic X86-64 target without requiring any special tweaking or exotic hacking is just another major advantage for Intel, not some "cheat".
  • Klimax - Tuesday, March 10, 2015 - link

    And if we are going for cutting edge compiler, then why not ICC with Intel's nice libraries... (pretty sure even ancient atom would suddenly look not that bad)
  • Wilco1 - Tuesday, March 10, 2015 - link

    To make a fair comparison you'd either need to use the exact same compiler and options or go all out and allow people to write hand optimized assembler for the kernels.
  • 68k - Saturday, March 14, 2015 - link

    You can't seriously claim that recompiling an existing program with a different (well known and mature) compiler is equal to hand optimize things in assembler. Hint, one of the options is ridiculous expensive, one is trivial.
  • aryonoco - Monday, March 9, 2015 - link

    Thank you Johan. Very very informative article. This is one of the least reported areas of IT in general, and one that I think is poised for significant uptake in the next 5 years or so.

    Very much appreciate your efforts into putting this together.

Log in

Don't have an account? Sign up now