ARM based servers hold the promise of extremely low power and excellent performance per Watt ratios. It's theoretically possible to place an incredible number of servers into a single rack; there are already implementations with as many as 1000 ARM servers in one rack (48 server nodes in a 2U chassis). What's more, all of those nodes consume less than 5KW combined (or around 5W per quad-core ARM node). But whenever a new technology is hyped, it's important to remain objective. The media loves to rave about new trends and people like reading about "some new thing"; however, at the end of the day the system administrator has to keep his IT services working and convince his boss to invest in new technologies.

At first sight, the relatively low performance per core of ARM CPUs seems like a bad match for servers. The dominant CPU in the server market is without doubt Intel's Xeon. The success of the Xeon family is largely rooted in its excellent single-threaded (or per core) performance at moderate power levels (70-95W). Combine this exceptional single-threaded performance with a decent core count and you get good performance in almost any kind of application. Economies of scale and the resulting price levels are also very important, but the server market has been more than willing to pay a little extra if the response times are lower and the energy bills moderate.

A data point proving that single-threaded performance is still important is the evolution of the T-series of Oracle (or Sun if you prefer). The Sun T3 had 16 cores with 128 threads; the T4 however had only 8 cores with 8 threads each, and CEO Larry Ellison touted more than once that single-threaded performance was massively improved, up to five times faster. Do we really need another server with a flock of slow but energy efficient cores? Has history not taught us that a few "bulls" is better than "a flock of chickens"?

History has also shown that the amount of memory per server is very important. Many HPC and virtualization applications are limited by the amount of RAM. The current Cortex-A9 generation of ARM CPUs has a 32-bit address bus and does not support more than 4GB.

And yet, the interest in ARM-based servers is growing, and there is more to it than just hype. Yes, ARM-based CPUs still lack the number crunching power and the massive amount of DIMM slots that Xeon's memory controller can handle, but ARM CPUs score extremely well when it comes to cost and power consumption.

ARM based CPU have also made giant steps forward when it comes to performance. To give you a few data points: a dual ARM Cortex-A9 at 1.2GHz (Samsung Exynos 1.2GHz) introduced in 2011 compresses more than 10 times faster than the typical ARM 11 based cores in 2008. The SunSpider performance increased by a factor 20 according to Anand's measurements on the iPhones (though part of that is almost certainly thanks to browser and software optimizations). The latest ARM Cortex-A15 is again quite a bit more powerful, offering about 50% higher performance. The A57 will add 64-bit support and is estimated to deliver 20 to 30% higher performance. In short, the single-threaded performance is increasing quickly, and the same is true for the amount of RAM that can be addresssed. The ARM Cortex-A9 is limited to 4GB but the Cortex-A15 should be able to address 16GB while the A57 will be able to address a lot more.

It is likely just a matter of time before ARM products can start to chip away at segments of the server market. How much time? The best way to find out is to look at the most mature ARM server shipping today: the Calxeda based Boston Viridis. Just what can this server handle today, where does it have the potential to succeed, and what are its shortcomings? Let's find out.

It's a Cluster, Not a Server
Comments Locked

99 Comments

View All Comments

  • tech4real - Thursday, March 14, 2013 - link

    Calxeda quotes 6W for the whole SOC. We don't know how much is used for all these uncore stuff. It's possible A9 core only burns around 800mW. Still quite a gap to 1.25W.
  • Wilco1 - Thursday, March 14, 2013 - link

    Assuming the 800mW figure is accurate and the uncore power stays the same, then a node would go from 6W to 7.8W - ie. 30% more power for 100% more performance. Or they could voltage scale down to 1.5GHz and get 65% more performance for 5% more power. While a 28nm A15 uses more power in both scenarios, it is also much faster, so perf/Watt is significantly better.
  • tech4real - Thursday, March 14, 2013 - link

    1. I guess we have to wait to see if it's really 2X perf from a9 to a15 in real tests. I personally wouldn't bet on that just yet.
    2. mostly likely the uncore power will increase too. i don't think the larger memory bandwidth will come free.
  • Wilco1 - Thursday, March 14, 2013 - link

    1. We already know A15 is 50-60% faster than A9 per clock (and often more, particularly floating point), so that gives ~2x gain from 1.4GHz to 1.8GHz.
    2. The uncore power will be scaling down with process while the higher bandwidth demand from A15 will increase DRAM power. Without detailed figures it's reasonable to assume these balance each other out.
  • tech4real - Thursday, March 14, 2013 - link

    then let's wait to see anand benchmarks the future a15 system.
    also since the real microserver battle is between the future a15 system and 22nm atom system, I am eager to see how it plays out.
  • Th-z - Wednesday, March 13, 2013 - link

    Very interesting article, thanks! This really piques another curiosity: how does latest IBM Power based server fair these days.
  • Flunk - Wednesday, March 13, 2013 - link

    It really doesn't sound like the price\performance is there. Also, lack of Windows support makes it useless for those of us who run ASP.NET websites (like the company I work for).

    It's still nice to see companies trying something different from the standard strategy. Maybe this is be better in a few generations and take the web server market by storm. If we see a Windows Server arm I could see considering it as an option.
  • skyroski - Wednesday, March 13, 2013 - link

    I agree your testing suite's method is good and ok, so you were testing in consideration with hosting providers, fair enough.

    However on the topic of if you were serving a single site would a standard Xeon be better or ARM based ones? Which - is the case of consideration to FB/Twitter/Google/Baidu etc..., whom are as I have been led to believe by the media this past year, companies that ARM partners are trying to sell this piece of kit to. This test unfortunately cannot tell us.

    A quick search on Google on performance impact of VMs yielded a thread in the VMware community forum by a vExpert/Moderator that mentioned expectation of 90% performance, and frankly, no matter how small you think the performance impact of a VM maybe, it is still using up CPU cycles to emulate hardware, that point will remain true no matter how efficient the hypervisor gets.

    Secondly, coupled with the overhead of running 24 physical copies of the OS + Apache + DB on a box that would otherwise be running a single copy of the OS + Apache + DB is total overkill (on that topic)

    It would be great if you can also test Xeon's req/sec if it ran a single instance so we can see it from a different perspective, as of now as I said, your test is skewered towards hosting providers whom might invest in Calxeda to provide VPS alternatives. But to them (and their client base), the benefit of a VPS is it's portability, which, 24 physical ARM nodes isn't going to provide, so I don't see them considering it as an alternative solution anyway.
  • skyroski - Wednesday, March 13, 2013 - link

    I also want to ask if your Xeon test server's network adapter is capable of and was using Intel VT-c?
  • JohanAnandtech - Thursday, March 14, 2013 - link

    It was using VMDq/Netqueue (via VMXnet) but not SR-IOV/VT-c

Log in

Don't have an account? Sign up now