Calxeda's ARM server tested

Name: Calxeda's ARM server tested
Item: Calxeda's ARM server tested
Author: Johan De Gelas

by Johan De Gelas on March 12, 2013 7:14 PM EST

99 Comments | Add A Comment

99 Comments

The Results that Matter

Before you jump ahead to the charts below, we suggest taking some time to properly interpret the results. First of all, we simulate between 5 to 15 "busy" users on the web server per second. As a user clicks somewhere on the website, this can result in a few requests or tens of requests. For example, accessing the forum on the website results in two simple "GET" requests, while posting a reply results in an avalanche of 56 POSTs and GETs. That is why we report performance in "responses per second". Responses are somewhat similar from the CPU load point of view if you look at a statistically large enough number of them. User actions are so wildly different that in some cases performing two user actions per seconds can require more processing power and network bandwidth than 20 user actions per second.

Webhosting throughtput—average over 24 web servers

At the low concurrencies, the Intel machine leverages turboboost and its exceptionally high per core performance. At the higher web loads, the total throughput of the 96 (24x quad-core SoCs) ARM Cortex-A9 cores is up to 50% higher than the low power 32 thread/16 core (2x Octal core) Xeons. Even the mighty 2660 cannot beat the herd of ARM SoCs.

While we have lots and lots of experience with x86 servers, we had almost none with ARM based servers, so we met up with the people of Calxeda engineering and got some valuable optimization tips. It turns out that the internal switch fabric can be tuned in various ways. For example, the link speed from one node is by default set to 2.5 gbit/s, which is rather high considering that we are mostly CPU limited and use less than 0.5Gbit/s per node. Setting the link speed of each node to 1Gbit/s should lower power and gives more than enough bandwidth. We also updated to a slightly newer kernel (155) from the Calxeda kernel PPA (Personal Package Archive). This allowed us to make use of Dynamic Voltage and Frequency Scaling (DVFS, P-states) using the CPUfreq tool. First let's see if all these power saving tweaks have reduced the total throughput.

Webhosting throughtput Optimized—average over 24 web servers

The changes did not give any boost in throughput (in many cases the scores might even be slightly slower), but the changes might lower power use and/or response times. Let's look at that next.

Response Times

Webhosting Response time optimized—average over 24 web servers

Again, the Intel machine performs better at lower concurrencies, but our ARM server delivers lower response times at high load. Our optimizations have had no effect on response times.

Our Real World Test Energy and Power

PRINT THIS ARTICLE

Post Your Comment
Please log in or sign up to comment.

Comments Locked

99 Comments

View All Comments

tech4real - Thursday, March 14, 2013 - link
Calxeda quotes 6W for the whole SOC. We don't know how much is used for all these uncore stuff. It's possible A9 core only burns around 800mW. Still quite a gap to 1.25W.
Wilco1 - Thursday, March 14, 2013 - link
Assuming the 800mW figure is accurate and the uncore power stays the same, then a node would go from 6W to 7.8W - ie. 30% more power for 100% more performance. Or they could voltage scale down to 1.5GHz and get 65% more performance for 5% more power. While a 28nm A15 uses more power in both scenarios, it is also much faster, so perf/Watt is significantly better.
tech4real - Thursday, March 14, 2013 - link
1. I guess we have to wait to see if it's really 2X perf from a9 to a15 in real tests. I personally wouldn't bet on that just yet.
2. mostly likely the uncore power will increase too. i don't think the larger memory bandwidth will come free.
Wilco1 - Thursday, March 14, 2013 - link
1. We already know A15 is 50-60% faster than A9 per clock (and often more, particularly floating point), so that gives ~2x gain from 1.4GHz to 1.8GHz.
2. The uncore power will be scaling down with process while the higher bandwidth demand from A15 will increase DRAM power. Without detailed figures it's reasonable to assume these balance each other out.
tech4real - Thursday, March 14, 2013 - link
then let's wait to see anand benchmarks the future a15 system.
also since the real microserver battle is between the future a15 system and 22nm atom system, I am eager to see how it plays out.
Th-z - Wednesday, March 13, 2013 - link
Very interesting article, thanks! This really piques another curiosity: how does latest IBM Power based server fair these days.
Flunk - Wednesday, March 13, 2013 - link
It really doesn't sound like the price\performance is there. Also, lack of Windows support makes it useless for those of us who run ASP.NET websites (like the company I work for).

It's still nice to see companies trying something different from the standard strategy. Maybe this is be better in a few generations and take the web server market by storm. If we see a Windows Server arm I could see considering it as an option.
skyroski - Wednesday, March 13, 2013 - link
I agree your testing suite's method is good and ok, so you were testing in consideration with hosting providers, fair enough.

However on the topic of if you were serving a single site would a standard Xeon be better or ARM based ones? Which - is the case of consideration to FB/Twitter/Google/Baidu etc..., whom are as I have been led to believe by the media this past year, companies that ARM partners are trying to sell this piece of kit to. This test unfortunately cannot tell us.

A quick search on Google on performance impact of VMs yielded a thread in the VMware community forum by a vExpert/Moderator that mentioned expectation of 90% performance, and frankly, no matter how small you think the performance impact of a VM maybe, it is still using up CPU cycles to emulate hardware, that point will remain true no matter how efficient the hypervisor gets.

Secondly, coupled with the overhead of running 24 physical copies of the OS + Apache + DB on a box that would otherwise be running a single copy of the OS + Apache + DB is total overkill (on that topic)

It would be great if you can also test Xeon's req/sec if it ran a single instance so we can see it from a different perspective, as of now as I said, your test is skewered towards hosting providers whom might invest in Calxeda to provide VPS alternatives. But to them (and their client base), the benefit of a VPS is it's portability, which, 24 physical ARM nodes isn't going to provide, so I don't see them considering it as an alternative solution anyway.
skyroski - Wednesday, March 13, 2013 - link
I also want to ask if your Xeon test server's network adapter is capable of and was using Intel VT-c?
JohanAnandtech - Thursday, March 14, 2013 - link
It was using VMDq/Netqueue (via VMXnet) but not SR-IOV/VT-c

Calxeda's ARM server tested

The Results that Matter

Response Times

Post Your Comment

99 Comments

View All Comments

tech4real - Thursday, March 14, 2013 - link

Wilco1 - Thursday, March 14, 2013 - link

tech4real - Thursday, March 14, 2013 - link

Wilco1 - Thursday, March 14, 2013 - link

tech4real - Thursday, March 14, 2013 - link

Th-z - Wednesday, March 13, 2013 - link

Flunk - Wednesday, March 13, 2013 - link

skyroski - Wednesday, March 13, 2013 - link

skyroski - Wednesday, March 13, 2013 - link

JohanAnandtech - Thursday, March 14, 2013 - link

Log in

Don't have an account? Sign up now