The Results that Matter

Before you jump ahead to the charts below, we suggest taking some time to properly interpret the results. First of all, we simulate between 5 to 15 "busy" users on the web server per second. As a user clicks somewhere on the website, this can result in a few requests or tens of requests. For example, accessing the forum on the website results in two simple "GET" requests, while posting a reply results in an avalanche of 56 POSTs and GETs. That is why we report performance in "responses per second". Responses are somewhat similar from the CPU load point of view if you look at a statistically large enough number of them. User actions are so wildly different that in some cases performing two user actions per seconds can require more processing power and network bandwidth than 20 user actions per second.

Webhosting throughtput—average over 24 web servers

At the low concurrencies, the Intel machine leverages turboboost and its exceptionally high per core performance. At the higher web loads, the total throughput of the 96 (24x quad-core SoCs) ARM Cortex-A9 cores is up to 50% higher than the low power 32 thread/16 core (2x Octal core) Xeons. Even the mighty 2660 cannot beat the herd of ARM SoCs.

While we have lots and lots of experience with x86 servers, we had almost none with ARM based servers, so we met up with the people of Calxeda engineering and got some valuable optimization tips. It turns out that the internal switch fabric can be tuned in various ways. For example, the link speed from one node is by default set to 2.5 gbit/s, which is rather high considering that we are mostly CPU limited and use less than 0.5Gbit/s per node. Setting the link speed of each node to 1Gbit/s should lower power and gives more than enough bandwidth. We also updated to a slightly newer kernel (155) from the Calxeda kernel PPA (Personal Package Archive). This allowed us to make use of Dynamic Voltage and Frequency Scaling (DVFS, P-states) using the CPUfreq tool. First let's see if all these power saving tweaks have reduced the total throughput.

Webhosting throughtput Optimized—average over 24 web servers

The changes did not give any boost in throughput (in many cases the scores might even be slightly slower), but the changes might lower power use and/or response times. Let's look at that next.

Response Times

Webhosting Response time optimized—average over 24 web servers

Again, the Intel machine performs better at lower concurrencies, but our ARM server delivers lower response times at high load. Our optimizations have had no effect on response times.

Our Real World Test Energy and Power
Comments Locked

99 Comments

View All Comments

  • Madpacket - Wednesday, March 13, 2013 - link

    And all of a sudden AMD's acquisition of SeaMicro is starting to make sense. Thanks Johan, great article!
  • JohanAnandtech - Wednesday, March 13, 2013 - link

    I really really hope they downscale the current SeaMicro's soon. Because with a starting price at $139000, they are not catering to the typical SME :-).
  • joshv - Wednesday, March 13, 2013 - link

    It seems this has a very narrow application in VM hosting, but I am not sure it's applicable when you have the choice of just scaling up memory or process usage of the single instance Xeon server. For example, I could load 24 instances of my production middle tier on the ARM server - or I could run one instance on a Xeon server and give it all the memory and make sure it spawns enough threads to keep all the internal cores busy. Perhaps my middle tier software has issues with handling all that RAM, so maybe I run 4 instances of it as a process, not a biggy.

    I am going to bet that the Xeon server will win as it won't have the VM overhead.
  • Kurge - Wednesday, March 13, 2013 - link

    I would be interested in a bare metal comparison. Since you're serving up the same app why would you split it between 24 VMs on the Xeon server? It's a bit contrived.

    Just load up Server 2012 and IIS or Linux + Apache straight up on the Xeon and see how it performs.
  • MrSpadge - Wednesday, March 13, 2013 - link

    Very interesting!

    I'd prefer a fat machine with virtualized servers to get automatic load balancing, but it's not like one couldn't shuffle tasks around in the ARM farm. And there's room for improvement: be it the next Atom or the memory controller in the current ECX-1000 CPUs. And take a look at how badly they scale from 2 to 4 threads - surely, there's lot's of rooms left!
  • rubyl - Wednesday, March 13, 2013 - link

    What is the average CPU utilization for the Viridis nodes and for the Xeon system under the 5 different concurrency loads (for the 24 webserver workload)?
  • gercho - Wednesday, March 13, 2013 - link

    When you said " The next generation ARM servers are already on the way and will probably hit the market in the third quarter of this year. The "Midway" SoC is based on a 28nm (TSMC) Cortex-A15 chip. A 28nm A15 offers 50% higher single-threaded integer performance at slightly higher power levels and can address up to 16GB of RAM." As far as I know the A15 cores have 50% more performance but consume 3X more power, that's not "slightly".........
  • nofumble62 - Wednesday, March 13, 2013 - link

    50% more performance at 3X more power... reminding me of the Netburst architect.
  • thenewguy617 - Wednesday, March 13, 2013 - link

    Can you please point me to sources of your number?
    Thanks
  • Wilco1 - Thursday, March 14, 2013 - link

    Where on earth you do get that 3x from? So far no 28nm Cortex-A15 chips have been released. The A15 in the Exynos Octo uses about 1.25W per core at 1.8GHz according to Samsung. That's slightly more power than a Calxeda A9 uses per core, but the A15 gives twice the performance per core.

Log in

Don't have an account? Sign up now