Original Link: http://www.anandtech.com/show/3817/low-power-server-cpus-the-energy-saving-choice
Low Power Server CPUs: the energy saving choice?by Johan De Gelas on July 15, 2010 4:54 AM EST
- Posted in
- IT Computing
If you are just trying to get your application live on the Internet, you might be under the impression that “green IT” and “Power Usage Efficiency” is something the infrastructure guys should worry about. As an hardware enthusiast working for an internet company, you probably worry a lot more about keeping everything “in the air” than the power budget of your servers. Still, it will have escaped no one’s attention that power is a big issue.
But instead of reiterating the clichés again, let us take a simple example. Let us say that you need a humble setup of 4 servers to run your services on the internet. Each server consuming 250W when running at a high load (CPU at 60% or so). If you are based in the US, that means that you need about 10-11 amps: 1000W divided by 110V plus a safety margin. In Europe, you’ll need about 5 amps (Voltage = 230V). In Europe you’ll pay up to $100 per month per amp that you need. The US prices vary typically between $15 and $30 per amp per month. So depending on where you live, it is not uncommon to pay something like $300 to $500 per month just to feed electricity to your servers. In the worst case, you are paying up to $6000 per year ($100 x 5 x 12 months) to keep a very basic setup up and running. That is $24000 in four years. If you buy new servers every 4 years, you probably end up spending more on power than on the hardware!
Keeping an eye on power when choosing the hardware and software components is thus much more than naively following the hype of “green IT”. It is simply the smart thing to do. We take another shot at understanding how choosing your server components wisely can give you a cost advantage. In this article, we focus on low power Xeons in a consolidated Hyper-V/Windows 2008 virtualization scenario. Do Low Power Xeons save energy and costs? We designed a new and improved methodology to find out.
The new methodology
At Anandtech, giving you real world measurements has always been the goal of this site. Contrary to the vast majority of IT sites out there, we don’t believe in letting some consultant or analyst spell it out for you. We give you our measurements, as close to the real world as possible. We give you our opinion based on those measurements, but ultimately it is up to you to decide how to interpret the numbers. You tell us in our comment box if we make a mistake in our thoughts somewhere. And we will investigate it, and get back to you. It is a slow process, but we firmly believe in it. And that is what happened in our article about “dynamic power management”and “testing low power CPUs”.
The former article was written to understand how the current power management techniques work. We needed a very easy, well understood benchmark to keep the complexity down. And it allowed us to learn a lot about the current Dynamic Voltage and Frequency Scaling (DVFS) techniques that AMD and Intel use. But as we admitted, our Fritz Chess benchmark was and is not a good choice if you wanted to apply this new insights to your own datacenter.
“Testing low power CPUs” went much less in depth, but used a real world benchmark: our vApus Mark I, which simulates a heavy consolidated virtualization load. The numbers were very interesting, but the article had one big shortcoming: it only measured at 90-100% workload or idle. The reason for this is that the vApus benchmark score was based upon throughput. And to measure the throughput of a certain system, you have to stress it close to the maximum. So we could not measure performance accurately unless we went for the top performance. And that is fine for an HPC workload, but not for a commercial virtualization/database/web workload.
Therefore we went for a different approach based upon our reader's feedback. We launched “one tile” of the vApus benchmark on each of tested servers. Such a tile consists of a OLAP database (4 vCPUs), an OLTP database (4 vCPUs) and two web VMs (2 vCPUs). So in total we have 12 virtual CPUs. These 12 virtual CPUs are much less than what a typical high-end dual CPU server can offer. From the point of view of the Windows 2008, Linux or VMware ESX scheduler, the best Xeon 5600 (“Westmere”) and Opteron 6100 (“Magny-cours”) can offer 24 logical or physical cores. To the hypervisor, those logical or physical cores are Hardware Execution Contexts (HECs). The hypervisor schedules VMs onto these HECs. Typically each of the 12 virtual cores needs somewhere between 50 and 90% of one core. Since we have twice the number of cores or HECs than required, we expect the typical load on the complete system to hover between 25 and 45%. And although it is not perfect, this is much closer to the real world. Most virtualized servers never run idle for a long time: with so many VMs, there is always something to do. System administrators also want to avoid CPU loads over 60-70% as this might make the response time go up exponentially.
There is more. Instead of measuring throughput, we focus on response time. At the end of the day, the number of pages that your server can maximally serve is nice to know, but not important. The response time that your system offers at a certain load is much more important. Users will appreciate low response times. Nobody is going to be happy about the fact that your server can serve up to 10.000 request per second if each page takes 10 seconds to load.
We have reviewed the Intel Xeon X5670 before: it is the best performing Intel Six-core in the 95W TDP power envelope. For comparison, we add the Intel Xeon L5640. The 32 nm “Westmere”L5640 reduces TDP to 60W, although it still has 6 cores. This chip runs at 2.26 GHz, but at lighter load it should boost itself to 2.8 GHz.
Asus RS700-E6/RS4 1U Server
Asus Z8PS-D12-1U Motherboard
Six-core Xeon L5640 2.26 GHz or Six-core Xeon X5670 2.93 GHz
6x Samsung M393B5170DZ1 - CH9 1333MHz CL9 ECC (24GB)
2x Western Digital WD1000FYPS 1TB (VM images and OS installation)
2 x Intel X25-E SLC SSD 32GB (Data Oracle OLTP & Log Oracle OLTP)
Most Important BIOS Settings: (BIOS version 0701 (20/01/2010))
C1E Support: Enabled
Hardware Prefetcher: Enable
Adjacent Cache Line Prefetch: Enabled
Intel VT: Enabled
Active Intel SpeedStep Tech: Enabled
Intel TurboMode: Enable
Intel C-State Tech:Enabled
C3 State: ACPI C3
We used the racktivity PM0816-ZB datacenter PDU to measure power.
Using a PDU for accurate power measurements might same pretty insane, but this is not your average PDU. Measurement circuits of most PDUs assume that the incoming AC is a perfect sine wave, but it never is. The Rackitivity PDU measures however true RMS current and voltage at a very high sample rate: up to 20.000 measurements per seconds for the complete PDU. We read out the current and voltage out each second, which already gives us more than 4000 data points along our 70 minutes long virtualization power test. As the PDU has 8 ports, this allows us to test several servers at once, which will be very handy for future reviews.
Where is AMD’s Opteron?
We did not manage to get a decent server based on the latest AMD’s Opterons in the lab. The current “Magny-Cours” Servers in our lab are reference motherboards running in a desktop tower. So to avoid any unfair comparison with our Xeon rack servers we delay our measurements on the AMD platform until we find a way to get a real server in the lab.
We start measuring idle power running on the two most “used” Power Plans of Windows 2008 R2 Enterprise (Hyper-V enabled): Balanced or High Performance. We described both Power Plans and the resulting effect on the server here. This is the power consumption of the complete system, measured at the electrical outlet.
The Xeon family has made large steps forward in the power management department: fine grained clock gating and core power gating reduces power significantly. This however also results in a very small difference between the low power Xeon and the “Performance” Xeon. When running in idle, the Power management hardware (PCU) shuts down 5 cores and clockgates all components of the remaining core that are not necessary. The result of all these hardware tricks is that it hardly matters if you run those CPUs at 1.6 GHz or 2.26/2.93 GHz. The power plan “balanced” allows the CPU to scale back to 1.6 GHz, the power plan “high performance” never clocks lower than the advertised clockspeed (2.26/2.93 GHz). The amazing thing is that even at the higher clockspeed and voltage, the CPU only needs 2W more at the power outlet. So the real difference at the CPU level is even lower.
Let us put some load on those servers. One tile of vApus Mark I demands 12 virtual CPUs, and as we described before, it will demand about 25-45% of the dual CPU configuration.
If we calculate the average power, everything seems to be “as expected”. However, the problem with this calculation is that the some of the tests took longer than others. For example the test on the L5640 took about 66 minutes, while the Xeon X5670 needed only 59 minutes.
And that was a real surprise to us: as we were not loading the CPU to 100%, we did not expect that one test would take so much longer than the other. But you can clearly see that the fastest Xeon went more quickly to an idle state.
So we measured the power over a period of 70 minutes (longer than the slowest test run), and calculated the real energy (power x time) consumed.
You can see that with the exception of the Xeon X5670 running with the high performance plan, there is almost no difference between the L5640 measurements and the “balanced” X5670. And let us look at performance now. We made a geometric mean of the number of URL/s and the number of database transactions that the server was able to process.
This graph explains why the Xeon X5670 does so well: as it is able to handle more transactions and web requests per second, it can empty the queues more quickly. When the web request and database queues are empty, the CPU can throttle back and save power. Since the idle power of the Xeon X5670 is pretty low (almost as good as the idle power of the low power version), this is a real tangible advantage.
At the end of day, users will not complain about throughput, they experience high response times as disturbing. Responses times are the ones that are part of the SLAs. Let us see what we measured. We made a geometric mean of the response time of the database queries and web requests.
The results are stunning! Despite the fact that we do not max out the CPUs at all, the X5670 leaves the low power version far behind when it comes to response times. This is partly due to the fact that vApus Mark I is constructed as a CPU test. In your own datacenter you might not see the same results if you are (partly) I/O constrained of course. But if web and database applications are well cached, the higher performing CPU can deliver tangible lower response times.
Trading off performance and power
Let us make this clearer by looking at the percentages. The Xeon X5670 is made the reference, the 100% yardstick.
The Xeon X5670 using the “high performance power plan” is able to boosts its clockspeed regularly, and this results in a 11% higher throughput over the same CPU in “the balanced power plan”. Or you could say that disabling turboboost (by using the power plan “balanced”) results in an 10% throughput disadvantage. The way the queues are handled, this 10% advantage in throughput results in a 31% higher response time. The really interesting thing is the comparison between the L5640 and the Xeon X5670 in “balanced mode”.
For only a saving of 3% energy, you have content yourself with 34% higher response times. Not a good deal if you ask us. Core Power gating and aggressive clock gating plays a much bigger role than DVFS and the result is that higher clocked CPU consume hardly more energy while offering better performance.
How useful are low power server CPUs?
We were quite a bit surprised that the lower power CPU did not return any significant energy savings compared to the X5670. Intuition tells you that the best server CPUs like the X5670 only would pay off in a high performance environment (for example an HPC server). But human intuition is a bad guide when dealing with physics. Cold hard measurements are a much better way to make up your mind. And all our measurements point in the same direction: the fastest Xeon offers a superior performance/watt ratio in a typical virtualization consolidation scenario.
You could argue that the X5670 is quite a bit more expensive; a server equipped with a dual X5670 will indeed cost you about $900 more. We admit: our comparison is not completely accurate price wise… as always we work with the CPUs that we got in the lab. But a typical server with these CPUs, 64 GB and some accessories will set you back $9000 easily. The 2.8 GHz Xeon X5560 is hardly 4% slower than the X5670, and will probably show the same favorable performance/watt ratio. And if you place a X5560 2.8 GHz instead of a L5640 2.26 GHz, you only add $200 dollar to a $8k-$9k server. That is peanuts, lost in the noise of the TCO calculation. So the question is real: are the "Low power Xeons" (L-series) useless and should you immediately go for the X-series?
Defeated as it may be, the L5640 can still play one trump card: lower maximum currents. Over the period of our 70 minutes of testing, we decided to take a look at maximum power. To avoid that any extreme peaks would muddle up the picture, we used the 95th percentile.
Let us focus on the “balanced” power plan. The L5640 makes sure power never goes beyond 231 W, while the peak of the X5670 is almost 20% higher. As a result, a rack of low power Xeon will be able to keep the maximum current consumed lower. You could consider the low power Xeon L5640 a “power capped” version of Xeon X5670. In many datacenters you pay a hefty fine if you briefly need more than the amp limit you have agreed upon. So the low power Xeon might save you money by guaranteeing that you never need more than a certain amperage.
Translated to the datacenter
Before we make any conclusions based upon what we learned in this testing session, we should not forget the lessons learned from previous experiments. And those tests tell us that you should avoid the low end server CPUs: those are the most leaky ones, consuming more power at idle and low load.
Our conclusion on our new energy measurement methodology is:
1. Low power Xeons save power but do not save energy in a typical Hyper-v consolidation ratio. Power is “capped”, but the total energy consumed for a certain task is (more or less) the same.
2. X-series Xeons offer a much better performance per watt ratio, but at the expense of brief power peaks. They do not necessarily need more energy in the long run than the lower power versions, and offer much better response times if your application is CPU bound.
So if you have to pay the actual energy consumption, you have some amp headroom left before being penalized and performance matters to your users, the faster Xeons are the right choice. The benefit is that you can offer a lower response times to your users, even when the CPU is not running at peak load! The Xeon X5670 is more flexible.
In case you pay a fixed price for a fixed amount of amps and you get heavy fines in your mailbox if you briefly breach your amp limit, buying lower power Xeons is probably the way to go.
Then again, we are not very enthusiastic about power capping at the server level. Many companies have already embraced the idea of a Dynamic Scheduled Virtualized Cluster. VMware's vSphere product is the most mature here: it is pretty easy to build a virtual cluster with dynamic power management (DPM) and scheduling (DRS). We still have to investigate this, but if this virtual cluster works well with solutions such as HP Insight Power Manager or Intel’s own power node manager, the faster Xeon will get interesting once again. The basic idea is that you should power cap your entire cluster (or rack), not one server. You should not care that one server needs a little more power than usual, but the whole cluster should not exceed the amp limits described in your contract with the datacenter. That way you can reconcile low response times with low power bills.
A big thanks to Dries Velle for assisting us in the Sizing Servers Lab.