AMD Power Management

Variable clock rate and CPU power management started with the Intel 386SL, but that would take us a bit too far back in history. Let's start with the introduction of the K6-2+ and PIII mobile. From that moment on, both Intel and AMD have been using Dynamic Frequency and Voltage Scaling (DFVS) per CPU. DFVS has been marketed as "PowerNow!", "SpeedStep" and many other names. In a multi-core CPU this means that all the cores will clock at the clock speed of highest loaded core. A clock speed requires a corresponding core voltage, so all cores also use the same voltage.

With the introduction of the K10h family (aka "Barcelona") in 2007, AMD reduced dynamic power by three different technologies:

  1. Dynamic Frequency Scaling per Core. Each core runs at its own clock.
  2. Separate power planes for the core and "uncore" part of the CPU.
  3. Clock gating at the CPU block level

The effect of (1) on performance/watt is not a complete success story: power is linear with frequency, and some OS schedulers will always try to "load balance" across the cores to avoid having one core get hot (which increases static power). As a result the power savings due to (1) are relatively small, and the lag in transitioning from one P-state to another reduces performance as our benchmarks will confirm. AMD Opterons typically support 4-5 P-states. The Opteron "Shanghai" 2389 in this test supports 2.9, 2.3, 1.7 and 0.8GHz. The six-core Opteron 2435 supports 2.6, 2.1, 1.7, 1.4 and 0.8GHz. [2]

Separate power planes provide several benefits. The first benefit is that the cores can go to a sleep (C-state) while the memory controller is still working for another external device (e.g. via DMA). Another advantage is that AMD is able to run the Northbridge and L3 cache out of sync with the cores. This lowers power significantly, while performance only decreases slightly. Overall, performance/watt is clearly increased.

Clock gating reduces power by 20 to 40% according to some publications [3]. This is probably the most important technology for the server market: as server code does not perform floating point code a lot, disabling the clock to the FPU by a clock gate saves quite a bit of power. As a matter of fact, the highest power numbers are measured by floating point intensive benchmarks like LINPAC; typical server benchmarks based on databases or web servers do not even come close. LINPAC needs 20-25% more power than our integer based benchmarks, despite the fact that in both situations the CPU reports utilization as 100%.


AMD added "Smart Fetch" to the newer "Shanghai" Opteron, which is essentially clock gating at the core level (making it new technology number four). The main goal is to make idling cores go to a "clock disabled" sleep state (AMD's C1-state) instead of a low frequency state (P-state). The problem is that snoops from the active core(s) might wake up the sleeping core too quickly, and those snoops would get a very slow "just woke up" answer. To avoid this, the idle core will dump the contents of its L1 and L2 caches into the L3 cache before it goes to the clock gated C1 state. This could not be done on Barcelona, as the 2MB L3 cache would fill up quickly if three cores dumped their L1 and L2 data into the L3 cache. However, it is important to remark that even when three cores are clock gated, it is unlikely they will take 1.7MB away (512 KB * 3 + 64 KB * 3) as shared cachelines between the cores are always kept inside the otherwise exclusive L3 cache of all quad-core Opterons. Clock gating at the core level reduces dynamic power to zero, which allows the new Opteron to save up to 5W per core.

That was quite impressive: a Shanghai Opteron uses about 10W for a quad-core in idle, while a quad-core "Barcelona" Opteron uses around 25W. This is also confirmed in the measurement on desktop CPUs performed by LostCircuits. AMD still has some catching up to do: the six-core Opteron "Lisbon" (set to launch around March 2010) will go from C1 to the hardware controlled C1E state.

Intel Power Management

Intel moved to pretty aggressive clock gating at the CPU block level in its "Woodcrest" server CPU in 2006. Intel also introduced cache sizing: the necessary data in the L2 cache is reduced to a minimum and cache blocks are turned off. While Intel was an innovator when it came to block clock gating and cache power reductions, AMD was first with independent power planes and independent core frequencies. It shows that even in the power management race, AMD and Intel are leapfrogging each other. Intel caught up with AMD and leapfrogged AMD again when it introduced the Xeon "Nehalem" 5500 series, where core and uncore got independent power planes.

However, Intel went one step further. It not only enabled clock gating for each core, but also power gating. Clock gating only reduces the dynamic power, while power gating reduces both dynamic and static (mostly leakage) power. Thanks to the built-in Power Control Unit (PCU, hardware circuit), Intel promises us that cores can go to the lowest C6 sleep state while other cores continue to work "undisturbed".


Below you can see how the operating system sees this. We asked the Windows 2008 kernel to tell us what ACPI state the cores use when the CPU is running completely idle. Notice that the clock speed of each logical core is reduced to 1.2GHz, another sign that the CPU is not processing anything significant.


So while the operating system demands the CPU to go the ACPI C2-state, the PCU overrides the orders of the operating system and should force the idle cores to go relatively quickly to C6, achieving lower power consumption. In C6, the core is not only completely clock gated, it is power gated too. So that means that the leakage of the idle core is reduced to almost zero. The older 5400 Xeon series was only capable of placing two cores into C6 at the same time (i.e. if only one core was idle, it couldn't enter C6). And the deeper the sleep, the slower the core wakes up. Intel severely lowered the time that is necessary to go to the C6 state and back in the Nehalem architecture.


The real magic of the "Nehalem" based architecture is that the integrated power switch makes this transition extremely quickly. Instead of 200µs [4] in the older Penryn processor (the Xeon 54xx is based on this architecture), the transition time is reduced to only 60µs. This should allow the Xeon 5500, 3500, and 3400 series to transition quickly to C6 with a small performance impact. We will check these claims.

The latest Intel Xeons have lots of P-states: one for each 133MHz speed bin from 1.2GHz to the maximum advertised clock speed. In other words, every 133MHz ratio between the lowest frequency P-state and the highest frequency P-state is a valid P-state.

Below you will find an overview of AMD's and Intel's techniques to reduce power while processing.


Intel supports lots of P-states but makes much less use of them than AMD. Despite the fact that the infrastructure is there (each core has its own PLL) Intel doesn't generally run cores at different clock speeds. Sanjay Sharma of Intel:

 

In the steady state, all active cores run at the same frequency, which equals the highest requested frequency of any of the active cores. When there is a frequency change request from one core that results in a change in the resolved frequency, all cores will change to that new resolved frequency. However, not all cores will necessarily change frequency at the same time, since the instruction stream on each core needs to reach an end of macro instruction boundary before it can change frequency. If a core is running a very long instruction when the frequency change request arrives, that core will change frequency later than the other cores that reached the interruptible point sooner. As a result, for very short time periods, it is possible that cores could be running at different frequencies.

 

The most likely reason why Intel does not allow cores to run at different clock speeds for prolonged times is the fact that you have to keep the voltage that is needed for the highest clock speed. AMD has some catching up to do, as the lowest C-state of an idle core is only C1. This situation will improve when the improved Magny-cours and Lisbon Opterons arrive, as those CPUs will support a C1E state like their notebook siblings.

The Hardware Not So Fast!
Comments Locked

35 Comments

View All Comments

  • JohanAnandtech - Tuesday, January 19, 2010 - link

    Well, Oracle has a few downsides when it comes to this kind of testing. It is not very popular in the smaller and medium business AFAIK (our main target), and we still haven't worked out why it performs much worse on Linux than on Windows. So chosing Oracle is a sure way to make the projecttime explode...IMHO.
  • ChristopherRice - Thursday, January 21, 2010 - link

    Works worse on Linux then windows? You have a setup issue likely with the kernel parameters or within oracle itself. I actually don't know of any enterprise location that uses oracle on windows anymore. "Generally all Rhel4/Rhel5/Sun".
  • TeXWiller - Monday, January 18, 2010 - link

    The 34xx series supports four quad rank modules, giving today a maximum supported amount of 32GB per CPU (and board). The 24GB limit is that of the three channel controller with unbuffered memory modules.
  • pablo906 - Monday, January 18, 2010 - link

    I love Johan's articles. I think this has some implications in how virtualization solutions may be the most cost effective. When you're running at 75% capacity on every server I think the AMD solution could have possibly become more attractive. I think I'm going to have to do some independent testin in my datacenter with this.

    I'd like to mention that focusing on VMWare is a disservice to Vt technology as a whole. It would be like not having benchmarked the K6-3+ just because P2's and Celerons were the mainstream and SS7 boards weren't quite up to par. There are situations, primarily virtualizing Linux, where Citrix XenServer is a better solution. Also many people who are buying Server '08 licenses are getting Hyper-V licenses bundled in for "free."

    I've known several IT Directors in very large Health Care organization who are deploying a mixed Hyper-V XenServer environment because of the "integration" between the two. Many of the people I've talked to at events around the country are using this model for at least part of the Virtualization deployments. I believe it would be important to publish to the industry what kind of performance you can expect from deployments.

    You can do some really interesting HomeBrew SAN deployments with OpenFiler or OpeniSCSI that can compete with the performance of EMC, Clarion, NetApp, LeftHand, etc. NFS deployments I've found can bring you better performance and manageability. I would love to see some articles about the strengths and weaknesses of the storage subsystem used and how it affects each type of deployment. I would absolutely be willing to devote some datacenter time and experience with helping put something like this together.

    I think this article really lends itself well into tieing with the Virtualization talks and I would love to see more comments on what you think this means to someone with a small, medium, and large datacenter.
  • maveric7911 - Tuesday, January 19, 2010 - link

    I'd personally prefer to see kvm over xenserver. Even redhat is ditching xen for kvm. In the environments I work in, xen is actually being decommissioned for VMware.
  • JohanAnandtech - Tuesday, January 19, 2010 - link

    I can see the theoretical reasons why some people are excited about KVM, but I still don't see the practical ones. Who is using this in production? Getting Xen, VMware or Hyper-V do their job is pretty easy, KVM does not seem to be even close to being beta. It is hard to get working, and it nowhere near to Xen when it comes to reliabilty. Admitted, those are our first impressions, but we are no virtualization rookies.

    Why do you prefer KVM?
  • VJ - Wednesday, January 20, 2010 - link

    "It is hard to get working, and it nowhere near to Xen when it comes to reliabilty. "

    I found Xen (separate kernel boot at the time) more difficult to work with than KVM (kernel module) so I'm thinking that the particular (host) platform you're using (windows?) may be geared towards one platform.

    If you had to set it up yourself then that may explain reliability issues you've had?

    On Fedora linux, it shouldn't be more difficult than Xen.
  • Toadster - Monday, January 18, 2010 - link

    One of the new technologies released with Xeon 5500 (Nehalem) is Intel Intelligent Power Node Manager which controls P/T states within the server CPU. This is a good article on existing P/C states, but will you guys be doing a review of newer control technologies as well?

    http://communities.intel.com/community/openportit/...">http://communities.intel.com/community/...r-intel-...
  • JohanAnandtech - Tuesday, January 19, 2010 - link

    I don't think it is "newer". Going to C6 for idle cores is less than a year old remember :-).

    It seems to be a sort of manager which monitors the electrical input (PDU based?) and then lowers the p-states to keep the power at certain level. Did I miss something? (quickly glanced)

    I think personally that HP is more onto something by capping the power inside their server management software. But I still have to evaluate both. We will look into that.
  • n0nsense - Monday, January 18, 2010 - link

    May be i missed something in the article, but from what I see at home C2Q (and C2D) can manage frequencies per core.
    i'm not sure it is possible under Windows, but in Linux it just works this way. You can actually see each core at its own frequency.
    Moreover, you can select for each core which frequency it should run.

Log in

Don't have an account? Sign up now