AMD Power Management

Variable clock rate and CPU power management started with the Intel 386SL, but that would take us a bit too far back in history. Let's start with the introduction of the K6-2+ and PIII mobile. From that moment on, both Intel and AMD have been using Dynamic Frequency and Voltage Scaling (DFVS) per CPU. DFVS has been marketed as "PowerNow!", "SpeedStep" and many other names. In a multi-core CPU this means that all the cores will clock at the clock speed of highest loaded core. A clock speed requires a corresponding core voltage, so all cores also use the same voltage.

With the introduction of the K10h family (aka "Barcelona") in 2007, AMD reduced dynamic power by three different technologies:

  1. Dynamic Frequency Scaling per Core. Each core runs at its own clock.
  2. Separate power planes for the core and "uncore" part of the CPU.
  3. Clock gating at the CPU block level

The effect of (1) on performance/watt is not a complete success story: power is linear with frequency, and some OS schedulers will always try to "load balance" across the cores to avoid having one core get hot (which increases static power). As a result the power savings due to (1) are relatively small, and the lag in transitioning from one P-state to another reduces performance as our benchmarks will confirm. AMD Opterons typically support 4-5 P-states. The Opteron "Shanghai" 2389 in this test supports 2.9, 2.3, 1.7 and 0.8GHz. The six-core Opteron 2435 supports 2.6, 2.1, 1.7, 1.4 and 0.8GHz. [2]

Separate power planes provide several benefits. The first benefit is that the cores can go to a sleep (C-state) while the memory controller is still working for another external device (e.g. via DMA). Another advantage is that AMD is able to run the Northbridge and L3 cache out of sync with the cores. This lowers power significantly, while performance only decreases slightly. Overall, performance/watt is clearly increased.

Clock gating reduces power by 20 to 40% according to some publications [3]. This is probably the most important technology for the server market: as server code does not perform floating point code a lot, disabling the clock to the FPU by a clock gate saves quite a bit of power. As a matter of fact, the highest power numbers are measured by floating point intensive benchmarks like LINPAC; typical server benchmarks based on databases or web servers do not even come close. LINPAC needs 20-25% more power than our integer based benchmarks, despite the fact that in both situations the CPU reports utilization as 100%.


AMD added "Smart Fetch" to the newer "Shanghai" Opteron, which is essentially clock gating at the core level (making it new technology number four). The main goal is to make idling cores go to a "clock disabled" sleep state (AMD's C1-state) instead of a low frequency state (P-state). The problem is that snoops from the active core(s) might wake up the sleeping core too quickly, and those snoops would get a very slow "just woke up" answer. To avoid this, the idle core will dump the contents of its L1 and L2 caches into the L3 cache before it goes to the clock gated C1 state. This could not be done on Barcelona, as the 2MB L3 cache would fill up quickly if three cores dumped their L1 and L2 data into the L3 cache. However, it is important to remark that even when three cores are clock gated, it is unlikely they will take 1.7MB away (512 KB * 3 + 64 KB * 3) as shared cachelines between the cores are always kept inside the otherwise exclusive L3 cache of all quad-core Opterons. Clock gating at the core level reduces dynamic power to zero, which allows the new Opteron to save up to 5W per core.

That was quite impressive: a Shanghai Opteron uses about 10W for a quad-core in idle, while a quad-core "Barcelona" Opteron uses around 25W. This is also confirmed in the measurement on desktop CPUs performed by LostCircuits. AMD still has some catching up to do: the six-core Opteron "Lisbon" (set to launch around March 2010) will go from C1 to the hardware controlled C1E state.

Intel Power Management

Intel moved to pretty aggressive clock gating at the CPU block level in its "Woodcrest" server CPU in 2006. Intel also introduced cache sizing: the necessary data in the L2 cache is reduced to a minimum and cache blocks are turned off. While Intel was an innovator when it came to block clock gating and cache power reductions, AMD was first with independent power planes and independent core frequencies. It shows that even in the power management race, AMD and Intel are leapfrogging each other. Intel caught up with AMD and leapfrogged AMD again when it introduced the Xeon "Nehalem" 5500 series, where core and uncore got independent power planes.

However, Intel went one step further. It not only enabled clock gating for each core, but also power gating. Clock gating only reduces the dynamic power, while power gating reduces both dynamic and static (mostly leakage) power. Thanks to the built-in Power Control Unit (PCU, hardware circuit), Intel promises us that cores can go to the lowest C6 sleep state while other cores continue to work "undisturbed".


Below you can see how the operating system sees this. We asked the Windows 2008 kernel to tell us what ACPI state the cores use when the CPU is running completely idle. Notice that the clock speed of each logical core is reduced to 1.2GHz, another sign that the CPU is not processing anything significant.


So while the operating system demands the CPU to go the ACPI C2-state, the PCU overrides the orders of the operating system and should force the idle cores to go relatively quickly to C6, achieving lower power consumption. In C6, the core is not only completely clock gated, it is power gated too. So that means that the leakage of the idle core is reduced to almost zero. The older 5400 Xeon series was only capable of placing two cores into C6 at the same time (i.e. if only one core was idle, it couldn't enter C6). And the deeper the sleep, the slower the core wakes up. Intel severely lowered the time that is necessary to go to the C6 state and back in the Nehalem architecture.


The real magic of the "Nehalem" based architecture is that the integrated power switch makes this transition extremely quickly. Instead of 200µs [4] in the older Penryn processor (the Xeon 54xx is based on this architecture), the transition time is reduced to only 60µs. This should allow the Xeon 5500, 3500, and 3400 series to transition quickly to C6 with a small performance impact. We will check these claims.

The latest Intel Xeons have lots of P-states: one for each 133MHz speed bin from 1.2GHz to the maximum advertised clock speed. In other words, every 133MHz ratio between the lowest frequency P-state and the highest frequency P-state is a valid P-state.

Below you will find an overview of AMD's and Intel's techniques to reduce power while processing.


Intel supports lots of P-states but makes much less use of them than AMD. Despite the fact that the infrastructure is there (each core has its own PLL) Intel doesn't generally run cores at different clock speeds. Sanjay Sharma of Intel:

 

In the steady state, all active cores run at the same frequency, which equals the highest requested frequency of any of the active cores. When there is a frequency change request from one core that results in a change in the resolved frequency, all cores will change to that new resolved frequency. However, not all cores will necessarily change frequency at the same time, since the instruction stream on each core needs to reach an end of macro instruction boundary before it can change frequency. If a core is running a very long instruction when the frequency change request arrives, that core will change frequency later than the other cores that reached the interruptible point sooner. As a result, for very short time periods, it is possible that cores could be running at different frequencies.

 

The most likely reason why Intel does not allow cores to run at different clock speeds for prolonged times is the fact that you have to keep the voltage that is needed for the highest clock speed. AMD has some catching up to do, as the lowest C-state of an idle core is only C1. This situation will improve when the improved Magny-cours and Lisbon Opterons arrive, as those CPUs will support a C1E state like their notebook siblings.

The Hardware Not So Fast!
Comments Locked

35 Comments

View All Comments

  • JohanAnandtech - Monday, January 18, 2010 - link

    In which utility do you set/manage the frequency of a separate core?
  • n0nsense - Monday, January 18, 2010 - link

    Gnome panel applets. CPU frequency monitor I guess it uses cpufreq. Each instance monitors core. So i have 4 of them visible all the time. If you have enabled CPU Frequency scaling (kernel) than you can select the governor (performance, on demand, conservative etc) or a static frequency. I can do it for each core. And it displays what i have set.
    Of course processor should support frequency scaling.(power now and speed step).
    Most mainstream distributions (Ubuntu, Sabayon, Fedora) will use onedemand governor by default when processor with frequency scaling available. No user intervention required.
  • jordanclock - Monday, January 18, 2010 - link

    I really think you're mistaken. Core 2 CPUs don't have any mechanism to allow per-core frequencies. There is one FSB clock and one multiplier. There is no way to set CPU0 to a different frequency than CPU1 (or for quad core, CPU2 and CPU3) because the variables that control the clock speed are chip wide.
  • VJ - Tuesday, January 19, 2010 - link

    These people seem to be convinced of per-core Speedstep:

    https://bugs.launchpad.net/ubuntu/+source/linux-so...">https://bugs.launchpad.net/ubuntu/+source/linux-so...

    Maybe someone can ask David Tomaschik for the Intel documentation he refers to?
  • n0nsense - Monday, January 18, 2010 - link

    I heard it in past, but i still tend to believe my eyes :)
    while writing this reply, i saw any possible combination. My Q9300 has 2 states 2.0GHz and
    2.5GHz. It's not a server CPU. Have no reason to mislead you
  • VJ - Tuesday, January 19, 2010 - link

    If there's only two states, then it's possible that one core is in the C2 state while the other is in its C0 state.

    The core in state C2 may be shown to be operating at 2Ghz (its lowest frequency) while it's really off. The OS may simply be reporting the lowest possible frequency while the core is really not receiving a clock signal.

    So in general, if one core is showing its lowest frequency it may be off which still allows the other core to operate (at a different frequency).

    It would be very strange if both cores are operating greater than their lowest and less than their highest frequencies at different frequencies.

    From a different angle: Has anybody ever seen /proc/cpuinfo report a frequency less than the CPU/Core's lowest active frequency or even zero? Probably not.



  • n0nsense - Tuesday, January 19, 2010 - link

    Nice theory :)
    But in this case, I see that each core doing something. htop shows that each core somewhere in 15% usage. So the only options left, are
    1. Each core frequency can be controlled independently on C2D and C2Q (May be i3 i5 i7 too)
    2. The OS is completely unaware of whats going on :) (which is less possible)
  • mino - Thursday, January 21, 2010 - link

    "The OS is completely unaware of whats going on" is the right answer.
    :)

    BTW, only x86 CPU's able to change freq per core are >=K10 for AMD and >=Nehalem for Intel.
  • VJ - Tuesday, January 19, 2010 - link

    Not to defeat your argument/observations, rather for completeness' sake:

    It's also possible that the differences are due to the reading of the attributes. If the attributes are read in succession, then it's possible that the differences are due to the time of reading the attributes, while at any given instant, notwithstanding the allowable subtle differences in frequency described in this article, all cores are operating at the same frequency.

    There's a lot of time at the bottom.
  • JanR - Tuesday, January 19, 2010 - link

    Hi,

    I completely agree to this:

    "It's also possible that the differences are due to the reading of the attributes."

    The point is that desktop usage together with ondemand governor leads to a lot of fast frequency changes. Therefore, this is not a good scenario to decide on "per core" vs "per CPU". We did a lot of testing the following way:

    Put load on all cores using "taskset" (this avoids C-states). Switch to "userspace" governor and then set frequencies of individual cores differently. You have one control per core but the actual hardware decides what really happens - you can check this in /proc/cpuinfo or using a tool such as "mhz" from lmbench as load generator (this one calculates actual frequency based on CPI and time, it allows also measurement of turbo frequencies).

    Trying around, the results are:

    AMD K8: One clock domain, maximum of the requested frequencies is taken

    Intel Core2 Duo: Same as K8

    AMD K10: Individual clock domains, you can clock each core individually

    Intel Core 2 Quad: TWO clock domains! These CPUs are two dual core dies glued together so each die has its one multiplicator. Therefore, the cores of each die get the maximum of the requested frequencies but you can clock the two dies independendly.

    Intel Nehalem: One clock domain, maximum of requests of all cores that are not in C-state! If you set one core to, e.g., 2.66 GHz and all other to 1.6, all cores clock at 1.6 as long as the core set to 2.66 is not used, they all switch to 2.66 if you put load on that core.

    So far to our findings. "cat /proc/cpuinfo" or some funny tools are useless if you do not control the environment (userspace, manual settings). If you then enable ondemand, the system switches fast between different states and looking at it is just a snapshot, maybe taken in the middle of a transition.

    Greetings,

    Jan

Log in

Don't have an account? Sign up now