How Much Power?

All this hardcore testing just made us more curious. Would we be able to determine how much power the PCU of Nehalem actually saves? Let's add a little more machine code to our hardware C-state scripts. The MSR 3FCh contains the info we need. We test once again with two active chess threads.

PCU Sleep State Comparison
  Clockticks Ticks spent in C3 Ticks spent in C6 Percentage C3 Percentage C6
Core 1 2961889630 3497984 71450624 0.12% 2.41%
Core 2 2989850634 4128768 768581632 0.14% 25.71%
Core 3 3022277437 186195968 1032536064 6.16% 34.16%
Core 4 3033988899 171286528 387645440 5.65% 12.78%
Average       3.02% 18.76%

At first you may think that these measurements contradict our previous measurements even though they were measured in the same circumstances (two active threads + one measurement thread). But if you calculate how much time the cores spend on average in C6, you get 19%, in the same ballpark as our previous measurement (21%). Notice that the PCU forces the Xeon cores to move quickly from C3 to a deeper C6 sleep: only 3% (!) is spent in C3.

So this means that the ACPI C2 state consists of 13.85% C3 and 86.15% C6 (18.76/ (3.02 + 18.76). Let's take the ACPI readings again.

ACPI C-State Comparison
  % idle C1 C2 C3
Opteron 2435 86 100 0 0
Xeon L3426 81 7 93 0
Opteron 2389 72.44 100 0 0

So now we can calculate how much time the CPU actually spent in the real hardware C-states.

% time spent in C1 = 7% of 81% idle

The "software" ACPI C2 states are mapped by the Xeon CPU to two "hardware CPU" states:

  1. % time spent in C3 = 13.85% out of 93% C3, at 81% idle = +/- 10.3%
  2. % time spent in C6 = 86.15% out of 93% C3, at 81% idle = +/- 65%

So our two threads of Chess caused the L3426 cores to spend:

  • 19% in C0
  • 5.7% in C1
  • 10.3% in C3
  • 65% (!) in C6

…on average.

What effect would this have on the power consumption of the chip? Intel gives us a good idea of what each C-state consumes with the Xeon X3400 series. In the thermal specifications and design guidelines [6] we find this table.


Intel does not give us C1 power, but let's assume it is 25W on the L3426; our industry sources tell us this should be close enough. If the complex circuitry of the PCU was not available, the CPU would be limited to the C1 state to save power. Other C-states would only be available if all cores were idle or the system was idle. We assume that C0 consumes 45W, which is not far from the truth either as the CPUs with low TDP tend to be quicker.

Total power w/o PCU
= 45W * 19% (C0) + 25W * 81% (C1)
= 28.8W
Total Power with PCU
= 45W * 19% (C0) + 25W * 5.7% (C1) + 17W * 10.3% (C3) + 4W * 65% (C6)
= 14.5W

The actual absolute numbers are not that important, but our simplified calculation shows that the fact that the PCU forces the CPU to go very quickly to C6 allows the "Lynnfield" Xeon to morph from a rather mediocre low power CPU into a "real" low power CPU. 14W for four complex out of order processors is very impressive, less than 4W per core! Intel's claims are justified: the PCU enables the "Nehalem" based cores to run in a deep sleep C6 state, even if other cores are hard at work. To end with an interesting note: even with four threads active on the Xeon L3426 we found out that the cores spent 11% of the time in C6.

Analysis: What Happened? More Performance Please!
Comments Locked

35 Comments

View All Comments

  • n0nsense - Monday, January 18, 2010 - link

    Here is what system sees ...
    only one is 2.5, other three are 2.0 :)

    nons ~ # cat /proc/cpuinfo
    processor : 0
    vendor_id : GenuineIntel
    cpu family : 6
    model : 23
    model name : Intel(R) Core(TM)2 Quad CPU Q9300 @ 2.50GHz
    stepping : 7
    cpu MHz : 2497.000
    cache size : 3072 KB
    physical id : 0
    siblings : 4
    core id : 0
    cpu cores : 4
    apicid : 0
    initial apicid : 0
    fpu : yes
    fpu_exception : yes
    cpuid level : 10
    wp : yes
    flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx lm constant_tsc arch_perfmon pebs bts rep_good pni dtes64 monitor ds_cpl vmx smx est tm2 ssse3 cx16 xtpr pdcm sse4_1 lahf_lm tpr_shadow vnmi flexpriority
    bogomips : 5009.38
    clflush size : 64
    cache_alignment : 64
    address sizes : 36 bits physical, 48 bits virtual
    power management:

    processor : 1
    vendor_id : GenuineIntel
    cpu family : 6
    model : 23
    model name : Intel(R) Core(TM)2 Quad CPU Q9300 @ 2.50GHz
    stepping : 7
    cpu MHz : 1998.000
    cache size : 3072 KB
    physical id : 0
    siblings : 4
    core id : 1
    cpu cores : 4
    apicid : 1
    initial apicid : 1
    fpu : yes
    fpu_exception : yes
    cpuid level : 10
    wp : yes
    flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx lm constant_tsc arch_perfmon pebs bts rep_good pni dtes64 monitor ds_cpl vmx smx est tm2 ssse3 cx16 xtpr pdcm sse4_1 lahf_lm tpr_shadow vnmi flexpriority
    bogomips : 7012.69
    clflush size : 64
    cache_alignment : 64
    address sizes : 36 bits physical, 48 bits virtual
    power management:

    processor : 2
    vendor_id : GenuineIntel
    cpu family : 6
    model : 23
    model name : Intel(R) Core(TM)2 Quad CPU Q9300 @ 2.50GHz
    stepping : 7
    cpu MHz : 1998.000
    cache size : 3072 KB
    physical id : 0
    siblings : 4
    core id : 2
    cpu cores : 4
    apicid : 2
    initial apicid : 2
    fpu : yes
    fpu_exception : yes
    cpuid level : 10
    wp : yes
    flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx lm constant_tsc arch_perfmon pebs bts rep_good pni dtes64 monitor ds_cpl vmx smx est tm2 ssse3 cx16 xtpr pdcm sse4_1 lahf_lm tpr_shadow vnmi flexpriority
    bogomips : 5009.08
    clflush size : 64
    cache_alignment : 64
    address sizes : 36 bits physical, 48 bits virtual
    power management:

    processor : 3
    vendor_id : GenuineIntel
    cpu family : 6
    model : 23
    model name : Intel(R) Core(TM)2 Quad CPU Q9300 @ 2.50GHz
    stepping : 7
    cpu MHz : 1998.000
    cache size : 3072 KB
    physical id : 0
    siblings : 4
    core id : 3
    cpu cores : 4
    apicid : 3
    initial apicid : 3
    fpu : yes
    fpu_exception : yes
    cpuid level : 10
    wp : yes
    flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx lm constant_tsc arch_perfmon pebs bts rep_good pni dtes64 monitor ds_cpl vmx smx est tm2 ssse3 cx16 xtpr pdcm sse4_1 lahf_lm tpr_shadow vnmi flexpriority
    bogomips : 5009.09
    clflush size : 64
    cache_alignment : 64
    address sizes : 36 bits physical, 48 bits virtual
    power management:
  • VJ - Tuesday, January 19, 2010 - link

    These are mobile CPUs, however:

    With Linux on a Latitude (Intel T7200 or T7500), CPU Frequency Scaling Monitor allows one to scale the frequency of one core to its max while leaving the other core at its minimum.

    With an AMD TL62, this is not possible. The induced scaling of one core causes the frequency of the other core to follow.

    With an AMD ZM84 this is possible. Just like with the Latitude, one can have one core at its max with the other core at its minimum.

    Maybe what's shown is not what's taking place.

    Additionally;

    http://www.intel.com/technology/itj/2006/volume10i...">http://www.intel.com/technology/itj/200...al_Manag...

    "For example, in a Dual-Processor system, when the OS decides to reduce the frequency of a single core, the other core can still run at full speed. In the Intel Core Duo system, however, lowering the frequency to one core slows down the other core as well."


  • VJ - Tuesday, January 19, 2010 - link

    Additionally; AMD's ZM84 allows each core to operate at different frequencies. The lowest frequency is 575Mhz while the highest is 2300Mhz.

    I can set one core to 1150Mhz with the other set at 2300Mhz. This is different from the Intel (Mobile) CPUs I've come across where a difference in frequency between cores is only possible when one core is (seemingly) operating at its lowest frequency (in a dual core system).

    What is also interesting from aforementioned cpuinfo output is that only core is running at its max frequency while all (3) other cores are (seemingly) at their minimum frequency. Considering my previous conjecture on C2 and C0 states, it would be surprising if one can show cpuinfo output where 2 cores are running at max frequency while the other 2 cores are running at any frequency other than max frequency. That shouldn't be possible at all.

  • valnar - Thursday, May 6, 2010 - link

    Does anyone know if this kind of power management for Lynnfield processors is available in Windows 2003?
  • hshen1 - Sunday, June 23, 2013 - link

    This is really a good article for power management researchers like me!!

Log in

Don't have an account? Sign up now