Saving Power at Low Load

Measuring idle power is important in some applications as operating system schedulers may choose to "race to idle", i.e. perform the task as quickly as possible so the CPU can return to an idle state. This strategy is only worthwhile if the idle state consumes very little power, but lots of server applications are running at relatively low but almost never "zero" load. One example is a web server that is visited all around the globe. Thus it is equally interesting to see how the processors deal with this kind of situation. We started Fritz Mark up with two threads to see how the operating system and hardware cope with this. First we look at the delivered performance.

Fritzmark integer processsing: 2 thread performance

In performance mode, the Xeon L3426 is capable of pushing clock speed up to 2.66GHz, but not always. Performance is equal to a similar Xeon at 2.5GHz. This in contrast with the Xeon X3470 which can almost always keep its clock speed at 3.33GHz, and as such delivers performance that is equal to a Xeon that would run always at that speed. The reason for this difference is that the PCU of the L3426 has less headroom: it cannot dissipate more than 45W while the X3470 is allowed to dissipate up to 95W. Still, the performance boost is quite impressive: Turbo Boost offers 34% better performance on the L3426 compared to the "normal" 1.86GHz clock.

Now let's confront the performance levels with the power consumption.

integer processing: 2 threads

The six-core Opteron is clearly a better choice than its faster clocked quad-core sibling. In power saving mode it is capable of reducing the power by 8W more while offering the same level of performance. It is a small surprise: do not forget that the "Istanbul" Opteron has twice as many idle cores that are leaking power than the "Shanghai" CPU.

The Nehalem based core offers very high performance per thread, about 40% higher than the Opteron's architecture is capable of achieving, but it does come with a price, as we see power shoot up very quickly. Part of the reason is of course is that the Nehalem is more efficient at idle. We assume - based on early component level power measurements - that the idle power of the Xeons is about 9W (power plan Balanced), the Opterons about 14W (power plan Balanced). Note that the exact numbers are not really important. Since the RAM is hardly touched, we assume that power is only raised by 1W per DIMM on average. Based on our previous assumptions we can estimate CPU + VRM power, measured at the outlet.

System Power Estimates
System Power Calculation CPU + VRM Power Notes
Xeon X3470 performance 119W - 4W (4 x 1W per DIMM) - 60W idle + 13W CPU = 68W (idle power of system was 73W = 13W CPU, 60W for the rest of the system)
Xeon L3426 performance 99W - 4W - 60W + 11W = 46W  
Xeon L3426 90W - 4W - 60W + 9W = 35W  
Opteron 2435 performance 102W - 4W - 70W idle + 18W = 42W (total idle power was 88W, 18W CPU)
Opteron 2435 balanced 100W - 4W - 70W idle + 14W = 40W  
Opteron 2389 performance 114W - 4W - 70W idle + 22W = 62W  

First of all, you might be surprised that the Turbo Boosted L3426 needs 46W. Don't forget this is measured at the power outlet, so 46W at 90% efficiency means that the CPU + VRMs got 41W delivered. Yes, these numbers are not entirely accurate, but that is not the point. Our component level power measurements still need some work, but we have reason to assume that the numbers above are close enough to draw some conclusions.

  1. AMD's platform consumes a bit too much at idle, but...
  2. The six-core Opteron CPUs are much more efficient than the quad-core in these circumstances
  3. Intel's 95W Xeons offer stellar performance but the high IPC requires quite a bit of power
  4. The low power versions offer an excellent performance / Watt ratio

So if we take the platform out of the picture, the low power Xeon with Turbo Boost consumes about the same as the "normal" six-core Opteron, but performance is 16% better. Is this a success or a failure? Did Intel's Power Controller Unit save a considerable amount of power? Or in other words, would the power of the Xeons be much higher if they didn't have a PCU? Let's dive deeper.

Our Benchmark Choice Analysis: What Happened?
Comments Locked

35 Comments

View All Comments

  • JohanAnandtech - Monday, January 18, 2010 - link

    In which utility do you set/manage the frequency of a separate core?
  • n0nsense - Monday, January 18, 2010 - link

    Gnome panel applets. CPU frequency monitor I guess it uses cpufreq. Each instance monitors core. So i have 4 of them visible all the time. If you have enabled CPU Frequency scaling (kernel) than you can select the governor (performance, on demand, conservative etc) or a static frequency. I can do it for each core. And it displays what i have set.
    Of course processor should support frequency scaling.(power now and speed step).
    Most mainstream distributions (Ubuntu, Sabayon, Fedora) will use onedemand governor by default when processor with frequency scaling available. No user intervention required.
  • jordanclock - Monday, January 18, 2010 - link

    I really think you're mistaken. Core 2 CPUs don't have any mechanism to allow per-core frequencies. There is one FSB clock and one multiplier. There is no way to set CPU0 to a different frequency than CPU1 (or for quad core, CPU2 and CPU3) because the variables that control the clock speed are chip wide.
  • VJ - Tuesday, January 19, 2010 - link

    These people seem to be convinced of per-core Speedstep:

    https://bugs.launchpad.net/ubuntu/+source/linux-so...">https://bugs.launchpad.net/ubuntu/+source/linux-so...

    Maybe someone can ask David Tomaschik for the Intel documentation he refers to?
  • n0nsense - Monday, January 18, 2010 - link

    I heard it in past, but i still tend to believe my eyes :)
    while writing this reply, i saw any possible combination. My Q9300 has 2 states 2.0GHz and
    2.5GHz. It's not a server CPU. Have no reason to mislead you
  • VJ - Tuesday, January 19, 2010 - link

    If there's only two states, then it's possible that one core is in the C2 state while the other is in its C0 state.

    The core in state C2 may be shown to be operating at 2Ghz (its lowest frequency) while it's really off. The OS may simply be reporting the lowest possible frequency while the core is really not receiving a clock signal.

    So in general, if one core is showing its lowest frequency it may be off which still allows the other core to operate (at a different frequency).

    It would be very strange if both cores are operating greater than their lowest and less than their highest frequencies at different frequencies.

    From a different angle: Has anybody ever seen /proc/cpuinfo report a frequency less than the CPU/Core's lowest active frequency or even zero? Probably not.



  • n0nsense - Tuesday, January 19, 2010 - link

    Nice theory :)
    But in this case, I see that each core doing something. htop shows that each core somewhere in 15% usage. So the only options left, are
    1. Each core frequency can be controlled independently on C2D and C2Q (May be i3 i5 i7 too)
    2. The OS is completely unaware of whats going on :) (which is less possible)
  • mino - Thursday, January 21, 2010 - link

    "The OS is completely unaware of whats going on" is the right answer.
    :)

    BTW, only x86 CPU's able to change freq per core are >=K10 for AMD and >=Nehalem for Intel.
  • VJ - Tuesday, January 19, 2010 - link

    Not to defeat your argument/observations, rather for completeness' sake:

    It's also possible that the differences are due to the reading of the attributes. If the attributes are read in succession, then it's possible that the differences are due to the time of reading the attributes, while at any given instant, notwithstanding the allowable subtle differences in frequency described in this article, all cores are operating at the same frequency.

    There's a lot of time at the bottom.
  • JanR - Tuesday, January 19, 2010 - link

    Hi,

    I completely agree to this:

    "It's also possible that the differences are due to the reading of the attributes."

    The point is that desktop usage together with ondemand governor leads to a lot of fast frequency changes. Therefore, this is not a good scenario to decide on "per core" vs "per CPU". We did a lot of testing the following way:

    Put load on all cores using "taskset" (this avoids C-states). Switch to "userspace" governor and then set frequencies of individual cores differently. You have one control per core but the actual hardware decides what really happens - you can check this in /proc/cpuinfo or using a tool such as "mhz" from lmbench as load generator (this one calculates actual frequency based on CPI and time, it allows also measurement of turbo frequencies).

    Trying around, the results are:

    AMD K8: One clock domain, maximum of the requested frequencies is taken

    Intel Core2 Duo: Same as K8

    AMD K10: Individual clock domains, you can clock each core individually

    Intel Core 2 Quad: TWO clock domains! These CPUs are two dual core dies glued together so each die has its one multiplicator. Therefore, the cores of each die get the maximum of the requested frequencies but you can clock the two dies independendly.

    Intel Nehalem: One clock domain, maximum of requests of all cores that are not in C-state! If you set one core to, e.g., 2.66 GHz and all other to 1.6, all cores clock at 1.6 as long as the core set to 2.66 is not used, they all switch to 2.66 if you put load on that core.

    So far to our findings. "cat /proc/cpuinfo" or some funny tools are useless if you do not control the environment (userspace, manual settings). If you then enable ondemand, the system switches fast between different states and looking at it is just a snapshot, maybe taken in the middle of a transition.

    Greetings,

    Jan

Log in

Don't have an account? Sign up now