Dynamic Power Management: A Quantitative Approach

Name: Dynamic Power Management: A Quantitative Approach
Item: Dynamic Power Management: A Quantitative Approach
Author: Johan De Gelas

by Johan De Gelas on January 18, 2010 2:00 AM EST

Posted in
IT Computing

35 Comments | Add A Comment

35 Comments

Not So Fast!

Power management, especially dynamic voltage and frequency scaling, does come with a performance cost. Since its introduction both Intel and AMD have been claiming that this performance cost is "negligible", but we all know better now. On dual-core Athlon X2 and Phenom I, it was for example impossible to use DVFS and get decent HD-video decoding. There are three important performance problems with dynamic power management:

Transitioning from one P-state to another takes a while, especially if you scale up.
Active cores will probe idle or lower P-state cores quite frequently.
The OS power manager has to predict whether or not the process will need more processing power soon or not. As a result the OS transitions a lot slower than the hardware.

Suppose that the OS decides that the CPU can clock down to a lower P-state. Just a few ms later, a running process requires a lot more performance. The result is that the voltage must be increased and this takes a while. During that time, the CPU is wasting more power than it should: processing is suspended for a small time and the clock speed cannot increase unless the higher voltage is reached and is stable enough. If this scenario is repeated a lot, the small power savings of going to a lower P-state will be overshadowed by the power losses of scaling quickly back up to a higher clock and voltage. It is important to understand that each voltage increase results in a small period where power is wasted without any processing happening. The same problem is true for entering a C-state: enter it too quickly and performance is lowered as it takes some time to wake that core up again.

The last problem is a bit more subtle: if you lower the P-state of one core, another core that sends a snoop towards this "slow" core will get a much slower answer. As a result the performance of the active core will be lower. According to some researchers [5], this performance decrease is about 5% at 800MHz on a "Barcelona" Opteron. If P-states could go as low as 400MHz, the performance impact would be 30% and more! That is the reason why lower P-states are not used: a core with P-states lower than 800MHz would wreak havoc on the performance/watt ratio of the CPU. That is also why "Smart Fetch" dumps the L1 and L2 caches in the L3 cache. This avoids not only waking the idle core up too soon, but it also avoids the performance hit associated with snooping a "napping" core. Intel's CPUs do not have this problem: the inclusive nature of the L3 cache means that if data cannot be found in the L3 cache, you will not find that data in any core's L1 or L2 caches.

The bottom line is that power management is quite complex: there is no silver bullet. Go to low/idle states too quickly and you end up burning more power while delivering less performance. At the same time, if the OS keeps the clock speed too high, the CPU might never achieve decent power savings. The OS must take into account the most likely behavior of the application and the capabilities of the hardware.

Power Management Technologies Our Benchmark Choice

PRINT THIS ARTICLE

Post Your Comment
Please log in or sign up to comment.

Comments Locked

35 Comments

View All Comments

JohanAnandtech - Monday, January 18, 2010 - link
In which utility do you set/manage the frequency of a separate core?
n0nsense - Monday, January 18, 2010 - link
Gnome panel applets. CPU frequency monitor I guess it uses cpufreq. Each instance monitors core. So i have 4 of them visible all the time. If you have enabled CPU Frequency scaling (kernel) than you can select the governor (performance, on demand, conservative etc) or a static frequency. I can do it for each core. And it displays what i have set.
Of course processor should support frequency scaling.(power now and speed step).
Most mainstream distributions (Ubuntu, Sabayon, Fedora) will use onedemand governor by default when processor with frequency scaling available. No user intervention required.
jordanclock - Monday, January 18, 2010 - link
I really think you're mistaken. Core 2 CPUs don't have any mechanism to allow per-core frequencies. There is one FSB clock and one multiplier. There is no way to set CPU0 to a different frequency than CPU1 (or for quad core, CPU2 and CPU3) because the variables that control the clock speed are chip wide.
VJ - Tuesday, January 19, 2010 - link
These people seem to be convinced of per-core Speedstep:

https://bugs.launchpad.net/ubuntu/+source/linux-so...">https://bugs.launchpad.net/ubuntu/+source/linux-so...

Maybe someone can ask David Tomaschik for the Intel documentation he refers to?
n0nsense - Monday, January 18, 2010 - link
I heard it in past, but i still tend to believe my eyes :)
while writing this reply, i saw any possible combination. My Q9300 has 2 states 2.0GHz and
2.5GHz. It's not a server CPU. Have no reason to mislead you
VJ - Tuesday, January 19, 2010 - link
If there's only two states, then it's possible that one core is in the C2 state while the other is in its C0 state.

The core in state C2 may be shown to be operating at 2Ghz (its lowest frequency) while it's really off. The OS may simply be reporting the lowest possible frequency while the core is really not receiving a clock signal.

So in general, if one core is showing its lowest frequency it may be off which still allows the other core to operate (at a different frequency).

It would be very strange if both cores are operating greater than their lowest and less than their highest frequencies at different frequencies.

From a different angle: Has anybody ever seen /proc/cpuinfo report a frequency less than the CPU/Core's lowest active frequency or even zero? Probably not.
n0nsense - Tuesday, January 19, 2010 - link
Nice theory :)
But in this case, I see that each core doing something. htop shows that each core somewhere in 15% usage. So the only options left, are
1. Each core frequency can be controlled independently on C2D and C2Q (May be i3 i5 i7 too)
2. The OS is completely unaware of whats going on :) (which is less possible)
mino - Thursday, January 21, 2010 - link
"The OS is completely unaware of whats going on" is the right answer.
:)

BTW, only x86 CPU's able to change freq per core are >=K10 for AMD and >=Nehalem for Intel.
VJ - Tuesday, January 19, 2010 - link
Not to defeat your argument/observations, rather for completeness' sake:

It's also possible that the differences are due to the reading of the attributes. If the attributes are read in succession, then it's possible that the differences are due to the time of reading the attributes, while at any given instant, notwithstanding the allowable subtle differences in frequency described in this article, all cores are operating at the same frequency.

There's a lot of time at the bottom.
JanR - Tuesday, January 19, 2010 - link
Hi,

I completely agree to this:

"It's also possible that the differences are due to the reading of the attributes."

The point is that desktop usage together with ondemand governor leads to a lot of fast frequency changes. Therefore, this is not a good scenario to decide on "per core" vs "per CPU". We did a lot of testing the following way:

Put load on all cores using "taskset" (this avoids C-states). Switch to "userspace" governor and then set frequencies of individual cores differently. You have one control per core but the actual hardware decides what really happens - you can check this in /proc/cpuinfo or using a tool such as "mhz" from lmbench as load generator (this one calculates actual frequency based on CPI and time, it allows also measurement of turbo frequencies).

Trying around, the results are:

AMD K8: One clock domain, maximum of the requested frequencies is taken

Intel Core2 Duo: Same as K8

AMD K10: Individual clock domains, you can clock each core individually

Intel Core 2 Quad: TWO clock domains! These CPUs are two dual core dies glued together so each die has its one multiplicator. Therefore, the cores of each die get the maximum of the requested frequencies but you can clock the two dies independendly.

Intel Nehalem: One clock domain, maximum of requests of all cores that are not in C-state! If you set one core to, e.g., 2.66 GHz and all other to 1.6, all cores clock at 1.6 as long as the core set to 2.66 is not used, they all switch to 2.66 if you put load on that core.

So far to our findings. "cat /proc/cpuinfo" or some funny tools are useless if you do not control the environment (userspace, manual settings). If you then enable ondemand, the system switches fast between different states and looking at it is just a snapshot, maybe taken in the middle of a transition.

Greetings,

Jan

Dynamic Power Management: A Quantitative Approach

Post Your Comment

35 Comments

View All Comments

JohanAnandtech - Monday, January 18, 2010 - link

n0nsense - Monday, January 18, 2010 - link

jordanclock - Monday, January 18, 2010 - link

VJ - Tuesday, January 19, 2010 - link

n0nsense - Monday, January 18, 2010 - link

VJ - Tuesday, January 19, 2010 - link

n0nsense - Tuesday, January 19, 2010 - link

mino - Thursday, January 21, 2010 - link

VJ - Tuesday, January 19, 2010 - link

JanR - Tuesday, January 19, 2010 - link

Log in

Don't have an account? Sign up now