How Does Power Management Work?

The BIOS settings, the power manager of the Operating System, the hardware circuits on the CPU, monitoring hardware, sensor banks... when I first started reading about power management, it quickly became very chaotic. Let's make some sense out of it.

It all starts with ACPI, the Advanced Configuration and Power Interface. In 1996, the three most influential companies in the PC world (Intel, HP, and Microsoft) together with Toshiba and Phoenix standardized power management by presenting the ACPI Specification. ACPI defines which registers a piece of hardware should have available, and what information the BIOS/firmware should offer: these are the red pieces in the graph below.


The most important information can be found in the ACPI tables, which describe the capabilities of the different devices of the platform. Once the kernel has read and interpreted them, the role of the BIOS is over. This is in sharp contrast with the power management (APM) systems that we used throughout the 80s and 90s, where for example CPU power management was completely controlled by the BIOS. The basic idea behind ACPI based power management is that unused/less used devices should be put into lower power states. You can even place the entire system in a low-power state (sleeping state) when possible. The ACPI system states are probably the best known ACPI states:

  • S0 Working
  • S1 processor idle and in low power state but still getting power, RAM powered
  • S2 Processor in a deep sleep, RAM powered, most devices in lower power
  • S3 CPU in a deep sleep, RAM still getting power, devices in the lowest power states, also known as "Standby"
  • S4 RAM no longer powered, disk contains an image of the RAM contents, also known as "Hibernate"
  • S5 is the soft power off

We translated the ACPI system states to their most popular implementations; the standards are actually a bit vague... or flexible if you like. You can find more details in the latest ACPI specification (revision 4.0, June 16, 2009). Windows 2008 R2, the operating system used in this article, uses the older ACPI 3.0 standard. ACPI 3.0 made it possible for different CPUs to enter a different power state.

The boss of the ACPI based power management is the power management component of the kernel. The kernel power manager handles the devices' power policy, calculates and commands the required processor power state transitions, and so on. Of course, a kernel component does not have to know every specific detail of each different device. Focusing on the CPUs, the power manager will send for example the right P-state towards a specific processor driver: in the case of Windows 2008 R2, this is either intelppm.sys or amdppm.sys. The processor driver will direct the hardware to enter the P-state requested by the kernel. This mostly happens by writing to machine specific registers, the famous MSRs. So it's clear that the CPU driver contains architecture specific code.

Processor states

There are two processor states: P-states and C-states. P-states are described as performance states; each P-state corresponds with a certain clock speed and voltage. P-states could also be called processing states: contrary to C-states, a core in a P-state is actively processing instructions.

With the exception of C0, C-states are sleep/idle states: there is no processing whatsoever. We will not go into the details as Hardware Secrets has written a very comprehensive article on C-states. We will give you a quick overview of the ACPI standard C-states, and then immediately look at the actual implementation of those C-states in modern CPUs. The ACPI standard only defines four CPU power states from C0 to C3:

  1. C0 is the state where the P-state transitions happen: the CPU is processing.
  2. C1 halts the CPU. There is no processing, but the CPU's own hardware management determines whether there will be any significant power savings. All ACPI compliant CPUs must have a C1 state.
  3. C2 is optional, also known as "stop clock". While most CPUs stop "a few" clock signals in C1, most clocks are stopped in C2.
  4. C3 is also known as "sleep", or completely stop all clocks in the CPU.

The actual result of each ACPI C-state is not defined. It depends on the power management hardware that is available on the platform and the CPU. For example, all Intel Xeons of the past years support an Enhanced C1E state, which is entered automatically if the CPU stays in C1 for a while. Modern CPUs will not only stop the clock in C3, but also move to "deeper C4/C5/C6" sleeps and drop the voltage of the CPU. The C1E, C4, C5, and C6 states are only known to the hardware; the operating system sees them as ACPI C2 or C3. We will discuss this in more detail further on in this article. Before we go into more detail on how the CPUs actually handle these C- and P-states, let's see what we assembled in our labs for testing purposes.

Index The Hardware
Comments Locked

35 Comments

View All Comments

  • JohanAnandtech - Monday, January 18, 2010 - link

    In which utility do you set/manage the frequency of a separate core?
  • n0nsense - Monday, January 18, 2010 - link

    Gnome panel applets. CPU frequency monitor I guess it uses cpufreq. Each instance monitors core. So i have 4 of them visible all the time. If you have enabled CPU Frequency scaling (kernel) than you can select the governor (performance, on demand, conservative etc) or a static frequency. I can do it for each core. And it displays what i have set.
    Of course processor should support frequency scaling.(power now and speed step).
    Most mainstream distributions (Ubuntu, Sabayon, Fedora) will use onedemand governor by default when processor with frequency scaling available. No user intervention required.
  • jordanclock - Monday, January 18, 2010 - link

    I really think you're mistaken. Core 2 CPUs don't have any mechanism to allow per-core frequencies. There is one FSB clock and one multiplier. There is no way to set CPU0 to a different frequency than CPU1 (or for quad core, CPU2 and CPU3) because the variables that control the clock speed are chip wide.
  • VJ - Tuesday, January 19, 2010 - link

    These people seem to be convinced of per-core Speedstep:

    https://bugs.launchpad.net/ubuntu/+source/linux-so...">https://bugs.launchpad.net/ubuntu/+source/linux-so...

    Maybe someone can ask David Tomaschik for the Intel documentation he refers to?
  • n0nsense - Monday, January 18, 2010 - link

    I heard it in past, but i still tend to believe my eyes :)
    while writing this reply, i saw any possible combination. My Q9300 has 2 states 2.0GHz and
    2.5GHz. It's not a server CPU. Have no reason to mislead you
  • VJ - Tuesday, January 19, 2010 - link

    If there's only two states, then it's possible that one core is in the C2 state while the other is in its C0 state.

    The core in state C2 may be shown to be operating at 2Ghz (its lowest frequency) while it's really off. The OS may simply be reporting the lowest possible frequency while the core is really not receiving a clock signal.

    So in general, if one core is showing its lowest frequency it may be off which still allows the other core to operate (at a different frequency).

    It would be very strange if both cores are operating greater than their lowest and less than their highest frequencies at different frequencies.

    From a different angle: Has anybody ever seen /proc/cpuinfo report a frequency less than the CPU/Core's lowest active frequency or even zero? Probably not.



  • n0nsense - Tuesday, January 19, 2010 - link

    Nice theory :)
    But in this case, I see that each core doing something. htop shows that each core somewhere in 15% usage. So the only options left, are
    1. Each core frequency can be controlled independently on C2D and C2Q (May be i3 i5 i7 too)
    2. The OS is completely unaware of whats going on :) (which is less possible)
  • mino - Thursday, January 21, 2010 - link

    "The OS is completely unaware of whats going on" is the right answer.
    :)

    BTW, only x86 CPU's able to change freq per core are >=K10 for AMD and >=Nehalem for Intel.
  • VJ - Tuesday, January 19, 2010 - link

    Not to defeat your argument/observations, rather for completeness' sake:

    It's also possible that the differences are due to the reading of the attributes. If the attributes are read in succession, then it's possible that the differences are due to the time of reading the attributes, while at any given instant, notwithstanding the allowable subtle differences in frequency described in this article, all cores are operating at the same frequency.

    There's a lot of time at the bottom.
  • JanR - Tuesday, January 19, 2010 - link

    Hi,

    I completely agree to this:

    "It's also possible that the differences are due to the reading of the attributes."

    The point is that desktop usage together with ondemand governor leads to a lot of fast frequency changes. Therefore, this is not a good scenario to decide on "per core" vs "per CPU". We did a lot of testing the following way:

    Put load on all cores using "taskset" (this avoids C-states). Switch to "userspace" governor and then set frequencies of individual cores differently. You have one control per core but the actual hardware decides what really happens - you can check this in /proc/cpuinfo or using a tool such as "mhz" from lmbench as load generator (this one calculates actual frequency based on CPI and time, it allows also measurement of turbo frequencies).

    Trying around, the results are:

    AMD K8: One clock domain, maximum of the requested frequencies is taken

    Intel Core2 Duo: Same as K8

    AMD K10: Individual clock domains, you can clock each core individually

    Intel Core 2 Quad: TWO clock domains! These CPUs are two dual core dies glued together so each die has its one multiplicator. Therefore, the cores of each die get the maximum of the requested frequencies but you can clock the two dies independendly.

    Intel Nehalem: One clock domain, maximum of requests of all cores that are not in C-state! If you set one core to, e.g., 2.66 GHz and all other to 1.6, all cores clock at 1.6 as long as the core set to 2.66 is not used, they all switch to 2.66 if you put load on that core.

    So far to our findings. "cat /proc/cpuinfo" or some funny tools are useless if you do not control the environment (userspace, manual settings). If you then enable ondemand, the system switches fast between different states and looking at it is just a snapshot, maybe taken in the middle of a transition.

    Greetings,

    Jan

Log in

Don't have an account? Sign up now