Power Management

Power management of previous big.LITTLE SoCs from Samsung was disappointing as it showed little signs of optimizations for efficiency and a general of attention to detail. The Exynos 7420 improves on this in several areas, some which are tied to the 14nm improvement and others which are tied to software improvements.

Modern ARM CPU’s power management works in a few different ways. Firstly, DVFS (Dynamic Voltage and Frequency Scaling) mechanisms try to optimize power efficiency by running the lowest possible frequency state without impacting performance. Because lower frequency states require lower operating voltage they intrinsically use less energy for a given fixed workload. The switching between these P-states (Performance states) is arbitrated by a so-called CPU frequency governor which works within the Linux kernel’s CPUFreq framework.

Google has since Android 4.1 Jellybean standardized the use of the “interactive” CPU governor as a part of Android and the vast amount of devices out there adopt this as the default governor, although vendors may have modifications done to it. The interactive governor is a relatively simple concept: Given a certain sampling time (20ms), it checks the load of the CPU. If the load exceeds the target load on the current frequency, then change to a frequency that would accommodate the current load within the target load threshold. The target load threshold is a parameter which describes how much % of CPU capacity we want the CPU to be at when scaling up to a certain P-state. If the load spikes too fast and much is superior to the target load, then there’s a secondary threshold called the high-speed load threshold which forcefully scales the CPUs to a fixed higher frequency, which in the case of the Exynos 7420 is respectively 900 and 1200MHz for the A53 and A57 cores. If the load has been stable and the newly computed target frequency is consistently aiming lower for 4 sample periods, meaning 80ms, it then scales back frequency to a lower state.

Samsung tries to optimize the Interactive governor to improve big.LITTLE scaling by introducing some new operating modes which alter the configurables of the interactive scaling logic on-the-fly. For example if only a single big CPU passes a load threshold of 95% it enters “single-load” mode which reduces the scaling thresholds for easier increases in frequency and also sets up a quality-of-service minimum frequency request to the small cores. I’m not too sure why they forcefully raise the frequency on the small cores when load is high on the big cores but Samsung must have profiled frequency scaling and decided that this is a beneficial change. Another mode triggered on top of the single-load mode is when the cumulative load across all 4 CPUs exceeds a certain threshold. This multi-load mode again changes scaling parameters by making them more lax and easier to scale up.

These changes had already been implemented in the Exynos 5433 as well but were never effectively used as the parameters remained at their default values and thus representing no improvement in the scaling mechanism. The Galaxy Alpha's 5430 did have the settings correctly set up, but then again Meizu's MX4Pro didn't, meaning we're either seeing an unlikely deliberate design decision, or what I find more likely and reasonable explanation, an oversight on the part of the software teams.

It looks like these modifications are mostly aimed at improving performance and reaction time of the DVFS scaling, and it looks the due to these changes the Exynos 7420 behaves much better in that regard. Samsung’s handling of frequency scaling is generally very good as the governor does well in its task. There are also a large number of QoS (Quality-of-Service) mechanisms by a variety of drivers which are able to instantly request the CPU to transition to a minimum frequency. One example is the screen touch booster: this is an independent scaling mechanism that is able to control the CPU frequency of both clusters as well as to tell the scheduler to force migrations onto the big cores for better reaction time and UI fluidity as soon as the display driver receives an interrupt request from the touch controller. Another scenario would be IP blocks in the media pipeline – blocks such as the 2D composer or the hardware video accelerator are predictable in terms of the required memory bandwidth and CPU capacity, so their drivers will dynamically put performance floors on the device’s DVFS mechanism to guarantee throughput. Samsung goes as far as to also use a QoS system for I/O bandwidth for the NAND, modem and WiFi as well as IPC (Inter-process calls) communications.

Of course beyond DVFS scaling as a power management mechanism all modern devices also offer clock- and power-gating. For the CPU this is again something which is controlled by the kernel within a mechanism called the CPUIdle framework. In the past before hardware had such power-saving mechanisms idling a system usually meant that it was running infinite loops of NOPs (no operation) until it got interrupted to do some actual work. Today instead of running inefficient idle loops, the scheduler calls the CPUIdle governor telling it to do “nothing”. The CPUIdle governor accumulates statistics on how long each idle period is and based on this data is able to choose from a variety of deeper or shallower hardware idle states. On ARM CPUs since the A15/A7 this is mostly consolidated into 3 so-called C-states: a clock-gating state called WFI (Wait-for-interrupt), an individual core power-gating state and a cluster power-gating state.

WFI is an instruction-level and architectural power-management state with extremely low latency that stops the clock to a given CPU. By stopping the clock one avoids dynamic leakage by the CPU, so this is a crucial part of doing “nothing” in mobile CPUs. Individual core power-gating states are able to turn off power to the CPU this way. This is a deeper state as the CPU needs to save its state upon entry and restore it upon waking up. On the 7420 we’re talking about exit latencies of 100µs. Because of the overhead of restoring the CPU state, it’s also not worth to enter these modes for reduced periods of time (called residency time). For the A57 cores this residency threshold is 2000µs and for the A53 cores 750µs. When all cores within a cluster are idle the whole cluster is allowed to be powered down. This of course has larger overhead with larger exit latencies (300µs) and greater minimal residency times (5ms). The cluster power-down is largely used on the big cluster as the small cluster is only allowed to power itself down when the screen is off. A very low-hanging fruit which has finally been picked by Samsung is to have optimized configuration values for each cluster. Previous Samsung SoCs would oddly just use a single driver with the same settings for both clusters, which didn’t make much sense and likely impacted CPU idle efficiency.

The 14nm process seems to have introduced a change in dynamic between the two CPU clusters as the efficiency of each cluster has scaled differently. This has significant impact in the way the GTS scheduler settings are set up as the new chipset’s power efficiency curves are tighter to each-other when compared to the Exynos 5433. To demonstrate this, I took the SPECint2000 scores of each cluster to determine what the IPC difference between the two architectures is and then used this as a ratio to normalize the A57 perf/W curve to the A53’s clocks. On the first set of charts the vertical axis is just an arbitrary normalized value of MHz/mW for the A53 cores, and the A57 curve uses a multiplier ratio of 2.09 to scale the efficiency value and thus represent the IPC increase of the larger architecture.

I’ll get back to actual perf/W charts in just a bit, but first I want to explain why the perf/MHz/W curves are an important metric we can deduce a lot from. Currently the Linux kernel and GTS mechanism sees load on a frequency invariant scale; what this means that if a process takes up 50% of the CPU while it’s running at 500MHz and its maximal scaling frequency is 1GHz, the scheduler will account the task as a 25% load on that CPU. This mechanism is meant to normalize current load to the maximum possible capacity of a CPU, and not just the current one.

The trigger points that determine thread migrations in GTS are called the up- and down-thresholds, which are thresholds on the load scales of the CPUs. For the Exynos 5433 Samsung used 50% and 25% as the up- and down-thresholds. When a thread would exceed 50% of the A53’s capacity it would be migrated over to the big cores, and once on the big core if the task would fall below 25%’s of the CPU’s capacity it would then migrate down. On the 7420 these values are set up slightly differently as Samsung configured the default values at 46.7% and 20.8%. At first I was confused to see such specific values and didn’t fully understand why they were set up as such until I calculated the actual performance/W curves of both CPU cluster.

One will have noticed the arrows I put on the graphs – these represent the theoretical point where a thread should migrate up to the big core, or down to the little cores. For the very attentive readers they will notice that the up threshold arrows aren’t at the mentioned 50 and 46% frequency points of the little cores. This is because the CPU frequency governor should actually be able to scale up frequency faster than the task triggering a scheduler migration by hitting the normalized up-threshold. For example 50% up-threshold of the 5433 would mean a 100% load at 800MHz of the A53 cores, but that will realistically never happen as the CPU will have scaled up to a higher frequency by then. The 5433 governor will try to maintain 10% of idle capacity when scaling to a frequency while the 7420 seeks 25%, meaning the latter has more lax settings which make it scale higher in frequency even though the load doesn’t require it. The result is that the avarage effective performance/capacity point where the little CPUs will try to migrate to the big cores is slightly below 900MHz for the 5433 and just above 1100MHz for the 7420.

For the down-threshold of the big cores the logic is a tad simpler because the scaling-down mechanism of the frequency governor is slower than the scheduler’s migration mechanism. This means that the arrow depicted in the graphs is a minimal value of when a thread will migrate down, and a down-migration might happen anytime at the higher frequencies.

When plotting the efficiency points on an axis depicting the absolute performance of the cores we get a much clearer picture of what big.LITTLE is supposed to achieve. And this is where we see a large difference between the 5433 and 7420: The way the Note 4 is currently set up makes it migrate up threads sooner than compared to the Galaxy S6 and the efficiency degradation when doing so is much greater. An optimal implementation would be a device where the up- and down-migration points would be as close as possible to each other in the efficiency axis while having a slight jump in the performance axis acting as a hysteresis to avoid migrations when a load falls in between the two performance curves.

It seems to me that Samsung paid much better attention to efficiency optimizations on the Exynos 7420’s software as it fixes many of the weird configuration issues of the Note 4 Exynos. The Exynos 7420 joins the Exynos 5430 (And MediaTek’s MT6595 which I’ll hopefully address sometime soon) as one of the rare SoCs which are able to reign in ARM’s big CPU core designs in a small form-factor mobile device and effectively use big.LITTLE without major downsides. While Samsung’s software stack could definitely improve with features such as full energy awareness inside the scheduler, it's no longer as misconfigured and as bad as I decribed it in the Exynos 5433 review.

In terms of maximum power consumption, I think 1.9GHz would have been a slightly more reasonable cap for the A57 cores as the device can on some occasions such as updating many apps or visiting a very heavy ad-ridden site can load up the big CPUs to their full capacity and make the device run a bit hot, but it’s a rare occasion and the vast majority of processing time is spent on the lower frequencies. It will be interesting to see what ARM's A72 processor core will be able to achieve in terms of performance and power efficiency. For 2015 though it seems Samsung's A57 SoC still remains king due to its process node advantage.

CPU Power Consumption GPU & LPDDR4 Performance & Power
POST A COMMENT

114 Comments

View All Comments

  • jjj - Monday, June 29, 2015 - link

    The power doesn't look that great, for the A57 seems to allow 300-350Mhz higher clocks, granted it's not a clean shrink. It looks good here because on 20nm they pushed the clocks way high. Reply
  • name99 - Monday, June 29, 2015 - link

    Insofar as rumors can be believed, the bulk of A9's are scheduled to be produced by Samsung, presumably on this process. It seems strange to have Apple design/layout everything twice for the same CPU, so if these same rumors (30% going to TSMC) are correct, presumably that means the A9X will be on TSMC.

    As for characterizing Apple CPUs, while there are limits to what one can learn (eg in the voltage/power tradeoffs), there is a LOT which can be done but which, to my disappointment, has still not been done. In particular if someone wanted, I think there's scope for learning an awful lot from carefully crafted micro benchmarks. Agner Fog has give a large number of examples of how to do this in the x86 space, while Henry Wong at stuffedcow.net has done the same for a few less obvious parts of the x86 architecture and for GPUs.

    It strikes me as bizarre how little we know about Apple CPUs even after two years.
    The basic numbers (logical registers, window, ROB size) seem to about match Intel these days, and the architecture seems to be 6-wide with two functional clusters. There appears to be a loop buffer (but how large?) But that's about it.
    How well does the branch prediction work and where does it fail?
    What prefetchers are provided? (at I1, D1, L2. L3)
    Do the caches do anything smart (like dead block prediction) for either performance or power?
    Does the memory manager do anything smart (like virtual write queue in the L3)?
    etc etc etc

    Obviously Apple doesn't tell us these. (Nowadays the ONLY company that does is IBM, and only in pay-walled articles in their JRD.) But people write the micro benchmarks to figure this out for Intel and AMD, and I wish the same sort of enthusiasm and community existed in the ARM world.
    Reply
  • SunnyNW - Wednesday, July 01, 2015 - link

    Believe word on the street is the A9 will be Sammy 14nm and the A9X TSM 16nm+ Reply
  • SunnyNW - Wednesday, July 01, 2015 - link

    Please ignore this comment, should have read the rest of the comments before posting since Name99 already alluded to this below. Sorry Reply
  • CiccioB - Monday, June 29, 2015 - link

    Is the heterogeneous processing that allows all 8 cores working together active?
    Seen the numbers of the various bench it seems this feature is not used.
    What I would like to know exactly is that is the bench number of this SoC can be directly compared to SoC with only 4 cores like the incoming Qualcomm Snapdragon 820 based on custom architecture which has "only" 4 cores and not a big.LITTLE configuration.
    Reply
  • Andrei Frumusanu - Monday, June 29, 2015 - link

    HMP is active. Why do you think it seems to be not used? Reply
  • CiccioB - Monday, June 29, 2015 - link

    Because with 8 cores active (or what they should be with HMP) results is not even near 4x the score of a single core.
    So I wonder if those 8 core are really active. And whether they are of any real use if, to keep consumption adequate, frequencies of higher cores get limited.
    Reply
  • Andrei Frumusanu - Monday, June 29, 2015 - link

    All the cores are always active and they do not get limited other than in thermal stress situations. I didn't publish any benchmarks comparing single vs multi-core performance so your assumption must be based on something else. Having X-times the cores doesn't mean you'll have X-times the performance, it completely depends on the application.

    It's still a perfectly valid comparison to look at traditional quad-cores vs bL octa-cores. In the end you're looking at total power and total performance and for use-cases such as PCMark the number of cores used shouldn't be of interest to the user.
    Reply
  • Refuge - Monday, June 29, 2015 - link

    I would hazard a guess that thermal throttling has something to do with part of it. Reply
  • ruturaj1989@gmail.com - Monday, June 29, 2015 - link

    It does have 4 cores but I guess they are in big.LITTLE configuration too. We will see shortly. HMP is active but I am not sure if every bench app uses all the cores. Reply

Log in

Don't have an account? Sign up now