Energy Aware Scheduler

Although big.LITTLE has been around for a few years at this point, it’s still worth going over the basics of big.LITTLE in mobile SoCs. Fundamentally, the smartphone SoC space has benefited greatly from playing catch-up to the PC space. At first, SoCs were on lagging process nodes and CPUs were simple and almost entirely in-order in nature. For the first few years, doubling CPU performance every year was possible by adding additional cores, increasing clock speeds, widening the pipeline, and jumping down a process node.

Once we reached the limit for optimizing in-order architectures, the only way to improve performance in a meaningful way was to focus on avoiding stalls in the CPU pipeline. In an in-order CPU architecture, any missing information for executing an instruction means that the CPU must wait for the information to arrive from DRAM or some other storage. Even if a CPU executes incredibly quickly otherwise, it is stuck waiting on dependencies that can significantly degrade performance.

The solution is to execute operations out-of-order. After all, if you have to build a PC, you don’t sit around waiting a few weeks for a graphics card to arrive before building the rest of the PC. Similarly, modern CPUs execute instructions out-of-order to improve performance and avoid stalls. However, implementing this logic in a CPU is far from a trivial matter, as a CPU has to be designed to know which instructions can be executed out-of-order and which must be executed in order. Even instructions with dependencies that have yet to be resolved can be executed speculatively, which can save a great deal of time if the results of this speculation are used. As a result of this speculative execution and the logic needed to implement out-of-order execution (OoOE), the number of transistors and interconnects grows dramatically, which means power consumption grows dramatically as well.

It is in the context of this fundamental problem that big.LITTLE came to be. While there are multiple solutions to solving the power problem that comes with OoOE, ARM currently sees big.LITTLE as the best solution. Fundamentally, big.LITTLE seeks to use in-order, low power processors for the vast majority of computing in mobile, but switches tasks to big, out-of-order processors when a task is too much for the little cores to handle. In theory, this seems to be the ideal solution as it makes it possible to retain the power-saving advantages of in-order cores and the performance advantages of big OoOE cores.

Meanwhile it should be noted that there are other ways of using two heterogeneous CPU clusters, such as cluster migration, which was employed in the first Samsung Exynos bL SoCs and is still employed in NVIDIA's Tegra X1. But for now we will only focus on big.LITTLE HMP operation, which allows all cores to be active and exposed to the operating system. To translate this simple idea into reality is a difficult task. Currently, the de-facto solution in the mobile space for big.LITTLE is the ARM and Linaro developed Global Task Scheduling, which relies on a per-entity load tracking (PELT) mechanism with two load thresholds that decide if a process should be migrated to a corresponding cluster.

There’s a significant amount of terminology flying around regarding how this works, and we've covered the mechanic in our Huawei Honor 6 review and more in depth in our recent ARM A57/A53 investigation of the Galaxy Note 4 with the Exynos 5433. To recap, at its core, the per-entity load tracking is the main mechanism at hand needed to make thread placement work in GTS.

This system is designed to track load per task by weighting recent load the greatest, and slowly reducing the impact of previous load by a decay factor, which is a geometric series by default. Unfortunately, this load metric does have some disadvantages. Primarily, if a task idles for a long period of time and suddenly demands a significant load, in a race to sleep scenario per-entity load tracking can take a significant amount of time to reach a given up differential to migrate a thread from the small cores to the big cores. Similarly, it can take a significant time for a system with per-entity load tracking system to view a task that has reached idle to migrate a task down to the little cores. This system is also completely unaware of the real-world energy characteristics of the CPU cores, as load is the only real consideration that comes into the scheduler.

For the Snapdragon 810, Qualcomm has fundamentally done away with per-entity load tracking, and uses a window-based system instead. While we weren’t told the size of each window or how many non-idle windows were retained, the load tracking system uses the average of load across all of the recent windows while also looking at the recent maximum value to determine if there is a task that suddenly requires a significant amount of CPU power. This means that there’s a much shorter waiting period for core scheduling when a thread that goes from mostly or completely idle to a high load or vice versa.

This scenario is common throughout smartphones and can be as simple as reading a web page and then opening a new link. The average metric over all of the windows is used to determine whether a thread needs to continue to run on a big core, or whether it can be safely moved down to a little core. In addition, this window-based system accounts for cases where cores are throttled from their maximum frequencies, which means that processes may stay on little cores even if the load for a task is high for a little core if it would perform worse on a throttled big core.

While there are some areas where we can compare and contrast current GTS solutions and Qualcomm’s solution for the Snapdragon 810, there are areas where no comparisons can be made at all. Although ARM is working on an energy model and an energy aware scheduler, we haven’t seen this working in any shipping SoC.

For the Snapdragon 810, there is an energy model for all CPUs that controls for changing power consumption with temperature and can provide a metric of performance per watt at all frequency states. However, unlike ARM’s energy cost model there’s no tracking for the power cost of a task that increases frequency on the cluster (synchronous architectures require that all CPUs run at the same frequency), nor are wake ups tracked and accounted for in energy modeling.

To be fair, there are a lot of aspects that are shared with the latest GTS mechanism such as packing small tasks onto already awake CPUs in order to avoid the cost of waking a CPU from a power collapse state. However, on the Snapdragon 810 there are evaluations throughout the execution of task to be able to move a task to a big core if its load increases from when the task first started, or if it’s necessary to move a thread from one big core to another big core depending upon the perf/W for each big core. In addition, if a single core is running a small load or task the scheduler can move the thread to another core and allow the other core to go to sleep and save power. The scheduler is also said to be aware of the power cost of migration.

Finally, the scheduler in Snapdragon 810 is used to help guide the CPU frequency governor policy by notifying the frequency governor appropriately to avoid cases where a task is migrated to another CPU and causes inefficient behavior. For example, if a task is at 100% load and is migrated in the middle of a sampling window to another core, the original core isn’t kept at an unnecessarily high frequency, and the core that the task was migrated to will go to the right frequency for the aggregate load of the task. This appears to be somewhat of a mitigation for the window-based system, as ARM’s scheduler uses events to handle these issues without having to resort these patchwork fixes.

In terms of how the power arbitration is actually implemented compared to traditional power management mechanisms in existing SoCs, Qualcomm replaces the old Intel-developed CPUIdle "Menu" and "Ladder" governors. These worked based on the achieved and target residency time of the individual idle states of a CPU core. Qualcomm's solution is a completely new approach (called the Low-Power-Mode CPUIdle driver, or LPM) as it ignores the time characteristic in its entirety and looks only at energy modeling. For this, the SoC's drivers need to have precise arbitration data to be able to properly model the SoC's real power consumption without actually measuring it. Thankfully Qualcomm does this, and it's the most complete model of a commercially available SoC's power characteristics to date.

We not only see the energy models for the various CPU and cluster idle states, but also the idle states of the CCI, something which is lacking in GTS's software stack.

Ultimately, while it’s clear that Qualcomm’s solution to the big.LITTLE problem has its inefficiencies, their solution appears to be far superior to anything else with big.LITTLE on the market. And as previously mentioned in our Note 4 Exynos review, ARM’s energy aware scheduler is still far from implementation on a shipping SoC. This issue is only compounded by ARM’s need to make a solution that works for all big.LITTLE SoCs, and OEM adoption is often slow in these scenarios. While the Snapdragon 810 could be behind other SoCs in process technology, advantages in areas such as the thread scheduler could narrow the gap.

RF: WTR3925, MDM9x35, MDM9x45 CPU/System Performance
Comments Locked

119 Comments

View All Comments

  • PC Perv - Saturday, February 14, 2015 - link

    Not an accurate description of the state of affairs. It is because Apple has the power over the carriers that other OEMs lack. I wish Congress can intervene in the situation and rein in on the carriers. That will not only benefit the U.S. consumers but also potentially influence the world market.

    Absolutely not "because Apple spend more money and Android OEMs do not want to spend money"
  • name99 - Friday, February 13, 2015 - link

    "While there are multiple solutions to solving the power problem that comes with OoOE, ARM currently sees big.LITTLE as the best solution. "

    I can't help but think (based on all the evidence we've seen so far) that big.LITTLE is the VLIW of low energy CPUs. Just like VLIW would be totally awesome if we could only solve those pesky compiler issues (which are just out of reach, but maybe next year...), so big.LITTLE would be awesome if we could only solve those pesky scheduler issues (which will, likewise, maybe be solved next year...)

    It's nice that QC claim they have a better scheduler; it would be even nicer if they were confident enough about it to provide actual power/energy NUMBERS...
  • TT Masterzz - Sunday, February 15, 2015 - link

    Amazing article. Although to be frank I hardly understood the antenna part. It would be amazing if the authors at Anand Tech make an article explaining the RF system/modems/naming scheme and baseband processors in depth. Also an article explaining some terms like CPU pipeline length/branch mispredict would be amazing.
  • Laststop311 - Friday, February 20, 2015 - link

    All this is telling me is that it can barely beat last generation exynos. The exynos 7 most likely stomps this in performance which is why samsung had to go qwith it for all countries. People would be too mad if only S Korea got the super fast exynos 7 and every 1 else got the slower 810. Before snapdragon had the slioght performance edge but looks like exynos may finally be the better chip.

    That is untile qualcomm busts out their custom made 64 bit krait that just wasnt ready in time so they had to use standard arm cores to get 64 bit to market faster. Custom Kraint 64 whatever they call it it krait 500 or something will most likely beat exynos again.
  • Zingam - Wednesday, February 25, 2015 - link

    Will these be DX12, OpenGL Next compatible, or will we have to wait for another 5 years for sufficient market penetration.
  • Keermalec - Saturday, March 21, 2015 - link

    So The 1-year old Nvidia K1 trounces the yet to arrive snapdragon 810...
    And yes, LTE is not integrated into the K1 in order for OEMS to have a choice between wifi or wifi+LTE tablet versions. Nvidia CAN integrate LTE in the SoC as they have done with the Tegra 4i. It was just sound business practice not to do so with the more powerful chip.
  • radeonex - Saturday, April 11, 2015 - link

    I want to point out that for linear amplifier circuits, most of the transistors operation in the saturated region (they do not act as switches but rather voltage controlled current sources). The high electron mobility helps with trans-conductance and other characteristics especially in the context of combating short-channel effects (helps smaller devices). It also helps to reduce the minimum voltage drop required to keep the saturated transistors in the correct region of operation.
  • Ning3n - Monday, July 27, 2015 - link

    To give a "joe sixpack" review of the 810. I recently replaced my HTC M7 with an M9....

    As far as I've seen/noticed, the 810 (combined with the 430 GPU), is *ROUGHLY* 15-20% faster than the 600 series I've upgraded from.

    Gaming performance (for a cellular device) is great! But, it took over an hour to encrypt just under 5Gb of mp3s, and 1.5Gb of pictures.

    Hardly a "phenomenal" improvement.
  • b.akhil96 - Tuesday, June 21, 2016 - link

    How do you categorize the loads ? max(avg,recent) policy when loads are categorized as peak or non peak . what would be an ideal policy to be applied on Moderate loads. (similar to max(avg,recent) )

Log in

Don't have an account? Sign up now