Energy Aware Scheduler

Although big.LITTLE has been around for a few years at this point, it’s still worth going over the basics of big.LITTLE in mobile SoCs. Fundamentally, the smartphone SoC space has benefited greatly from playing catch-up to the PC space. At first, SoCs were on lagging process nodes and CPUs were simple and almost entirely in-order in nature. For the first few years, doubling CPU performance every year was possible by adding additional cores, increasing clock speeds, widening the pipeline, and jumping down a process node.

Once we reached the limit for optimizing in-order architectures, the only way to improve performance in a meaningful way was to focus on avoiding stalls in the CPU pipeline. In an in-order CPU architecture, any missing information for executing an instruction means that the CPU must wait for the information to arrive from DRAM or some other storage. Even if a CPU executes incredibly quickly otherwise, it is stuck waiting on dependencies that can significantly degrade performance.

The solution is to execute operations out-of-order. After all, if you have to build a PC, you don’t sit around waiting a few weeks for a graphics card to arrive before building the rest of the PC. Similarly, modern CPUs execute instructions out-of-order to improve performance and avoid stalls. However, implementing this logic in a CPU is far from a trivial matter, as a CPU has to be designed to know which instructions can be executed out-of-order and which must be executed in order. Even instructions with dependencies that have yet to be resolved can be executed speculatively, which can save a great deal of time if the results of this speculation are used. As a result of this speculative execution and the logic needed to implement out-of-order execution (OoOE), the number of transistors and interconnects grows dramatically, which means power consumption grows dramatically as well.

It is in the context of this fundamental problem that big.LITTLE came to be. While there are multiple solutions to solving the power problem that comes with OoOE, ARM currently sees big.LITTLE as the best solution. Fundamentally, big.LITTLE seeks to use in-order, low power processors for the vast majority of computing in mobile, but switches tasks to big, out-of-order processors when a task is too much for the little cores to handle. In theory, this seems to be the ideal solution as it makes it possible to retain the power-saving advantages of in-order cores and the performance advantages of big OoOE cores.

Meanwhile it should be noted that there are other ways of using two heterogeneous CPU clusters, such as cluster migration, which was employed in the first Samsung Exynos bL SoCs and is still employed in NVIDIA's Tegra X1. But for now we will only focus on big.LITTLE HMP operation, which allows all cores to be active and exposed to the operating system. To translate this simple idea into reality is a difficult task. Currently, the de-facto solution in the mobile space for big.LITTLE is the ARM and Linaro developed Global Task Scheduling, which relies on a per-entity load tracking (PELT) mechanism with two load thresholds that decide if a process should be migrated to a corresponding cluster.

There’s a significant amount of terminology flying around regarding how this works, and we've covered the mechanic in our Huawei Honor 6 review and more in depth in our recent ARM A57/A53 investigation of the Galaxy Note 4 with the Exynos 5433. To recap, at its core, the per-entity load tracking is the main mechanism at hand needed to make thread placement work in GTS.

This system is designed to track load per task by weighting recent load the greatest, and slowly reducing the impact of previous load by a decay factor, which is a geometric series by default. Unfortunately, this load metric does have some disadvantages. Primarily, if a task idles for a long period of time and suddenly demands a significant load, in a race to sleep scenario per-entity load tracking can take a significant amount of time to reach a given up differential to migrate a thread from the small cores to the big cores. Similarly, it can take a significant time for a system with per-entity load tracking system to view a task that has reached idle to migrate a task down to the little cores. This system is also completely unaware of the real-world energy characteristics of the CPU cores, as load is the only real consideration that comes into the scheduler.

For the Snapdragon 810, Qualcomm has fundamentally done away with per-entity load tracking, and uses a window-based system instead. While we weren’t told the size of each window or how many non-idle windows were retained, the load tracking system uses the average of load across all of the recent windows while also looking at the recent maximum value to determine if there is a task that suddenly requires a significant amount of CPU power. This means that there’s a much shorter waiting period for core scheduling when a thread that goes from mostly or completely idle to a high load or vice versa.

This scenario is common throughout smartphones and can be as simple as reading a web page and then opening a new link. The average metric over all of the windows is used to determine whether a thread needs to continue to run on a big core, or whether it can be safely moved down to a little core. In addition, this window-based system accounts for cases where cores are throttled from their maximum frequencies, which means that processes may stay on little cores even if the load for a task is high for a little core if it would perform worse on a throttled big core.

While there are some areas where we can compare and contrast current GTS solutions and Qualcomm’s solution for the Snapdragon 810, there are areas where no comparisons can be made at all. Although ARM is working on an energy model and an energy aware scheduler, we haven’t seen this working in any shipping SoC.

For the Snapdragon 810, there is an energy model for all CPUs that controls for changing power consumption with temperature and can provide a metric of performance per watt at all frequency states. However, unlike ARM’s energy cost model there’s no tracking for the power cost of a task that increases frequency on the cluster (synchronous architectures require that all CPUs run at the same frequency), nor are wake ups tracked and accounted for in energy modeling.

To be fair, there are a lot of aspects that are shared with the latest GTS mechanism such as packing small tasks onto already awake CPUs in order to avoid the cost of waking a CPU from a power collapse state. However, on the Snapdragon 810 there are evaluations throughout the execution of task to be able to move a task to a big core if its load increases from when the task first started, or if it’s necessary to move a thread from one big core to another big core depending upon the perf/W for each big core. In addition, if a single core is running a small load or task the scheduler can move the thread to another core and allow the other core to go to sleep and save power. The scheduler is also said to be aware of the power cost of migration.

Finally, the scheduler in Snapdragon 810 is used to help guide the CPU frequency governor policy by notifying the frequency governor appropriately to avoid cases where a task is migrated to another CPU and causes inefficient behavior. For example, if a task is at 100% load and is migrated in the middle of a sampling window to another core, the original core isn’t kept at an unnecessarily high frequency, and the core that the task was migrated to will go to the right frequency for the aggregate load of the task. This appears to be somewhat of a mitigation for the window-based system, as ARM’s scheduler uses events to handle these issues without having to resort these patchwork fixes.

In terms of how the power arbitration is actually implemented compared to traditional power management mechanisms in existing SoCs, Qualcomm replaces the old Intel-developed CPUIdle "Menu" and "Ladder" governors. These worked based on the achieved and target residency time of the individual idle states of a CPU core. Qualcomm's solution is a completely new approach (called the Low-Power-Mode CPUIdle driver, or LPM) as it ignores the time characteristic in its entirety and looks only at energy modeling. For this, the SoC's drivers need to have precise arbitration data to be able to properly model the SoC's real power consumption without actually measuring it. Thankfully Qualcomm does this, and it's the most complete model of a commercially available SoC's power characteristics to date.

We not only see the energy models for the various CPU and cluster idle states, but also the idle states of the CCI, something which is lacking in GTS's software stack.

Ultimately, while it’s clear that Qualcomm’s solution to the big.LITTLE problem has its inefficiencies, their solution appears to be far superior to anything else with big.LITTLE on the market. And as previously mentioned in our Note 4 Exynos review, ARM’s energy aware scheduler is still far from implementation on a shipping SoC. This issue is only compounded by ARM’s need to make a solution that works for all big.LITTLE SoCs, and OEM adoption is often slow in these scenarios. While the Snapdragon 810 could be behind other SoCs in process technology, advantages in areas such as the thread scheduler could narrow the gap.

RF: WTR3925, MDM9x35, MDM9x45 CPU/System Performance
Comments Locked

119 Comments

View All Comments

  • JoshHo - Thursday, February 12, 2015 - link

    Comparing the Snapdragon 810 to the Exynos 5433 wouldn't be of much value as the S810 won't be competing with the Exynos 5433 in flagship 2015 devices. We hope to make a valid comparison to an Exynos SoC in the near future.
  • warreo - Thursday, February 12, 2015 - link

    I disagree. This article is already primarily a comparison of S810 and S805, which like the Exynos will obviously not be competing for flagship 2015 devices. Does that make the comparison invalid? No, it's just a matter of context. People know that Exynos 5433 is an older SoC, but it's still interesting to see how S810 compares to it, just like it is interesting to see how it compares to S805.

    In reading this article the most interesting takeaways that I got are that on the CPU side, S810 is in a dead heat with Exynos (or barely outperforms it), and on the GPU side, there was a more substantial outperformance (call it 20-25%) vs. Exynos. The sad thing is that I had to draw that conclusion myself, because it was barely addressed in the article.

    As an aside, can someone please learn me on how this performance is considered good when Exynos 7420 is right around the corner? Am I missing something?
  • Andrei Frumusanu - Thursday, February 12, 2015 - link

    The vast majority of users will want to compare performance to the S805, seeing as the 5433 is only found in one variant of the Note 4 and probably won't bee seeing any other implementation.

    As for your last point, we just can't comment on performance of unreleased products.
  • warreo - Thursday, February 12, 2015 - link

    Unfortunately, I still disagree. While most people will never use the 5433 because it is limited to the Note 4, it is still a relevant comparison because the S6 will use the 7420, which is the next iteration of the 5433. Lest we forget, there are (likely) millions of people who will buy or consider buying the S6, making the comparison against 5433 an early preview of 7420 vs. S810 which I'm willing to bet is HIGHLY interesting to readers of this site.

    No direct comments on the unreleased 7420 would be necessary, just a more indepth discussion on how the S810 fares against 5433 would be helpful and let readers extrapolate to 7420 themselves. The reality is, the data and benchmarks are all there, I'm just a bit mystified why it is apparently not worth the effort to add the % difference into the table and discuss those results more in the text of the article.

    I'll make this my last comment on the matter as you've at least shown you've thought about the matter and had a reason as to why you didn't really discuss Exynos. I remain cynical as to whether this is a good reason (or even the true reason), but I do at least appreciate the responses. I hope you'll take my comments in the spirit in which they were given: constructive criticism to improve the quality of the article.
  • lopri - Thursday, February 12, 2015 - link

    Millions of people considering the S6 will want to know how S810 (or rather Exynos 7420) performs compared to S600/S800/S801, because those are the platforms they are currently using. Millions of people also do not have access to the Exynos 5433 Note 4, and will not be upgrading from or to it. It would be akin to comparing some obscure Xeon CPU to widely popular Core i5 CPU.

    I fully expect there will be a comparison between S810 and whatever else it competes against in due time.
  • Jumangi - Friday, February 13, 2015 - link

    Millions? Let's be real here. 99%+ of the people who go out to buy the next Galaxy phone or any smartphone for that matter won't have the slightest clue of the SoC in the thing.
  • tdslam720 - Thursday, February 12, 2015 - link

    Way to miss out on all the hype. Take some hints from Pro Wrestling or UFC. Samsung vs Qualcomm is the hype right now. Exynos vs 810 . You claim people want to see 810 vs 805, no one cares about that. Give us 810 vs Exynos and get tons more ad money while maintaining your credibility. Right now it just looks like Qualcomm is influencing you to play nice.
  • melgross - Thursday, February 12, 2015 - link

    You say no one cares about that, but that's just you saying that. Samsung doesn't sell a whole lot of Notes, particularly to the number of devices Qualcomm sells into.
  • tdslam720 - Thursday, February 12, 2015 - link

    No but they'll sell millions of S6s which is basically the same chip
  • blzd - Thursday, February 12, 2015 - link

    We should care about a CPU in a phone that we will never use? Because the next iteration perhaps we will be able to use? Um no.

    S800 vs S810 is what I want to know personally.

Log in

Don't have an account? Sign up now