Over the last 5 years the mobile space has seen a dramatic change in terms of performance of smartphone and tablet SoCs. The industry has seen a move from single-core to dual-core to quad-core processors to today’s heterogeneous 6-10 core designs. This was a natural evolution similar to what the PC space has seen in the last decade, but only in a much more accelerated pace. While ILP (Instruction-level parallelism) has certainly also gone up with each new processor architecture, with designs such as ARM’s Cortex A15 or Apple’s Cyclone processor cores brining significant single-threaded performance boosts, it’s the increase of CPU cores that has brought the most simple way of increasing overall computing power.

This increasing of CPU cores brought up many discussions about just how much sense such designs make in real-world usages. I can still remember when the first quad-cores were introduced that users were arguing the benefit of 4 cores in mobile workloads and that these increases were just done for the sake of marketing. I can draw parallels between those discussions from a few years ago and today’s arguments about 6 to 10-core SoCs based on big.LITTLE.

While there have been some attempts to analyse the core-count debate, I was never really satisfied with the methodology and results of these pieces. The existing tools for monitoring CPUs just don’t cut it when it comes to accurately analysing the fine-grained events that dictate the management of multi-core and heterogeneous CPUs. To try to finally have a proper analysis of the situation, for this article, I’ve tried to approach this issue from the ground up in an orderly and correct manner, and not relying on any third-party tools.

Methodology Explained

I should start with a disclaimer that because the tools required for such an analysis rely heavily on the Linux kernel, that this analysis is constrained to the behaviour of Android devices and doesn't necessarily represent the behaviour of devices on other operating systems, in particular Apple's iOS. As such, any comparisons between such SoCs should be limited to purely to theoretical scenarios where a given CPU configuration would be running Android.

The Basics: Frequency

Traditionally when wanting to log what the CPU is doing, most users would think of looking at the frequency which it is currently running at. Usually this gives a rough idea to see if there is some load on the CPU and when it kicks into high gear. The issue with this is the way one captures the frequency: the readout sample will always be a single discrete value at a given point in time. To be able to accurately get a good representation of the frequency one would need to have a sample rate of at least twice as fast as the CPU’s DVFS mechanism. Mobile SoCs now can switch frequency at intervals of down to 10-20ms, and even have unpredictable finer-grained switches which can be caused by QoS (Quality of Service) requests.

Sampling at anything under half the DVFS switching speeds can lead to inaccurate data. For example this can happen in periodic short high bursts. Take a given sample rate of 1s: Imagine that we read frequency out at 0.1s and 1.1s in time. Frequency at both these readouts would be either at a high or low frequency. What happens in-between though is not captured, and due to the switching speed being so high, we can miss out on 90%+ of the true frequency behaviour of the CPU.

Instead of going the route of logging the discrete frequency at a very high rate, we can do something far more accurate: Log the cumulative residency time for each frequency on each readout. Since Android devices run on the Linux kernel, we have easy access to this statistic provided by the CPUFreq framework. The time-in-state statistics are always accurate because they are incremented by the kernel driver asynchronously at each frequency change. So by calculating the deltas between each readout, we end up with an accurate frequency distribution within the period between our readouts.

What we end up is a stacked time distribution graph such as this:

The Y-axis of the graph is a stacked percentage of each CPU’s frequency state. The X-axis represents the distribution in time, always depending on the scenario’s length. For readability’s sake in this article, I chose an effective ~200ms sample period (Due to overhead on scripting and time-keeping mechanisms, this is just a rough target) which should give enough resolution for a good graphical representation of the CPU’s frequency behaviour.

With this, we now have the first part of our tools to accurately analyse the SoC’s behaviour: frequency.

The Details: Power States

While frequency is one of the first metrics that comes to mind when trying to monitor a CPU’s behaviour, there’s a whole other hidden layer that rarely gets exposure: CPU idle states. For readers looking for a more in-depth explanation of how CPUIdle works, I’ve touched upon it and power management of modern SoCs in general work in our deep dive of the Exynos 7420. These explanations are valid for basically all of today's SoCs based on ARM CPU IP, so it applies to SoCs from MediaTek and ARM-based Qualcomm chipsets as well.

To keep things short, a simplified explanation is that beyond frequency, modern CPUs are able to save power by entering idle states that either turn off the clock or the power to the individual CPU cores. At this point we’re talking about switching times of ~500µs to +5ms. It is rare to find SoC vendors expose APIs for live readout of the power states of the CPUs, so this is a statistic one couldn’t even realistically log via discrete readouts. Luckily CPU idle states are still arbitrated by the kernel, which again, similarly to the CPUFreq framework, provides us aggregate time-in-state statistics for each power state on each CPU.

This is an important distinction to make in today’s ARM CPU cores as (except for Qualcomm’s Krait architecture) all CPUs within a cluster run on the same synchronous frequency plane. So while one CPU can be reported to be running at a high frequency, this doesn’t really tell us what it’s doing and could as well be fully power-gated while sitting idle.

Using the same method as for frequency logging, we end up with an idle power-state stacked time-distribution graph for all cores within a cluster. I’ve labelled the states as “Clock-gated”, “Power-gated” and “Active” which in technical terms they represent the WFI (Wait-For-Interrupt) C1, power-collapse C2 idle states, as well as the difference in time to the wall-clock which represents the “active” time in which the CPU isn’t in any power-saving state.

The Intricacies: Scheduler Run-Queue Depths

One metric I don’t think that was ever discussed in the context of mobile is the depth of the CPU’s run-queue. In the Linux kernel scheduler the run-queue is a list of processes (The actual implementation involves a red-black tree) currently residing on that CPU. This is at the core of the preemptive scheduling nature of the CFS (Completely Fair Scheduler) process scheduler in the Linux kernel. When multiple processes run on the same CPU the scheduler is in charge to fairly distribute processing time between each thread based on time-slices and process priority.

The kernel and Android are able to sort of expose information on the run-queue through one of the kernel’s sysfs nodes. On Android this can be enabled through the “Show CPU Usage” option in the developer options. This gives you three numerical parameters as well as a list of the read-out active processes. The numerical value is the so-called “load average” of the scheduler. It represents the load of the whole system – and it can be used to read how many threads in a system are used. The three values represent averages for different time-windows: 1 minute, 5 minutes and 15 minutes. The actual value is a percentage – so for example 2.85 represents 285%. How this is meant to be interpreted is that if we were to consolidate all processes in as little CPUs as possible we theoretically have two CPUs whose load is 100% (summing up to 200%) as well as a third up to 85% load.

Now this is very odd, how can the phone be fully using almost 3 cores while I was doing nothing more than idling on the screen with the CPU statistics on? Sadly the kernel scheduler suffers from the same sampling rate issue as explained in our frequency logging methodology. Truth is that the load average statistic is only a snapshot of the scheduler’s run-queues which is updated only in 5-second intervals and the represented value is a calculated load based on the time between snapshots. Unfortunately this statistic is extremely misleading and in no way represents the actual situation of the run-queues. On Qualcomm devices this statistic is even more misleading as it can show load-averages of up to 12 in idle situations. Ultimately, this means it’s basically impossible to get accurate RQ-depth statistics on stock devices.

Luckily, I stumbled upon the same issue a few years ago and was aware of a patch that I previously used in the past and which was authored by Nvidia which introduces detailed rq-depth statistics. This tracks the run-queues accurately and atomically each time a process enters or leaves a run-queue, enabling it to expose a sliding-window average of the run-queue depth of each CPU over the period of 134ms.

Now we have a live pollable average for the scheduler’s run-queues and we can fully log the exact amount of threads run on the system.

Again, the X-axis throughout the graphs represent the time in milliseconds. This time the Y-axis represents the rq-depth of each CPU. I also included the sum of the rq-depths of all CPUs in a cluster as well the sum of both clusters for the system total in a separate graph.

The values can be interpreted similarly to the load-average metrics, only this time we have a separate value for each CPU. A run-queue depth of 1 means the CPU is loaded 100% of the time, 0.2 means the CPU is loaded by only 20%. Now the interesting metric comes for values above 1: For anything above a rq-depth of 1 it means that the CPU is preempting between multiple processes which cumulatively exceed the processing power of that CPU. For example in the above graph we have some per-CPU peaks of ~2. It means the CPU has at least two threads on that CPU and they each share 50% of the compute-time of that CPU, i.e. they’re running at half speed.

The Data And The Goals

On the following pages we’ll have a look at about 20 different real-world often encountered use-cases where we monitor CPU frequency, power states and scheduler run-queues. What we are looking for specifically is the run-queue depth spikes for each scenario to see just how many threads are spawned during the various scenarios.

The tests are run on Samsung's Galaxy S6 with the Exynos 7420 (4x Cortex A57 @ 2.1GHz + 4x Cortex A53 @ 1.5GHz) which should serve well as a representation of similar flagship devices sold in 2015 and beyond.

Depending on the use-cases, we'll see just how many of the cores on today's many-core big.LITTLE systems are used. Together with having power management data on both clusters, we'll also see just how much sense heterogeneous processing makes and just how much benefit one can gain from it.

Browser: S-Browser - AnandTech Article
Comments Locked

157 Comments

View All Comments

  • rstuart - Tuesday, September 1, 2015 - link

    Wow, excellent article. Colour me impressed that the developers use 4 cores effectively more times than not. It was not what I was expecting. Nor did I realise how much of the video processing task was offloaded to the GPU. In fact it's so good I suspect there will be more than a few electrical engineers poring over this in order to understand how well their software brethren make use of the hardware they provide.
  • Filiprino - Tuesday, September 1, 2015 - link

    Are you sure the Galaxy S6 employs the CFS scheduler? Should not it be the GTS scheduler?
  • Andrei Frumusanu - Tuesday, September 1, 2015 - link

    GTS is just an extension on top of CFS.
  • Filiprino - Wednesday, September 2, 2015 - link

    Well, yes. But it's not the same saying CFS or GTS. I think it should be noted that the phone is using GTS whose run queues work like in CFS.
  • Andrei Frumusanu - Saturday, September 5, 2015 - link

    GTS doesn't touch runqueues. GTS's modification to the CFS scheduler are relatively minor, it's still very much CFS at the core.
  • AySz88 - Tuesday, September 1, 2015 - link

    A technical note regarding "...scaling up higher in frequency has a quadratically detrimental effect on power efficiency as we need higher operating voltages..." - note that power consumption *already* goes up quadratically as voltage squared, BEFORE including the frequency change (i.e. P = k*f*v*v). So if you're also scaling up voltage while increasing frequency, you get a horrific blowing-up-in-your-face CUBIC relationship between power and frequency.
  • ThreeDee912 - Tuesday, September 1, 2015 - link

    Being in the Apple camp, I do know Apple also highly encourages developers to use multithreading as much as possible with their Grand Central Dispatch API, and has implemented things like App Nap and timer coalescing to help with the "race-to-idle" in OS X. I'm guessing Apple is likely taking this into account when designing their ARM CPUs as well. The thing is, unlike OS X, iOS and their A-series CPUs are mostly a black box of unknowns, other than whatever APIs they let developers use.
  • jjj - Wednesday, September 2, 2015 - link

    For web browsing i do wish you would look at heavier sites, worst case scenario since that's when the device stumbles and look at desktop versions.
    Would be nice to have a total run-queue depth graph normalized for core perf and clocks ( so converted in total perf expressed in w/e unit you like) to see what total perf would be needed (and mix of cores) with an ideal scheduler - pretty hard to do it in a reasonable way but it would be an interesting metric. After all the current total is a bit misleading by combining small and big , it shows facts but ppl can draw the wrong conclusions, like 4 cores is enough or 8 is not. Many commenters seem to jump to certain conclusions because of it too.
    Would be nice to see the tests for each cluster with the other cluster shut down, ideally with perf and power measured too. Would help address some objections in the comments.
    In the Hangouts launch test conclusion you say that more than 4 cores wouldn't be needed but that doesn't seem accurate outside the high end since if all the cores were small. assuming the small cores would be 2-3 times lower perf, then above 1.5 run-queue depth on the big cores might require more than 4 small cores if we had no big ones. Same goes for some other tests
    A SoC with 3 types of cores , 2 of them big ,even bigger than A72 , and a bunch of medium and small does seem to make sense, with a proper scheduler and thermal management ofc. For midrange 2+4 should do and it wouldn't increase the cost too much vs 8 small ones, depending a bit on cache size - lets say on 16ff A53 bellow 0.5mm2 , A72 1.15mm2 and cache maybe close to 1.7 mm2 per 1MB. so a very rough approximation would be 2-3mm2 penalty depending if the dual A72 has 1 or 2MB L2. a lot more if the dual A72 forces them to add a second memory chan but even then it's worth the cost, 1-2$ more for the OEM would be worth it given the gain in single threaded perf and the marketing benefits
    When looking at perf and battery in the real world multitasking is always present in some way. in benchmarks, never is. So why not try that too, something you encounter in daily usage. a couple of extra threads from other things should matter enough - maybe on Samsung devices you could test in split screen mode too, since it's something ppl do use and actually like.
    For games it would be interesting to plot GPU clocks and power or temps as well as maybe FPS. Was expecting games to use the small cores more to allow for more TDP to go to the GPU and the games you tested do seem to do just that. Maybe you could look at a bunch of games from that perspective. Then again, it would be nice if AT would just start testing real games instead of synthetic nonsense that has minimal value and relevance
    A look at image processing done on CPU+GPU would be interesting.
    The way Android scales on many cores is encouraging for glasses where every cubic mm matters and batteries got to be tiny. Do hope the rumored Mercury core for wearables at 50-150mW is real and shows up soon.
    Oh and i do support looking at how AT's battery of benchmarks is behaving but a better solution would be to transition away from synthetic, no idea why it takes so long in mobile when we had the PC precedent and nobody needs an explanation as to why synthetic benchmarks are far from ideal.
    Anyway, great to see some effort in actually understanding mobile, as opposed to dubious synthetic benchmarks and empty assumptions that have been dominating this scene.AT should hire a few more people to help you out and increase the frequency of such articles since there are lots of things to explore and nobody is doing it.
  • tuxRoller - Wednesday, September 2, 2015 - link

    Linux had largely been guided towards massively multiprocess workloads. If they didn't do this well then they wouldn't do anything well.
    The scheduler should be getting a lot better soon. It APPEARS that, after a long long long long time, things are moving forward on the combined scheduler (cfs), cpuidle, and cpufreq (dvfs) front. That's necessary in order to proper scheduling of tasks, especially across an aSMP soc.
    One thing to keep in mind is that these oems often carry out of tree patches that they believe help their hardware. Often these patches are of, ahem, suspect quality, and pretty much always perform their task with some noticeable drawbacks. The upstream solution is (almost?) always "better".
    Iow, things should only get better.
  • toyotabedzrock - Wednesday, September 2, 2015 - link

    Is Chrome rendering pages it expects you to visit on the little cores?
    You should test with Chrome DEV as well.

Log in

Don't have an account? Sign up now