The Mobile CPU Core-Count Debate: Analyzing The Real World

Name: The Mobile CPU Core-Count Debate: Analyzing The Real World
Item: The Mobile CPU Core-Count Debate: Analyzing The Real World
Author: Andrei Frumusanu

by Andrei Frumusanu on September 1, 2015 8:00 AM EST

Posted in
Smartphones
CPUs
Mobile
SoCs

157 Comments | Add A Comment

157 Comments

Overall Analysis & Conclusion

Hopefully we've managed to cover a few of the more common use-cases that are routinely encountered in daily usage on Android and get a good idea of how applications behave. We've seen some quite expected numbers for some use-cases but also stumbled on very large surprises that weren't quite as obvious.

There were two cases that especially stood out: Browser usage and application installation and updates. It could be argued that app updates are merely a corner-case that doesn't affect a user's experience much. After all, installing an updating apps represent only an insignificant fraction of what a user does on a device. Browser usage and web-page rendering in general however, are one of the most common and often encountered scenarios on a smartphone, and here's where we encountered the largest surprises.

When I started out this piece the goals I set out to reach was to either confirm or debunk on how useful homogeneous 8-core designs would be in the real world. The fact that Chrome and to a lesser extent Samsung's stock browser were able to consistently load up to 6-8 concurrent processes while loading a page suddenly gives a lot of credence to these 8-core designs that we would have otherwise not thought of being able to fully use their designed CPU configurations. In terms of pure computational load, web-page rendering remains as one of the heaviest tasks on a smartphone so it's very encouraging to see that today's web rendering engines are able to make good use of parallelization to spread the load between the available CPU cores.

It's hard to summarize the vast amount data of the last 16 pages in an orderly and correct manner. After all we are talking about extremely varying use-cases and time-scales for each scenario. While averaging the metrics over the course of a scenario might seem a good idea at first, one has to keep in mind that this wouldn't be able to properly represent cases where load peaks for smaller durations. It's these small computational bursts which are most of the time the cause for "lags" and frame-drops. So to better represent these bottle-necks which determine the user-visible cases of application speed and performance, we rather use the 90th percentile of the CPU run-queue depths:

90th Percentile Run-Queue Depth Averages
	Little Cluster	Big Cluster	Little + Big Clusters
S-Browser - AnandTech Article	2.27	2.19	3.87
S-Browser - AnandTech FP	3.12	1.25	4.15
Chrome - AnandTech FP	5.69	1.84	7.10
Chrome - BBC Frontpage	5.00	2.00	6.22
Hangouts Launch	2.77	2.11	4.01
Hangouts Writing A Message	2.80	0.05	2.57
Reddit Sync Launch	1.84	1.11	2.38
Reddit Sync Scrolling	0.95	1.03	1.46
Play Store Open & Scroll	2.87	0.78	3.45
Play Store App Updates	3.73	5.42	8.51
Camera: Launch	1.45	2.73	2.98
Camera: Still Snapshot	4.12	0.87	4.59
Camera: Video Recording	5.17	2.04	5.42
Real Racing 3 Launch	2.16	1.33	3.26
Real Racing 3 Playing	2.09	0.89	2.96
Modern Combat 5 Playing	2.09	0.73	2.68

I was wary of creating this table as it can be easily misinterpreted: Because run-queue depth averages are not directly representative of the amount of concurrent threads in a given scenario, we lose information when aggregating them for a given cluster or the whole system. This for example happens on the big cluster on the AT article load scenario where the 90th percentile of the aggregate rq-depth reaches 2.19 while in reality this figure is composed of 4 medium-high threads. Readers should thus keep in mind the actual detailed graphs of the preceding pages when reading the table.

While not directly the goal of the article, the collected data also serves as a perfect case-study for heterogeneous big.LITTLE SoCs. We've long seen discussions concerning what the "ideal" big.LITTLE configuration would be. There's several angles to this: the most optimal little and big cluster core counts, and whether we're aiming for performance or power efficiency in each case. In terms of low- to medium-performance threads, we've had several cases where 4 little cores weren't enough. Web page rendering in Chrome in particular seems to be the killer use-case where actually having two clusters of highly efficient cores makes sense.

On the high-performance "big" cluster side, the discussion topic is more about whether 2 or 4 core designs make more sense. I think the decision here is not about performance but rather about power efficiency. A 2-core big-cluster design would provide more than enough performance for most use-cases, but as we've seen throughout our testing during interactive use it's more common than not to have 2+ threads placed on the big cluster. So while a 2-core design could handle bursts where ~3-4 threads are placed onto the big cluster, the CPUs would need to scale up higher in frequency to provide the same performance compared to a wider 4-core design. And scaling up higher in frequency has a quadratically detrimental effect on power efficiency as we need higher operating voltages. At the end of the day I think the 4 big core designs are not only the better performing ones but also the more efficient ones.

This puts one particular vendor in quite of an interesting position: MediaTek. Even if one wouldn't be able to fully saturate a cluster one can still derive power efficiency advantages due to the fact that two small clusters would be able to operate at separate frequencies and thus efficiency points. I've encountered enough scenarios that would in theory fit the Helio X20's tri-cluster design that I'm starting to think that such a design would actually be a very smart choice for current Android devices.

What about more traditional SoC configurations? As mentioned earlier symmetric 8-core designs such as MediaTek's Helio X10 would, contrary to one's expectations, be seemingly able to take advantage of their higher core counts. So while it would be preferable to have higher performance cores such as Cortex A57's or A72's, one has to keep in mind the target market of these architectures are limited to higher-end SoCs. The 8 little-core designs are mostly targeted at the entry- and mid-level where adding a second Cortex A53 cluster can be very cheap way of still providing benefits in every-day usages, particularly in web-browsing.

What is clear though albeit there are corner-cases, is that the vast majority of applications do seem to be optimal for quad-core SoCs. This is why traditional 4-core and 4.4 big.LITTLE designs still appear to make the most sense in terms providing a balanced configuration and making most use of the hardware at hand. For big.LITTLE, even if there were no use-cases where all cores are concurrently used, it's not a big deal as what we are aiming for in heterogeneous systems is power efficiency gains.

This is also the point of the discussion where the debate of the potential detrimental effect of having more cores comes into play: The fact that a SoC has more cores does not automatically mean it uses more power. As demonstrated in the data, modern power management is advanced enough to make extensive use of fine-grained power-gated idle states, thus eliminating any overhead there might be of simply having more physical cores on the silicon. If there are cases (And as we've seen, there are!) which make use of more cores then this should be seen purely as an added bonus and icing on the cake.

What about narrow CPU-core number design philosophies? Would such designs make sense on Android? This is probably another question that our readers will ask themselves when looking at the data. Apple and recently Nvidia with their Denver architecture both choose to keep going the route of employing large 2-core designs that are strong in their single-threaded performance but fall behind in terms of multi-threaded performance.

While for Apple it can be argued that we're dealing with a very different operating system and it is likely iOS applications are less threaded than their Android counter-parts. But there are cases where this doesn't need to be necessarily hold true: For example browser rendering engines, as demonstrated, can be multi-threaded if adapted to do so. Native high-end games which already make use of multiple threads are also unlikely to differ in their threading logic between the platforms.

While such narrow CPU-core designs would have higher performance at a given frequency - it is not a direct indicator of the actual performance/W efficiency that a single thread would have on these chipsets. We still haven't had a chance to make a proper apples-to-apples comparison for these architectures so we're limited to theorycrafting with the data we currently have available to us:

What we see in the use-case analysis is that the amount of use-cases where an application is visibly limited due to single-threaded performance seems be very limited. In fact, a large amount of the analyzed scenarios our test-device with Cortex A57 cores would rarely need to ramp up to their full frequency beyond short bursts (Thermal throttling was not a factor in any of the tests). On the other hand, scenarios were we'd find 3-4 high load threads seem not to be that particularly hard to find, and actually appear to be an a pretty common occurence. For mobile, the choice seems to be obvious due to the power curve implications. In scenarios where we're not talking about having loads so small that it becomes not worthwhile to spend the energy to bring a secondary core out of its idle state, one could generalize that if one is able to spread the load over multiple CPUs, it will always preferable and more efficient to do so.

In the end what we should take away from this analysis is that Android devices can make much better use of multi-threading than initially expected. There's very solid evidence that not only are 4.4 big.LITTLE designs validated, but we also find practical benefits of using 8-core "little" designs over similar single-cluster 4-core SoCs. For the foreseeable future it seems that vendors who rely on ARM's CPU designs will be well served with a continued use of 4.4 b.L designs. Only MediaTek seems to fall out of the norm here with its upcoming X20 SoC, which I'm definitely looking forward to see as to how it behaves in the real-world. We'll also see some vendors revert back to quad-core designs in their custom architectures - while we've yet to get a better picture of how these will behave in terms of performance and power, I think that 4 cores will be a quite reasonable target and sweet-spot for vendors to aim for.

Games: Modern Combat 5 Playing

PRINT THIS ARTICLE

Post Your Comment
Please log in or sign up to comment.

Comments Locked

157 Comments

View All Comments

jjj - Wednesday, September 2, 2015 - link
Fortune seems way heavy for example but even Amazon's home page (desktop version) seems not too friendly.
djscrew - Tuesday, September 1, 2015 - link
Love the article, but after reading it, I feel like the articles you write comparing phone CPU performance & battery life are far more applicable. You lose access to so much of the information in this article that at the end of the day testing the actual phone & OS usage of the CPU makes more sense.
Daniel Egger - Tuesday, September 1, 2015 - link
What I'm sincerely missing in this article is the differentiation between multi-processing and multi-threading, with the difference being that multi-processing is partitioning the workload across multiple processes whereas multi-threading spawns threads which are then run in the OS directly or again mapped to processes in different ways -- depending on the OS, in Linux they're actually mapped onto processes.Threads do share context with their creator so shared information requires locking which wastes performance and increases waiting times, the solution to which in the threading happy world is to throw more threads at a problem in the hopes that locking contention doesn't go through the roof and there's always enough work to do to keep the cores busy.

So the optimum way to utilise resources to a maximum is actually not to use MT but MP for the heavy lifting and make sure that the heavy work is split evenly across the available number of to-be-utilised cores.

For me it would actually be interesting to know whether some apps are actually clever enough to do MP for the real work or are just stupidly creating threads (and also how many).

Since someone mentioned iOS: Actually if you're using queues this is not a traditional threading model but more akin to a MP model where different queues handled by workers (IMNSHO confusingly called thread) are used to dispatch work to in a usually lock free manner. Those workers (although they can be managed manually) are managed by the system and adjust automatically to the available resources to always deliver the best possible performance.
extide - Tuesday, September 1, 2015 - link
Don't forget, most of it is in Java, so it's probably one java process with several threads, not multiple java processes. The native apps, could go either way.
Daniel Egger - Tuesday, September 1, 2015 - link
One interesting question here is: What does Google do? Chrome on regular desktop OS uses one process per view to properly isolate views from one another; does anybody know whether Chrome on Android does the same? I couldn't figure it out from the available documentation...
praeses - Tuesday, September 1, 2015 - link
Next time can the colour legend below the graphs have their little squares enlarged to the height of the text? For those who are colour-challenged, it would make it a lot easier to match even when the image is blown-up. There doesn't seem to be a reason to have them so small.
endrebjorsvik - Thursday, September 3, 2015 - link
I would rather make the colors more intuitive. For instance by using a colormap like the jet colormap from Octave/Matlab. Low clock frequencies should be mapped to cool colors (blue to green), while high clock frequencies should be mapped to warm colors (yellow to red). By doing that you just have to look at the legend only once. After that, the colors speak for themselves.

The plots are really hard to read now when you have green at both low and high frequency (700 and 1400), and four shades of blue evenly distributed over the frequency range (500, 900, 1100, 1500). When I want to read such a plot, I don't care whether the frequency is 600 or 700. So these two colors doesn't have to be very different. But 500 and 1500 should be wastly different. The plots in this article are made in the opposite way. All the small steps has big color differences in order to be able to distinguish every small step from each other. But at some point the map ran out of majors colors and started repeating the spectrum again, with only slightly different colors.
qlum - Tuesday, September 1, 2015 - link
It would be interesting how desktop systems hold up in these tests especially with amd's 2 cores per module design.
name99 - Tuesday, September 1, 2015 - link
Andrei,
After so much work on your part it seems uncouth to complain! But this is the internet, so here goes...

If you ever have the energy to revise this topic, allow me to suggest two changes to substantially improve the value of the results:

With respect to how results are displayed:
- Might I suggest you change the stacking order of the Power State Distribution graphs so that we see Power Gated (ie the most power saving state) at the bottom, with Clock Gated (slightly less power saving) in the middle, and Active on top.
- The frequency distribution graphs make it really difficult to distinguish certain color pairs, and to see the big picture. Might I suggest that a plot using just grey scale (eg black at lowest frequency to white at highest frequency) would actually be easier to parse and to show the general structural pattern?

As a larger point, while this data is interesting in many ways, it doesn't (IMHO) answer the real question of interest. Knowing that there are frequently four runnable threads is NOT the same thing as knowing that four cores are useful, because it is quite possible that those threads are low priority, and that sliding them so as run consecutively rather than simultaneously would have no effect on perceived user performance.

The only way, I think, that one can REALLY answer this particular question ("are four cores valuable, and if so how") is an elimination study. (Alternatives like trying to figure out the average run duration of short term threads is really tough, especially given the granularity at which data is reported).

So the question is: does Android provide facilities for knocking out certain cores so that the scheduler just ignores them? If so, I can suggest a few very interesting experiments one might run to see the effects of certain knockout patterns. In each case, ideally, one would want to learn
- "throughput" style performance (how fast the system scores on various benchmarks)
- effect on battery usage
- "snappiness" (which is difficult to measure objectively, but maybe is obvious enough for subjective results to be noticed).

So, for example, what if we knock out all the .LITTLE cores? How much faster does the system seem to run, with what effect on battery? Likewise if we knockout all the big cores? What if we have just two big cores (vaguely equivalent to an iPhone 6)? What if we have two big and two LITTLE cores?

I don't have any axe to grind here --- I've no idea what these experiments will show. But it would certainly be interesting to know, for example, if a system consisting of only 4 big cores feels noticeably snappier than a big.LITTLE system while battery life is 95% as long? That might be a tradeoff many people are willing to make. Or, maybe it goes the other way --- a system with only one big core and 2 little cores feels just as fast as an octocore system, but the battery lasts 50% longer?
justinoes - Tuesday, September 1, 2015 - link
This was a seriously fascinating read. It points to a few things...

First, Android has some serious ability to take advantage of multiple cores or ILP has improved dramatically. I remember when the Moto X (1st Gen) came out with a dual core CPU engineers at Moto said that even opening many websites didn't use more than two cores on most phones. [http://www.cnet.com/news/top-motorola-engineer-def...] Does this mean that Android has stepped up its game dramatically or was that information not true to begin with?

Second, It seems like there are two related components to the question that I have about multi-core performance. First, do extra cores get used? (You show that they do. Question answered.) Secondly, do extra cores matter from a performance perspective (if clock speed is compromised or otherwise)? (This is probably harder to answer because cores and clock are confounded - better CPU -> more cores, faster clock and complicated by the heterogeneous nature of these CPUs core setups.)

I suppose the second question could be (mostly) answered by taking a homogeneous core CPU and disabling a cores sequentially and looking at the changes in user experienced performance and power consumption. I'm sure some people will buy something with the maximum number of cores, but I'm just curious about whether it'll make a difference in real-world situations.

The Mobile CPU Core-Count Debate: Analyzing The Real World

Overall Analysis & Conclusion

Post Your Comment

157 Comments

View All Comments

jjj - Wednesday, September 2, 2015 - link

djscrew - Tuesday, September 1, 2015 - link

Daniel Egger - Tuesday, September 1, 2015 - link

extide - Tuesday, September 1, 2015 - link

Daniel Egger - Tuesday, September 1, 2015 - link

praeses - Tuesday, September 1, 2015 - link

endrebjorsvik - Thursday, September 3, 2015 - link

qlum - Tuesday, September 1, 2015 - link

name99 - Tuesday, September 1, 2015 - link

justinoes - Tuesday, September 1, 2015 - link

Log in

Don't have an account? Sign up now