Overall Analysis & Conclusion

Hopefully we've managed to cover a few of the more common use-cases that are routinely encountered in daily usage on Android and get a good idea of how applications behave. We've seen some quite expected numbers for some use-cases but also stumbled on very large surprises that weren't quite as obvious. 

There were two cases that especially stood out: Browser usage and application installation and updates. It could be argued that app updates are merely a corner-case that doesn't affect a user's experience much. After all, installing an updating apps represent only an insignificant fraction of what a user does on a device. Browser usage and web-page rendering in general however, are one of the most common and often encountered scenarios on a smartphone, and here's where we encountered the largest surprises.

When I started out this piece the goals I set out to reach was to either confirm or debunk on how useful homogeneous 8-core designs would be in the real world. The fact that Chrome and to a lesser extent Samsung's stock browser were able to consistently load up to 6-8 concurrent processes while loading a page suddenly gives a lot of credence to these 8-core designs that we would have otherwise not thought of being able to fully use their designed CPU configurations. In terms of pure computational load, web-page rendering remains as one of the heaviest tasks on a smartphone so it's very encouraging to see that today's web rendering engines are able to make good use of parallelization to spread the load between the available CPU cores.

It's hard to summarize the vast amount data of the last 16 pages in an orderly and correct manner. After all we are talking about extremely varying use-cases and time-scales for each scenario. While averaging the metrics over the course of a scenario might seem a good idea at first, one has to keep in mind that this wouldn't be able to properly represent cases where load peaks for smaller durations. It's these small computational bursts which are most of the time the cause for "lags" and frame-drops. So to better represent these bottle-necks which determine the user-visible cases of application speed and performance, we rather use the 90th percentile of the CPU run-queue depths:

90th Percentile Run-Queue Depth Averages
  Little Cluster Big Cluster Little + Big
Clusters
S-Browser - AnandTech Article 2.27 2.19 3.87
S-Browser - AnandTech FP 3.12 1.25 4.15
Chrome - AnandTech FP 5.69 1.84 7.10
Chrome - BBC Frontpage 5.00 2.00 6.22
Hangouts Launch 2.77 2.11 4.01
Hangouts Writing A Message 2.80 0.05 2.57
Reddit Sync Launch 1.84 1.11 2.38
Reddit Sync Scrolling 0.95 1.03 1.46
Play Store Open & Scroll 2.87 0.78 3.45
Play Store App Updates 3.73 5.42 8.51
Camera: Launch 1.45 2.73 2.98
Camera: Still Snapshot 4.12 0.87 4.59
Camera: Video Recording 5.17 2.04 5.42
Real Racing 3 Launch 2.16 1.33 3.26
Real Racing 3 Playing 2.09 0.89 2.96
Modern Combat 5 Playing 2.09 0.73 2.68

I was wary of creating this table as it can be easily misinterpreted: Because run-queue depth averages are not directly representative of the amount of concurrent threads in a given scenario, we lose information when aggregating them for a given cluster or the whole system. This for example happens on the big cluster on the AT article load scenario where the 90th percentile of the aggregate rq-depth reaches 2.19 while in reality this figure is composed of 4 medium-high threads. Readers should thus keep in mind the actual detailed graphs of the preceding pages when reading the table.

While not directly the goal of the article, the collected data also serves as a perfect case-study for heterogeneous big.LITTLE SoCs. We've long seen discussions concerning what the "ideal" big.LITTLE configuration would be. There's several angles to this: the most optimal little and big cluster core counts, and whether we're aiming for performance or power efficiency in each case. In terms of low- to medium-performance threads, we've had several cases where 4 little cores weren't enough. Web page rendering in Chrome in particular seems to be the killer use-case where actually having two clusters of highly efficient cores makes sense.

On the high-performance "big" cluster side, the discussion topic is more about whether 2 or 4 core designs make more sense. I think the decision here is not about performance but rather about power efficiency. A 2-core big-cluster design would provide more than enough performance for most use-cases, but as we've seen throughout our testing during interactive use it's more common than not to have 2+ threads placed on the big cluster. So while a 2-core design could handle bursts where ~3-4 threads are placed onto the big cluster, the CPUs would need to scale up higher in frequency to provide the same performance compared to a wider 4-core design. And scaling up higher in frequency has a quadratically detrimental effect on power efficiency as we need higher operating voltages. At the end of the day I think the 4 big core designs are not only the better performing ones but also the more efficient ones. 

This puts one particular vendor in quite of an interesting position: MediaTek. Even if one wouldn't be able to fully saturate a cluster one can still derive power efficiency advantages due to the fact that two small clusters would be able to operate at separate frequencies and thus efficiency points. I've encountered enough scenarios that would in theory fit the Helio X20's tri-cluster design that I'm starting to think that such a design would actually be a very smart choice for current Android devices.

What about more traditional SoC configurations? As mentioned earlier symmetric 8-core designs such as MediaTek's Helio X10 would, contrary to one's expectations, be seemingly able to take advantage of their higher core counts. So while it would be preferable to have higher performance cores such as Cortex A57's or A72's, one has to keep in mind the target market of these architectures are limited to higher-end SoCs. The 8 little-core designs are mostly targeted at the entry- and mid-level where adding a second Cortex A53 cluster can be very cheap way of still providing benefits in every-day usages, particularly in web-browsing.

What is clear though albeit there are corner-cases, is that the vast majority of applications do seem to be optimal for quad-core SoCs. This is why traditional 4-core and 4.4 big.LITTLE designs still appear to make the most sense in terms providing a balanced configuration and making most use of the hardware at hand. For big.LITTLE, even if there were no use-cases where all cores are concurrently used, it's not a big deal as what we are aiming for in heterogeneous systems is power efficiency gains.

This is also the point of the discussion where the debate of the potential detrimental effect of having more cores comes into play: The fact that a SoC has more cores does not automatically mean it uses more power. As demonstrated in the data, modern power management is advanced enough to make extensive use of fine-grained power-gated idle states, thus eliminating any overhead there might be of simply having more physical cores on the silicon. If there are cases (And as we've seen, there are!) which make use of more cores then this should be seen purely as an added bonus and icing on the cake. 

What about narrow CPU-core number design philosophies? Would such designs make sense on Android? This is probably another question that our readers will ask themselves when looking at the data. Apple and recently Nvidia with their Denver architecture both choose to keep going the route of employing large 2-core designs that are strong in their single-threaded performance but fall behind in terms of multi-threaded performance.

While for Apple it can be argued that we're dealing with a very different operating system and it is likely iOS applications are less threaded than their Android counter-parts. But there are cases where this doesn't need to be necessarily hold true: For example browser rendering engines, as demonstrated, can be multi-threaded if adapted to do so. Native high-end games which already make use of multiple threads are also unlikely to differ in their threading logic between the platforms.

While such narrow CPU-core designs would have higher performance at a given frequency - it is not a direct indicator of the actual performance/W efficiency that a single thread would have on these chipsets. We still haven't had a chance to make a proper apples-to-apples comparison for these architectures so we're limited to theorycrafting with the data we currently have available to us:

What we see in the use-case analysis is that the amount of use-cases where an application is visibly limited due to single-threaded performance seems be very limited. In fact, a large amount of the analyzed scenarios our test-device with Cortex A57 cores would rarely need to ramp up to their full frequency beyond short bursts (Thermal throttling was not a factor in any of the tests). On the other hand, scenarios were we'd find 3-4 high load threads seem not to be that particularly hard to find, and actually appear to be an a pretty common occurence. For mobile, the choice seems to be obvious due to the power curve implications. In scenarios where we're not talking about having loads so small that it becomes not worthwhile to spend the energy to bring a secondary core out of its idle state, one could generalize that if one is able to spread the load over multiple CPUs, it will always preferable and more efficient to do so. 

In the end what we should take away from this analysis is that Android devices can make much better use of multi-threading than initially expected. There's very solid evidence that not only are 4.4 big.LITTLE designs validated, but we also find practical benefits of using 8-core "little" designs over similar single-cluster 4-core SoCs. For the foreseeable future it seems that vendors who rely on ARM's CPU designs will be well served with a continued use of 4.4 b.L designs. Only MediaTek seems to fall out of the norm here with its upcoming X20 SoC, which I'm definitely looking forward to see as to how it behaves in the real-world. We'll also see some vendors revert back to quad-core designs in their custom architectures - while we've yet to get a better picture of how these will behave in terms of performance and power, I think that 4 cores will be a quite reasonable target and sweet-spot for vendors to aim for.

Games: Modern Combat 5 Playing
POST A COMMENT

156 Comments

View All Comments

  • R0H1T - Tuesday, September 01, 2015 - link

    Seems like Android has Windows' number as far as "multi-threading" is concerned, kudos to Google for this & seems like the tired old argument of developers getting a free pass (for poor MT implementation on desktops) needs to change asap! Reply
  • Impulses - Tuesday, September 01, 2015 - link

    Ehh, I think you're ignoring some key differences in clock speed and single threaded performance, not to mention how easily Intel can ramp clock speed up and back down, and then there's Hyper Threading which allows you to span more threads per core.

    Laptops might be the outlier, but I dunno what benefit a desktop (which have commonly run quads for years) would see from a lower powered core cluster. Development just works very differently by nature of the environment.

    Also things that benefit a ton from parallelization on the desktop often end up using the GPU instead... And/or specialized instructions that aren't available at all on mobile. It's not even apples and oranges IMO, it's apples and watermelons.
    Reply
  • R0H1T - Tuesday, September 01, 2015 - link

    You're missing the point, which is that Google & Android have shown (even with the vast number of SoC's it runs on) that MT & load management, when implemented properly, on the supported hardware & complementing software, makes great use of x number of cores even in a highly constrained environment like a smartphone.

    On desktops we ought to have had affordable octa cores available for the masses by now, but since Intel has no real competition & they price their products through the roof, we're seeing what or how windows & the x86 platform has stagnated. Granted that more people are moving to small, portable computing devices but there's no reason why the OS & the platform as a whole has to slow down, also the clock speed, IPC argument is getting old now. If anything DX12, Mantle, Vulkan et al have shown us is that if there's good hardware & the willingness to push it to its limits developers, with the right tools at hand, will make use of it. Not to mention giving them a free pass for badly coded programs, remember the "ST performance is king" argument, is the wrong way to go as it not only wastes the (great) potential of desktops but it also slows down the progress of PC as a platform.

    Now I know MT isn't a cakewalk especially on modern systems but if anything it should be more widespread because desktops & notebooks give a lot of thermal headroom, as compared to tablets & smartphones, besides the 30+ years of history behind this particular industry should make the task easier. Also not all compute tasks can be offloaded to GPU, that's why it's even more imperative that the users push developers to make use of more cores & not get the free ride that GPGPU has been giving them over the last few years, as it is the GPU industry is also slowing down massively & then we'll eventually be back to square one & zero growth.
    Reply
  • metafor - Tuesday, September 01, 2015 - link

    Yes and no. Google and Android are able to show that things like app updates, web page loads and general system upkeep is able to take advantage of multiple threads. But that's been true for a while. In a smartphone, those happen to be the performance dominating tasks. On a desktop, those tasks are noise.

    Desktop workloads that actually stress the CPU (and users care about performing well) are very different. That's not to say they're not threadable, but they may not be as threadable as Chrome, which basically eats RAM and processes.

    That being said, heterogenous MT could make a lot of sense for laptop processors as well. Having threadable workloads run on smaller Atoms instead of big Sky Lakes would probably improve efficiency. But it may not be as dramatic depending on the perf/W of Sky Lake at lower frequencies.
    Reply
  • niva - Tuesday, September 01, 2015 - link

    OK can we talk about this for a bit. I for one found the webpage CPU usage extremely disturbing. I'm running an old phone, Galaxy Nexus, and browsing has become by far the task my phone struggles with the most. Why is that? What is it about modern websites that causes them to be so CPU heavy? Is that acceptable? It does seem that much of the internet is filled with websites running shady scripts in the background and automatically playing video or sound which is annoying at the very least, but detrimental to performance always. Whatever happened to website optimization for minimizing data usage and actually making websites accessible?

    Secondly, what is the actual throughput of CPUs in desktops compared to the latest state of the line arm APUs? Just because desktop workloads might be different, does that mean that a mobile APU cannot handle it or is that simply due to the usage mode of the device in question? What I'm seeing out of mobile/phone chips is that they are extremely capable, to the point I'm starting to wonder if I'll ever need another desktop rig to replace my old Phenom X2 machine.
    Reply
  • metafor - Tuesday, September 01, 2015 - link

    I would guess that websites are just more complicated nowadays. Think about a dynamic website like Twitter, which has to have live menus and notifications/updates. That's basically a program more than just a web page. We've slowly migrated what used to be stand-alone programs to load-on-demand web programs. And added many many inefficient layers of script interpreters in between. Reply
  • emn13 - Thursday, September 03, 2015 - link

    Somewhat ironically, the more modern a web-page the *less* friendly it is likely to be to multithreading. After all, modern features tend to include heavy javascript usage (which is almost purely single-threaded), and a CPU usage that is bottlenecked by a path through javascript (typically layout not actually the JS itself, but that layout affects JS and hence needs fine-grained interaction). Reply
  • Jaybus - Tuesday, September 01, 2015 - link

    It is the more extensive use of client-side processing, in a nutshell, JavaScript and JSON. On older websites, they dynamic stuff was processed server-side and the client simply did page reloads. The modern sites require less bandwidth, but at the expense of increasing CPU usage.

    Also, modern sites are higher res and more image intensive, or in other words more GPU heavy as well. Some of the Nexus struggle can be attributed to GPU load.
    Reply
  • mkozakewich - Wednesday, September 02, 2015 - link

    Most of it has to do with using multiple JavaScript libraries. It's not strange to need to download over 50 different files on a website today. Anandtech.com took 123 requests over four seconds to load. Mostly fonts, ads, and Twitter stuff, but it adds up. Reply
  • name99 - Tuesday, September 01, 2015 - link

    You are totally misinterpreting these results.
    The mere existence of a large number of runnable threads does not mean that the cores are being usefully used. Knowing that there are frequently four runnable threads is NOT the same thing as knowing that four cores are useful, because it is quite possible that those threads are low priority, and that sliding them so as run consecutively rather than simultaneously would have no effect on perceived user performance.

    There is plenty of evidence to suggest that this interpretation is correct.
    Within the AnandTech data, the fact that these threads are usually on the LITTLE cores, and running those at low frequency, suggests they are not high priority threads.

    This paper from MS research confirms the hypothesis:
    http://research.microsoft.com:8082/en-us/um/people...

    Now there is a whole lot of tribalism going on in this thread. I'm not interested in that; I'm interested in the facts. What the MS paper states (confirmed, IMHO) by these AnandTech results, is that there is a reasonable (around 20%) throughput improvement in going from one to two threads, along with a small (around 10%) energy drop, and that going from two to three or four cores buys you only very slight further energy and performance boosts.
    In one sense this means there's no harm in having octacores around --- they don't seem to burning energy, and in principle they could deliver extra snappiness (though the lousiness of the scheduling in these AnandTech results suggests that's more a hope than a reality). But there's a world of difference between the claim "doesn't hurt energy, may occasionally be slightly useful" and the claim "pretty much always useful because apps are so deeply threaded these days".
    Reply

Log in

Don't have an account? Sign up now