In the grand scheme of things, it hasn’t been all that long since we first covered Arm’s announcement of the new Cortex A76 CPU microarchitecture. The new CPU IP was publicly unveiled back on the first of June, and Arm had made big promises in regards to the performance and efficiency improvements of the new core. It’s been a little over 5 months since then, and as we originally predicted, we’ve seen vendors announce as well as ship silicon SoCs with the new CPU.

Last week we published our review of the Huawei Mate 20 and Mate 20 Pro – both which contain HiSilicon’s new Kirin 980 chipset. Unfortunately for a lot of our readers which are based in the US, the review won’t be as interesting as the devices won’t be available to them. For this reason I’m writing up a standalone piece focusing more on the results of the new Cortex A76 inside the Kirin 980, and discuss more in detail how I think things will play out in the upcoming generation of competing SoCs.

Verifying Arm’s performance projections

Naturally one of the first things people will be interested in is seeing how the Cortex A76 actually manages to perform in practice. Arm had advertised the Cortex A76 to reach clocks of up to 3GHz, and correspondingly had all of its performance projections presented at this frequency. As I’ve written back in May, the 3GHz frequency was always an overly optimistic target that vendors would not be able to achieve; I said something along 2.5GHz would be a much more realistic figure. The Kirin 980 ended up being released with a final clock speed of 2.6GHz, which was more in line with what I expected.

The Cortex A76 at 3GHz was projected to perform respectively 1.9x and 2.5x times better than a Cortex A73 at 2.45GHz – which is the configuration of Qualcomm’s Snapdragon 835. Translated to a clock speed of 2.6GHz, the improvements are adjusted to ~1.65x and ~2.15x.

 GeekBench 4 Single Core

In practice, the Kirin 980 manages to reach an improvement of 1.77x in the integer score, as well as slightly exceeding the target improvement for the floating point score, achieving an increase of 2.21x. The reason the Kirin 980 here exceeds the targets is maybe linked to the fact that the chip is configured with a 4MB L3 while Arm’s simulations ran with a 2MB L3.

Moving over to SPEC2006, we have a set of more complex and robust workloads that better represent the wider range of applications that you would come to expect.

SPEC2006 Estimate

Here Arm’s performance projections were a bit more coloured, as we had been presented IPC comparisons as well as absolute score comparisons. In the absolute improvements at 3GHz, we saw claims of 2.1x “without thermal constraints” at 3.3GHz and figures of 1.9x “within 5W TDPs”. The latter figures was extremely confusing as Arm’s marketing was contradictory as to what this exactly means, which for a long time had me questioning if the CPU would somehow hit thermal limits in the single-threaded SPEC workloads, which would have been a pretty terrible result.

The IPC comparisons are a lot more straightforward: Versus a Cortex A73, we would respectively see increases of 1.58x and 1.79x in the integer and FP suites.

In practice, the Kirin 980 and the Cortex A76 more than delivers: we’re seeing 1.89x and 2.04x increases in the integer and FP scores. In terms of IPC, the increases over the Cortex A73 based Kirin 970 and Snapdragon 835 are even more significant: Here we’re seeing jumps of respectively 1.78x and 1.92x. In fact, because the Kirin 980 performed better than expected, it actually managed to reach my projected scores (based on Arm’s figures) I had estimated for a 3GHz Cortex A76, but actually achieving this at 2.6GHz.

Memory subsystem performance matters enormously

There is one aspect of CPU performance that seems to be continuously misunderstood and misrepresented: Memory subsystem performance. A CPU can be incredibly wide as well as have any amount of execution resources, however no matter how big the microarchitecture is, it matters little if the memory subsystem (caches, memory controllers) are not able to keep the machine properly fed with data. The mobile space over the last few years has pretty much seen the same workload progression that we’ve seen in desktops over the past decades, just in a vastly more accelerated pace. Applications become bigger and more complex in terms of their program sizes, and the data they’re processing has also seen significant growth.

The problem with this evolution is that the tools that we usually use to benchmark performance can become outdated if they can’t accurately reproduce the microarchitectural workload characteristics of modern every-day applications. Recently with the launch of the Kirin 980, I’ve seen some people get the wrong idea and come to the wrong conclusion in terms of the actual performance of the chipset, basing their opinion on results such as GeekBench 4 scores.

To explain this, I wanted to showcase the evolution of recent generation SoCs, all relative to a fixed starting figure. I picked the Snapdragon 835 for this as it represented a well-balanced and popular SoC.

SPEC2006 vs GeekBench4 Integer Performance Scaling SPEC2006 vs GeekBench4 Floating Point Performance Scaling

In SPECint2006, the scores don’t seem to diverge all that much from what GeekBench4 is able to project, and this is valid for most SoCs. In this set, the only significant divergence comes from the Apple’s A11 and A12 chips. Here the A11 and A12 were able to show significantly larger increases in the SPEC workload performances than in GB4.

Switching over to SPECfp2006, beyond the obvious fact that the benchmarks here are using more floating point datatypes in their programs, we see a much larger percentage of workloads that are characterised by putting a lot more demand on the memory subsystems. Here, we see a lot more discrepancy between the different SoCs. On one side again, the Apple A12 again was able to showcase much bigger generational improvements in SPECfp than it was able to showcase in GB4’s FP workloads, again pointing out to the massive memory subsystem performance improvements Apple was able to introduce this generation. On the other hand, the Exynos 9810 sticks out in the opposite way: its performance in SPEC was much less than what we see in GeekBench4, again representing the Achilles heel of the chipset as the CPU’s memory and cache subsystem largely lags behind the competition.

The point I’m trying to make here is that the vast majority of real-world applications behave a lot more like SPEC than GeekBench4: Most notably Apple’s new A12 as well as Samsung’s Exynos 9810 contrast themselves in the two extremes as shown above. In more representative benchmarks such as browser JS framework performance tests (Speedometer 2.0), or on the Android side, PCMark 2.0, we see even greater instruction and data pressure than in SPEC – multiplying the differences exposed by SPECfp.

There are also benchmarks who go in the opposite way of their workload characterisation: Dhrystone or Coremark have very small memory footprints. Here most of the benchmark will entirely fit into the lower cache hierarchies of a CPU, not putting any kind of pressure to the bigger caches or even DRAM. These are useful benchmarks still in their own regard, but shouldn’t be taken as a representation of overall performance in modern application. AnTuTu’s CPU test falls among these as its footprint is also tiny and not testing anything beyond the execution engines and the first level cache hierarchy.

HiSilicon’s Kirin 980 along with Arm’s Cortex A76 here seem to strike a great balance in this regard: The performance between SPEC and GeekBench4 doesn’t diverge all too much. We’ll get back this just in a bit when looking at the efficiency results of the new Kirin chipset.

Top-tier energy/power efficiency, absolute performance still quite behind Apple

When it comes to power and energy efficiency, Arm made two claims: At the same power usage, the Cortex A76 would perform 40% better than the Cortex A75, and at the same performance point, the Cortex A76 would use only 50% of the energy of a Cortex A75. Of course these two figures are to be taken with quite a handful of salt as the comparison was made across process nodes.

Looking at the SPEC efficiency results, they seem more than validate Arm’s claims. As I had mentioned before, I had made performance and power projections based on Arm’s figures back in May, and the actual results beat these figures. Because the Cortex A76 beat the IPC projections, it was able to achieve the target performance points at a more efficient frequency point than my 3GHz estimate back then.

The results for the chip are just excellent: The Kirin 980 beats the Snapdragon 845 in performance by 45-48%, all whilst using 25-30% less energy to complete the workloads. If we were to clock down the Kirin 980 or actually measure the energy efficiency of the lower clocked 1.9GHz A76 pairs in order to match the performance point of the S845, I can very easily see the Kirin 980 using less than half the energy.

The one metric that doesn’t quite pan out for Arm is the claim that at the same power, the Cortex A76 would perform 40% better. Here Arm chose an arbitrary 750mW point for the comparison – which may or may not make the claim accurate, however we don’t know where this intersection point lies, and it would require more exact measurements of the frequency-power curve of both chipsets. The matter of fact is, the Cortex A76 is a more power hungry CPU, and single core active platform power consumption has gone up by 14-21%.

It’s here where we can make the interesting comparison to Apple’s latest: The energy efficiency for the Kirin 980 is ever so slightly ahead of the Apple A12, meaning the perf/W of both SoCs are nearly identical. The big difference here is that Apple is able to achieve a 61-74% performance advantage, at a linear cost of 60-70% increased power consumption.

What it means for next Snapdragon and Exynos 9820

The excellent showing of the Kirin 980 is a good omen for the upcoming Snapdragon flagship. I’m expecting Qualcomm to be a little more aggressive when it comes to the core clocks, aiming just a tad higher above the 2.6GHz of the Kirin 980. What this will actually mean in regards to the resulting power efficiency remains to be seen.

Performance on paper should also fare well, but in practice Qualcomm does have an aspect that can complicate things: the SoC’s system cache. Here evidently Qualcomm is trying to mimic Apple in having a further system-wide cache hierarchy before going to DRAM; for the Snapdragon 845 this was a double-edged sword as memory latency saw a degradation over the Snapdragon 835. This degradation seemingly caused the Cortex A75 in the S845 to maybe not achieve its full potential. Hopefully the new generation SoC has less of an impact in this regard, and we can expect good performance figures.

Samsung last week officially announced the Exynos 9820, and here the outlook is a bit more pessimistic. The Exynos 9810 did not fare well in benchmarks, but this was not only because of the scheduler issues, but also simply because the microarchitecture didn’t seem balanced. The Kirin 980 is able to beat the Exynos 9810’s top performance, all while consuming less than half the energy. At the more reasonable 2.3GHz frequency point of the chip, the performance gap widens to 23-30%, while still showcasing a 42-47% energy efficiency disadvantage over the Kirin 980.

Samsung proclaims that the Exynos 9820 showcases 20% better performance, or 40% better efficiency. The keyword here being “or” – meaning the improvements are at an iso-comparison to the other axis. Taking the 2.7GHz figures as a base comparison, a 20% performance improvement could well compete with the Cortex A76, but the horrid energy efficiency of the chip would still remain. Similarly, taking the more efficient 2.3GHz result as the baseline performance, a 40% improvement in efficiency would match the Kirin 980 in efficiency, but still would have to endure the performance deficit.

Samsung’s marketing figures just aren’t good enough, and mathematically I just don’t see any way the Exynos 9820 would be able to compete if the results do pan out like this. The only glimmer of hope here is that, much like Apple’s marketing department understated the performance improvements of the A12, S.LSI is understating the improvements of the Exynos 9820. Here the only scenario I could see as working out is that the claimed performance jump merely represents GeekBench4 scores, and actual improvements in SPEC and more realistic workloads see a much more significant jump, closing this ratio gap between the two benchmarks that we discussed just earlier. Let’s hope for this latter scenario.

The Cortex A76 is a very solid CPU – Deimos & Hercules will follow up

Arm had already teased the successor to Enyo (Cortex A76) with the reveal of Deimos and Hercules. Here Arm promised 15-20% performance increases in the next generation. Arm’s strength here lies in actually delivering an overall excellent package of performance within great power envelopes. Also while this part of the PPA metric isn’t something consumer should inherently care about, Arm is able to also keep the CPUs extremely small.

We’ve just recently seen Arm’s new server core in the wild – Ares should be the infrastructure counterpart to Enyo/A76 and part of the recently announced Neoverse family of CPU cores. It’s not hard to imagine 32 or 64 of cores of this calibre on a single chip. Overall, we’re looking forward to more exciting products in the next several months – both in the mobile and infrastructure spaces.

Related Reading

POST A COMMENT

99 Comments

View All Comments

  • Quantumz0d - Tuesday, November 20, 2018 - link

    Subpar SoCs and all these benchmarks. I've been running Android since Froyo 2.2 and until now 7.1.2 all customROMs and custom kernel. I used the first iPhone and yes that time it was something great later after the Snapdragon S4 Pro it was great for Android and after the 820, I honestly can't see any difference in comparing the SoCs where the A series can only Run th jailed iOS where user can't do anything apart from being rules by Orwellian Apple ecosystem. And people are like ashamed to use despite not seeing any masssssive UX loss.

    My phone has 820 and OP3T has 821 which uses highest Clock speed on high perf cores which caused instability and guess who brought to this to attention ? Sultanxda and it was implemented by Flar2 and all devs. Guess what ? It lost 200-300MHz clock speed. And judging by this severe fetishistic obsession to benchmarks on smartphones would have caused it to choke and die I suppose ? But guess what ? User experience barely being affected, thanks to custom drivers they build.

    And how about LineageOS, Running an OSS tuned for customation and user choice vs the iTunes dependent iDevices and Apple stamp approved.

    Its really a shame that how GB is being used all over as if its some sort of HWBOT points with Cinebench or WPrime. On smartphones we need UX and user choice. Look at Huawei EMUI garbage their phones are blocked by VLC due to aggressive background killing or blocking Installing Nova launcher and Bootloader unlock blocked.

    No surprise how this Apple taking over the world with A series is emphasized everywhere on AT Forums, Various blogs when their A12X is fabled for killing Laptops and Xbox as quoted by Apple without a proper file manager or mouse KB support.

    It would be great if Android reviews forlcuaed more on choice vs these bragging rights. As if the Android SoCs are causing some sort of debilitating disability losing to A series. Pathetic throttling isn't mentioned anywhere on iPhone reviews. Caught red handed deep pockets, enable a switch to user who is ultra dumb to know what it does.
    Reply
  • tuxRoller - Tuesday, November 20, 2018 - link

    So, has a76 caught up to the a10 (I've forgotten the name of the cores) or not? You seem to be saying both of those ("[...] no one can catch up to even a 2 year [...]" and "[...] only to find that the [sic] are 2 years behind [...]").
    If it makes you feel better this is the closest Android has been to Apple since... 2013 (whatever year they went v8a) they are probably quite close to the limits of what standard architectures and our (mass) manufacturing capabilities can achieve. Instead, as had been noted elsewhere, you'll see further work with accelerators & heterogenous systems architecture.
    Reply
  • artk2219 - Tuesday, November 20, 2018 - link

    Its kind of a moot point if you can't buy those faster Apples chips to use in anything else, its basically Apple saying "see, our stuffs so fast! No you cant play with it or buy it to use for yourself, and no we aren't going to let you install your stuff on our devices" Reply
  • xype - Wednesday, November 21, 2018 - link

    "Its kind of a moot point "

    How? I have an iPhone, but I´d still _love_ for the competition to catch up. Talking Android SoCs up as "decent" when they´re anything but just tells the manufacturers that they don´t really need to try harder. So Android gets shit SoCs and Apple can price their stuff above $1k; I don´t see any group of customers here as "winners".
    Reply
  • Speedfriend - Wednesday, November 21, 2018 - link

    I use both Android and iPhone on a daily basis. There is not a single thing I do or app I use where there is any meaningful difference between the platforms in performance apart from my trading app where iOS kills off its background pricing feeds when supposedly multitasking making it next to useless... Reply
  • artk2219 - Tuesday, November 27, 2018 - link

    Its a moot point in that you're locked into a platform, it means it should only be compared to the other chips within that platform. It doesn't matter how well a more standard ARM core performs next to Apples because you cant use that Apple core in anything that isnt an Apple device and vice versa. So no matter how much you oohh and ahh over which is faster, there will never be a point where its really an issue because any cross platform software will target the lowest common denominator, which will be slower than the fastest chip of either platform. So all this hemming and hawing about "wow no one has caught up with apple" is pointless, because theres never a point where it will be an issue. Reply
  • artk2219 - Tuesday, November 27, 2018 - link

    Its a moot point in that you're locked into a platform, it means it should only be compared to the other chips within that platform. It doesn't matter how well a more standard ARM core performs next to Apples because you cant use that Apple core in anything that isnt an Apple device and vice versa. So no matter how much you oohh and ahh over which is faster, there will never be a point where its really an issue because any cross platform software will target the lowest common denominator, which will be slower than the fastest chip of either platform. So all this hemming and hawing about "wow no one has caught up with apple" is pointless, because theres never a point where it will be an issue. Reply
  • gijames1225 - Tuesday, November 20, 2018 - link

    ARM has different economic concerns than Apple. They are building compact, reusable cores that can be slapped down in inexpensive (comparatively) SoCs by a multitude of clients. There's no reason to think they couldn't make a CPU that performed similar to an A12, but their business model doesn't lead that direction. They aim for a balance of cost/size, efficiency, and performance when Apple, designing only for premium products, doesn't care if they have a big chip size and comparatively high cost since to fab. The A76 looks very good from ARM's perspective seeing as it'll be used by a variety of vendors delivering SoCs using these cores in products for that can cost 2/3 of an iPhone's asking price. Reply
  • xype - Wednesday, November 21, 2018 - link

    If only companies like Samsung or Qualcomm were big enough to afford to create their own, decent ARM implementations. Alas. Reply
  • ZolaIII - Tuesday, November 20, 2018 - link

    It catches up with current generation of Apples ARM core's when it comes to the performance/W metric and still being significantly higher clocked. This also means that sustainable performance is the same. The Huawei SoC uses EAV+ scheduler & QC timed window approach actually gives a healthy performance difference in fast switching workloads (15~20%). All together A76 is a better design for mobile. Reply

Log in

Don't have an account? Sign up now