Conclusion & Thoughts

The Cortex A76 presents itself a solid generational improvement for Arm. We’ve been waiting on a larger CPU microarchitecture for several years now, and while the A76 isn’t quite a performance monster to compete with Apple’s cores, it shows how important it is to have a balanced microarchitecture. This year all eyes were on Samsung and the M3 core, and unfortunately the performance increase came at a great cost of power and efficiency which ended up making the end-product rather uncompetitive. The A76 drives performance up but on every step of the way it still deeply focused on power efficiency which means we’ll get to see the best of both worlds in end products.

In general Arm promises a 35% performance improvement which is a significant generational uplift. Together with the fact that the A76 is targeted to be employed in 7nm designs is also a boost to the projected product.

I’m having some reservations in terms of the performance targets and if vendors will indeed release the SoC with quad-core clock rates of up to 3GHz – based on what I’ve heard from vendors that seems like a rather very optimistic target. Even then, a reduced clock frequency still brings significant benefits, and it’s especially on the efficiency side where Arm should be lauded for continuing to place great focus on.

Whether my projections are correct or not is something we’ll have to see in actual products, but fact is that we *will* see significant efficiency benefits in the next generation of SoCs which should bring both an notable performance improvement as well as battery life improvement to the user. Arm’s focus here on the user experience seems to be exemplary and I hope vendors will be able to implement the core based on Arm’s guidance and reach the targeted metrics.

The Cortex A76 is said to have already come back in working silicon at two partners and we’ll very likely see it shipping in commercial products by the end of the year. I won’t be beating around the bush here as Huawei and HiSilicon’s product cycle schedule makes it obvious that they’re likely one of the launch partners for the product. Qualcomm has also doubled down on using Arm cores in the mobile space so we should also be seeing the next generation Snapdragon SoCs employ the A76. Among the big players, it’s Samsung LSI which is going to have a tough time – the A76 doesn’t seem to greatly outperform the M3, so at least in theory, the M4’s focus will need to be solely on power efficiency. Then again Arm is very open about their design goals; half the area and half the power at similar performance is something that’s going to be hard to compete against.

The Cortex A76 is said to be the baseline microarchitecture on which Arm will iterate over the next 2 generations at least. Arm has been able to execute their yearly beat roadmap on time for 5 generations now and with yearly 20-25% CAGR it’s going to be a very interesting next couple of years as the mobile space is very quickly approaching the performance of desktop CPUs.

Cortex A76 - Performance & Power Projections
Comments Locked

123 Comments

View All Comments

  • techconc - Tuesday, June 5, 2018 - link

    " A11 is no match in speed and performance versus SD845 and Exynos powered Android phones today"

    Huh? Benchmarks do not support your claim.
  • name99 - Friday, June 1, 2018 - link

    This assumption makes two mistakes.

    The first is to assume that ONE metric (in this case 4-wide front end) is the PRIMARY determinant of performance. Even Apple's (A11) IPC (over a wide range of code) is about maybe 2.7. This means on average less than 3 of those 6 execution units are being used per cycle. IF other parts of the core uncache could be PERFECTED on a 4 wide design so that EVERY cycle 4 instructions executed, it would clearly surpass the A11 in IPC.
    The problem, of course, is just how hard it is to prevent cycles where NOTHING executes and cycles where only a few (one or two) instructions execute. Reducing these are where most of the magic is --- and you won't see details of that it in an article like this; rather it's in that painstaking rooting out hundreds of small inefficiencies that the article talked about.

    To give just one example -- no-one is talking about the clustered page tables. This is a very cool idea which relies on the fact that most of the time the OS page allocator allocates a number of pages contiguously in virtual AND physical space, and with the same permissions. If that is so, the same page entry can correspond to multiple contiguous pages (in the academic literature, usually up to 8). This gives you a substantial increase in TLB reach at a very minor increase in TLB bits.
    (I can find no info as to whether Intel does this. I SUSPECT Apple used to do this in their earlier [and probably even A11] cores. There are very recent even better ideas in the academic literature that might, perhaps, have made it to the A12 core.)

    Second mistake you make is to ignore frequency. A9 ran at 1.85 GHz, A10 at 2.35 GHz. The A76 will likely run at 3GHz.
  • tipoo - Tuesday, September 4, 2018 - link

    And yet, here we are with single core results at around 60% that of the A11. Taking their own numbers at face value, a 56% increase over the A73 in GB4 results in 2800.

    Yes, I used a very simplistic one dimensional comparison, and there's a whole lot more to it. However, core complexity does go up almost exponentially with width, and so it does point to what ballpark they were aiming at. A76 was never going to beat the A11 per core because it was never aimed at it.
  • colinisation - Thursday, May 31, 2018 - link

    Hi Andrei,

    Is this core the one referred to as Ares on roadmaps?

    Been waiting years for this one if it is.
  • Andrei Frumusanu - Thursday, May 31, 2018 - link

    Yes in practical terms - no in actual terms. You'll likely hear more about this in the future.
  • tuxRoller - Friday, June 1, 2018 - link

    Ok, now you've incepted the idea that ARM is going to announce a dedicated server-class chip (maybe even a tease of SVE.....)
  • name99 - Friday, June 1, 2018 - link

    I agree with your point (ARM will release a server chip, essentially based on this core).
    Remember that GB4 is scaled to 4000 represents an i7-6600U (Skylake, 3.4GHz, 4MiB L3, 15W).
    So A76 is essentially at that level (slightly worse FP, but many server tasks will not care).
    To the extent that that Skylake at 3.4GHz is an acceptable Server class core, ARM could dump some large number of A76 on a die and be in the same sort of space as dearly-departed Centriq and ThunderX2.

    They likely would have to beef up their NoC one way or another, and tweak the caching and memory systems, the usual server additions.
    But I assume they didn't put all that effort into "lowest possible latency for hypervisor activity" on the theory that hypervisor performance on smartphones is THE next big thing...
  • joe_85 - Thursday, May 31, 2018 - link

    I am sure people will disagree with me because people love to argue. Andrei, I like your writing and overall thoroughness but a few critiques here. The charts you make are extremely unpleasant to look at and do not lend themselves to a quick assessment of the data.

    First of all the color coded stripes in the legend for the A76 projections is not even decipherable, and the actual bars are on the chart are difficult to see. Secondly, why are you color coding them at all? Just put processor names to the left of the bars and the benchmark name above the bars.

    Additionally are the bars in any particular order? If they are I certainly can't tell, they should be relative to the performance OR the efficiency.

    Other constructive criticism would be that adding some additional subheadings within your articles would make it feel like a more solid piece.

    Keep up the good work.
  • jospoortvliet - Wednesday, June 6, 2018 - link

    Loving to argue or not, the performance vs efficiency graphs are rather unique and very clever, I have not seen any design that so clearly shows how different SOC's compare at both. Yes it takes a few mins before you can read them I am sorry that the world is so complicated. But they work very well if you just use that gray matter a bit.
  • syxbit - Thursday, May 31, 2018 - link

    As an Android user, I continue to be disappointed with QCOMM, Arm (and Nvidia for dropping out) at how far ahead Apple in single threaded perf.

Log in

Don't have an account? Sign up now