Conclusion & Thoughts

The Cortex A76 presents itself a solid generational improvement for Arm. We’ve been waiting on a larger CPU microarchitecture for several years now, and while the A76 isn’t quite a performance monster to compete with Apple’s cores, it shows how important it is to have a balanced microarchitecture. This year all eyes were on Samsung and the M3 core, and unfortunately the performance increase came at a great cost of power and efficiency which ended up making the end-product rather uncompetitive. The A76 drives performance up but on every step of the way it still deeply focused on power efficiency which means we’ll get to see the best of both worlds in end products.

In general Arm promises a 35% performance improvement which is a significant generational uplift. Together with the fact that the A76 is targeted to be employed in 7nm designs is also a boost to the projected product.

I’m having some reservations in terms of the performance targets and if vendors will indeed release the SoC with quad-core clock rates of up to 3GHz – based on what I’ve heard from vendors that seems like a rather very optimistic target. Even then, a reduced clock frequency still brings significant benefits, and it’s especially on the efficiency side where Arm should be lauded for continuing to place great focus on.

Whether my projections are correct or not is something we’ll have to see in actual products, but fact is that we *will* see significant efficiency benefits in the next generation of SoCs which should bring both an notable performance improvement as well as battery life improvement to the user. Arm’s focus here on the user experience seems to be exemplary and I hope vendors will be able to implement the core based on Arm’s guidance and reach the targeted metrics.

The Cortex A76 is said to have already come back in working silicon at two partners and we’ll very likely see it shipping in commercial products by the end of the year. I won’t be beating around the bush here as Huawei and HiSilicon’s product cycle schedule makes it obvious that they’re likely one of the launch partners for the product. Qualcomm has also doubled down on using Arm cores in the mobile space so we should also be seeing the next generation Snapdragon SoCs employ the A76. Among the big players, it’s Samsung LSI which is going to have a tough time – the A76 doesn’t seem to greatly outperform the M3, so at least in theory, the M4’s focus will need to be solely on power efficiency. Then again Arm is very open about their design goals; half the area and half the power at similar performance is something that’s going to be hard to compete against.

The Cortex A76 is said to be the baseline microarchitecture on which Arm will iterate over the next 2 generations at least. Arm has been able to execute their yearly beat roadmap on time for 5 generations now and with yearly 20-25% CAGR it’s going to be a very interesting next couple of years as the mobile space is very quickly approaching the performance of desktop CPUs.

Cortex A76 - Performance & Power Projections
Comments Locked

123 Comments

View All Comments

  • jospoortvliet - Wednesday, June 6, 2018 - link

    Twice as fast at half power should not be hard. Of course the process has changed since those chips were baked, it isn't all in architecture.
  • tipoo - Tuesday, September 4, 2018 - link

    Yeah, on 7nm they should easily be able to make portable mode do what docked mode did, and add a new higher performance docked mode. Easy transition.
  • name99 - Friday, December 18, 2020 - link

    "The branch prediction unit is what Arm calls a first in the industry in adopting a hybrid indirect predictor. "

    This is somewhat misleading. The fetch unit is very interesting (and Andrei did not spend enough time praising it) but to say that it is first in the industry seems unreasonable.
    The idea of decoupling the stream of fetch addresses from actual I-cache access dates from a thesis in 2001. Implementations I know about include Zen and Exynos M1 (2016) and IBM z14 (2017). Apple probably got in there even earlier.

    So there may be some very specific detail in how ARM is implementing this that is a first, but the overall idea has been around for 17 years.

    (The reason why it's taken so long to be implemented is that, first, it needs lots of transistors to store all the predictor state and, second, it requires some rethinking of how your branch predictors are indexed and updated. Think about it. What you want is machinery that, EVERY CYCLE, when given a PC will spit out two addresses -- where the current run of straightline fetching must end, ie the next TAKEN branch target, and where the PC must be directed to when it hits the end of this basic block. And it has to do this "in isolation", without looking at the instructions that are going to be loaded from the I$ because the whole point is that this is happening decoupled from, and in advance of, access to the I$. It's not trivial to think of a set of data structure that can do that. I'm still not at all convinced my understanding of exactly how this is correct, even though I've been trying to understand it for some time now.)

Log in

Don't have an account? Sign up now