The Cortex-A77 µarch: Added ALUs & Better Load/Stores

Having covered the front-end and middle-core, we move onto the back-end of the Cortex-A77 and investigate what kind of changes Arm has made to the execution units and data pipelines.

On the integer execution side of the core we’ve seen the addition of a second branch port, which goes along with the doubling of the branch-predictor bandwidth of the front-end.

We also see the addition on an additional integer ALU. This new unit goes half-way between a simple single-cycle ALU and the existing complex ALU pipeline: It naturally still has the ability of single-cycle ALU operations but also is able to support the more complex 2-cycle operations (Some shift combination instructions, logical instructions, move instructions, test/compare instructions). Arm says that the addition of this new pipeline saw a surprising amount of performance uplift: As the core gets wider, the back-end can become a bottleneck and this was a case of the execution units needing to grow along with the rest of the core.

A larger change in the execution core was the unification of the issue queues. Arm explains that this was done in order to maintain efficiency of the core with the added execution ports.

Finally, existing execution pipelines haven’t seen much changes. One latency improvement was the pipelining of the integer multiply unit on the complex ALU which allows it to achieve 2-3 cycle multiplications as opposed to 4.

Oddly enough, Arm didn’t make much mention of the floating-point / ASIMD pipelines for the Cortex-A77. Here it seems the A76’s “state-of-the-art” design was good enough for them to focus the efforts elsewhere on the core for this generation.

On the part of the load/store units, we still find two units, however Arm has added two additional dedicated store ports to the units, which in effect doubles the issue bandwidth. In effect this means the L/S units are 4-wide with 2 address generation µOps and 2 store data µOps.

The issue queues themselves again have been unified and Arm has increased the capacity by 25% in order to expose more memory-level parallelism.

Data prefetching is incredibly important in order to hide memory latency of a system: Shaving off cycles by avoiding to having to wait for data can be a big performance boost. I tried to cover the Cortex-A76’s new prefetchers and contrast it against other CPUs in the industry in our review of the Galaxy S10. What stood out for Arm is that the A76’s new prefetchers were outstandingly performant and were able to deal with some very complex patterns. In fact the A76 did far better than any other tested microarchitecture, which is quite a feat.

For the A77, Arm improved the prefetchers and added in even new additional prefetching engines to improve this even further. Arm is quite tight-lipped about the details here, but we’re promised increased pattern coverages and better prefetching accuracy. One such change is claimed to be “increased maximum distance”, which means the prefetchers will recognize repeated access patterns over larger virtual memory distances.

One new functional addition in the A77 is so called “system-aware prefetching”. Here Arm is trying to solve the issue of having to use a single IP in loads of different systems; some systems might have better or worse memory characteristics such as latency than others. In order to deal with this variance between memory subsystems, the new prefetchers will change the behaviour and aggressiveness based on how the current system is behaving.

A thought of mine would be that this could signify some interesting performance improvements under some DVFS conditions – where the prefetchers will alter their behaviour based on the current memory frequency.

Another aspect of this new system-awareness is more knowledge of the cache pressure of the DSU’s L3 cache. In case that other CPU cores would be highly active, the core’s prefetchers would see this and scale down its aggressiveness in order to possibly avoid thrashing the shared cache needlessly, increasing overall system performance.

The Cortex-A77 µarch: Going For A 6-Wide* Front-End Performance: 20-35% Better IPC, End Remarks
Comments Locked

108 Comments

View All Comments

  • Valis - Thursday, May 30, 2019 - link

    Yeah, it's because he is a white male, probably hetero also. :P
  • raptormissle - Monday, May 27, 2019 - link

    So it looks like the SD 865 is finally going to break the 4000 geekbench single core score and probably score 12000+ in multi core. ARM has finally leveled the playing field with Apple as I really don't expect major gains from the A13 as they've likely blown their wad.
  • GC2:CS - Monday, May 27, 2019 - link

    There were big gains from going from 16/14 nm in 2016 to 7 nm in 2018.
    While performance gains were impresive in that time, no shrink like that coming any time soon.

    76 was a big performance jump, but that was after many regular releases where ARM had overestimeted or even regresed on their performance metrics in reality (Or so i think... Am I right ?).
    And Apple does not make promises, but they are expected (based on like 5 (how many ?!?) generations of big performance jumps) to deliver something crazy, which they can fail.

    Honestly I would happily take the same performance with lower power. And cut/cap the peak power as well... not planing to buy a fan equped phone.
  • Wilco1 - Monday, May 27, 2019 - link

    There is a 7+nm shrink coming this year and then another huge one with 5nm next year. So big shrinks are continuing at least at TSMC.

    And performance has increased hugely with each generation since Cortex-A57. The smallest gain was with Cortex-A73, but that still improved sustained performance and efficiency considerably.

    Many phones support battery saving modes which limit the frequency of the big cores (or switch them off). So you can already get what you want if battery life is your goal. I find these modes very useful but you clearly notice the performance loss while browsing.
  • Santoval - Monday, May 27, 2019 - link

    "or even regresed on their performance metrics in reality (Or so i think... Am I right ?)."
    Indeed, the A73 core was usually slower than the A72 core, or at best just as fast, though it was supposed to be the successor core.
  • blu42 - Tuesday, May 28, 2019 - link

    CA73 was not exactly 'usually slower' than CA72 -- former was usually slower at asimd workloads, but it was going better at spaghetti code (even with the narrower frontend), of which there's more in this world. We now have CA57s, CA72s and CA73s all in affordable SBCs (finally!), so people can check for themselves.

    BTW, CA73 was a successor but it was not meant to be a performance improvement, rather than an efficiency improvement, and there it delivered, IMO
  • ZolaIII - Tuesday, May 28, 2019 - link

    A73 is a two instructions wide vs A72 three instructions wide OoO design. While performance all together whose the same the gain in performance per W whose 30~33% & per mm² 27~28%.
  • name99 - Tuesday, May 28, 2019 - link

    The A10 was based on 16nm, just like the A9. But came with a substantial improvement...

    You guys are way too ignorant of the role of micro architecture in performance. Apple’s speed boosts so far (since A7) are pretty much exactly split 50% micro-architecture and 50% frequency (so process).

    The A13 based on 7nm is still capable of large improvements; firstly from micro-arch improvements, second from having a chance to optimize what’s already there (like, as I said, the A10 as second pass through 16nm).
  • galdutro - Monday, May 27, 2019 - link

    Given ther track record, I would be surprised if the A13 had the same performance as the A12X. YES, this is crazy! But it is what they actually have been able to achiave in the past couple of generations.
  • galdutro - Monday, May 27, 2019 - link

    typos: their* wouldn´t** achieve***

Log in

Don't have an account? Sign up now