Arm's Cortex-A76 CPU Unveiled: Taking Aim at the Top for 7nm

Name: Arm's Cortex-A76 CPU Unveiled: Taking Aim at the Top for 7nm
Item: Arm's Cortex-A76 CPU Unveiled: Taking Aim at the Top for 7nm
Author: Andrei Frumusanu

by Andrei Frumusanu on May 31, 2018 3:01 PM EST

123 Comments | Add A Comment

123 Comments

Cortex A76 µarch - Frontend

Starting off with a rough overview of the Cortex A76 microarchitectural diagram we see the larger functional blocks. The A76 doesn’t look too different than other Arm processors in this regard and the differences come only with details that Arm is willing to divulge. To overly simplify it, this is a superscalar out-of-order core with a 4-wide decode front-end with 8 execution ports in the backend with a total implementation pipeline depth of 13 stages with the execution latencies of a 11 stage core.

In the front-end, Arm has created a new predict/fetch unit that it calls a “predict-directed fetch”, meaning the branch prediction unit feeds into the instruction fetch unit. This is a divergence from past Arm µarches and it allows for both higher performance and lower power consumption.

The branch prediction unit is what Arm calls a first in the industry in adopting a hybrid indirect predictor. The predictor is decoupled from the fetch unit and its supporting large structures operate separate from the rest of the machine – likely what this means is that it will be easier to clock-gate during operation to save on power. The branch predictor is supported by 3-level branch target caches; a 16-entry nanoBTB, a 64-entry microBTB and a 6000 entry main BTB. Arm claimed back in the A73 and A75 generations of branch predictors were able to nearly predict all taken branches so this new unit in the A76 seems to be one level above that in capability.

The branch unit operates at double the bandwidth of the fetch unit – it operates on 32B/cycle meaning up to 8 32b instructions per cycle. This feeds a fetch queue in front of the instruction fetch consisting of 12 “blocks”. The fetch unit operates at 16B/cycle meaning 4 32b instructions. The branch unit operating at double the throughput makes it possible to get ahead of the fetch unit. What this serves is that in the case of a mispredict it can hide branch bubbles in the pipeline and avoid stalling the fetch unit and the rest of the core. The core is said to able to cope with up to 8 misses on the I-side.

I mentioned at the beginning that the A76 is a 13-stage implementation with the latency of an 11-stage core. What happens is that in latency-critical paths the stages can be overlapped. One such cycle happens between the second cycle of the branch predict path and the first cycle of the fetch path. So effectively while there’s 4 (2+2) pipeline stages on the branch and fetch, the core has latencies of down to 3 cycles.

On the decode and rename stages we see a throughput of 4 instructions per cycle. The A73 and A75 were respectively 2 and 3-wide in their decode stages so the A76 is 33% wider than the last generation in this aspect. It was curious to see the A73 go down in decode width from the 3-wide A72, but this was done to optimise for power efficiency and “leanness” of the pipeline with goals of improving the utilisation of the front-end units. With the A76 going 4-wide, this is also Arms to date widest microarchitecture – although it’s still extremely lean when putting it into juxtaposition with competing µarches from Samsung or Apple.

The fetch unit feeds a decode queue of up to 16 32b instructions. The pipeline stages here consist of 2 cycles of instruction align and decode. It looks here Arm decided to go back to a 2-cycle decode as opposed to the 1-cycle unit found on the A73 and A75. As a reminder the Sophia cores still required a secondary cycle on the decode stage when handling instructions utilising the ASIMD/FP pipelines so Arm may have found other optimisation methods with the A76 µarch that warranted this design decision.

The decode stage takes in 4 instructions per cycle and outputs macro-ops at an average ratio of 1.06Mops per instruction. Entering the register rename stage we see heavy power optimisation as the rename units are separated and clock gated for integer/ASIMD/flag operations. The rename and dispatch are a 1 cycle stage which is a reduction from the 2-cycle rename/dispatch from the A73 and A75. Macro-ops are expanded into micro-ops at a ratio of 1.2µop per instruction and we see up to 8µops dispatched per cycle, which is an increase from the advertised 6µops/cycle on the A75 and 4µops/cycle on the A73.

The out-of-order window commit size of the A76 is 128 and the buffer is separated into two structures responsible for instruction management and register reclaim, called a hybrid commit system. Arm here made it clear that it wasn’t focusing on increasing this aspect of the design as it found it as a terrible return on investment when it comes to performance. It is said that the performance scaling is 1/7^th – meaning a 7% increase of the reorder buffer only results in a 1% increase in performance. This comes at great juxtaposition compared to for example Samsung's M3 cores with a very large 224 ROB.

As a last note on the front-end, Arm said to have tried to optimised the front-end for lowest possible latency for hypervisor activity and system calls, but didn’t go into more details.

The Arm Cortex A76 - Introduction Cortex A76 µarch - Backend

PRINT THIS ARTICLE

Post Your Comment
Please log in or sign up to comment.

Comments Locked

123 Comments

View All Comments

name99 - Friday, June 1, 2018 - link
FFS. the issue is NOT "Older batteries might not be able to supply enough power for a big core", it is that the battery cannot supply enough CURRENT.

If you can't be bothered to understand the underlying engineering issue and why the difference between current and power matters, then your opinions on this issue are worthless.
serendip - Friday, June 1, 2018 - link
Whoa, chill there buddy, I'm not an electrical engineer.
name99 - Friday, June 1, 2018 - link
"Does anyone actually use the full performance of the A11 or A12 in daily tasks? "
Absolutely. I've updated iPhones every two years, and every update brings a substantial boost in "fluidity" and just general not having to wait. I can definitely feel the difference between my iPhone 7 and my friend's iPhone X; and I expect I will likewise feel the difference when I get my iPhone 2018 edition (whatever they are naming them this year...)

Now if you want to be a tool, you can argue "that's because Apple's software sux. Bloat, useless animations, last good version of iOS was version 4, blah blah". Whatever.
MOST people find more functionality distributed throughout the dozens of little changes of each new version of the OS, and MOST people find the "texture" of the OS (colors, animations, etc) more pleasant than having some sort of text only Apple II UI, though doubtless that could run at a 10,000 fps.

So point is, yeah, you DO notice the difference on phones. Likewise on iPads. I use my iPad to read technical PDFs, and again, each two year update provides a REALLY obvious jump in how quickly complicated PDF pages render. With my very first iPad 1 there was a noticeable wait almost every page (only hidden, usually, because of page caching). By the A10X iPad Pro it's rare to encounter a PDF page that ever makes you wait, cached or not.

I've also talked about in the past about Wolfram Player, a subset of Mathematica for iPad. This allows you to interact with Mathematica "animations" (actually they're 3D interactive objects you construct that change what is displayed depending on how you move sliders or otherwise tweak parameters). These are calculating what's to be displayed (which might be something like numerically solving a partial differential equation, then displaying the result as a 3D object) in realtime as you move a slider.
Now this is (for now) pretty specialized stuff. But Wolfram's goal, as they fix the various bugs in the app and implement the bits of Mathematica that don't yet work well (or at all), is for these things to be the equivalent of video today. We used to put up with explanations (in books, or newspapers) that were just words. Then we got BW diagrams. Then we got color diagrams. Then we got video. Now we have web sites like NYT and Vox putting up dynamic explainers where you can move sliders --- BUT they are limited to the (slow) performance of browsers, and are a pain to construct (both the UI, and the underlying mathematical simulation). Something like Mathematica's animations are vastly more powerful, and vastly easier to create. One day these will be as ubiquitous as video is today, just one more datatype that gets passed around. But for them to work well requires a CPU that can numerically solve PDEs in real time on your mobile device...
techconc - Tuesday, June 5, 2018 - link
The benefits of having a fast single core are seen on most common operations, including UI and scrolling, etc. Moreover, Apple has demonstrated that a powerful core can in fact be more efficient in race to sleep conditions whereby it completes the work more quickly then sleeps. The overall effect is a more responsive system that is just as efficient overall.
tipoo - Tuesday, September 4, 2018 - link
Sure, every time I render a webpage.
ZolaIII - Friday, June 1, 2018 - link
Well let's put it this way the A73 which is two instructions wide had a no problems on 14 nm FinFET, A76 is 4 instruction wide & for a sakes of argument let's say 2x the size. So switching from 14 nm to 7nm (60% reduction on power) cower it, A76 is approximately 65% faster than A73 MHz per MHz so its able to deliver approximately the 1.8x performance per same DTP. Second part is a manufacturing process in comparison to the core size. The FinFET structure transistors leak as hell when the 2.1~2.2 GHz limit is reached disregarding of OEM, vendor/foundry. So if you employ 50% wider core's (6 instructions wide) that won't cross the 2.1~2.2 GHz limit it's not the same as if you push the limit of the 4 instructions wide one to 3GHz as the power consumption will be doubled compared to the same one operating on 2.1~2.2 GHz & in the end you lose both on theoretical true output (performance) and power consumption metric but it still costs you 33% less. In reality it's much harder to feed optimally the wider core (especially on something which is mobile OS). ARM (cowboy camp) did a great work optimising instruction latency and cache latency/true output which will both increase real instruction output per clock & help predictor without significant increase in needed resources (cost - size) & A76 is a first of it's kind (CPU ever) regarding implanted solution for this. However thing that ARM didn't deliver is a better primary work horse which could make a difference in base user experience. A55 aren't exactly the power haus regarding performance & now their is more headroom regarding power when scaled down to the 7 nm, enough for let's say A73 on slightly lower clocks to replace the A55 (A73 is 1.6x integer performance of A55 MHz/MHz so A73 @ 1.7GHz = A55 @ 2.7 GHz while switching from 14 to 7nm would make DTP of A55 to A73 the same). But A73 doesn't work on DinamIQ cluster. So there is a need for the new two instructions wide OoO core with merged architectural advancements (front end, predictor, cache, ASIMD...) as in order ones did hit the brick wall long time ago.
vladx - Friday, June 1, 2018 - link
> So switching from 14 nm to 7nm (60% reduction on power)

That might've been true if both 14nm and 7nm fab processes were actually the real deal. But alas, they are not.
ZolaIII - Saturday, June 2, 2018 - link
Based on the TSMC projections 60% power reduction.
beginner99 - Monday, June 4, 2018 - link
The things is that CPU power use might already be a small part of phone power use. The display usually being the main consumer and when the display isn't running, most likely the big core will also not be running. Saving 40% power sounds great on paper. But in real designs it will already be smaller and the total impact on phone battery life will be much, much smaller. Single-digit percentage probably depending on how much you use. The more it is idle, the less the big core efficiency matters.
Dazedconfused - Thursday, May 31, 2018 - link
I get this, but when comparing an iPhone x and say an Android flagship next to each other in pretty much every day to day task, they appear evenly matched. There are some good comparisons on YouTube. There are definitely strengths to each platform, but it's not clear cut at all

Arm's Cortex-A76 CPU Unveiled: Taking Aim at the Top for 7nm

Cortex A76 µarch - Frontend

Post Your Comment

123 Comments

View All Comments

name99 - Friday, June 1, 2018 - link

serendip - Friday, June 1, 2018 - link

name99 - Friday, June 1, 2018 - link

techconc - Tuesday, June 5, 2018 - link

tipoo - Tuesday, September 4, 2018 - link

ZolaIII - Friday, June 1, 2018 - link

vladx - Friday, June 1, 2018 - link

ZolaIII - Saturday, June 2, 2018 - link

beginner99 - Monday, June 4, 2018 - link

Dazedconfused - Thursday, May 31, 2018 - link

Log in

Don't have an account? Sign up now