Arm's Cortex-A76 CPU Unveiled: Taking Aim at the Top for 7nm

Name: Arm's Cortex-A76 CPU Unveiled: Taking Aim at the Top for 7nm
Item: Arm's Cortex-A76 CPU Unveiled: Taking Aim at the Top for 7nm
Author: Andrei Frumusanu

by Andrei Frumusanu on May 31, 2018 3:01 PM EST

123 Comments | Add A Comment

123 Comments

Performance & Power Projections

Now that we’ve had more insight into the A76’s microarchitecture – there’s always a disconnect between theoretical performance based on overlying µarch and how it ends up in practice. We’re first going to look at ISO-process and ISO-frequency comparisons which means the generational performance improvements between the cores with otherwise identical factors such as memory subsystems.

In terms of general IPC Arm promises a ~25% increase in integer workloads and a ~35% increase in ASIMD/floating point workloads. Together with up to 90% higher memory bandwidth figures compared to the A75 the A76 is then meant to provide around a 28% increase in GeekBench4 and 35% more JavaScript performance (Octane, JetStream). In AI inferencing workloads the doubled ASIMD 128-bit capabilities of the A76 serves to quadruple the general matrix multiply performance in half precision formats.

These performance figures are respectable but not quite earth-shattering considering the tone of the improvements of the µarch. However it’s to note that we’re expecting the A76 to come first be deployed in flagship SoCs on TSMC’s 7nm process which allows for increased clocks.

Here Arm’s projections is that we’ll be seeing the A76 clocked at up to 3GHz on 7nm, which in turn will result in higher improvements. Quoted figures are 1.9x in integer and 2.5x in floating point subscores of GeekBench4 while we should be expecting total score increases of 35%.

GeekBench 4 Single Core

What this means in terms of absolute numbers is projected in the above graph. Baselining on the performance of the Snapdragon 835 and Snapdragon 845 a future SoC with an A76, 512KB L2’s and 2MB L3 would fall in around the GeekBench4 performance of the Exynos 9810 depending if the target 3GHz is reached.

In the past Arm has been overly optimistic when releasing frequency targets – for example the A73 was first projected at up to 2.8GHz and the Cortex A75 projected at up to 3GHz. In the end both ended up at no higher than 2.45GHz and 2.8GHz.

I’ve talked to a vendor about this and it seems Arm doesn’t take into consideration all corners when doing timing signoff, and in particular vendors have to take into consideration process variations which result in differently binned units, some of which might not reach the target frequencies. As mobile chips generally aren’t performance binned but rather power binned, vendors need to lower the target clock to get sufficient volume for commercialisation which results in slightly reduced clocks compared to what Arm usually talks about.

For the first A76 implementations in mobile devices I’m adamant that we won’t be seeing 3GHz SKUs but rather frequencies around 2.5GHz. Arm is still confident that we’ll see 3GHz SoCs but I’m going to be rather on the conservative side and be talking about 2.5GHz and 3GHz projections alongside each other, with the latter more of a projection of future higher TDP platforms.

Arm also had a slide demonstrating absolute peak performance at frequencies of 3.3GHz. The important thing to note here was that this scenario exceeded 5W and the performance would be reduced to get under that TDP target. It wasn’t clear if this was SoC power or solely CPU power – I’ll follow up with a clarification after I reach out to Arm.

Obviously the most important metric here alongside the performance improvements is the power and efficiency targets. In target products comparing Cortex A75 on a 10nm process versus a Cortex A76 on a 7nm process under the same 750mW/core power budget, the Cortex A76 delivers 40% more performance.

In terms of energy efficiency, a 7nm A76 at a performance target of 20 SPECint2006 of an A75 on 10nm (meaning maximum performance at 2.8GHz) is said to use half the amount of energy.

What is important in all these metrics again is that we weren’t presented an ISO-process comparison or a comparison at maximum performance of the A76 at 3GHz, so we’re left with quite a bit of guesswork in terms of projecting the end energy efficiency difference in products. TSMC promises a 40% drop in power versus 10FF. We haven’t seen an A75 implemented on a TSMC process to date so the best baseline we have is Qualcomm’s Snapdragon 845 on Samsung 10LPP which should slightly outperform 10FF.

Going through my projected data on one side we have performance on the right side: I baselined the SPECspeed scores on the average of the Snapdragon 835 and Kirin 970 measured results and applied Arm’s projected IPC claims and scaled the scores for frequency. For the 3GHz A76 projection this gets us the near 2x performance improvements in SPECint2006 vs the A73 generation of cores.

In terms of power efficiency, there’s more guesswork as the only real figure we have is as earlier stated the process scaled efficiency figures. Arm quoted a performance target of 20 SPECint2006 which I suspect is a 2.8GHz A75 run with GCC compiled benchmark binaries which have an advantage over my LLVM figures. If Arm wanted to compare against the Snapdragon 845 this matches roughly a 2.4GHz A76. Accounting for the process power improvement this roughly leaves a ~15% microarchitectural advantage for the A76. However as the A76 is targeted to perform 35% higher, and as we’ve seen in the past performance increases through clock don’t scale linearly with power, the power and efficiency advantages would very quickly degrade at peak performance.

Taking all factor into account as best as I could, we should be seeing 7nm A76 based SoCs beat slightly beat the energy efficiency of current Arm SoCs in terms of absolute energy usage at peak performance, a metric which is important as it is directly proportional to a device’s battery life. At a more conservative 2.5GHz clock this energy efficiency advantage would be greater and around 30% less energy than current generation A73 and A75 SoCs.

So on one hand the A76 would be extremely energy efficient, but also it could very well be a thermally constrained design as its peak performance we’d be seeing quite higher TDP figures. Arm states that the A76 is meant to run at full frequency in quad-core mode, however that claim is limited to larger form-factors, as for mobile devices, based on what I’m hearing vendors, will need to tone it down to lower clocks in order to fit smartphone designs.

Again the projection here contain a lot of variables and I’m erring towards the more conservative side in terms of performance and efficiency- however it’s clear that the jump will be significant in whichever way vendors will decide to push the A76 in (Performance or efficiency).

Gallery: Arm Cortex-A76 Slide Deck

Cortex A76 µarch - Backend Conclusion & Thoughts

PRINT THIS ARTICLE

Post Your Comment
Please log in or sign up to comment.

Comments Locked

123 Comments

View All Comments

iwod - Friday, June 1, 2018 - link
Even if Apple moved A11 from 10nm to 7nm, and runs at 3Ghz it will still be a huge gap in performance. Let alone they will have A12 and 7nm shipping in a few months time. Compare this to A76, which I don't think will come in 2018.

So there is still roughly a 3 years gap between ARM and Apple in IPC or Single thread performance.
Lolimaster - Friday, June 1, 2018 - link
And why do you care about IPC, when 99.99% of all smartphone users:

-Use the phone as a gloried clock
-A tool for showing off (even with the cancer "dynamic" profile on Samsung AMOLED powered devices, they don't know the "basic" calibrated profile exists)
-Twitter, facebook, instagram, whatapp

Where is your need for performance? Unless you buy a phone to run antutu/geekbench all the time you pick the phone out of your pockets.

The biggest improvement in phone performance was the jump from slow/high latency EMMC to nvme-like nand (apple), UFS for samsung and the others.
serendip - Friday, June 1, 2018 - link
Spot on. I've got a SD650 and a SD625 phone, one with A72 big cores and the other with only A53 cores, and for web browsing and chatting they're almost indistinguishable. The 625 device also has much better battery life.
darwiniandude - Friday, June 1, 2018 - link
Of course a faster device can accomplish a task faster and drop back to idle power effciency to aid battery life. Depends on many factors, but running at (hypothetical) 20 units of performance per second over 5 seconds (total 100) then dropping back to idle might be preferable to 10 units of performance per second over 10 seconds.
Also, remember Apple’s devices do much on device, the Kinect-like FaceID for one, and unlike Google Photos where images are scanned for content in the cloud (this picture contains a bridge, and a dog) iOS devices scan their libraries on device when on charge.
name99 - Friday, June 1, 2018 - link
That's like saying Intel shouldn't bother with performance any more because 99.99% of PCs run Facebook in the web browser, email, and Word.

(a) Apple sells delight, and part of delight in your phone is NEVER waiting. If you want to save money, buy a cheaper phone and wait, but part of Apple's value proposition is that, for the money you spend, you reduce the friction of constant short waits. (Compare, eg, how much faster the phone felt when 1st gen TouchID was replaced with the faster 2nd TouchID. Same thing now with FaceID; it works and works well. But it will feel even smoother when the current half second delay is dropped to a tenth of a second [or whatever].)

(b) Apple chips also go into iPads. And people use iPads (and sometimes iPhones) for more than you claim --- for various artistic tasks (manipulating video and photos, drawing with very fancy [ie high CPU] "brushes" and effects, creating music, etc). One of the reasons these jobs are done on iPads (and sometimes Surfaces) and not Android is because they need a decent CPU.

(c) Ambition. BECAUSE Apple has a decent CPU, they can put that CPU into their desktops. And, soon enough, also into their data centers...
serendip - Friday, June 1, 2018 - link
I'm curious about all this because I'm an iPad user. No iPhones though. Even an old iPad Mini is smoother than top Android tablets today.

Does the CPU spike up to maximum speed quickly when loading apps or PDFs, then very quickly throttle down to minimum? I don't know how Apple make their UI so smooth while also having good battery life.
varase - Saturday, June 2, 2018 - link
Smooth is the iPhone X.

When you touch the screen, touch tracking boosts to 120hz, even though they can only run the OLED screen at 60hz.

As for PDFs, MacOS (and as a consequence iOS) uses non-computational postscript as their graphics framework ... and PDF is essentially journaled postscript (like a PICT was journaled QuickDraw).

As for throttling down: yeah, when you've completed your computationally expensive task you throttle down to save power.
YaleZhang - Friday, June 1, 2018 - link
Reducing latency of floating point instructions from 3 cycles to 2 seems quite an accomplishment. For Intel, it's been >= 3 cycles (http://www.agner.org/optimize/instruction_tables.p...

Skylake: 4 cycles / 4.3 GHz = 0.93 ns
A76: 2 cycles / 3 GHz = 0.66 ns

Skylake latency increased to 4 probably to achieve a higher clock, but if A76 can do it in 3, then Skylake should also be able to do it (3 cycles / 4.3 GHz) = 0.70 ns.
How did ARM do this?
tipoo - Tuesday, September 4, 2018 - link
Lower max clocks, shorter pipeline maybe?
Quantumz0d - Friday, June 1, 2018 - link
Hilarious commenters. Apple's SoC ? Again ? I guess people need to think about how bad their Power envelope is. Their A11 gets beaten by 835 in consistency, dropping to 60% of clocks lol. And the battery killing SoC yes the battery capacity is less on iPhones. But Apple's R&D and the chips costs are very high vs the ARM. Not to forget how 845s GPU performance slaps and drowns that Custom *cough cough *Imagination* IP derived GPU core.

They rely on the Single Thread performance because of power and optimization it goes for one OS and one HW ecosystem ruled and locked by Apple only where as ARM derived designs or Qcomm are robust for supporting wider hardware pool and can even run Windows OS.

Arm's Cortex-A76 CPU Unveiled: Taking Aim at the Top for 7nm

Performance & Power Projections

Post Your Comment

123 Comments

View All Comments

iwod - Friday, June 1, 2018 - link

Lolimaster - Friday, June 1, 2018 - link

serendip - Friday, June 1, 2018 - link

darwiniandude - Friday, June 1, 2018 - link

name99 - Friday, June 1, 2018 - link

serendip - Friday, June 1, 2018 - link

varase - Saturday, June 2, 2018 - link

YaleZhang - Friday, June 1, 2018 - link

tipoo - Tuesday, September 4, 2018 - link

Quantumz0d - Friday, June 1, 2018 - link

Log in

Don't have an account? Sign up now