Cortex A76 µarch - Backend

Switching to the back-end of the core we have a look at the execution core. 

The integer core contains 6 issue queues and execution ports (4 depicted in the slide plus 2 load/store pipelines). There are 3 integer execution pipelines – two ALUs capable of simple arithmetic operations and a complex pipeline handling also multiplication, division and CRC ops. The three integer pipelines are served by 16 deep issue queues. The same size issue queue can also be found serving the single branch execution port.

Two load/store units are the remaining ports of the integer core and are each served by two 12 deep issue queues. The issue queue stages are 3 cycles deep and while I mentioned that the rename/dispatch is 1 stage deep, the dispatch stage actually overlaps with the first cycle of the issue queues stages.

The ASIMD/floating point core contains two pipelines which are served by two 16-deep issue queues.

When it comes to the backend of a CPU core the two most important metrics are instruction throughput and latency. Where the A76 in particular improves a lot is in terms of instruction latency as it’s able to shave off cycles on very important instructions.

To better overview the improvements I created a table with the most common instruction types. The execution throughput and latencies presented here are for AArch64 instructions and if not otherwise noted represent operations on 64-bit data for integer and 64bit (double precision) FP.

Backend Execution Throughput and Latency
  Cortex-A75 Cortex-A76 Exynos-M3
  Exec Lat Exec Lat Exec Lat
Integer Arithmetic (Add, sub) 2 1 3 1 4 1
Integer Multiply 32b 1 3 1 2 2 3
Integer Multiply 64b 1 3 1 2 1
(2x 0.5)
4
Integer Multiply Accumulate 1 3 1 2 1 3
Integer Division 32b 0.25 12 0.2 < 12 1/12 - 1 < 12
Integer Division 64b 0.25 12 0.2 < 12 1/21 - 1 < 21
Move  (Mov) 2 1 3 1 3 1
Shift ops (Lsl) 2 1 3 1 3 1
Load instructions 2 4 2 4 2 4
Store instructions 2 1 2 1 1 1
FP Arithmetic 2 3 2 2 3 2
FP Multiply 2 3 2 3 3 4
FP Multiply Accumulate 2 5 2 4 3 4
FP Division (S-form) 0.2-0.33 6-10 0.66 7 >0.16
(2x 1+/12)
12
FP Load 2 5 2 5 2 5
FP Store 2 1-N 2 2 2 1
ASIMD Arithmetic 2 3 2 2 3 1
ASIMD Multiply 1 4 1 4 1 3
ASIMD Multiply Accumulate 1 4 1 4 1 3
ASIMD FP Arithmetic 2 3 2 2 3 2
ASIMD FP Multiply 2 3 2 3 1 3
ASIMD FP Chained MAC (VMLA) 2 6 2 5 3 5
ASIMD FP Fused MAC (VFMA) 2 5 2 4 3 4

On the integer operations side the A76 improves the multiplication and multiply accumulate latencies from 3 cycles down to 2 cycles, with the throughput remaining the same when compared to the A75. Obviously because the A76 has 3 integer pipelines simple arithmetic operations see a 50% increase in throughput versus the A75’s 2 pipelines.

The much larger and important improvements can be found in the “VX” (vector execution) pipelines which are in charge of FP and ASIMD operations. Arm calls the new pipeline a “state-of-the-art” design and this is finally the result that’s been hyped up for several years now.

Floating point arithmetic operations have been reduced in latency from 3 cycles down to 2 cycles, and multiply accumulate has also shaved off a cycle from 5 cycles down to 4.

What Arm means by the “Dual 128bit ASIMD” with doubled execution bandwidth is that for the A75 and prior only one of the vector pipelines was capable of 128bit while the other one was still 64-bit. For the A76 both vector pipelines are 128-bit now so quad-precision operations see a doubling of the execution throughput.

Moving onto more details of the data handling side, we see the again the two load/store pipelines which was something first implemented on the A73 and A75. Although depicted as one issue queue in the slide, the LD/S pipelines each have their own queues at 16 entries deep.

The data cache is fixed at 64KB and is 4-way associative. Load latency remains at 4 cycles. The DTLBs run a separate pipeline as tag and data lookup. Arm’s goals here is aiming for maximum MLP/ memory level parallelism to be able to feed the core.

In a perfect machine everything would be already located in the caches, so it’s important to have very robust prefetching capabilities. On the A76 we see a new 4th generation prefetchers introduced to get nearer to this goal of perfect cache-hit operation. In all the A76 has 4 different prefetching engines running in parallel looking at various data patterns and loading data into the caches.

In terms of the A76 cache hierarchy Arm is said to have made no compromises and got the best of both worlds in terms of bandwidth and latency. The 64KB L1 instruction cache reads up to 32B/cycle and the same bandwidth applies to the L1 data cache in both directions. The L1 is a writeback cache. The L2 cache is configurable in 256 or 512KB sizes and is D-side inclusive with the same 2x 32B/cycle write and read interfaces up to the exclusive L3 cache in the 2nd generation DSU.

Overall the microarchitectural improvements on the core are said to improve memory bandwidth to DRAM by up to 90% in microbenchmarks.

All in all the microarchitecture of the A76 could be summed up in a few focus design points: Maximise memory performance throughout the core by looking at every single cycle. During the design phase the engineers were looking at feature changes with a sensitivity of up to 0.25% in performance or power – if that metric was fulfilled then it was deemed to be a worthwhile change in the core. Small percentages then in turn add up to create significant figures in the end product.

The focus on bandwidth on latency is said to have been extreme, and Arm was very adamant in re-iterating that to be able to take full advantage of the microarchitecture that vendors need to implement an equally capable memory subsystem on the SoC to see full advantages. A figure that was put out there was 0.25% of performance per nanosecond of latency to main memory. As we’ve seen in the Snapdragon 845 one of the reasons the SoC didn’t quite reach Arm’s projected performance metrics was the degraded memory latency figures which might have been introduced by the L4 system cache in the SoC. In the future vendors will need to focus more on providing latency sensitive memory subsystems as otherwise they’ll be letting free performance and power on the table with differences that could amount to basically a generational difference in CPU IP.

Cortex A76 µarch - Frontend Cortex A76 - Performance & Power Projections
Comments Locked

123 Comments

View All Comments

  • iwod - Friday, June 1, 2018 - link

    Even if Apple moved A11 from 10nm to 7nm, and runs at 3Ghz it will still be a huge gap in performance. Let alone they will have A12 and 7nm shipping in a few months time. Compare this to A76, which I don't think will come in 2018.

    So there is still roughly a 3 years gap between ARM and Apple in IPC or Single thread performance.
  • Lolimaster - Friday, June 1, 2018 - link

    And why do you care about IPC, when 99.99% of all smartphone users:

    -Use the phone as a gloried clock
    -A tool for showing off (even with the cancer "dynamic" profile on Samsung AMOLED powered devices, they don't know the "basic" calibrated profile exists)
    -Twitter, facebook, instagram, whatapp

    Where is your need for performance? Unless you buy a phone to run antutu/geekbench all the time you pick the phone out of your pockets.

    The biggest improvement in phone performance was the jump from slow/high latency EMMC to nvme-like nand (apple), UFS for samsung and the others.
  • serendip - Friday, June 1, 2018 - link

    Spot on. I've got a SD650 and a SD625 phone, one with A72 big cores and the other with only A53 cores, and for web browsing and chatting they're almost indistinguishable. The 625 device also has much better battery life.
  • darwiniandude - Friday, June 1, 2018 - link

    Of course a faster device can accomplish a task faster and drop back to idle power effciency to aid battery life. Depends on many factors, but running at (hypothetical) 20 units of performance per second over 5 seconds (total 100) then dropping back to idle might be preferable to 10 units of performance per second over 10 seconds.
    Also, remember Apple’s devices do much on device, the Kinect-like FaceID for one, and unlike Google Photos where images are scanned for content in the cloud (this picture contains a bridge, and a dog) iOS devices scan their libraries on device when on charge.
  • name99 - Friday, June 1, 2018 - link

    That's like saying Intel shouldn't bother with performance any more because 99.99% of PCs run Facebook in the web browser, email, and Word.

    (a) Apple sells delight, and part of delight in your phone is NEVER waiting. If you want to save money, buy a cheaper phone and wait, but part of Apple's value proposition is that, for the money you spend, you reduce the friction of constant short waits. (Compare, eg, how much faster the phone felt when 1st gen TouchID was replaced with the faster 2nd TouchID. Same thing now with FaceID; it works and works well. But it will feel even smoother when the current half second delay is dropped to a tenth of a second [or whatever].)

    (b) Apple chips also go into iPads. And people use iPads (and sometimes iPhones) for more than you claim --- for various artistic tasks (manipulating video and photos, drawing with very fancy [ie high CPU] "brushes" and effects, creating music, etc). One of the reasons these jobs are done on iPads (and sometimes Surfaces) and not Android is because they need a decent CPU.

    (c) Ambition. BECAUSE Apple has a decent CPU, they can put that CPU into their desktops. And, soon enough, also into their data centers...
  • serendip - Friday, June 1, 2018 - link

    I'm curious about all this because I'm an iPad user. No iPhones though. Even an old iPad Mini is smoother than top Android tablets today.

    Does the CPU spike up to maximum speed quickly when loading apps or PDFs, then very quickly throttle down to minimum? I don't know how Apple make their UI so smooth while also having good battery life.
  • varase - Saturday, June 2, 2018 - link

    Smooth is the iPhone X.

    When you touch the screen, touch tracking boosts to 120hz, even though they can only run the OLED screen at 60hz.

    As for PDFs, MacOS (and as a consequence iOS) uses non-computational postscript as their graphics framework ... and PDF is essentially journaled postscript (like a PICT was journaled QuickDraw).

    As for throttling down: yeah, when you've completed your computationally expensive task you throttle down to save power.
  • YaleZhang - Friday, June 1, 2018 - link

    Reducing latency of floating point instructions from 3 cycles to 2 seems quite an accomplishment. For Intel, it's been >= 3 cycles (http://www.agner.org/optimize/instruction_tables.p...

    Skylake: 4 cycles / 4.3 GHz = 0.93 ns
    A76: 2 cycles / 3 GHz = 0.66 ns

    Skylake latency increased to 4 probably to achieve a higher clock, but if A76 can do it in 3, then Skylake should also be able to do it (3 cycles / 4.3 GHz) = 0.70 ns.
    How did ARM do this?
  • tipoo - Tuesday, September 4, 2018 - link

    Lower max clocks, shorter pipeline maybe?
  • Quantumz0d - Friday, June 1, 2018 - link

    Hilarious commenters. Apple's SoC ? Again ? I guess people need to think about how bad their Power envelope is. Their A11 gets beaten by 835 in consistency, dropping to 60% of clocks lol. And the battery killing SoC yes the battery capacity is less on iPhones. But Apple's R&D and the chips costs are very high vs the ARM. Not to forget how 845s GPU performance slaps and drowns that Custom *cough cough *Imagination* IP derived GPU core.

    They rely on the Single Thread performance because of power and optimization it goes for one OS and one HW ecosystem ruled and locked by Apple only where as ARM derived designs or Qcomm are robust for supporting wider hardware pool and can even run Windows OS.

Log in

Don't have an account? Sign up now