Another year, another TechDay from Arm. Over the last several years Arm’s event has come as clockwork in the May timeframe and has every time unveiled the newest flagship CPU and GPU IPs. This year is no exception as the event is back on the American side of the Atlantic in Austin Texas where Arm has one of its major design centres.

Two years ago during the unveiling of the Cortex A73 I had talked a bit more about Arm’s CPU design teams and how they’re spread across locations and product lines. The main design centres for Cortex-A series of CPUs are found in Austin, Texas; Cambridge, the United Kingdom, and Sophia-Antipolis in the south of France near Nice. For the last two years the Cortex A73 and Cortex A75 were designs that mainly came out of the Sophia team while the Cortex A53 and more recently the A55 were designs coming out of Cambridge. This means that we haven’t seen any recent designs coming out of Austin and the last of the “Austin family” of CPUs were the A57 and A72.

The project being worked on in Austin had been hyped up for several years – I remember even as early as the A73 release back in 2016 the company had pulled forward some elements from an advanced future microarchitecture on the back-end pipelines, especially on the FP/SIMD side. The Cortex A75 was further remarked as pulling more elements from this new mysterious project.

Today we can finally unveil what the Austin team has been working on – and it’s a big one. The new Cortex A76 is a brand new microarchitecture which has been built from scratch and lays the foundation for at least two more generations for what I’ll call “the second generation of Austin family” of CPUs.

The Cortex A76 is important for Arm for a design perspective as it represents a new start from a clean sheet. It’s rare for IP claim to be able to do this as it represents a great resource and time investment and if it weren’t for the Sophia design team taking over the steering wheel for the last two generations of products it wouldn’t have been reasonable to execute. The execution of the CPU design teams should be emphasised in particular as Arm claims this is the 5th generation “annual beat” product where the company delivers a new microarchitecture every new year. Think of it as an analogue to Intel’s past Tick-Tock strategy, but rather Tock-Tock-Tock for Arm with steady CAGR (compound annual growth rate) of 20-25% every generation coming from µarch improvements.

So what is the Cortex A76? In Arm’s words, it’s a “laptop-class” performance processor with mobile efficiency. The vision of the A76 as a laptop-class processor had been emphasised throughout the TechDay presentation so it seems Arm is really taking advantage of the large performance boost of the IP to cater to new market segments such as the emerging “Always connected PCs” which Qualcomm is spearheading with their SoC platforms.

The Cortex A76 microarchitecture has been designed with high performance while maintaining power efficiency in mind. Starting from a clean sheet allowed the designers to remove bottlenecks throughout the design and to break previous microarchitectural limitations. The focus here was again maximum performance while remaining within energy efficiency that is fit for smartphones.

In broad metrics, what we’re promised in actual products using the A76 is the follows: a 35% performance increase alongside 40% improved power efficiency. We’ll also see a 4x improvements in machine learning workloads thanks to new optimisations in the ASIMD pipelines and how dot products are handled. These figures are baselined on A75 configurations running at 2.8GHz on 10nm processes while the A76 is projected by Arm to come in at 3GHz on 7nm TSMC based products.

The new CPU is naturally still compatible with DynamIQ’s common cluster topology and Arm envisions designs to be paired with Cortex A55s as the little more power efficient CPUs. The configuration scalability of the DynamIQ IP again was reiterated and we were presented with example configurations such as 1+7 or 2+6 with either Cortex A75 or A76 CPU IP. This presentation slide was one of the rare ones where Arm referred to the area size of the A76, pointing out that the A75 still had better PPA and thus might still be a valid design choice for companies, depending on their needs. One comparison that was made during the event is that in terms of area, three A76’s with larger caches would fit inside the size of a Skylake core – all while within 10% of the IPC of the Intel CPU, but obviously there’s also process node scaling considerations to take into account.

A standout claim is that Arm aims to outperform the competition at half the area and half the power. Arm was slightly beating around the bush here in what it considers the competition, but generally the answer was that it was considering everybody the competition. Taking into account Intel, AMD or Samsung it’s actually not that hard to imagine Arm beating them in PPA as historically the company always had the smallest CPU designs and that directly translates into more efficient microarchitectures.

Before we get into more detailed breakdowns of the performance and power improvements and what I’m expecting to happen into products, let’s see the microarchitectural improvements on the core and how Arm managed to extract this much performance while maintaining power efficiency.

Cortex A76 µarch - Frontend
POST A COMMENT

122 Comments

View All Comments

  • iwod - Friday, June 1, 2018 - link

    Even if Apple moved A11 from 10nm to 7nm, and runs at 3Ghz it will still be a huge gap in performance. Let alone they will have A12 and 7nm shipping in a few months time. Compare this to A76, which I don't think will come in 2018.

    So there is still roughly a 3 years gap between ARM and Apple in IPC or Single thread performance.
    Reply
  • Lolimaster - Friday, June 1, 2018 - link

    And why do you care about IPC, when 99.99% of all smartphone users:

    -Use the phone as a gloried clock
    -A tool for showing off (even with the cancer "dynamic" profile on Samsung AMOLED powered devices, they don't know the "basic" calibrated profile exists)
    -Twitter, facebook, instagram, whatapp

    Where is your need for performance? Unless you buy a phone to run antutu/geekbench all the time you pick the phone out of your pockets.

    The biggest improvement in phone performance was the jump from slow/high latency EMMC to nvme-like nand (apple), UFS for samsung and the others.
    Reply
  • serendip - Friday, June 1, 2018 - link

    Spot on. I've got a SD650 and a SD625 phone, one with A72 big cores and the other with only A53 cores, and for web browsing and chatting they're almost indistinguishable. The 625 device also has much better battery life. Reply
  • darwiniandude - Friday, June 1, 2018 - link

    Of course a faster device can accomplish a task faster and drop back to idle power effciency to aid battery life. Depends on many factors, but running at (hypothetical) 20 units of performance per second over 5 seconds (total 100) then dropping back to idle might be preferable to 10 units of performance per second over 10 seconds.
    Also, remember Apple’s devices do much on device, the Kinect-like FaceID for one, and unlike Google Photos where images are scanned for content in the cloud (this picture contains a bridge, and a dog) iOS devices scan their libraries on device when on charge.
    Reply
  • name99 - Friday, June 1, 2018 - link

    That's like saying Intel shouldn't bother with performance any more because 99.99% of PCs run Facebook in the web browser, email, and Word.

    (a) Apple sells delight, and part of delight in your phone is NEVER waiting. If you want to save money, buy a cheaper phone and wait, but part of Apple's value proposition is that, for the money you spend, you reduce the friction of constant short waits. (Compare, eg, how much faster the phone felt when 1st gen TouchID was replaced with the faster 2nd TouchID. Same thing now with FaceID; it works and works well. But it will feel even smoother when the current half second delay is dropped to a tenth of a second [or whatever].)

    (b) Apple chips also go into iPads. And people use iPads (and sometimes iPhones) for more than you claim --- for various artistic tasks (manipulating video and photos, drawing with very fancy [ie high CPU] "brushes" and effects, creating music, etc). One of the reasons these jobs are done on iPads (and sometimes Surfaces) and not Android is because they need a decent CPU.

    (c) Ambition. BECAUSE Apple has a decent CPU, they can put that CPU into their desktops. And, soon enough, also into their data centers...
    Reply
  • serendip - Friday, June 1, 2018 - link

    I'm curious about all this because I'm an iPad user. No iPhones though. Even an old iPad Mini is smoother than top Android tablets today.

    Does the CPU spike up to maximum speed quickly when loading apps or PDFs, then very quickly throttle down to minimum? I don't know how Apple make their UI so smooth while also having good battery life.
    Reply
  • varase - Saturday, June 2, 2018 - link

    Smooth is the iPhone X.

    When you touch the screen, touch tracking boosts to 120hz, even though they can only run the OLED screen at 60hz.

    As for PDFs, MacOS (and as a consequence iOS) uses non-computational postscript as their graphics framework ... and PDF is essentially journaled postscript (like a PICT was journaled QuickDraw).

    As for throttling down: yeah, when you've completed your computationally expensive task you throttle down to save power.
    Reply
  • YaleZhang - Friday, June 1, 2018 - link

    Reducing latency of floating point instructions from 3 cycles to 2 seems quite an accomplishment. For Intel, it's been >= 3 cycles (http://www.agner.org/optimize/instruction_tables.p...

    Skylake: 4 cycles / 4.3 GHz = 0.93 ns
    A76: 2 cycles / 3 GHz = 0.66 ns

    Skylake latency increased to 4 probably to achieve a higher clock, but if A76 can do it in 3, then Skylake should also be able to do it (3 cycles / 4.3 GHz) = 0.70 ns.
    How did ARM do this?
    Reply
  • tipoo - Tuesday, September 4, 2018 - link

    Lower max clocks, shorter pipeline maybe? Reply
  • Quantumz0d - Friday, June 1, 2018 - link

    Hilarious commenters. Apple's SoC ? Again ? I guess people need to think about how bad their Power envelope is. Their A11 gets beaten by 835 in consistency, dropping to 60% of clocks lol. And the battery killing SoC yes the battery capacity is less on iPhones. But Apple's R&D and the chips costs are very high vs the ARM. Not to forget how 845s GPU performance slaps and drowns that Custom *cough cough *Imagination* IP derived GPU core.

    They rely on the Single Thread performance because of power and optimization it goes for one OS and one HW ecosystem ruled and locked by Apple only where as ARM derived designs or Qcomm are robust for supporting wider hardware pool and can even run Windows OS.
    Reply

Log in

Don't have an account? Sign up now