Arm's New Cortex-A78 and Cortex-X1 Microarchitectures: An Efficiency and Performance Divergence

Name: Arm's New Cortex-A78 and Cortex-X1 Microarchitectures: An Efficiency and Performance Divergence
Item: Arm's New Cortex-A78 and Cortex-X1 Microarchitectures: An Efficiency and Performance Divergence
Author: Andrei Frumusanu

by Andrei Frumusanu on May 26, 2020 9:00 AM EST

192 Comments | Add A Comment

192 Comments

Implementations Choices & Customers

Naturally, the Cortex-X1 is expected to be quite bigger than a Cortex-A78, but not dramatically more. Arm does warn though that for mobile designs it’s extremely unlikely that we’ll see implementations with more than two X1 cores. The company here is essentially embracing the industry trend of going for a three tier core hierarchy, and with the introduction of the A78 and X1, they’re allowing customers to build such systems with much more flexibility and more differentiation than the frequency and process library differentiation we’ve been seeing on today’s “mid” and performance cores.

There’s still going to be customers who may be cost averse or simply not take part in the “Cortex-X Program”, who might just avoid the X1 and just go with A78 cores. The comparison Arm is making here is against an equivalent A77 setup, and the A78 cores would indeed bring a good amount of area savings all while improving performance.

Cortex-X1 implementers would very likely go for a hybrid cluster implementation with X1, A78 and A55 cores in a DSU. Arm here depicts Qualcomm’s favorite 1+3+4 configuration, and it's a logical setup that we’d expect to see in a future Snapdragon chip.

Today’s announcement of the Arm cores also came with an unusual quote from Samsung LSI:

“Samsung and Arm have a strong technology partnership and we are very excited to see the new direction Arm is taking with Cortex-X Custom program, enabling innovation in the Android ecosystem for next-gen user experiences.”

- Joonseok Kim, vice president of SoC design team at Samsung Electronics

It’s extremely rare to hear Samsung talk about a new Arm IP like this during a launch, and I think it’s pretty safe to say that this is very much an indirect confirmation that they’re a licensee of the X1 cores. In which case, we’ll be seeing the core in the next generation of flagship Exynos chipsets. Looking back at what happened with Samsung’s custom CPU design team last year as well as their lackluster performance of their custom cores, the very existence of the X1 probably further sealed the fate for their custom core efforts. The only remaining questions for me is whether they’ll go for a 1+3+4, or a 2+2+4 setup, and if Samsung’s 5nm will showcase better competitiveness compared to their lagging 7nm node.

Meanwhile HiSilicon, being in the middle of political turmoil, probably won't get to produce an X1 chip; plus the vendor has a tendency not always use the latest CPU IPs anyhow. MediaTek would be the last candidate licensee for the X1 – but here I’m also relatively uncertain if the company’s cost-oriented mantra actually fits well with the X1’s philosophy of going all out on area, with the likelihood that it’s also more expensive to license.

First Impressions - Arm Finally Going For Pure Performance

Today’s reveal of the Cortex-A78 and Cortex-X1 brought both the expected and the unexpected. I've had relatively modest expectations of the A78, as for years we had been told it would be the smallest upgrade amongst the new Austin family of Arm CPU microarchitectures. The A76 and A77 were after all both big leaps in performance and IPC. What I didn’t expect was for Arm to really focus on maximizing the PPA of the design, with efficiency being a first-class citizen in terms of design priorities. In that sense, the A78’s performance improvements might be a little tame compared to previous generations, but seemingly it’s still going to be an excellent core that is going to continue Arm's recent strides in outstandingly efficient computing.

Meanwhile the Cortex-X1 is a big change for Arm. And that change has less to do with the technology of the cores, and more with the business decisions that it now opens up for the company, although both are intertwined. For years many people were wondering why the company didn't design a core that could more closely compete with what Apple had built. In my view, one of the reasons for that was that Arm has always been constrained by the need to create a “one core fits all” design that could fit all of their customers’ needs – and not just the few flagship SoC designs.

The Cortex-X program here effectively unshackles Arm from these business limitations, and it allows the company to provide the best of both worlds. As a result, the A78 continues the company’s bread & butter design philosophy of power-performance-area leadership, whilst the X1 and its successors can now aim for the stars in terms of performance, without such strict area usage or power consumption limitations.

In this regard, the X1 seems really, really impressive. The 30% IPC improvement over the A77 is astounding and not something I had expected from the company this generation. The company has been incessantly beating the drum of their annual projected 20-25% improvements in performance – a pace which is currently well beyond what the competition has been able to achieve. These most recent projected performance figures are getting crazy close to the best that what we’ve seeing from the x86 players out there right now. That’s exciting for Arm, and should be worrying for the competition.

Performance & Power Projections: Best of Both Worlds

PRINT THIS ARTICLE

Post Your Comment
Please log in or sign up to comment.

Comments Locked

192 Comments

View All Comments

Wilco1 - Tuesday, May 26, 2020 - link
Disappointed in what way? Flagship phones have been more than fast enough in the last few years. There is a balance between power consumption and performance - and I think the improved efficiency of Cortex-A78 will be more useful in typical use-cases. It won't win benchmarks, but if you believe iPhone performance is measurably better in real-life use (rather than benchmarks), why not just buy one?
syxbit - Tuesday, May 26, 2020 - link
Put it in context. You pay $1500 for a Galaxy S20 ultra that's slower than a $400 iphone.
If you do a lot of web browsing on javascript heavy pages, nothing beats single threaded perf. You can't improve it by just throwing slower cores at it.
Discourse did a good writeup that's still valid today.
https://meta.discourse.org/t/the-state-of-javascri...
Wilco1 - Tuesday, May 26, 2020 - link
You could also get the $699 OnePlus 8 and beat the S20 ultra on both performance and cost. Where is the difference?

Javascript and browsers depend heavily on software optimization, and that's the real issue.
armchair_architect - Tuesday, May 26, 2020 - link
syxbit is right. Javascript and browsers are not just software. They stress CPU in different ways than the usual Spec/Geekbench and X1 will not be just a benchmark core.
If you look at DVFS curve of A77 vs A78, X1 will probably be even lower power than A78 in the region of perf in which they overlap.
For the simple reason that to achieve same performance as A77/A78, X1 will need much lower frequency and voltage. This will greatly offset the intrinsic growth in iso-frequency power that X1 will for sure have.
My point would be: going wider helps you be more efficient iso-perf vs narrower cores.
The power efficiency hit only comes when you go over the peak perf offered by the narrower core.
So you could argue that something like X1 is taking the A78 DVFS curve and pushing it down (lower power) and of top of that it extends it to new performance point not even reachable on A78.
Obviously you pay in area for this :)
But Apple has clearly showed over the years that this is the winning formula
ZolaIII - Wednesday, May 27, 2020 - link
You are completely wrong. It's much more about caching than wider core's. X1 is not 50% faster than A78 but it is 50% bigger. Best approach would be wider ISA with same execution units multiplied in numbers like RISC V did lay out already foundations for 256 bit ISA (still a scratch) and is finalising 128 bit one. But there's a catch in tool's and compilers support.
soresu - Wednesday, May 27, 2020 - link
X1 does have wider NEON SIMD, twice as wide in fact - so for content that favors SIMD (like dav1d AV1 decoding) you will get a serious jump in performance.

Unfortunately the benchmarks do not really give us much of an idea of real world improvement for something like this, so we'll have to wait for products to get a better idea.
dotjaz - Thursday, May 28, 2020 - link
ARM specifically said A78 was designed to INCREASE EFFICIENCY vs A77, a lot of the decisions concur with that.
X1 was designed to MAXIMIZE PERFORMANCE sacrificing efficiency and area in the process. When you factor in the leakage caused by larger die. X1 would almost certainly be less efficient than A78 when you drop it to below 2GHz.
Wilco1 - Thursday, May 28, 2020 - link
"Javascript and browsers are not just software."

They are just software. Fun fact: your Android browser is built with -Oz. Yes, all optimizations are turned off in order to reduce binary size. That's an insanely stupid software decision which means Android phones appear to be behind iOS when in fact they are not.
name99 - Saturday, May 30, 2020 - link
It's not an "insanely stupid software decision"...
Fun fact: Apple ALSO builds pretty much all their software at either -Oz or -Os! Both Apple and Google (and probably MS) are well aware that the "overall system experience" matters more than picking up a few percentage points in particular benchmarks, and that large app footprints hurt that overall system experience. Apple's recommendation for MOST developer code (and followed internally) has been to optimize for size for yikes, at least 20 years, and hasn't changed in all that time.

Look at the (ongoing) work in LLVM to reduce code size ( "outliner" is one of the relevant keywords); the people involved in that span a range of companies. I've seen a lot of work by Apple people, a lot by Google people, some even by Facebook people.
Wilco1 - Saturday, May 30, 2020 - link
There is a world of difference between optimizing performance without regard for codesize and optimizing for smallest possible codesize without any regard for performance. -Ofast is the former, -Oz is the latter. Most software, including Linux distros, uses -O2 as the best tradeoff between these extremes. Non essential applications use -Os (or even -Oz if performance is irrelevant). However a browser is extremely performance sensitive. Saving a few bytes with -Oz loses 10-20% performance and that means you lose the equivalent of a full CPU generation. I call that insanely stupid, there are no other words to describe it.

Arm's New Cortex-A78 and Cortex-X1 Microarchitectures: An Efficiency and Performance Divergence

Implementations Choices & Customers

First Impressions - Arm Finally Going For Pure Performance

Post Your Comment

192 Comments

View All Comments

Wilco1 - Tuesday, May 26, 2020 - link

syxbit - Tuesday, May 26, 2020 - link

Wilco1 - Tuesday, May 26, 2020 - link

armchair_architect - Tuesday, May 26, 2020 - link

ZolaIII - Wednesday, May 27, 2020 - link

soresu - Wednesday, May 27, 2020 - link

dotjaz - Thursday, May 28, 2020 - link

Wilco1 - Thursday, May 28, 2020 - link

name99 - Saturday, May 30, 2020 - link

Wilco1 - Saturday, May 30, 2020 - link

Log in

Don't have an account? Sign up now