Arm Announces Mobile Armv9 CPU Microarchitectures: Cortex-X2, Cortex-A710 & Cortex-A510

Name: Arm Announces Mobile Armv9 CPU Microarchitectures: Cortex-X2, Cortex-A710 & Cortex-A510
Item: Arm Announces Mobile Armv9 CPU Microarchitectures: Cortex-X2, Cortex-A710 & Cortex-A510
Author: Andrei Frumusanu

by Andrei Frumusanu on May 25, 2021 9:00 AM EST

181 Comments | Add A Comment

181 Comments

New DSU-110 L3 & Cluster: Massively More Bandwidth

Alongside the new CPU microarchitectures, Arm today is also announcing a new L3 design in the form of the new DSU-110. The “DynamIQ Shared Unit” had been the company’s go-to cluster and “core complex” block ever since it was introduced in 2017 with the Cortex-A75 and Cortex-A55. While we’ve seen small iterative improvements, today’s DSU-110 marks a major change in how the DSU operates and how it promises to scale up in cache size and bandwidth.

The new DSU-110 is a ground-up redesign with an emphasis on more bandwidth and more power efficiency. It continues to be the core building block for all of Arm’s mobile and lower tier market segments.

A key metric is of course the increase of L3 cache configuration which will now go up to 16MB this generation. This is of course the high-end of the spectrum and generally we shouldn’t expect such a configuration in a mobile SoC soon, but Arm has had several slides depicting larger form-factor implementations using such a larger design housing up to 8 Cortex-X2 cores. This is undoubtedly extremely interesting for a higher-performance laptop use-case.

The bandwidth increase of the new design is also significant, and applies from single-thread to multi-threaded scenarios. The new DSU-110 promises aggregate bandwidth increases of up to 5x compared to the contemporary design. More interesting is the fact that it also significantly boosts single-core bandwidth, and Arm here actually notes that the new DSU can actually support more bandwidth than what’s actually capable of the new core microarchitectures for the time being.

Arm never really disclosed the internal topology of the previous generation DSU, but remarks that with the DSU-110 the company has shifted over to a bi-directional dual-ring transport topology, each with four ring-stops, and now supporting up to 8 cache slices. The dual-ring structure is used to reduce the latencies and hops between ring-stops and in shorten the paths between the cache slices and cores. Arm notes that they’ve tried to retain the same lower access latencies as on the current generation DSU (cache size increases aside), so we should be seeing very similar average latencies between the two generations.

Parallel access increases for bandwidth as well as more outstanding transactions seem to have been also very important in order to improve performance, which seems very exciting for upcoming SoC designs, but also puts into more question the previously presented CPU IPC improvements and exactly how much the new DSU-110 contributes to those numbers.

Architecturally, one important change to the capabilities of the DSU-110 is support for MTE tags, a upcoming security and debugging feature promising to greatly help with memory safety issues.

The new DSU can scale up to 4x AMBA CHI ports, meaning we’ll have up to 1024-bit total bi-directional bandwidth to the system memory. With a theoretical DSU clock of around 2GHz this would enable bandwidth of up to 256GB/s reads or writes, or double that when combined, plenty enough to be able to saturate also eventual high-end laptop configurations.

In terms of power efficiency, the new DSU offers more options for low-power operation when in idle situations, implementing partial L3 power-down, able to reduce leakage power of up to 75% compared to the current DSU.

In general idle situations but still having the full L3 powered on, the new design promises up to 25% reduction in leakage power all whilst offering 2x the bandwidth capabilities.

It’s important to note that we’re talking about leakage power here- active dynamic power is expected to generally scale linearly with the bandwidth increase of the new design, meaning 5x the bandwidth would also cost 5x the power. This would be an important factor to note into system power and in general the expected power behaviour of the next-gen SoCs when they’re put under heavy memory workloads.

Arm describes the DSU-110 as the backbone of the Armv9 cluster and that seemingly seems to be an apt description. The new bandwidth capabilities are sure to help out both with single-threaded, but also with multi-threaded performance of upcoming SoCs. Generally, the new 16MB L3 capability, while it’s possible somebody might do a high-end laptop SoC configuration, isn’t as exciting as the now finally expected move to a new 8MB L3 on mobile SoCs, hopefully also enabling higher power efficiency and more battery life for devices.

The Cortex-A510: Brand-new Little Design Comes in Pairs A new CI-700 Coherent Interconnect & NI-700 NoC For SoCs

PRINT THIS ARTICLE

Post Your Comment
Please log in or sign up to comment.

Comments Locked

181 Comments

View All Comments

name99 - Tuesday, May 25, 2021 - link
Inrinsity was about circuit design.
PA Semi was about microarchitecture.

There was a *lot* of good stuff in PA Semi! I have looked quickly at quite a few of the Intrinsity patents, but I don't know enough about that level of the stack to have any option as to how impressive they were. (This is not a criticism -- even if all that was picked up from Intrinsity was a number of competent engineers capable of implementing the micro-architecture ideas of the PA Semi folks, that's an essential part of shipping a chip!)
I'd honestly love someone who is familiar with the circuit level to look at the Intrinsity (low level and PA Semi patents, like for a new register file design) and let us know an informed opinion.

But as important as both of these has been Apple's willingness to keep pushing the envelope, to keep pouring money into design, to keep taking risks (every design change is a risk...) and not to accept "good enough". That might seem obvious except that, of course,
- Intel has been cruising on "good enough" for 10 years,
- QC (notoriously) made "good enough" its official response to the A7, and followed that up by cancelling Centriq, and
- ARM, for whatever reason, seems to alternate between designs that look like they're trying to at least approach Apple, and designs that feel like "good enough.
melgross - Tuesday, May 25, 2021 - link
Intrinsity was about efficiency. That was what they were known for.
mode_13h - Wednesday, May 26, 2021 - link
> anyone in the non-iOS space is stuck with this attempt to inject some
> Bulldozer design features into the tired in-order A55 lineage.

Well, they can have just one core per complex, instead of 2.

I'm not really sure why the hate, unless you think you're going to be running a lot of FP/vector threads.
melgross - Thursday, May 27, 2021 - link
That was the problem with Bulldozer. They made the same mistake.
mode_13h - Saturday, May 29, 2021 - link
> That was the problem with Bulldozer. They made the same mistake.

You mean the 2 cores per complex? But ARM is giving customers the option to order up an A510 with just 1 per complex, if you think you need enough FP/vector throughput to warrant it.

I think a lot of the hate being directed at the A510 is mere guilt by association. It's massively different than Bulldozer, but the sharing of that one feature really seems to have tainted it with all the negative feelings people have towards Bulldozer.
lemurbutton - Tuesday, May 25, 2021 - link
x86 is dead.

AMD doing 5% to 15% improvements every year.
Intel doing -5% to 10% every year.

Meanwhile, Apple & ARM are doing 10 - 20%+ every year and including accelerators like machine learning.

M1 runs circles around anything AMD and Intel have. M1X and M2 will allow Apple to claim performance wins across all consumer computing devices. Can't wait for the 32/64 core Mac Pros too. It's going to be ugly for AMD/Intel.
SarahKerrigan - Tuesday, May 25, 2021 - link
I would be hesitant to lump in Apple and ARM, given how far apart the highest-performing shipping licensables and the highest-performing shipping Apple cores are.

ARM is still a long way from matching peak AMD or Intel ST (not merely iso clock, where they do okay, but absolute) in any shipping product, and honestly, neither A710 nor X2 look especially groundbreaking. A510 looks really good, but mixed with a certain amount of "well, about frigging time."
ikjadoon - Tuesday, May 25, 2021 - link
I agree on point 1, sadly. The X1 earns 40 points on SPEC2006 1T Geomean, while the A14 broke 70 points and A13 is 59 points.

The X2 vs A15 battle will be interesting in terms of power, but the X2 will likely be slower than the A13.

On the second, isn’t the A510 four years late and it has an almost identical power vs performance curve to the A55? Personally, I thought it was the smallest and saddest announcement today.

The only genuine A510 improvement is at the A55’s worst position / peak power: 10% faster for 20% less power. That’s four years later.

The rest of A510 power vs performance is by ramping up the power budget. That +10% perf for -20% power = 37.5% increase in perf-power over four years = 8% perf-power improvements per year. ;(

If they are sticking with in-order, I hoped the A510 could’ve done something more over four years.
Raqia - Tuesday, May 25, 2021 - link
Apple will rule the roost for the next year, at least until Nuvia's Phoenix cores make their debut some time in the second half of 2022 (that announced timeline likely means the design has taped out...) The cache hierarchy of Apple CPU complexes is simpler and fewer in level than what ARM's is capable of, which reflects the scope of their respective ambitions. ARM's hierarchy hobbles performance at mobile device scales but has much more headroom for supercomputing or server scale compute.
Wilco1 - Tuesday, May 25, 2021 - link
Your numbers are off. AnandTech's SPECINT2006 results are 63.34 for A14 and 41.3 for SD888: https://images.anandtech.com/doci/16463/SPEC-power...

TSMC 5nm offers ~15% speedup over 7nm, so 3.3-3.5GHz may be feasible (compared to 3.1GHz for SD865+ on 7nm), and that should get Cortex-X2 scores in the high 50's, close to the A14.

As for efficiency, it's unrealistic to expect major gains when starting from an already very efficient design. It's the same with performance, you can't expect a doubling of ST performance every few years like in the past.

Arm Announces Mobile Armv9 CPU Microarchitectures: Cortex-X2, Cortex-A710 & Cortex-A510

New DSU-110 L3 & Cluster: Massively More Bandwidth

Post Your Comment

181 Comments

View All Comments

name99 - Tuesday, May 25, 2021 - link

melgross - Tuesday, May 25, 2021 - link

mode_13h - Wednesday, May 26, 2021 - link

melgross - Thursday, May 27, 2021 - link

mode_13h - Saturday, May 29, 2021 - link

lemurbutton - Tuesday, May 25, 2021 - link

SarahKerrigan - Tuesday, May 25, 2021 - link

ikjadoon - Tuesday, May 25, 2021 - link

Raqia - Tuesday, May 25, 2021 - link

Wilco1 - Tuesday, May 25, 2021 - link

Log in

Don't have an account? Sign up now