Conclusion & First Impressions

Today’s Arm Client TechDay disclosures were generally quite a lot more extensive than in the last few years, especially given the number of new IP releases we’ve covered. Three new CPU microarchitectures, a new DSU/L3 cluster design, and two new SoC interconnect IPs is quite a bit more than we’re used to, and it goes to underscore just how much effort Arm is putting into updating all of the parts of its client IP.

Starting off with the CPUs, the new Cortex-X2 and Cortex-A710 cores are meant to be iterative designs compared to their predecessors, and that's certainly what they are from a performance and efficiency viewpoint. On a generational basis, Arm is promising a 10-16% improvement in IPC. However these figures are somewhat muddled by the fact we’re also comparing 4MB and 8MB L3 caches. Generally, it’s a reasonable expectation of what we’ll be seeing in 2022 devices, but it’s also hard to disambiguate and attribute the performance of the cores versus that of the new DSU-110 L3 cluster design.

Arm has also made some more lofty performance claims when it comes to actual device implementations in 2022, such as +30% peak-to-peak performance boosts on the parts of the X2 cores. Generally, given our expectations that both the next Snapdragon and the next Exynos flagships will come in a similar Samsung foundry process node with smaller improvements, I’m very doubtful we’ll be seeing such larger generational improvements in practice, unless somehow MediaTek surprises us with a flagship X2 SoC made out at TSMC.

While the X2 and A710 aren’t all that groundbreaking, we have to note that the move towards Armv9 brings a lot of new architectural features that would otherwise eat into the expected yearly performance or efficiency improvements. The move to the new ISA baseline has been a long time coming and I’m curious to see what it will enable in terms of media applications (SVE) or AI (new ML instructions).

This is also the fourth and last iteration of Arm’s Austin core family, so hopefully next year’s new Sophia family will see larger generational leaps. Arm admits that we’re nearing diminishing returns and it’s certainly not at the same break-neck pace it was moving a few years ago, but there’s still a lot which can be done.

Today we also saw the unveiling of a brand-new little core in the form of the Cortex-A510. A new clean-sheet design from the Cambridge team, it’s certainly using an innovative approach given its “merged core” design, sharing the L2 cache hierarchy and the FP/SIMD back-end amongst two otherwise full featured cores. The performance and IPC gains are claimed to be quite large at +35-50%, however it seems that this generation hasn’t improved the efficiency curve all that much. It’s still a much better design and will have effective benefits for power efficiency in real-world workloads due to how workloads interact between the little and larger cores, but leaves us with a feeling that it doesn’t provide a knock-out convincing jump we had expected after 4 years. The silver lining here is that Arm is promising further generational improvements in performance and power with subsequent iterations, so we won’t be left with the current state of affairs the same way we saw the Cortex-A55 stagnate.

One of the more key points I saw Arm put their focus on was the new possibilities in larger form-factor devices beyond mobile. The new DSU-110 now supports up to 8 Cortex-X2 cores, a theoretical setup that would pretty much blow away the current Cortex-A76 based Arm laptop SoCs such as the Snapdragon 8cx family. The new cluster design allows for large L3 caches of up to 16MB, and while I don’t know if we’ll see the new interconnect IPs used by the larger vendors, it surely also makes a big argument for larger performance designs. The catch is that if Qualcomm were to adopt and make such a design, it would seemingly be short-lived given their recent Nuvia acquisition and intent on using custom cores. Otherwise, because of a lack of Mali Windows drivers, this really only leaves space for a theoretical Samsung laptop SoC with AMD RDNA GPU, but such a SoC could nonetheless be very successful.

Overall, this year’s CPU and system IP announcements from Arm are extremely solid new IP offerings, really laying down a new foundation, both architecturally with Armv9, and microarchitecturally thanks to elements such as the new DSU and the new little core CPUs. We’re looking forward to the new 2022 SoCs and products that will be powered by the new Arm IP.

A new CI-700 Coherent Interconnect & NI-700 NoC For SoCs
Comments Locked

181 Comments

View All Comments

  • RSAUser - Wednesday, May 26, 2021 - link

    Basically interesting for cases when you don't want to add an A73, e.g. It's pretty big news in the watch space where it's been the same 4/5yo architecture for a very long time.
  • mode_13h - Thursday, May 27, 2021 - link

    > It's pretty big news in the watch space

    I'm actually surprised people are even using A55s in smartwatches, or that ARM is targeting the A510 at them. I'd figured the most they could get away with would be the A35.

    I guess pairing a couple A55s with some A35s might be a way to get responsiveness *and* battery life. Is that something people do?
  • mode_13h - Wednesday, May 26, 2021 - link

    It'd be interesting to see how efficient the A73 would be, if you dropped its clock to match the A510's performance.
  • AntonErtl - Thursday, May 27, 2021 - link

    Yes. ARM gives some flowery wordings for the lower performance of the A510 compared to A73 (and Andrei reworded in the way ARM wants us to think: "very similar IPC and frequency capabilities whilst consuming a lot less power"; looking at the numbers given by ARM, the A510 has >20% less performance than the A73, at 35% less power. The DVFS stuff I have seen makes me expect that the A73 has the same or lower power at the same performance, if you lower the clock by 20% (or whatever the slowness factor of the A510 is).

    Andrei already showed us in his Exynos 9820 review that the A75 has better Perf and Perf/W for nearly all of the performance range of the A55. So I find it surprising that ARM went for another in-order design for the little core of ARMv9, instead of something like an ARMv9-enabled A75. For me it will certainly be an interesting microarchitecture to study, but I guess it will take some time until it appears in some Odroid or Raspi board.
  • mode_13h - Saturday, May 29, 2021 - link

    > the A75 has better Perf and Perf/W for nearly all of the performance range of the A55.
    > So I find it surprising that ARM went for another in-order design for the little core of ARMv9

    You're forgetting about PPA, though. The A510 is probably a lot smaller (ISO-process) than the A75.

    > I guess it will take some time until it appears in some Odroid or Raspi board.

    Look for A76-enabled SBCs late this year or early next. Rockchip's RK3588 will have 4x A76.

    Raspberry Pi will probably be stuck on A72 or A73 for a couple more generations, since they plan to stay on 28 nm, for a while. Meanwhile, the Allwinner SoC in ODROID's N2 is made on 12 nm.
  • AntonErtl - Sunday, May 30, 2021 - link

    Looking at the Exynos 9820 die shot, te A55 is ~3.4 times smaller than the A75, but it also has ~3.4 times lower top performance and a similar factor at the lowest common perf/W point, and from the looks of the line, in between. I doubt that the A510 is better in perf/area. But maybe it's the difference that ARM is claiming between the workloads Andrei used for evaluating performance (SPEC CPU2006) and what the A55 and A510 are doing in practice; if they mainly wait for peripherals, I can believe that their performance does not matter much.

    Thanks for the info on SBCs to be expected.
  • Wereweeb - Wednesday, May 26, 2021 - link

    I'll ignore all the warfare in the comments, and just say this: imagine a 16-'core' A510 SoC. Sorry.
  • mode_13h - Wednesday, May 26, 2021 - link

    So, if you built a HPC CPU with A510 @ one core per complex, 2x 128-bit SVE2, and max L2 cache, how would area-efficiency (PPA) and power-efficiency (PPW) compare with a V1-based chip on the same node?

    Let's assume the workload has enough concurrency to scale up to all the A510 cores, and that there's enough ILP that the A510's lack of OoO isn't a significant impediment.
  • Shakal - Thursday, May 27, 2021 - link

    Pardon my ignorance but what exactly is an "Alternate path predictor"? They mention that for the X2 core but I've not found any reference to what it is. I've heard of path based predictors but how does the alternate come into play?
  • ballsystemlord - Friday, May 28, 2021 - link

    Spelling and grammar errors (there are lots!):

    I read through everything but the conclusion.

    "From a microarchitectural standpoint this is interesting as it means Arm will have been able to kick out some cruft in the design."
    "has", not "have" and subtract "will":
    "From a microarchitectural standpoint this is interesting as it means Arm has been able to kick out some cruft in the design."

    "Even though it's a in-order core,..."
    "an" not "a":
    "Even though it's an in-order core,..."

    "...and since then we haven't had seen any updates to Arm's little cores, to the point of it being seen as large weakness of last few generations of mobile SoCs."
    You need an "a" and subtract "had":
    "...and since then we haven't seen any updates to Arm's little cores, to the point of it being seen as a large weakness of last few generations of mobile SoCs."

    "The new design if a clean-sheet microarchitecture from Arm's Cambridge team which the engineers had been working on the past 4 years, ..."
    "is" not "if":
    "The new design is a clean-sheet microarchitecture from Arm's Cambridge team which the engineers had been working on the past 4 years, ..."

    "... the performance impact and deficit is said to only a few percent versus having a pipeline dedicated for each core."
    Add a "be":
    "... the performance impact and deficit is said to be only a few percent versus having a pipeline dedicated for each core."

    "The dual-ring structure is used to reduce the latencies and hops between ring-stops and in shorten the paths between the cache slices and cores."
    "to", not "in":
    "The dual-ring structure is used to reduce the latencies and hops between ring-stops and to shorten the paths between the cache slices and cores."

    "Architecturally, one important change to the capabilities of the DSU-110 is support for MTE tags, a upcoming security and debugging feature promising to greatly help with memory safety issues."
    "an" not "a":
    "Architecturally, one important change to the capabilities of the DSU-110 is support for MTE tags, an upcoming security and debugging feature promising to greatly help with memory safety issues."

    "The SLC can server as both a bandwidth amplifier as well as reducing external memory/DRAM transactions, reducing system power reduction."
    "serve", not "server" and "consumption", not "reduction":
    "The SLC can serve as both a bandwidth amplifier as well as reducing external memory/DRAM transactions, reducing system power consumption."

    "Overall, the new system IP announced today is very interesting, but the one question that's one has to ask oneself is exactly who these net interconnects are meant for."
    Excess "'s". Refactoring makes more sense.
    "Overall, the new system IP announced today is very interesting, but we have to ask who exactly these net interconnects are meant for."

Log in

Don't have an account? Sign up now