The CMN-700 Mesh Network - Bigger, More Flexible

It’s been a whole 5 years since we last wrote about Arm’s Coherent Mesh Network, the current generation CMN-600. The IP was announced quite some time ago, but has been a mainstay of Arm’s infrastructure IP for some time now, with it seeing some iterations in terms of IP revisions, with r2 introducing some important changes such as larger caches and CCIX capability.

Along with the V1 and N2, Arm today is also announcing a new generation CMN product in the form of the new CMN-700, promising much larger improvements to how Arm’s mesh network operates and what it is capable of in terms of scalability, performance, and flexibility.

Starting off with the basic characteristics of the new design, the important big new feature is the fact that the mesh now has grown from a limit of 8 x 8 nodes (64) to 12 x 12 (144), allowing Arm to increase the number of CPUs on a single mesh and silicon die.

Terminology:

  • RN-F: Fully coherent Request Node – Typically a CPU core, a CAL with two cores, or a DSU cluster
  • HN-F: Fully coherent Home Node – A block of SLC cache with Snoop Filter
  • CAL: Component Aggregation Layer – A block that houses two CPU cores connecting to one RN-F port

The actual maximum number of cores in a mesh has grown from 64 to 256, the latter number achievable through 128 RN-F request nodes each with 2 cores through a CAL (Component Aggregation Layer). For attentive readers, it might be weird to see Arm say that the CMN-600 only supports up to 64 cores when we have 80-core designs such as the Altra out there. Arm explained that the 64-core limit is through native cores connected to RN-F’s or through CALs, and that it’s actually possible to host more cores when you integrate them into the mesh through DSU (DynamiQ Shared Units). Ampere never confirmed their mesh layout, but this seems to be the only explanation of how they’d achieve a core count that high on the CMN-600.

Alongside 128 RN-Fs, hosting up to 256 cores, the chip hosts up to 128 HN-F home nodes, meaning nodes in which the SLC (System Level Cache) resides. Arm here discloses a maximum SLC of up to 512MB per die, meaning 4MB per node, while oddly enough saying the CMN-600 only supports 128MB, which technically is incorrect given that the reference manual says it goes up to 256MB at 4MB per node at 64 nodes.

In both cases, the SLC figures are a bit extreme and one shouldn’t expect designs with such sizes anytime soon.

Current generation Graviton2 and Altra Q chips only features 32MB SLC cache capacities in their mesh designs. One reason for this that in the past we haven’t talked about is that beyond the actual SLC, the HN-F nodes in the mesh also contain snoop filter caches that have particularly high size requirements. Arm states that generally the snoop filters need to be at least 1.5x the size of the aggregate exclusive caches of the cores, so in the case of the Altra Q with 80 cores and 1MB L2’s per core, that’s at least 120MB of required snoop filters caches on the mesh, alongside the 32MB of SLC. This would be very well a possible explanation as to why the SLCs are so small compared to say what AMD and Intel employ – the former for example using shadow tags of the L2’s for coherency (And the IOD having shadow tags of the CCD L3’s). It seems Arm’s design here is less area efficient.

The maximum memory controller (CHI SN-F nodes) in the mesh has been greatly increased from 16 to 40 ports, as Arm envisions more expansive mixed memory system architectures and designs to be employed in these newer designs.

Finally, CCIX ports have also seen a massive increase from 4 to 32, critical for some of the disaggregated chiplet designs that are also expected to be deployed – more on that in a bit.

In terms of the memory capabilities, we noted that Arm expects hybrid architecture designs which would employ not only many more DDR memory controllers, but also integrate HBM memory. SiPearl’s Rhea chip is again such a confirmed design with 4x HBM2E stacks and 4-6 DDR5 memory controllers. The CMN-700 would be able to deal with such memory arrangements and properly manage the bandwidth and traffic across the heterogenous memory architectures.

Arm quotes a 3x increase in cross-sectional bandwidth in the mesh. Part of this is achieved through generational higher mesh frequencies, but the new design also most importantly now allows for doubled mesh channels between nodes. A mesh channel is still 256b wide with dedicated read and write ports, so a doubled-up design is essentially 2x256b in each direction. Arm discloses mesh frequencies of around 2GHz, so a 12x12 mesh with doubled up channels, if my math isn’t wrong, would result in cross-sectional bandwidth of around 3TB/s.

We asked Arm if the new mesh would be capable of more exotic 3D routing in terms of the direct interconnections between the nodes, but alas for this generation it’s still “only” limited to a 2D layout.

As noted in the V1 system features, CBusy is a new CPU-Mesh feature to alleviate mesh traffic congestions under high load, varying the CPU’s traffic requirements. There’s also general traffic improvements such as combining operations to reduce operations, or straightforward operations such as data-less writes to pages (Writing a page to all 0’s can be done with only one transaction, instead of writing zero to each cache line).  

MPAM, again as explained in the earlier CPU section, helps traffic managing across independent workloads on a system such as VMs, ensuring QoS for SLA requirements, and general resource allocations across the entities in the system.

The CMN-600 this generation already had support for CCIX 1.1, which had been employed in designs such as the Altra Q. The CMN-700 now also introduces CCIX 2.0, as well as CXL compatibility.

Besides operating with coherent accelerators over PCIe, Arm also sees memory disaggregation being a thing in the future, where we would see large pools of memory addressable by both the CPU clusters as well as the compute or accelerator nodes in a coherent fashion.

CCIX 2.0 is important for future multi-die and multi-socket designs as it’s allowed to get rid of the PCIe transaction and physical layers for a more closed generic link layer and PHY. One big disadvantage of the previous generation implementation, for example in multi-socket systems, was that it had tremendous latency penalties to cross all these different layers and protocols. We’ve seen the effect of this in our core-to-core test in the review of the Ampere Altra where the chip fared terribly in this regard.

The new CMN-700 and CCIX 2.0 connectivity promises to solve those very high latencies as well as the behaviour of requesting a remote socket cache line when talking between two cores in a local socket. This is not only important for socket-to-socket communications but also directly applies to chiplet-to-chiplet designs. It’s to be noted that Arm designs here still have to translate between AMBA CHI and CCIX 2.0 for such traffic, and whilst it’s much improved to what we’ve seen in the CCIX 1.1 implementations, it’s likely still not to be quite as performant as fully native protocol handling as for example Intel and AMD solutions.

In multi-chip systems where there’s a disaggregation of memory through NUMA domains can result in performance hits when working on the same data. An alleviation of such scenarios is the division of the home node coherency across two chips (This is why the CMN-700 is advertised as up to 512 cores in a “system”). This has disadvantages as the multi-chip link can create a bottleneck, but then it’s also possible to architect exotic designs such as having pooled memory with equal access between two chips.

By now most readers will be familiar with AMD’s chiplet approach, and it’s a general architecture most vendors are heading towards given the slow-down of Moore’s law. Arm’s CMN-700 also allows for designs that are eerily similar to what AMD uses today, where a system can have a central IO Hub along with auxiliary compute dies.

We can have more traditional chip designs simply interconnected to each other, or more exotic designs with possible heterogenous chiplet architectures.

In the latter, Arm introduces the notion of a “Super Home Node” which acts as the central coherency point. In essence, this is simply just another mesh, but in theory it could be operated with no cores and simply just house an SLC (or none), and the central snoop filter handling coherency across all cores. In such an architecture, the SLC within that die would act as an L4 while the SLC in the mesh of a chiplet would act as an L3. There’s a bit of a mish-mash in terms of terminology here as we’re adding layers and chiplets, but I hope most will understand the hierarchy.

PPA & ISO Performance Projections Eventual Design Performance Projections
Comments Locked

95 Comments

View All Comments

  • mode_13h - Thursday, April 29, 2021 - link

    Uh...

    > 2013, they had the A7 (tiny), A15 (small), and A57
    > Then ARM made the leap into 64bit processing around 2016.

    A57 is a 64-bit core.

    > Contrast that to the new x86 competition in AMD

    No. Why would we do that? They were competing in totally different markets, at the time. The only partial overlap was embedded Ryzen.

    > There hasn't been any upgrades for the "tiny" portfolio, being stuck to ... Cortex A35 CPU
    > There has been only a slight refresh to the "small" portfolio, upgrading to the Cortex A55 CPU

    The A35 and A55 both launched in 2017.

    > they're a joke, and easily surpassable by the competitors.

    In terms of what? PPA? Perf/W? Perf/$? Might want to be sure you're comparing apples to apples and not comparing competing "small" core with ARM "tiny".

    > There hasn't been any new "large" category for iGPUs from ARM or competitors

    Samsung is using RDNA and MediaTek is licensing a Nvidia GPU for its upcoming SoCs.

    Might want to do a little more research, before writing another longpost. I agree that A55 could use a refresh, but ARMv9 will force that, anyway. I don't even know where A35 is used, but same story, there.

    It's worth noting that ARM has also been active in the microcontroller market, with both 32-bit and 64-bit offerings.
  • Kangal - Friday, April 30, 2021 - link

    Firstly, apologies.
    I know the A57 is 64bit, but there have been many (most?) implementations of it running in 32bit mode. The A57 was really a "rough draft" for ARM, in moving towards both "medium" sized cores and into 64bit computing. Hence, it feels more at home next to it's A7 and A15 brethren.

    The contrast is there, and necessary to show the landscape of the time. The tech industry is a fast-paced one. And if your code/calculations is agnostic, that it can run on any platform, you would consider all options (not that I recommend people go creating agnostic code, compared to specialized or hardware-accelarated code).

    The Cortex A35 launched in 2015. It's long due for an upgrade, or replacement. Where this core likes to be in is in small, low-power, and cheap devices. In particular the microcontroller market as you mentioned. ARM hasn't been as active in this field as you think they have, with many of the products being custom designs from the ODMs.

    I already mentioned the A55 was a slight refresh for the A53, and that itself is also surpassed. Have a look at Apple's "small" cores. They are Out-of-Order processors, they are slightly faster than an A73, they use slightly less power than an A53. It's mind boggling. Others disagree, and say they're actually faster than A75, and more efficient than A55... but at this scale we're splitting hairs. With that much room for difference, it's not inconceivable (heck it's likely) that an outside competitor like RISC-V will surpass the A55 in terms of Perf/W, Perf/PPA, Perf/$, or a combination of the lot. And remember, the Cortex-A53 is the most popular core out there, where it's getting stamped out on so many different Chinese products.

    Samsung isn't using Radeon iGPUs YET, and neither is MediaTek. Besides, we have yet to see them in the wild and find out details if their architecture. These might be licensed from AMD or Nvidia, but they might be "small" iGPUs instead of "large" iGPU designs. I did forget to mention that the Tegra X1, and some Nvidia SBC did actually use their "large" iGPU architecture (ie Maxwell etc).

    The gist of my rant is that ARM was a revolutionist early on, basically creating the market. Then they were extremely innovative and competitive, basically dominating the market. Now they are competitive but not as revolutionary nor as competitive/innovative as they used to. With ARMv9 they have a chance to start fresh, and return to status quo, by having a trifecta of products for the computing industry. I was pointing out the gaps in their history and portfolio. They shouldn't just focus on mobile phones, that's boring.
  • mode_13h - Friday, April 30, 2021 - link

    > The Cortex A35 launched in 2015.

    Okay, the date I saw was wrong. It seems to have been announced in November 2015. The A55 seems to have been announced in May 2017.

    > this core likes to be in is in small, low-power, and cheap devices.
    > In particular the microcontroller market as you mentioned.

    They have actual microcontrollers, though. The A35 is still too power-hungry (and expensive?) for most IoT devices.

    > Have a look at Apple's "small" cores.

    You focus on performance and efficiency, but what about area? Apple has a narrower focus and different process, cost, & area targets than ARM.

    The point we can definitely agree on is that ARM's bottom & middle tier cores should've been refreshed more frequently. But, everyone seems to think that ARM is directly competing with Apple, but it's not. Their objectives meaningfully differ, resulting in ARM probably being driven more towards making smaller cores than Apple.

    It's only at the top end of their mobile stacks that you can really say ARM and Apple are in direct competition. However, even on something like the A78, ARM is still put in a position of having to make compromises that Apple isn't.

    > ARM was a revolutionist early on, basically creating the market.
    > Now they are competitive but not as revolutionary nor as competitive/innovative as they used to.

    That's how these things work. A small upstart has a lot of freedom. The bigger a company gets, the more constrained it becomes by its customers, its market, the cost of changing, and the downside risk. I'm still just not totally convinced that entirely explains what we're seeing.

    If they can manage to cleave their server cores entirely from their mobile cores, and then really make big cores that are performance-first (instead of scaled up versions of mostly-performance cores, like the X1 and A78 situation), then we might see them start to compete at Apple's level. Basically, to compete they'd have to start by designing the X1 first, and then make the A78 by putting it on a diet.

    > They shouldn't just focus on mobile phones, that's boring.

    LOL, it's also where most of their revenue still lies. If you were CEO, you wouldn't last a day.
  • grant3 - Saturday, May 1, 2021 - link

    > LOL, it's also where most of their revenue still lies. If you were CEO, you wouldn't last a day.

    Focusing on the same-ol' same-ol' business is exactly how once-profitable companies fade into irrelevance as technology moves on. Plenty of mediocre CEOs do that.

    A great CEO can find the future revenue opportunities and prove it to the company's owners.
  • mode_13h - Sunday, May 2, 2021 - link

    Yeah, but you can't afford to walk away from your bread and butter. Any new growth areas you pursue can't come at the expense of revenues in your core business. If you even threatened to starve your core business, you'd be out of a job before your new ambitions could ever get off the ground.

    Just look at what happened with Qualcomm, they tried to invest in new areas, but their investors absolutely wouldn't tolerate it. Granted, they're more exposed than ARM would be, either under Soft Bank or Nvidia.
  • Kangal - Sunday, May 2, 2021 - link

    No, grant3 is exactly right.

    What you said is EXACTLY what Blockbuster said before they went bankrupt. In case you didn't know, the board members passed the opportunity to buy Netflix for $50 Million. The CEO then tried to right that wrong by acquiring another competitor, and shifting their revenue stream. The board fired their CEO, saying that their late-fee revenue was the bread and butter of their business model. Blockbuster was too narrow focused and stuck in the past, that not only did they miss the opportunity of becoming a whole new behemoth, but they sunk their own ship at the same time.
  • mode_13h - Sunday, May 2, 2021 - link

    > What you said is EXACTLY what Blockbuster said before they went bankrupt.

    If grant3 is saying that Blockbuster should close half its stores while they're still profitable, to divert money into R&D on getting into the (then) almost non-existent streaming market, no company in the world would do that.

    Now, it's not like ARM is ignoring other markets, of course. They just can't turn their back on the mobile market, in order to do so.

    > Blockbuster was too narrow focused and stuck in the past

    The genius of capitalism is that the failure of Blockbuster to transition into a streaming platform didn't keep streaming from happening. Its investors could even get in on the game by shifting their investments into players in the streaming market. If the CEO was such a believer, he could've quit and gone to work for a streaming company or founded his own.

    Also, let's not forget that there have already been losers in streaming, and it wasn't clear Netflix would've successfully made the transition from movies-by-mail. Who remembers Google Video? Yahoo even bought some company in the space. And just last year, there was quibbi. I'm sure there are others I'm forgetting.

    I think we all want to see ARM succeed outside of mobile. They're been investing a lot, in order to do so. Some in this very thread have been complaining at their lack of focus on their smaller, lower-power cores (currently A35 & A55), which you could see as evidence they've already been making sacrifices to try and compete outside their niche. I don't know if that's accurate, but it's plausible.

    If Nvidia's acquisition goes through (as I expect it will), I hope and expect it will provide ARM with the funds to do even more ambitious things.
  • Spunjji - Friday, April 30, 2021 - link

    That's a sound argument for that expectation - it's definitely long since past time for an update.
  • dotjaz - Tuesday, April 27, 2021 - link

    Why would you need rumours when we know for a FACT that there will be an A55 successor unless b.L design is abandoned for no good reason. I'll give you a hint, b.L can't have mixed architectures that's why big cores stayed at ARMv8.2a for so long.
  • eastcoast_pete - Tuesday, April 27, 2021 - link

    Maybe the shift to ARMv9 will force ARM's hand with giving the LITTLE cores out-of-order designs; however, current bigLITTLE designs already mix big, out-of-order designs with LITTLE in-order cores like the A55. So, bL can and has worked with mixed architectures for quite a while. However, I hope you are correct in that the shift to ARMv9 will force the issue, and we'll finally get out-of-order LITTLE cores also on non-Apple devices

Log in

Don't have an account? Sign up now