The CMN-700 Mesh Network - Bigger, More Flexible

It’s been a whole 5 years since we last wrote about Arm’s Coherent Mesh Network, the current generation CMN-600. The IP was announced quite some time ago, but has been a mainstay of Arm’s infrastructure IP for some time now, with it seeing some iterations in terms of IP revisions, with r2 introducing some important changes such as larger caches and CCIX capability.

Along with the V1 and N2, Arm today is also announcing a new generation CMN product in the form of the new CMN-700, promising much larger improvements to how Arm’s mesh network operates and what it is capable of in terms of scalability, performance, and flexibility.

Starting off with the basic characteristics of the new design, the important big new feature is the fact that the mesh now has grown from a limit of 8 x 8 nodes (64) to 12 x 12 (144), allowing Arm to increase the number of CPUs on a single mesh and silicon die.

Terminology:

  • RN-F: Fully coherent Request Node – Typically a CPU core, a CAL with two cores, or a DSU cluster
  • HN-F: Fully coherent Home Node – A block of SLC cache with Snoop Filter
  • CAL: Component Aggregation Layer – A block that houses two CPU cores connecting to one RN-F port

The actual maximum number of cores in a mesh has grown from 64 to 256, the latter number achievable through 128 RN-F request nodes each with 2 cores through a CAL (Component Aggregation Layer). For attentive readers, it might be weird to see Arm say that the CMN-600 only supports up to 64 cores when we have 80-core designs such as the Altra out there. Arm explained that the 64-core limit is through native cores connected to RN-F’s or through CALs, and that it’s actually possible to host more cores when you integrate them into the mesh through DSU (DynamiQ Shared Units). Ampere never confirmed their mesh layout, but this seems to be the only explanation of how they’d achieve a core count that high on the CMN-600.

Alongside 128 RN-Fs, hosting up to 256 cores, the chip hosts up to 128 HN-F home nodes, meaning nodes in which the SLC (System Level Cache) resides. Arm here discloses a maximum SLC of up to 512MB per die, meaning 4MB per node, while oddly enough saying the CMN-600 only supports 128MB, which technically is incorrect given that the reference manual says it goes up to 256MB at 4MB per node at 64 nodes.

In both cases, the SLC figures are a bit extreme and one shouldn’t expect designs with such sizes anytime soon.

Current generation Graviton2 and Altra Q chips only features 32MB SLC cache capacities in their mesh designs. One reason for this that in the past we haven’t talked about is that beyond the actual SLC, the HN-F nodes in the mesh also contain snoop filter caches that have particularly high size requirements. Arm states that generally the snoop filters need to be at least 1.5x the size of the aggregate exclusive caches of the cores, so in the case of the Altra Q with 80 cores and 1MB L2’s per core, that’s at least 120MB of required snoop filters caches on the mesh, alongside the 32MB of SLC. This would be very well a possible explanation as to why the SLCs are so small compared to say what AMD and Intel employ – the former for example using shadow tags of the L2’s for coherency (And the IOD having shadow tags of the CCD L3’s). It seems Arm’s design here is less area efficient.

The maximum memory controller (CHI SN-F nodes) in the mesh has been greatly increased from 16 to 40 ports, as Arm envisions more expansive mixed memory system architectures and designs to be employed in these newer designs.

Finally, CCIX ports have also seen a massive increase from 4 to 32, critical for some of the disaggregated chiplet designs that are also expected to be deployed – more on that in a bit.

In terms of the memory capabilities, we noted that Arm expects hybrid architecture designs which would employ not only many more DDR memory controllers, but also integrate HBM memory. SiPearl’s Rhea chip is again such a confirmed design with 4x HBM2E stacks and 4-6 DDR5 memory controllers. The CMN-700 would be able to deal with such memory arrangements and properly manage the bandwidth and traffic across the heterogenous memory architectures.

Arm quotes a 3x increase in cross-sectional bandwidth in the mesh. Part of this is achieved through generational higher mesh frequencies, but the new design also most importantly now allows for doubled mesh channels between nodes. A mesh channel is still 256b wide with dedicated read and write ports, so a doubled-up design is essentially 2x256b in each direction. Arm discloses mesh frequencies of around 2GHz, so a 12x12 mesh with doubled up channels, if my math isn’t wrong, would result in cross-sectional bandwidth of around 3TB/s.

We asked Arm if the new mesh would be capable of more exotic 3D routing in terms of the direct interconnections between the nodes, but alas for this generation it’s still “only” limited to a 2D layout.

As noted in the V1 system features, CBusy is a new CPU-Mesh feature to alleviate mesh traffic congestions under high load, varying the CPU’s traffic requirements. There’s also general traffic improvements such as combining operations to reduce operations, or straightforward operations such as data-less writes to pages (Writing a page to all 0’s can be done with only one transaction, instead of writing zero to each cache line).  

MPAM, again as explained in the earlier CPU section, helps traffic managing across independent workloads on a system such as VMs, ensuring QoS for SLA requirements, and general resource allocations across the entities in the system.

The CMN-600 this generation already had support for CCIX 1.1, which had been employed in designs such as the Altra Q. The CMN-700 now also introduces CCIX 2.0, as well as CXL compatibility.

Besides operating with coherent accelerators over PCIe, Arm also sees memory disaggregation being a thing in the future, where we would see large pools of memory addressable by both the CPU clusters as well as the compute or accelerator nodes in a coherent fashion.

CCIX 2.0 is important for future multi-die and multi-socket designs as it’s allowed to get rid of the PCIe transaction and physical layers for a more closed generic link layer and PHY. One big disadvantage of the previous generation implementation, for example in multi-socket systems, was that it had tremendous latency penalties to cross all these different layers and protocols. We’ve seen the effect of this in our core-to-core test in the review of the Ampere Altra where the chip fared terribly in this regard.

The new CMN-700 and CCIX 2.0 connectivity promises to solve those very high latencies as well as the behaviour of requesting a remote socket cache line when talking between two cores in a local socket. This is not only important for socket-to-socket communications but also directly applies to chiplet-to-chiplet designs. It’s to be noted that Arm designs here still have to translate between AMBA CHI and CCIX 2.0 for such traffic, and whilst it’s much improved to what we’ve seen in the CCIX 1.1 implementations, it’s likely still not to be quite as performant as fully native protocol handling as for example Intel and AMD solutions.

In multi-chip systems where there’s a disaggregation of memory through NUMA domains can result in performance hits when working on the same data. An alleviation of such scenarios is the division of the home node coherency across two chips (This is why the CMN-700 is advertised as up to 512 cores in a “system”). This has disadvantages as the multi-chip link can create a bottleneck, but then it’s also possible to architect exotic designs such as having pooled memory with equal access between two chips.

By now most readers will be familiar with AMD’s chiplet approach, and it’s a general architecture most vendors are heading towards given the slow-down of Moore’s law. Arm’s CMN-700 also allows for designs that are eerily similar to what AMD uses today, where a system can have a central IO Hub along with auxiliary compute dies.

We can have more traditional chip designs simply interconnected to each other, or more exotic designs with possible heterogenous chiplet architectures.

In the latter, Arm introduces the notion of a “Super Home Node” which acts as the central coherency point. In essence, this is simply just another mesh, but in theory it could be operated with no cores and simply just house an SLC (or none), and the central snoop filter handling coherency across all cores. In such an architecture, the SLC within that die would act as an L4 while the SLC in the mesh of a chiplet would act as an L3. There’s a bit of a mish-mash in terms of terminology here as we’re adding layers and chiplets, but I hope most will understand the hierarchy.

PPA & ISO Performance Projections Eventual Design Performance Projections
Comments Locked

95 Comments

View All Comments

  • name99 - Thursday, April 29, 2021 - link

    What do you mean by "compare?"
    Apple is 8-wide Decode, Map, Rename. But that doesn't include the fact that Apple does a ton of clever work in those three stages (
    - simple branches handled at Decode,
    - a variety of zero-cycle moves and immediate handling in Rename
    - two-level scheduler, with the higher level able to accept an 8-wide feed from Rename, even though the lower-level scheduler is narrower [6 for int, 4 for FP or LS] )

    Apple is *astonishingly* wide at the completion end. 16-wide register freeing and history file release, up to 56(!!!)-wide release of ROB entries.

    The Apple pattern so far (insofar as pattern-detection is worth anything) has been a 1st generation of four cores (A7/8/9/10) with similar design and 6-wide, constantly iterating on details within that framework;
    then a 2nd generation (A11/12/13/14) that makes explicit the big.LITTLE structure (in A10 that was mostly invisible) and based on an 8-wide (with 6 integer units) structure.
    If one has to bet, the reasonable thing to bet might be something like starting with A15 we transition to 10-wide (initially with6 later with 8 integer units), and 2xSVE256. Once again lay the framework, then scale out the pieces over the subsequent three cores.

    One thing that is very clear (and presumably part of Apple's success) is that they have been very willing to keep modifying how they do things; they don't just settle on a design and leave it unchanged except perhaps for some scaling up. For example the way they handle the MOV xn, xm instruction has gone through at least three very different schemes. This may seem trivial (who cares about how a singe instruction is implemented?) except that these schemes indicate a substantial reworking of the entire register file and how registers are allocated and then freed.
    This is in comparison to x86 which seems to live in (probably justified) terror that any change they make, no matter how low level, will probably break something because the whole system is so complex and so interconnected that no one person holds the entire thing in their head.

    They also seem to have a good system in place for hiding new functionality behind chicken bits, so that they can effectively debug new features within shipping hardware. For example there are reasons to believe that A14 might have in place most of the pieces required for physical register file amplification (avoid allocation for back-to-back register usage and grab the intermediary off the bypass bus; early release of logically overwritten registers) but these are not visible -- probably behind chicken bits so that they can be tested under all circumstances in shipping HW, and made visible for A15.
    And anyone who has not looked at the details is unaware of just how impressive the underlying Apple µArch platform is. There is substantial room there on-going growth! As I continue to explore it, not only do I see how well it works today, I also see multiple directions in which it could "easily" (ie feasibly, on schedule and within budget) be improved for years to come. The only other artifact I know of that comes close in terms of quality of implementation and ability for continuing growth is the Mathematica code base -- other artifacts like other CPUs, or various OS implementations, are in a totally different (and far inferior) league.
  • name99 - Thursday, April 29, 2021 - link

    To expand on my point, it's great that ARM are including so many good idea, but it's also astonishing the extent to which pretty much every good idea already has an Apple precedent.

    For example consider the MPAM discussion: "The mechanism to which this can be achieved can also include microarchitectural features such as dispatch throttling where the core slows down the dispatched instructions, smoothing out high power requirements in workloads having high execution periods, particularly important now with the new wider 2x256b SVE pipelines for example."
    This sounds like (and IS) a good idea -- certainly a lot better than reducing frequency the way Intel does for AVX512.

    But look at this Apple patent from 2011(!)
    https://patents.google.com/patent/US9009451B2
    "A system and method for reducing power consumption through issue throttling of selected problematic instructions. A power throttle unit within a processor maintains instruction issue counts for associated instruction types. The instruction types may be a subset of supported instruction types executed by an execution core within the processor. The instruction types may be chosen based on high power consumption estimates for processing instructions of these types. The power throttle unit may determine a given instruction issue count exceeds a given threshold. In response, the power throttle unit may select given instruction types to limit a respective issue rate. The power throttle unit may choose an issue rate for each one of the selected given instruction types and limit an associated issue rate to a chosen issue rate. The selection of given instruction types and associated issue rate limits is programmable."

    I just keep bumping into this stuff! Arm release new cores with what seem like good ideas (and of course ARM tell us a lot more about what's new than Apple does). I do some exploring -- and find Apple patented that idea five or more years earlier!
  • mode_13h - Thursday, April 29, 2021 - link

    > consider the MPAM discussion

    The slide calls that MPMM. The article confuses it with Memory Partitioning And Monitoring.

    Anyway, the N2's PDP sounds a lot more advanced.
  • GeoffreyA - Friday, April 30, 2021 - link

    Zen 3 needn't blush when standing next to Apple. 4-wide decode might be small but that does pick up to 6, coming out of the micro-op dispatch. Then, going down, you've got 10-wide issue on the INT side, and 6-wide on FP. Admittedly, narrower register files and 8-wide retire from the (smaller) ROB, along with smaller caches. As for move elimination, even Skylake has that. Yes, everything tends to be narrower. But I think it goes to show there's nothing particularly out of this world on the Apple side.
  • name99 - Friday, April 30, 2021 - link

    I did not say that move elimination was the interesting part.
    I said that what was interesting is that over Apple's short CPU career they have already implemented it in three significantly different ways.

    That strikes me as interesting and important -- there is no resting on the laurels, no acceptance that "we have the feature, OK to slow down". You honestly believe that Intel operates according to that same mentality?
  • GeoffreyA - Friday, April 30, 2021 - link

    I'll admit, Apple isn't resting; and they aren't scared to break orthodoxy in advancing their designs. If the others do not wake up, they'll be left in the dust. As for Intel, complacency has put them in the well-deserved pickle they're in today. AMD deserves credit, though, for doing much these past few years; and, like Apple, aren't resting on their laurels either (arguably like they did in the K8 era).
  • mode_13h - Friday, April 30, 2021 - link

    How big are Apple's cores, though? Area is tricky, because they tend to be on a newer process node.

    But, my point is that maybe AMD and Intel aren't making their cores even larger and more complex, because they're targeting the server market and found that a more area-efficient way to scale performance is by adding more cores, rather than making their existing cores even more complex.
  • name99 - Friday, April 30, 2021 - link

    A14 is 88mm^2.

    Eyeballing it, each large core is about 2.5% of that area, so 2.2mm^2
    Throw in the L2 at about 3.5%, so 3mm^2 (shared between two cores).
    Throw in the SLC (not exactly an L3, but pretend it is if you insist) at 8.8% and about 8mm^2.

    I guess if you were targeting a server type design, we could probably treat it as something like
    2.5+(3/4)+ (8.8/8) [making rough guesses about what sort of L2 and L3 would be optimal for a server type design] so ~4.4mm^2.
    Could fit ~100 in a 440mm^2 (though you'd also want some memory controllers and IO!)
    Definitely a lot larger than something like an N1 or N2 -- but of course, Apple isn't designing for the data center -- if they were, they'd probably adopt something halfway between Fire Storm and Ice Storm.

    The problem is not that Intel and AMD are chasing the server market, it is that the way they are chasing it is incoherent. IF your primary goal is the server market, then WTF are you designing for super-high frequencies? The data center cores never run at those frequencies -- but being able to boast about them means your transistor density is half to a third that of Apple (or more relevant ARM/Altra/Graviton) on the same process...
    Pick a goal and optimize for it! But Intel's goal seems to be to optimize for marketing that they can hit 5.x GHz (for half a second...) Not clear that designing the entire company around that goal (of zero interest to the data center, and little interest to most users) is such a great long-term strategy.
  • mode_13h - Saturday, May 1, 2021 - link

    > IF your primary goal is the server market, then WTF are you designing for super-high frequencies?

    AMD actually has lagged Intel in frequency, and I think that's one of the reasons. Remember, AMD is the only one using the same exact silicon on both the mainstream desktop and in all their server products.

    Intel, on the other hand, has completely separate silicon for their server dies, and we don't know all of the subtle ways they could differ from their desktop or laptop cores. We just know they tend to reuse the same basic core micro-architecture up and down their product lines (except for the really cheap/low-power stuff).

    > The data center cores never run at those frequencies

    A few Cascade Lake Xeons could turbo up to 4.5 GHz, which benefits certain workloads. The fastest turbo clock of an Ice Lake Xeon is 4.4 GHz

    The fastest EPYC can boost up to 4.1 GHz.

    In Intel & AMD's defense, a simpler core can clock higher, but runs more efficiently at lower clocks and enables higher densities. So, it seems like a pretty good strategy to me.

    > Pick a goal and optimize for it!

    Like ARM, Intel and AMD trying to balance power (in a server/laptop application, at least) and area. Apple is the only one who really has the luxury not to care much about area and just optimize for a single target. When Apple reuses its laptop core micro-architecture in both desktops AND servers, then we can compare them to the other guys. Until then, I think it's a case of Apples and pears.
  • GeoffreyA - Saturday, May 1, 2021 - link

    "Until then, I think it's a case of Apples and pears."

    I think that's it. Well, soon process improvements will be a thing of the past, owing to quantum effects, and then we'll see who does what. The free ride is almost over.

Log in

Don't have an account? Sign up now