New DSU-120: More L3 Cache, Doubling Down on Efficiency

For the launch of its Armv9.2 architecture, Arm has decided to opt for a new core complex design for its TCS23 CPU cores by building upon the foundations of its current DSU-110 block. Initially introduced in 2017 along with the Cortex A75 and A55 cores, DSU-110 represented a significant redesign and generational shift to integrate larger pools of shared L3 cache, bandwidth, and scalability. Along with the efficiency tweaking Arm has done to its new Cortex-X4, Cortex-A720, and A520 cores, the new DynamIQ Shared Unit-120 (DSU-120) also plays a significant role in these advancements.

Building a more refined DSU instead of another ground-up design, Arm has made plenty of inroads to improving overall scalability, efficiency, and performance with its DSU-120. Some of the most notable improvements include support for up to 14 CPU cores in a single cluster, which allows SoC vendors to pick and choose their core cluster configurations to suit the device going to market. Arm has also improved its Power and Performance Area (PPA) by implementing new power-saving modes, including RAM and Slicing power-downs, which work in stages depending on the type of workload and the intensity to reduce the overall power footprint of the cluster.

Perhaps the most significant change to DSU-120 from DSU-110 is that Arm has effectively doubled the total amount of shared L3 cache a cluster can implement. DSU-110 initially supported up to 16 MB, whereas DS-120 can now accommodate up to 32 MB of shared L3 cache across the entire complex, with other options also available, including 24 MB. While this isn't a direct implementation into the IP, the decision on the number of L3 cache implemented is entirely down to SoC vendors to decide the right levels of L3 cache based on performance and efficiency balancing depending on the device. The key focus is that DSU-120 and the new TCS23 cluster have the ability to support this if vendors wish to implement more L3 cache.

As with the current/previous DSU-110 interconnect, the new DSU-120 also uses a dual bi-directional ring-based topology, which allows data transmission in both directions within the cluster and reduces overall latency. The overall design of the DynamIQ Shared Unit is to optimize things for latency and increase bandwidth, which is precisely what Arm has done by slicing its logic L3 and snoop filters. As such, it is configurable based on specific customer bandwidth requirements. As previously mentioned, DSU-120 allows up to 14 Cortex-X/A cores to be implemented into a cluster, with plenty of benefits of opting for the latest Armv9.2 generation over the previous iterations.

Focusing on the new power improvements to the TCS23 and DSU-120 complex, Arm has identified specific areas where it can save on power to maximize efficiency. One of these is through RAM and reducing any unnecessary power leakage associated with that. To combat this, Arm has opted for a mechanism that allows RAM to be placed into a low-powered state when not being actively used, but still with enough power to ensure the integrity of its contents. The Logic is split into slices with the L3 cache and a snoop filter designed to improve cache coherence within a multi-core complex. 

Opting for a sliced approach with snoop filters enables a couple of things. Firstly as we've mentioned, it improves and enhances cache coherence. This means that the cores are fed consistently and up-to-date instructions, and the snoop filter itself is designed to filter out requests that are deemed unnecessary, which does give some efficiency benefits. Secondly, slicing allows Arm's IP to increase scalability, which with an increase in cores, means an increase in slices with dedicated cache slices, allowing for better distribution of data and lower data contention rates. Armv9.2 IP with the DSU-120 allows for between 1 and 8 slices to be used, designed to enable SoC vendors the flexibility to work within their bandwidth requirements.

Arm claims that RAM power-down enabled across half of the L3 RAMs on the complex is suitable for large L3 caches when all of the capacity isn't being used. By allowing RAM power-down, all of the unused RAM is put into a low power state, but with enough to keep the contents and withhold their integrity within the memory substructure. Even with RAM and Slice power-downs active, the cores can still be active and process relevant instructions and data. One slice will effectively remain active, which is ideal for smaller and light workloads on a single core, but when it comes to powering down features on the DSU-120 interconnect, accessing the cores will enact a wake-up of the DSU-120.

Looking at how this efficiency translates into data, Arm has provided a handy slide with estimates from its own testing. As we can see, with various levels of RAM and Slice Logic power-downs, we get varied potential power savings, which can then be budgeted back into the cores themselves for higher performance levels. Different workloads and tasks require different levels of core power, coherence, intensity, and L3 allocation, so different power-downs lead to varying levels of leakage and power efficiency savings. Arm's figures estimate between 30 and 72% at the other states of power-down, with 100% savings in leakage with all the slices enabled.

Cortex A520: LITTLE Core with Big Improvements Closing Remarks: TCS23 Promises Improved Performance and Power Efficiency
Comments Locked

52 Comments

View All Comments

  • Silver5urfer - Monday, May 29, 2023 - link

    It is not related to the UI, it is related to the worst practices in ARM, Apple.

    Disposable goods, non compute focused, rather a simplistic tool for the Technological dependency rather than using it like a computer and most importantly, owning your own data in the case of an ARM powered smartphone - Filesystem, Applications control Etc. None of these are present in iOS. And they are now incorporated into the Android heavily from the UI, Design philosophy, Technology.

    Axing 32Bit OS / Applications and forcing everyone to be on the Playstore mandated policies gives an edge to Android on axing the power user features, i.e targeting latest OS SDK means you are restricted heavily to an OS and its jail. Also they are hiding applications now on Playstore. That means old apps are now hard to find, and good apps do not work on latest OS (Timey app for eg), and lot of examples. Plus now modern Android blocks you even on Sideload notifying the SDK target version in normal terms such as this app won't work properly because Android 14 and up do not allow Android 6 below apps.

    Windows enjoys the superior user retention and proper computing because of it's legacy support, A Windows 3.1 .exe will work on Windows 10. But on Apple it's all outdated and even hardware, any x86 processor from Core2Quad which lacks SSE4 and AVX2 still runs modern games which utilize these features but can be made to work because of the power of x86 and Windows. That's how a superior computer is born but not guardrails and heavy restriction and placing consumer in the dark in the name of technology BS.
  • Eliadbu - Monday, May 29, 2023 - link

    Legacy support is overrated for vast majority of the user base, even on windows. its also a thing that can be achieved with emulation for the niche use cases. Most of the argument you gave had little to non to do with 32 bit support. This legacy support costs in silicon space, complexity and software upkeep - all of those resources can be used for actually useful things that will benefit most users.
  • TheinsanegamerN - Wednesday, May 31, 2023 - link

    LMAO legacy support is the only reason windows still exists.
  • iAPX - Monday, May 29, 2023 - link

    Intel is thinking about being 64bit-only too, with the X86S project.

    This is an interesting way, as 16bit and 32bit compatibility could be offered through software emulation in a VM (their proposal), naturally with impact on performances.
  • Silver5urfer - Monday, May 29, 2023 - link

    I hope that project doesn't fly but looking at modern Intel with their ARM clone of P+E to worst now P+E+LPE cores they may break the whole 32Bit Application world.

    Only HPC market can stop it but looking at how Windows 10 is now being retired by 2030 max (LTSC 1809 maximum lifecycle) add maybe ESU channel like Win7 to 2033 at best, after that I think Windows will also copy Apple hard they are already doing it hardcore as Windows 11 is the Win10S branched out because those internal designers are cultists of the Applesque systems lock down and uber simplification of power user nature, this makes entire generations of young population being dumbed down by the basic structure of the OS + Technology rather than innovative and explorative thinking process of the older era (XP, 7 etc)

    Windows 10 is the last Microsoft OS that has real support of all the older Windows applications, 11 discarded a lot of Shell32 / Win32 systems and ruined the NTKernel in the process and the CPU schedulers. They sabotaged the entire explorer.exe too, and with the modern fad AI introduction into the OS the telemetry will explode into exponential factor and with the complete dumbing down of the OS and the process, Atomization of the human thinking will lead to regressive computation. Really unfortunate.

    Emulation means there will be a performance penalty.
  • stephenbrooks - Monday, May 29, 2023 - link

    I'm interested by the ARM laptop direction (the 10 X4 plus 4 mid-core design). That could run a full OS like Windows or Linux.
  • eastcoast_pete - Monday, May 29, 2023 - link

    At least Gavin addressed the mini-elephant in the room for the small cores (thanks!): still no out-of-order design. Instead, an ALU is removed "for greater efficiency". By now, I am suspecting that ARM and Apple have some kind of understanding that ARM little cores won't, under any circumstance, be allowed to come anywhere close to challenging Apple's efficiency cores in Perf/W. Apple's efficiency cores have about twice the IPC of the little ARM cores and all at about the same power draw. Which made the impossible come true: I am now rooting for Qualcomm to kick ARM's butt, both in court and in SoCs.
  • name99 - Monday, May 29, 2023 - link

    Oh FFS, always the conspiracy theories!
    It’s really much simpler — Apple’s small cores are much larger than ARM’s small cores. ARM seems to be thinking that their mid cores (A720) can play the role of Apple’s small cores, and that may be to some extent true in that Apple can split work between big and small in a way that Android cannot, given that Apple knows much more about what each app is doing (better API for specifying this, and much more control of the OS and drivers).

    Much more interesting is how this is all about essentially catching up to Apple’s M series. Which is fine, but if you look at what Apple is doing, the action is all at the high end. I’ve said it before and will say it again; Apple has IBM-level performance in its sights. The most active work over the past three years is all about “scalability” in all its forms, how to design and connect together multiple Max class devices to various ends. The next year or two will be wild at the Apple high end!
  • Kangal - Monday, May 29, 2023 - link

    Thank you!

    However, I still welcome the development of a smaller and slower ARM core, if it means small power draw and small silicon area. There is a market for that outside of phones; in embedded devices, watches, wearables, and ultra low power gadgets.

    We used to have something like Cortex-A7 (tiny), Cortex-A9 (small), Cortex-A17 (medium). Then we had Cortex-A35 (tiny), Cortex-A53 (small), Cortex-A73 (medium). But we never got a successor for the Cortex-A35, so perhaps a very undervolted Cortex-A520 will work. Just like how ARM justified using an overclocked Cortex-A515 as a legitimate successor to the Cortex-A53 range.

    Almost all the attention goes to the (Medium) cores. It's their bread and butter. From the development of (2016) Cortex-A73, A75, A76, A77, A78, A710, A720 (2023).

    But as you said, the exciting things are happening at the high-end (LARGE) cores. It's started with the creation of a new category in the X1, X2, X3, X4 designs. They seem unfit in phones, okay in tablets, and necessary for ARMbooks. Even then, their performance is somewhere far from Apple's M1/M2/M3 and unfit to tackle AMD Zen3 / Intel 11th-gen x86 cores. Let alone their newest variants.
  • back2future - Tuesday, May 30, 2023 - link

    "Even then, their performance is somewhere far from Apple's M1/M2/M3 and unfit to tackle AMD Zen3 / Intel 11th-gen x86 cores."

    without sufficient support for desktop OS on desktop performance CPUs it reduces possibilities to binary translation/multiarch binaries and ISA specific OS from vendors, but not on ARM generally having a free choice for Windows/Linux/Unix variants that suit individual needs (work&development/media/gaming)

Log in

Don't have an account? Sign up now