Cortex-A520: LITTLE Core with Big Improvements

The third of the Armv9.2 cores is the Cortex-A520, which is little in design, but Arm promises big improvements over previous generations, particularly on power efficiency.

Addressing the biggest question right off the bat: no, Cortex-A520 is not an out-of-order core design. True to Arm's little core design ethos, it's still an in-order core – and in fact, Arm has even removed an ALU in the process.

Arm's smallest core for this generation is a new core in effect, but it is still more of a refinement of the Cortex-A510 than a completely new design. It has the lowest power-to-area ratio of all three announced Cortex Armv9.2 cores. The most significant differences come through optimizations on power, with Arm claiming that the Cortex-A520 is 22% more energy efficient than the previous Cortex-A510 core at iso-process and iso-frequency. The little core in Arm's TCS23 catalog is primarily designed for performing low-intensity and background operational tasks, which takes these loads off bigger cores such as the Cortex-A720/Cortex-X4 to allow better power efficiency overall within the cluster.

Many of Arm's efficiency gains come from small, microarchitectural level changes, mostly around how it implements data prefetch and branch prediction. On the whole, not much has been changed to the little core, but the small changes that have been made have all been about improving efficiency.

One of the non-architectural areas of improvement has been introducing the new QARMA3 Pointer Authentication Code (PAC) algorithm, which Arm claims to reduce PAC's overhead to under 1%. QARMA3 is a cryptographic-based technique designed to ensure the pointers' integrity is correct and accurate. It also provides a secure and efficient way to avoid tampering with the necessary underlying code so that any authorized modifications or tampering if the pointers are eliminated adds a layer of hardware-level security. Arm is not only leveraging QARMA3 PAC to boost security and integrity, but it also allows them to squeeze out additional levels of efficiency, if compared to using PAC with older algorithms.

Much like when Arm announced its armv9 architecture in 2021, the small Cortex-A520 cores can be merged in pairs to share pipelines and improve efficiency. Adopting a pairing of the smaller Cortex-A520 cores can enhance efficiency by combining them through relevant pipelines such as SVE, NEON, and FP. More so in the case of SVE2, which does require a larger area footprint than other executions and makes sense to pair two smaller cores than have just one on its own. However, it is entirely plausible and possible for SoC vendors to use single-core option implementations on their designs if they wish to do so.

Sometimes less is more, and in the case of the Cortex-A520, Arm has removed the third ALU pipeline, which it originally added to the Cortex-A5x DNA with the Cortex-A510. Arm's ideology behind this is it saves power in issue logic and improves forwarding results within the overall complexity of the pipeline. In practice, Arm has figured out how to recover enough of the lost performance through other improvements that they are opting to eat the hit from removing the ALU in order to minimize core size and maximize efficiency.

Ultimately, Arm is looking at a big-picture trade-off as well; reducing the power consumption of the Cortex-A520 frees up energy that can be allocated to the other cores, such as the Cortex-A720 and even the Cortex-X4 where applicable. This makes Armv9.2 IP versatile and scalable, making small savings where it can deliver the savings in other areas where and when it is needed.

Using SPEC2006_int_rate_1copy as its performance metric to judge performance and efficiency, generation on generation (and at iso-process and iso-frequency), Arm is claiming the Cortex-A520 delivers 8% more performance than the Cortex-A510 at similar levels of power consumption. Alternatively, at iso-performance, Cortex-A520 can deliver a significant 22% power savings.

While seemingly small, it can add up in the grand scheme of things, especially across a four-core complex of Cortex-A520 cores. Although there's always a diminishing level of returns in terms of increasing core count when it comes to performance, having lower-powered and more efficient cores typically creates more power for other areas to tap into, such as the big Cortex-X4 core, which requires more grunt to boost those intensive and burst reliant workloads.

Cortex A720: Middle Core, Big on Efficiency New DSU-120: More L3 Cache, Doubling Down on Efficiency
Comments Locked

52 Comments

View All Comments

  • Kangal - Monday, May 29, 2023 - link

    I also forgot to mention, we've had leaks for more than a year about ARMv9 and their Second-Gen cores. They were promised with a sizeable performance improvement at a reduced power draw.

    Turns out the rumours were not correct. Well sort of. We assumed the advances came just from the architecture but that's not it. We're seeing a modest improvement in the architecture, and the benefits coming from a number of other factors. They're relying on a new DynamIQ setup, more cache, faster memory, all mixing together to have an overall notable improvement. Going 64bit-only in microcode will have unseen benefits too. And the elephant in the room is the jump to TSMC-3NM node shrink, which will likely have frequency increases.

    So comparing the QC 8g1 (Samsung 5nm) to the (TSMC-5nm) QC 8g1+ and QC 8g2 (+5nm-TSMC) and (TSMC-3NM) QC 8g3 will be a mixed bag.
  • Kangal - Monday, May 29, 2023 - link

    A TCS23 (X4+720+520) with 1+5+2 configuration, only yields a +27% performance uplift at the same power, compared to TCS22 (X3+715+510) in 1+3+4 cluster.

    Something is miscalculated there!!!

    They either mean:
    1) TCS23 vs TCS23, with only difference being different configuration
    2) TCS22 vs TCS22, with only difference being different configuration
    3) TCS22 vs TCS23, and they meant PLUS an extra +27% performance on top of the architectural improvements
    4) It's not a typo, and they really did mean you ONLY get +27% uplift total. Which doesn't make sense since they claimed the X4 uses -40% less energy than X3, whilst the A720 uses -20% less energy than A715, and the A520 uses -22% less energy than A530. Logically speaking if you just multiply the efficiency gains by the core quantities you get an impressive figure. Unless you divide that by the total cores, that gives you an average drop by -23% energy, but that's not the total. Unless the engineers or the marketers are utterly incompetent there at ARM, and they meant this -23% figure gets increased to -27% figure (+4% efficiency gain) just based on the cluster configuration difference. That's not a great improvement, it's negligible, and not substantial enough to require a new silicon stamp (which explains MediaTek).

    1x40% + 5x20% + 2x22% = 184% / 8 = 23%
  • Doug_S - Monday, May 29, 2023 - link

    There are so many factors changing like more L2 cache, supporting more L3 cache, faster memory, better processes and they don't tell you anything about what the differences are.

    If they said X3 with x cache, y DRAM on process z was compared to X3 with x cache, y DRAM on process z then you could assume the performance uplift was due to their architecture. But they are turning all those knobs so who knows what improvement comes from the core versus what is around the core and what node it is on.
  • Doug_S - Monday, May 29, 2023 - link

    Ugh I meant X3 compared to X4 of course.
  • Kangal - Tuesday, May 30, 2023 - link

    I got that, but the architecture has been rather since the Cortex-A78. Just like how there was a long period of time since the release of the Cortex-A57 compared to the Cortex-A72. That extra time let ARM make a lot of big architectural improvements. In fact, it was pretty lengthy that we got Custom Cores developed by the likes of Nvidia, Qualcomm, Samsung, all which were vastly superior to the Cortex-A57 and they matched the subsequent release of the Cortex-A72.

    My biggest concern is the mistakes in their slides.
    If you have a 1+3+4 TCS22 design, and you do nothing but change the A510 to A520, you should see an upgrade of 22% per core. So 22% x4 should see a +88% uptick in performance. Now compare that to the mere 27% upgrade they said if you upgraded all the core types (X4 / A720 / A520) and you went with a larger chipset with the 1+5+2 design. Something is clearly amiss.

    Another solution to the riddle is they are using the problematic silicon from Samsung-5nm. Making a comparison between the flawed QC 8g1, against a new chipset using the same node, but upgrading the Core-types and the Cluster-design. Even then it's a bad excuse, because that would mean its barely competing against the 6-month old QC 8g2 (on TSMC node), and we collectively just ignore the existence of the MediaTek Dimensity chipsets.

    I think we will have to wait to hear the announcement of next-gen chips for 2024, in the form of QC 8g3 and MTK D9400. Let's see their claimed battery life improvement, their performance improvement, and deduce the efficiency from there. Look at which silicon they're building upon (TSMC 4nm vs 5nm). And finally look at the in depth reviews from the likes of Anandtech/Andrei, Geekerwan, and Golden Reviewer.
  • iphonebestgamephone - Wednesday, May 31, 2023 - link

    Techtechpotato instead of anandtech/andrei
  • Findecanor - Monday, May 29, 2023 - link

    ARM MTE and PAC are two different things, and I find it really silly to see them touted for use together.
    MTE steals eight pointer bits that would have been used for PAC, and on some implementations the bits for PAC would then be as few as 3.

    You would better pick one or the other, depending on your protection scheme.
  • syxbit - Monday, May 29, 2023 - link

    So Arm are still years behind Apple? And possibly will be slower than Nuvia too?
    I guess this just helps QUalcomm, as smaller companies have no choice. Either use slow off the shelf parts, or pay QCOMM for their superior chip (assuming Nuvia is as good as claimed.).
  • DanNeely - Thursday, June 1, 2023 - link

    Blame some of the mainland China android forks. They've spent years trying to pretend 64 bit only was never going to happen and were nowhere near ready when last years x3 dropped support for 32 bit code.
  • Arnulf - Monday, May 29, 2023 - link

    "up for the 6/8-wide dispatch with of the X3"

    From ... width

Log in

Don't have an account? Sign up now