Cortex-A720: Middle Core, Big on Efficiency

Focusing on Arm's latest middle core, the Cortex-A720 hasn't changed much from the previous Cortex A715 design last year, which was also Arm's first AArch64-only middle core. Arm has a set philosophy for its A700 family, and that's mostly about increasing performance through optimizations, delivering maximum levels of power efficiency within set thermal limits, and optimizing workloads for actual use cases instead of blisteringly fast benchmark performance. Arm's key aims are to enhance performance metrics while maintaining power efficiency, area, and all within an acceptable thermal envelope. Cost is also essential, with many entry-level mobile devices already on the market leveraging the Cortex A700 family for its main cores.

Similar to the Cortex-X4 in that the Cortex-A720 is built around the Armv9.2 ISA, Arm has optimized its design to enable the A720 to deliver more performance within the same power budget compared to the Cortex A715. The Arm 700-series family typically covers a much broader range of applications and caters to various markets, including, and not limited to, digital TVs (DTV), smartphones, and laptops. Having more comprehensive flexibility in a more diverse space has its advantages, and Arm looks to capitalize on that with the Cortex-A720 acting as the 'workhorse' of the TSC23 core cluster.

Devices such as smartphones at the entry-level typically want to reduce cost but maximize performance and efficiency, and that's where cores such as the Cortex-A720 come into play; the Cortex-X4 is primarily allocated to devices with flagship status or those that require the most burst and sustained performance, such as top tier smartphones, tablets, and laptops. Meanwhile, Cortex-A720 is the next step down, giving up the X4's high peak performance for a much smaller core size and with correspondingly lower energy consumption.

For the Cortex-A720 in particular, Arm is also offering multiple configuration options. Along with the standard, highest-performing option, Arm has what they're terming an "entry-tier" configuration that shaves A720 down to the same size as Arm Cortex-A78, all while still offering a 10% uplift in overall performance. With some Arm customers being especially austere on die sizes, moves such as these are necessary to convince them to finally make the jump over to the Cortex-A7xx series and Armv9.

Arm's focus is to broaden the range of the entry-level market and expand on the possible use cases for its Cortex-A720 core so that it can be implemented into a wider variety of entry-level mobile devices and in lower-end markets.

Some of the critical improvements to the Cortex-A720, when compared to the previous A715, is Arm has opted for a faster branch mispredict recovery. Branch prediction breaks down the instructions into predicates, and a branch predictor will only execute statements it predicates to be true. Opting for a faster branch mispredict recovery has multiple benefits, as it not only reduces the delay within the execution of instructions, but it can improve overall performance. Another element of this is pipeline efficiency, as a branch misprediction can disrupt the flow of instruction throughout the pipeline, and the ability to do this faster not only yields benefits to performance but also to overall power efficiency.

Arm has reduced the overall branch mispredict penalty on A720 to 11 cycles, down from 12 on the Cortex A715. They have also improved upon their 2-taken branch prediction technique, which predicts the outcome of the instruction, and, again, adds efficiency to the pipeline and reduces the penalties regarding misprediction.

Another improvement is the Pipelined FDIV/FSQRT (division + square root), which performs operations on floating point numbers using the pipelines. Allowing for concurrent executions of both FDI and FSQRT can improve instruction throughput, and Arm claims to have achieved a significant speed boost without impacting the overall area. There are also faster transfers from floating point to floating point, including NEON and SVE2 integer, which Arm introduced for Armv9. This also includes overall improvements to issue queues and the execution units, which simplifies the forwarding of data forwarding to AGUs.

Within the memory system of the Cortex-A720, reduced the L2 cache latency to 9-cycles, and Arm claims to have up to 2x the memset(0) bandwidth within the L2 cache. Without going into much detail about their methods, Arm also claims to have improved generationally on accuracy and coverage to the prefetcher. However, it has a new L2 spatial prefetch engine, which was previously a pioneering Cortex-X core system design feature.

Translating the refinements and improvements to performance, Arm estimates the performance uplift to be about 15% at iso-frequency, depending on the workload. Among other benchmarkmarks, thare are clear gains over the previous generation in SPECint2017 and improvements within internal testing with SPECint2006. For example, using SPECint2007 as its performance indication metric in SPECint2007_403.gcc, the Cortex-A720 has a gain of around 5% over the Cortex A715, with an even more significant improvement of about 6% in power efficiency. 

Other performance metrics on offer include DRAM reads, which Arm has focused a lot of attention on making more efficient, showing minor gains overall; SPEC2007int_483.xalacbmk shows a massive increase of up to 41% in DRAM read performance. While everything is relative and subjective to the workload tasked, Arm has made some clear forward progress with its latest Cortex-A720 CPU core microarchitecture.

Arm Cortex X4: Fastest Arm Core Ever Built Cortex A520: LITTLE Core with Big Improvements
Comments Locked

52 Comments

View All Comments

  • Silver5urfer - Monday, May 29, 2023 - link

    It is not related to the UI, it is related to the worst practices in ARM, Apple.

    Disposable goods, non compute focused, rather a simplistic tool for the Technological dependency rather than using it like a computer and most importantly, owning your own data in the case of an ARM powered smartphone - Filesystem, Applications control Etc. None of these are present in iOS. And they are now incorporated into the Android heavily from the UI, Design philosophy, Technology.

    Axing 32Bit OS / Applications and forcing everyone to be on the Playstore mandated policies gives an edge to Android on axing the power user features, i.e targeting latest OS SDK means you are restricted heavily to an OS and its jail. Also they are hiding applications now on Playstore. That means old apps are now hard to find, and good apps do not work on latest OS (Timey app for eg), and lot of examples. Plus now modern Android blocks you even on Sideload notifying the SDK target version in normal terms such as this app won't work properly because Android 14 and up do not allow Android 6 below apps.

    Windows enjoys the superior user retention and proper computing because of it's legacy support, A Windows 3.1 .exe will work on Windows 10. But on Apple it's all outdated and even hardware, any x86 processor from Core2Quad which lacks SSE4 and AVX2 still runs modern games which utilize these features but can be made to work because of the power of x86 and Windows. That's how a superior computer is born but not guardrails and heavy restriction and placing consumer in the dark in the name of technology BS.
  • Eliadbu - Monday, May 29, 2023 - link

    Legacy support is overrated for vast majority of the user base, even on windows. its also a thing that can be achieved with emulation for the niche use cases. Most of the argument you gave had little to non to do with 32 bit support. This legacy support costs in silicon space, complexity and software upkeep - all of those resources can be used for actually useful things that will benefit most users.
  • TheinsanegamerN - Wednesday, May 31, 2023 - link

    LMAO legacy support is the only reason windows still exists.
  • iAPX - Monday, May 29, 2023 - link

    Intel is thinking about being 64bit-only too, with the X86S project.

    This is an interesting way, as 16bit and 32bit compatibility could be offered through software emulation in a VM (their proposal), naturally with impact on performances.
  • Silver5urfer - Monday, May 29, 2023 - link

    I hope that project doesn't fly but looking at modern Intel with their ARM clone of P+E to worst now P+E+LPE cores they may break the whole 32Bit Application world.

    Only HPC market can stop it but looking at how Windows 10 is now being retired by 2030 max (LTSC 1809 maximum lifecycle) add maybe ESU channel like Win7 to 2033 at best, after that I think Windows will also copy Apple hard they are already doing it hardcore as Windows 11 is the Win10S branched out because those internal designers are cultists of the Applesque systems lock down and uber simplification of power user nature, this makes entire generations of young population being dumbed down by the basic structure of the OS + Technology rather than innovative and explorative thinking process of the older era (XP, 7 etc)

    Windows 10 is the last Microsoft OS that has real support of all the older Windows applications, 11 discarded a lot of Shell32 / Win32 systems and ruined the NTKernel in the process and the CPU schedulers. They sabotaged the entire explorer.exe too, and with the modern fad AI introduction into the OS the telemetry will explode into exponential factor and with the complete dumbing down of the OS and the process, Atomization of the human thinking will lead to regressive computation. Really unfortunate.

    Emulation means there will be a performance penalty.
  • stephenbrooks - Monday, May 29, 2023 - link

    I'm interested by the ARM laptop direction (the 10 X4 plus 4 mid-core design). That could run a full OS like Windows or Linux.
  • eastcoast_pete - Monday, May 29, 2023 - link

    At least Gavin addressed the mini-elephant in the room for the small cores (thanks!): still no out-of-order design. Instead, an ALU is removed "for greater efficiency". By now, I am suspecting that ARM and Apple have some kind of understanding that ARM little cores won't, under any circumstance, be allowed to come anywhere close to challenging Apple's efficiency cores in Perf/W. Apple's efficiency cores have about twice the IPC of the little ARM cores and all at about the same power draw. Which made the impossible come true: I am now rooting for Qualcomm to kick ARM's butt, both in court and in SoCs.
  • name99 - Monday, May 29, 2023 - link

    Oh FFS, always the conspiracy theories!
    It’s really much simpler — Apple’s small cores are much larger than ARM’s small cores. ARM seems to be thinking that their mid cores (A720) can play the role of Apple’s small cores, and that may be to some extent true in that Apple can split work between big and small in a way that Android cannot, given that Apple knows much more about what each app is doing (better API for specifying this, and much more control of the OS and drivers).

    Much more interesting is how this is all about essentially catching up to Apple’s M series. Which is fine, but if you look at what Apple is doing, the action is all at the high end. I’ve said it before and will say it again; Apple has IBM-level performance in its sights. The most active work over the past three years is all about “scalability” in all its forms, how to design and connect together multiple Max class devices to various ends. The next year or two will be wild at the Apple high end!
  • Kangal - Monday, May 29, 2023 - link

    Thank you!

    However, I still welcome the development of a smaller and slower ARM core, if it means small power draw and small silicon area. There is a market for that outside of phones; in embedded devices, watches, wearables, and ultra low power gadgets.

    We used to have something like Cortex-A7 (tiny), Cortex-A9 (small), Cortex-A17 (medium). Then we had Cortex-A35 (tiny), Cortex-A53 (small), Cortex-A73 (medium). But we never got a successor for the Cortex-A35, so perhaps a very undervolted Cortex-A520 will work. Just like how ARM justified using an overclocked Cortex-A515 as a legitimate successor to the Cortex-A53 range.

    Almost all the attention goes to the (Medium) cores. It's their bread and butter. From the development of (2016) Cortex-A73, A75, A76, A77, A78, A710, A720 (2023).

    But as you said, the exciting things are happening at the high-end (LARGE) cores. It's started with the creation of a new category in the X1, X2, X3, X4 designs. They seem unfit in phones, okay in tablets, and necessary for ARMbooks. Even then, their performance is somewhere far from Apple's M1/M2/M3 and unfit to tackle AMD Zen3 / Intel 11th-gen x86 cores. Let alone their newest variants.
  • back2future - Tuesday, May 30, 2023 - link

    "Even then, their performance is somewhere far from Apple's M1/M2/M3 and unfit to tackle AMD Zen3 / Intel 11th-gen x86 cores."

    without sufficient support for desktop OS on desktop performance CPUs it reduces possibilities to binary translation/multiarch binaries and ISA specific OS from vendors, but not on ARM generally having a free choice for Windows/Linux/Unix variants that suit individual needs (work&development/media/gaming)

Log in

Don't have an account? Sign up now