Today in London as part of ARM's TechDay 2015 event we had the pleasure to get a better insight into ARM's new Cortex-A72 CPU. ARM had announced the Cortex-A72 in the beginning of February - leaving a lot of questions to be asked and sense of mystery in the air. The Cortex-A72 is a direct successor to the Cortex-A57 - taking the predecessor as a baseline in order to iterate and improve it.

On the naming side of the equation, moving from 'A57' to 'A72' rather than 'A59' or similar, ARM explains that it is purely a marketing decision as they wanted to give better differentiation between its higher-performance cores from the mid-tier and low-power cores. There seemed to be some confusion between the more power efficienct A53 and the more powerful A57, whereby users would assume they are similar, and thus moving its new big core into the A7x series.

We saw some absolute targeted performance numbers back during the February release, which promised some very interesting numbers that could be achieved over the A57. The problem was that it was not clear how much from performance and power efficiency came from the architectural changes and how much came from the the process on which these targeted performance data points are estimated from. It's clear that on the high-end ARM is promoting the A72 on the new FinFET processes from Samsung/GlobalFoundries and TSMC, which are referred to as 14nm and 16nm in the slides. Generally, due to the design and the node, the A72 will be able to achieve higher clocks than the A57, and we seem to be aiming around 2.5GHz on the 14/16nm nodes when high-end smartphones are concerned. Higher clocks may be present in server applications, where the A72 is also aimed at.

Probably the most interesting slide next to the actual performance metrics of the A72 is the apples-to-apples comparison of the A57 to the A72 on the same process node. When on the 28nm node, we see the A72 having a respectable 20% power reduction when compared to the A57. As a reminder - we're talking about absolute power at the same clock speed, which does not consider performance and thus not a representation of efficiency.

Notably, ARM is aiming for the A72 to be capable of extensive sustained performance at its target frequency. This is something that smaller form factor A57 designs (e.g. phones) have struggled with due to just how powerful A57 is, which has lead to more bursty designs that can only run A57 at its top speeds for short periods of time. We are presented with figures such as sustained 750mW operation per core on 16FF+ at clocks of ~2.5GHz.

While the power numbers are interesting we also have to put them into context of the achieved work. ARM has made several optimizations to the architecture to improve performance when compared to the A57. We'll get into more detail in just a bit - but what we are looking at is a general 16-30% increase on IPC depending on the kind of workload. Together with the power reduction, we now see how ARM is able to advertise such large efficiency gains for the same fixed workload.

A72 Architecture - The Upgrades Over A57

ARM seems to have managed to achieve an improvement in all three areas of the PPA metric; Performance, Power and Area - the trifecta of semiconductor design goals. This was achieved by doing a re-optimization of (almost) every logical block from the A57. There has been some considerable redesign in the CPU's architecture, some of which include a new branch-predictor and improvements in the decoder pipeline to allows for better throughoutput. 

On the level of the instruction fetch block we see a brand new branch-predictor that follows a new sophisticated algorithm that improves performance and reduces power through reduced misprediction and speculation, which has been cut down by 50% for mispredictions and 25% for speculation when compared to the A57. Superfluous branch-predictor accesses have also been suppressed - in workloads where the predictor is not able to do its job efficienctly it is then bypassed completely. There also has been general power optimization in the RAM-organization by coupling the different IP blocks better together, something ARM looks to provide with their own physical IP.

Moving down the pipeline, A72's decoder/rename capabilities have seen their own set of improvements.The decoder itself is still a 3-wide decoder, but ARM has gone through it to try to improve both performance and power consumption in other ways. To improve performance, the effective decode bandwidth has been increased, and the decoder has received some AArch64 instruction-fusion enhancements. Meanwhile power consumption has been tempered at multiple levels, including optimizing decoding directly, and in other power optimizations to the buffers and flow-control hardware that work around the decoder.

However it's on the dispatch/retire stage that the architecture sees the biggest improvements to performance. Going hand-in-hand with the decoder's ability to fuse instructions, ARM's dispatch unit can then break those ops back down into more granular micro-ops for feeding into the execution units, transforming it from a 3-wide to an effective 5-wide machine at the dispatch stage. The net result of this increases decoder throughput (by reducing the number of individual instructions decoded) while also increasing the total number of micro-ops created by the dispatcher and eventually executed per cycle. ARM is quoting an average of 1.08 micro-ops per instruction in code, which will aid the cases where in A57 the 3-wide dispatch unit was eventually dispatch limited. Again on the dispatch-level, ARM has done more extensive work on their register file by reducing the number of read-ports by introducting port-sharing and further reducing superfluous access.

ARM CPU Core Comparison
  Cortex-A15 Cortex-A57 Cortex-A72
ARM ISA ARMv7 (32-bit) ARMv8 (32/64-bit)
Decoder Width 3 ops
Maximum Pipeline Length 19 stages 16 stages
Integer Pipeline Length 14 stages
Branch Mispredict Penalty 15 cycles
Integer Add 2
Integer Mul 1
Load/Store Units 1 + 1 (Dedicated L/S)
Branch Units 1
FP/NEON ALUs 2x64-bit 2x128-bit
L1 Cache 32KB I$ + 32KB D$ 48KB I$ + 32KB D$
L2 Cache 512KB - 4MB 512KB - 2MB 512KB - 4MB
 

On the side of the execution units we see introduction of new, next-generation FP/Advanced SIMD units. The new units allow for much lower instruction latency as the FP pipeline length is reduced from 9 to 6.  FMUL is reduced from 5 cycles down to 3, FADD goes from 4 to 3, FMAC from 9 to 6, and the CVT units go from 4 to 2 units. The reduction of the FP pipeline length brings down the maximum pileline length of the architecture down from 19 to 16. 

The integer units also see an improvement, as the Radix-16 divider has seen its bandwidth doubled, while the CRC unit now becomes a pipelined block with just 1-cycle latency, a 3x increase in bandwidth over the A57. Again, we see a repeating pattern here as ARM claims it tried to squeeze the most power efficiency from all the units by improving the physical implementation.

Another large performance improvement over the A57 is found on the Load/Store unit. Here, ARM claims that bandwidth to L1/L2 has been improved by up to 30%. This was achieved by introducting a sophisticated L1/L2 data prefetcher which, again, is at the same time more efficient as improvements in the L1-hit pipeline, fowarding network, and way-predictor reduce the needed power. 

We've been generally impressed with what the A72 brings to the table. It's clear that new architecture is an evolutional upgrade ot the A57, and the improvements in performance, power, and area, when looked at from an aggregate view, bring substantial differences and upgrades when compared to the A57. With the A57 having come to market in Q3 of last year and it now shipping in high-volume SoCs such as the Snapdragon 810 and Exynos 7420, we are looking at the possibility of seeing its successor come to market in shipping devices in less than a year's time. The obvious partners that might ramp prodution the soonest are MediaTek and Qualcomm, at least if they are able to hit their target schedules. There should presumably still be un-announced parts from other ARM partners as well. It's clear that ARM has increased the cadence of releasing refreshes of its IP portfolio and the quick succession of the A72 seems to be part of that.

The A72 looks to be a logical update to the A57 addressing some weakpoints such as peak power and power efficiency combined with an ~10% area reduction. We already saw Mediatek showing off an A72 package at MWC, so it will be interesting to see how the IP actually performs in silicon and what ARM's partners will be able to do with the core and the time to market.

Comments Locked

92 Comments

View All Comments

  • sonicmerlin - Friday, April 24, 2015 - link

    Intel is not going to rebrand their Core M chips and sell them as Atoms. Not only would that be utterly stupid from a financial perspective (why would anyone buy a Skylake chip if they could get a Core M with similar performance for a tenth the cost? And why do you think Skylake will have such improved performance when Intel hasn't improved their performance in 5 years?), but ARM's chips are running at a much lower TDP than Intel's. That's why ARM gets the praise.
  • name99 - Friday, April 24, 2015 - link

    You think in a YEAR Intel will be selling Cannonlake? Good luck with that prediction.

    As for "much faster Skylake", apart from AVX-512, what makes you believe that? The recent history of Intel has been maybe 15% speed increases at major new architectures (from Nehalem to Sandy Bridge) and around 5% for the subsequent tweaks, eg Sandy to Ivy to Haswell to Broadwell.
    Even if Skylake falls into the 15% category, that's hardly "much faster".
  • jjj - Friday, April 24, 2015 - link

    Why do you treat Intel like they don't utterly suck? Or like greed is a virtue?
    Intel can't make a competitive Atom, it's always next year
    And maybe you should factor in die sizes.
    A72 is 3.3 mm2 on 28nm, bellow 1.9mm2 on Samsung 20nm and some 1.6nm2 on Samsung 14nm. You have any clue how huge Broadwell is, some 4 times bigger than A72 on 14nm. So great Intel has a core that costs 4 times more, they deserve a Nobel in economy.
    Intel can sell Broadwell at 40$ sure since it's 82mm2 while the ARM guys can do A72 at 10$ this year on 28nm and well bellow that later on smaller processes.
    Sure the Braadwell core is not made for this kind of load and there is nothing wrong with that but you don't get to claim that it can compete with a core that's a few times smaller in the real world.
  • name99 - Friday, April 24, 2015 - link

    OK. Call us when Core-M gets its first phone win.
    Until then, ARM's comparison is perfectly legit.
  • psychobriggsy - Friday, April 24, 2015 - link

    In terms of comparison with Atom, A9 was 'competitive' in terms of performance with the Atom cores of the time. A15 often exceeded (significantly). A57 should be in front of current Atom cores, and this should be further still. Intel had a process lead (used to reduce power consumption and have good turbo speeds), but that's reduced now with 14FF from Samsung (sure, there are 20nm elements still, but it's not a full node or two advantage for Intel now).
  • mczak - Friday, April 24, 2015 - link

    The Cortex-A15 never really exceeded performance of the (silvermont) atom, at least not significantly. Even though based on paper specs, it should have (as it's a wider design). Ultimately though in the form factors we're talking about performance was limited by power, and the A15 didn't seem to have an edge there.
    Likewise, the Cortex-A9 should have outperformed the then current slow bonnell core atom, but likewise didn't really (the execution core was definitely faster, but the atom had the edge in things like branch prediction, memory pipeline, overall this pretty much was a wash).
    But the A72 (when using a finfet process) should have a good chance of outperforming the atom I guess (since the new 14nm atom didn't improve all that much). I don't know if it's more energy efficient, but peak performance at least should probably be somewhat higher (as it should reach similar clocks but be faster per clock). I don't think though the differences will be too great (it's always difficult to tell due to the different architecture used, hence even when using the same software compilers etc. can have quite some influence).
  • Wilco1 - Saturday, April 25, 2015 - link

    Just like the A9 outperformed Silverthorne, Cortex-A15 outperforms Silvermont by a huge margin:

    http://browser.primatelabs.com/geekbench3/compare/...

    To get an idea how Atom will do compared to A72, here is how Galaxy S6 compares with the fastest Atom running Android:

    http://browser.primatelabs.com/geekbench3/compare/...

    Yes that shows A57 is ~80% faster clock for clock than Atom. Now consider that A72 will do 2.5GHz and is 16-30% faster per clock as well, so it is safe to say A72 will be at least TWICE as fast as Atom when it comes out in a few months.

    So A72 "has a good chance to outperform Atom" is the understatement of the year...
  • 68k - Saturday, April 25, 2015 - link

    I get the same relative performance in Geekbench when comparing Silvermont vs Cortex-A15, so my boards are not obviously broken. Yet, when running the same Debian version on both boards, I cannot find any non-trival "real" program where Silvermont is at least as fast as Cortex-A15 clock-for-clock. Silvermont beat A15 with a huge amount in programs with a lot of random memory accesses in data set that far exceed the size of the cache.

    Maybe Geekbench isn't the most reliable measurement of general purpose performance, especially not when comparing result from different ISA.
  • Wilco1 - Monday, April 27, 2015 - link

    Geekbench tracks real world performance reasonably well and gives similar results as SPEC when comparing different CPUs. It is not a random memory test of course, but it does exercise L2 and main memory, so having a good memory system and prefetchers helps a lot.

    Since you mention dev boards, in my experience ARM dev boards are lacking in memory performance (both latency and bandwidth) - performance in real devices tends to be much better. As an actual example a dev board I used gives less than 50% of the bandwidth of my phone when running the same benchmark. For Intel boards it may be the other way around - for example the latest 22nm Atoms are slower than the initial Silvermont dev boards - look at the memory test, especially multithreaded:

    http://browser.primatelabs.com/geekbench3/compare/...

    So when comparing performance devboards are not necessarily representative.
  • mczak - Saturday, April 25, 2015 - link

    You use the Tegra K1 for comparison, which isn't really very representative overall. This uses a later revision of the Cortex-A15 than most, and nvidia also managed to clock it higher (it is of course also probably the newest SoC still using a A15). That alone is good enough for roughly 20% more performance than what your typical Cortex-A15 SoC provides.
    Besides, geekbench isn't quite that indicative of real world performance. Yes, the execution core of the A15 is faster than Silvermont (just like Cortex-A9 was faster than Silverthorne), as it should be, but that did not really translate to any real world advantage, not least because the memory pipeline of the atoms was superior, which isn't really reflected in geekbench scores.
    There's imho no way the A72 will be twice as fast in any kind of real world task. If it reaches 50% faster that would be quite an achievement already.

Log in

Don't have an account? Sign up now