Cortex-A55 Microarchitecture

The Cortex-A55 is the next CPU microarchitecture in ARM’s Cambridge family of low-power, small-footprint cores, which also include the A5, A7, and A53. This new little core is an evolution of the A53, so the two cores share much in common. Using the A53 as the starting point makes sense—it has proven to be a successful design—but as the big cores evolved over time, the performance gap between the big cores and the A53 grew, altering the balance in ARM’s big.LITTLE scheme. And as ARM continues its push into new markets beyond mobile, it needs new features that the A53 lacks. The A55 addresses the performance issue with improvements to the memory system and other microarchitectural changes and adds key features by moving from the ARMv8.0 architecture to ARMv8.2.

At a high level, the A55 is still a dual-issue, in-order CPU with an 8-stage pipeline. According to ARM, 8 stages is still the sweet spot, because it’s not seeing significant frequency improvements when moving from 16/14nm to 10nm to 7nm (most of the process gains are with area scaling and reduced dynamic/leakage power). With 8 stages, the A55 should reach a similar peak frequency as A53. Moving to a shorter pipeline would reduce the max frequency without a significant improvement to power or area, while a longer pipeline would increase area and power consumption for only a small frequency gain.



The similarities between the two cores is evident when examining the diagrams above. There’s still a 2-wide decode stage that decodes most instructions in a single cycle. Another feature that carries over from A53 is symmetric dual-issue for most instructions, meaning both issue slots can feed instructions to any pipeline. We’ll cover the execution pipelines in more detail below, but the big change here is that the A55 moves to independent load and store AGUs that can perform loads and stores in parallel instead of a single, combined AGU like the A53.

The A53’s core already provides good throughput, but not having instructions or data ready to process, perhaps because of a mispredicted branch or a cache miss, nullifies this advantage. Without the ability to execute instructions out of order while the CPU waits to fetch an instruction or data from elsewhere in the memory hierarchy, the entire core can stall (all other instructions need to wait for the current instruction to finish), so keeping an in-order core fed with instructions and data is critical. This is why ARM focused heavily on improving the A55’s memory system performance.

Starting on the instruction side, the L1 I-cache is now 4-way set associative versus 2-way for the A53. It’s still VIPT (Virtually Indexed, Physically Tagged), which is commonly used for L1 caches as it reduces latency, but there’s a larger 15-entry L1 TLB that supports multiple page sizes. This simple-sounding change actually required a significant amount of work to alleviate issues with timing pressure. The size of the L1 I-cache is configurable with options for 16KB, 32KB, or 64KB, which is similar to the A53 that offers a range from 8KB to 64KB.

It seems that every new CPU microarchitecture comes with a new branch predictor, and the A55 is no exception. The new conditional predictor increases prediction accuracy by using neural network based algorithms. It also adds loop termination prediction, which avoids a mispredict occurring at the end of a loop. There are also new 0-cycle micro-predictors ahead of the main conditional predictor. These are not as accurate as the main predictor, but their ability to perform predictions back to back to back provide a significant performance uplift by reducing pipeline bubbles even in tight loops. There’s also an indirect predictor that’s only used when necessary, reducing the power penalty because indirect branches do not occur as frequently. This includes a 256-entry BTAC (Branch Target Address Cache).

Switching over to the data path, the A55 includes an improved data prefetcher that delivers higher bandwidth. It’s capable of detecting more complex cache miss patterns and can prefetch to the L1 or L3 caches. ARM expects these improvements to make a noticeable impact on mobile device UI performance.

The L1 D-cache is still 4-way set associative like the A53, but is now fully-exclusive instead of pseudo-exclusive, which saves area because data is not duplicated in the L2 cache. The size of the L1 D-cache is also configurable with 16KB, 32KB, or 64KB options, which is also similar to the A53’s range from 8KB to 64KB.

Another big change for the L1 D-cache is moving from a PIPT (Physically Indexed, Physically Tagged) to a VIPT access scheme. This reduces cache latency because the cache index lookup happens in parallel with the TLB translation, but also creates aliasing issues where several virtual addresses might reference the same physical address. The A55 deals with aliasing in hardware, however, making the VIPT cache appear like a PIPT cache to the programmer.

Further enhancements include a larger 16-entry micro-TLB for the L1 D-cache, a significant increase over the A53’s 10-entry micro-TLB, and a larger store buffer with higher bandwidth into the L1 to better cope with workloads that generate a lot of stores. ARM also reduced the L1 pointer chasing load-to-use latency from 3 cycles in A53 to 2 cycles in A55, providing a small performance bump when working with certain types of data structures.

Being compatible with DynamIQ means the A55 gets an integrated L2 cache that operates at core speed. This reduces L2 latency by 50% (12 cycles to 6 cycles) compared to the A53 and its shared L2. The L2 cache size options include 0KB, 64KB, 128KB, and 256KB. ARM expects 128KB to be the most common for mobile applications, but there should be a reasonable number of 256KB configurations too. The smaller sizes will appeal to the networking and embedded markets that desire lower area/cost or need more deterministic memory behavior. With the L2 now part of the core, the A55’s L2 TLB grows to 1024 entries up from 512 entries for A53.

The A55’s integrated L2 uses PIPT, which is simpler to implement and uses less power than VIPT. Unlike the VIPT L1, using PIPT for the L2 does not incur a performance penalty because its naturally higher latency means there’s sufficient time to get the physical address from the TLB before performing the tag compare.

The L2 is 4-way set associative like the L1, a design choice meant to minimize cache latency, which is important for reducing the impact of stalls on in-order cores. The downside to using a less associative cache is a greater chance to miss, but ARM feels comfortable with this compromise with an L3 cache sitting reasonably close to the core.

Having access to a (potentially) large L3 cache, something the A53 did not have, will also help improve performance. As previously stated, in-order cores are sensitive to cache misses that cause the core to stall, so they respond particularly well to increases in cache size and reductions in memory access latency. ARM says the A53/A55 sees roughly a 5% performance improvement for each doubling of cache size depending on workload.

We’ll end our tour through the data path and our discussion about the A55’s memory system improvements with the AGUs (Address Generation Units). According to ARM’s internal benchmarking, the CPU can spend a lot of time doing loads and stores, so improving performance here was crucial. Instead of sticking with the A53’s single, combined AGU, the A55 uses independent load and store AGUs that can perform loads and stores in parallel, enabling a higher issue rate into the memory system. Note that the A55’s AGUs are not capable of performing both loads and stores (just one or the other) like the AGUs in the A73/A75, a trade off between performance and complexity.

Shifting our focus to the execution pipes, we find 2 ALU/INT units that are functionally the same as the A53’s. Both ALUs can perform basic operations such as additions and shifts, but only one ALU handles integer multiplication and multiply-accumulate operations, while the other focuses on integer division with a Radix-16 divider. So while the A53/A55 cannot perform 2 integer multiplies or divides in parallel, it can dual issue a MUL/MAC alongside a divide/add/shift. The ALUs are still 2 stages, but nearly all instructions complete in 1 cycle. There are also several forwarding paths built in, including paths within the same pipe or cross pipe, to reduce execution latency.

The 2 64-bit NEON/Floating Point pipes are still optional (some markets do not require them) and are served from a dedicated 128-bit register file like the A53. Each SIMD NEON pipe in the A53/A55 can perform 8 8-bit integer, 4 16-bit integer, 2 32-bit integer or single-precision floating-point (FP), or 1 64-bit integer or double-precision FP operations per cycle, giving programmers the flexibility to choose the right balance between precision and performance. As part of its support for the ARMv8.2 architecture, the A55 gains native support for half-precision FP16 operations too. The A53 can fetch FP16 values, but they need to be converted to FP32 before execution. This is no longer the case with A55, so the core performs less work when dealing with FP16 values that are frequently used in some image processing and machine learning applications that require 16-bit precision.

The A55 also includes a new INT8 dot product instruction, another useful addition for machine learning, that offers a 4x performance improvement over the A53 (which increases to about 5x when taking the memory system improvements into account). Instead of executing multiple instructions back to back to back like the A53, the A55 only needs to push a single instruction through the pipe.

One final noteworthy improvement is the ability to do a fused multiply–add (FMA) in a single pass instead of two for the A53, reducing latency from 8 cycles to 4.

The A55 has much in common with the A53, which is not surprising because the A53 already delivers good core throughput. While the A55 gets some improvements to the NEON/FP pipes, mostly from additional instructions courtesy of ARMv8.2, most of its performance gains come from changes to the memory system meant to reduce both the number of core stalls and their latency impact when they do occur.

Cortex-A75 Microarchitecture Final Thoughts
POST A COMMENT

104 Comments

View All Comments

  • Meteor2 - Monday, May 29, 2017 - link

    How? Reply
  • Paul A. Clayton - Monday, May 29, 2017 - link

    The A55 design is constrained not merely by area and power but also by configurability. Being able to vary the L1 cache sizes from 16 KiB to 64 KiB means that the pipeline structure and cycle time is not optimized for one size. Targeting multiple processes and design factors (e.g., SRAM libraries can be tuned for different performance/area/power tradeoffs) also constrains optimization.

    While ARM might have had in mind a particular implementation for optimization (for which it might provide hard cores), it is still limited to providing acceptable designs for other implementations. Some microarchitectural optimizations might strongly depend on implementation details which are outside of ARM's control.

    There are probably also higher-risk design possibilities that were not explored simply because the resources were not available. Having multiple design teams with similar targets typically would mean wasting effort, but such provides a potential for a better design. It would be difficult for ARM to charge for the cost of unused designs given that other designs are available.

    Targeting a broad range of workloads also means a design will tend to be worse than a design targeting a narrower range of workloads.
    Reply
  • Kevin G - Monday, May 29, 2017 - link

    Of course they could but would those changes have permitted it still be within the design constraints of the A55? Small die size and lower power are two characteristics that are not compromised for the A55. Faster is easy to do with more power but considering that the A55 is the little core, higher power consumption is to be avoided. Similarly a faster core might be done with a larger die area. There are trade offs here but the pressure from ARM's customers is to keep this as small as possible.

    Considering those constraints, I considering any improvements to be rather impressive. If there is a silver bullet that ARM could have used to make it faster/smaller/consumer less power in these designs without violating the constraints they have in place, I'd like know what it was.
    Reply
  • tipoo - Monday, May 29, 2017 - link

    Alas, still waiting to find out how different Apples Zephyr is from standard Little cores like it. It's nearly twice as big. Reply
  • jjj - Monday, May 29, 2017 - link

    "What will be the goal for the next core, which will be coming from ARM’s Austin team that produced the A72? "

    That was my main question too but my hope was that the next core is aiming for much higher IPC. They need it for server and dual big core configs in mobile on 7nm.
    Or maybe they don't quite need it really, A75 is really fast and if the next core adds 15-20% higher IPC combined with higher clocks enabled by the process, that's quite a lot and rather amazing from a perf density perspective.

    Not much talk about area, any clue how A75 + DynamIQ compare to previous solutions - ofc the cache part is easy to factor in.

    It is interesting that A75 scales better with higher clocks, any guesses for clocks at 2W? A laptop with 4b4L would be rather nice.

    A55 not targeting higher clocks seems a bit odd, would mean that power goes down if folks move from A53 on 16FF to A55 on 7nm so maybe ARM has another update before 7nm.
    Reply
  • Meteor2 - Monday, May 29, 2017 - link

    I think these are still mobile CPUs. It's up to Cavium et al to do ARM ISA-compatible designs for servers. ARM's not that bothered; the mobile market is far larger. Reply
  • jjj - Monday, May 29, 2017 - link

    ARM is very eager to go server and just a year ago ARM was targeting 25% share in server by 2020. This gen does highlight infrastructure as they call it, a large segment where they've been gaining share and the next step is server.
    7nm is where it starts really, TSMC has the HPC version of the process and ARM needs to be ready too with the core that follows A75.
    What's is unclear is the strategy. A75 is already desktop class so they could just increase IPC some more but maybe they can aim higher. It seems that the Austin team got an extra year to work on the next core so that's 3 years, could be an entirely new design.
    Reply
  • Kevin G - Monday, May 29, 2017 - link

    ARM in the server space is sound much like the hype of Linux on the desktop: always 'next year'.

    The challenges ARM designs have had have been to simply get out to market. AMD's Seattle chip is indeed out but suffered two years of delays and most of the design wins have evaporated due to it. AMD's K12 efforts are MIA right now. Similarly Cavium's ThunderX line is interesting but not the game changer it was hyped to be. Broadcom has exited the ARM server market after promoting an interesting design (SMT on ARM!). Applied Micro's efforts for ARM servers have been lost to corporate mergers. Caldexa folded years ago.

    The one interesting ray of hope is that there are indeed some customers like Microsoft, Facebook, Google and Amazon who are interesting in ARM's low power nature to certain workloads. Microsoft has a version of Windows Server running on ARM but is not releasing it publicly, rather keeping it tied to their Azure cloud services. I have yet to hear where MS has gotten their ARM hardware from though. Google has dipped their toe into chip development for their deep learning efforts and it would be a straight forward process to piece together their own server designs from licensed IP blocks now that they have the in-house expertise to do it (saying they can and them doing it are two different things). In the end, the big cloud providers who could have spurred the ARM server space for everyone may keep the ARM server idea private to themselves while the rest of the market gets to deal with x86. Considering that x86 is perceived as higher power and higher cost, this serves the cloud providers well as it give incentive for companies to migrate to their cloud solutions instead of looking at ARM alternatives.

    The other difficulty for ARM in the market place right now is that Intel preemptively released their response: the Xeon D. Intel was doing a performance/watt play there and it paid for for the low end server market. In most cases, the Xeon D for a pure single socket server was a better choice than the Xeon E5 1xxxx or Xeon E3 line up. I suspect that Intel management sees Xeon D as 'too good' and thus hasn't been quick to bring an updated Sky Lake version to market.
    Reply
  • Wilco1 - Monday, May 29, 2017 - link

    Please read: http://www.anandtech.com/show/11189/appliedmicro-x... - it says both Vulcan and XGene are alive. You forgot to mention QC's Centric (48 cores on 10nm, available this year). There are also 64-core/256GB DRAM beasts made by HiSilicon. Reply
  • jjj - Monday, May 29, 2017 - link

    If you assume a 15-20% IPC gain over A75 for ARM's 7nm core and clock it past 4GHz for server, that's somewhat the worst case scenario for where ARM is in server in 2018-2019. We can assume DinamIQ evolves a bit by then too.
    That wouldn't be bad at all and ARM has extraordinary perf density. They might deliver more than that, we'll see.
    Reply

Log in

Don't have an account? Sign up now