The ARM Diaries, Part 2: Understanding the Cortex A12

Name: The ARM Diaries, Part 2: Understanding the Cortex A12
Item: The ARM Diaries, Part 2: Understanding the Cortex A12
Author: Anand Lal Shimpi

by Anand Lal Shimpi on July 17, 2013 12:30 PM EST

Posted in
CPUs
Arm
SoCs
Cortex A12

65 Comments | Add A Comment

65 Comments

Back End Improvements

The front end of the Cortex A12 is a bit more efficient than the Cortex A9, but the bulk of the performance gains really come from improvements to the execution side of the core. Similar to the Cortex A15, ARM introduced multiple independent issue queues ahead of the functional units. It’s important to get nomenclature right here. Instructions are decoded into micro-ops, renamed instructions are dispatched into the issue queues and then micro-ops are issued from the issue queues when their operands are available. Everything up to the issue queue is handled in order, while issuing can be handled out of order in the Cortex A12 (in most cases, more on this later).

Whereas the Cortex A9 had a single issue queue ahead of all functional units, the Cortex A12 moves to three independent issue queues. The A9’s issue queue could hold 4 decoded instructions, while each issue queue in Cortex A12 is larger than that. The move to larger independent issue queues alone should help with increasing IPC.

The three issue queues are as follows: one for integer, one for FP/NEON and one for loads and stores. ARM provided bits and pieces of an architectural block diagram for the Cortex A12. I reconstructed one as best as I could below. The blue blocks indicate in-order components of the design, while the pink/salmon blocks are out-of-order. You can toggle between the A12 and A9 diagrams to see how things have changed.

Cortex A12 retains the two integer pipelines of the Cortex A9, but adds support for integer divides (like the A7 and A15, other A-series architectures generally lacked support for hardware int divides). The rest of the integer execution capabilities are unchanged.

The FP/NEON units are vastly improved on the Cortex A12. When the Cortex A9 was first introduced, NEON code was rarely used which even lead NVIDIA to dropping NEON support altogether in Tegra 2. Times quickly changed as NEON code is widely used in Android and mobile applications.

The Cortex A12 design retains separate physical register files for integer and FP operations, but the RFs are larger than in Cortex A9.

Although Cortex A9 was considered an out-of-order microarchitecture, all FP and NEON instructions were executed in-order. With Cortex A12, ARM moves to a fully out-of-order architecture, at least as far as non-memory-ops are concerned. The FP/NEON issue queue now dual-issues into two FP/NEON pipes, both of which operate fully out-of-order. The FP/NEON pipes are also more tightly coupled, allowing for quicker data movement between FP and Integer units.

The improvements to the FP/NEON side are expected to show up quite nicely in benchmarks. ARM shared performance data using an FFMPEG workload on simulated Cortex A9 and Cortex A12 designs at the same frequency with the same number of cores (1):

A 48% increase in NEON performance isn’t unexpected at all given the magnitude of improvements to this part of the execution engine.

The final issue queue feeds the two load-store pipelines with two AGUs, once again a doubling from what was present in the Cortex A9 design. Each pipeline is equally capable (load/store agnostic) and mostly out-of-order (limits on what you can re-order if there are address dependencies between loads). By comparison, the load/store pipe in Cortex A9 was fully in-order.

Introduction to Cortex A12 & The Front End Performance Expectations & Final Words

PRINT THIS ARTICLE

Post Your Comment
Please log in or sign up to comment.

Comments Locked

65 Comments

View All Comments

lmcd - Wednesday, July 17, 2013 - link
Where do various Kraits fit in?
Wilco1 - Wednesday, July 17, 2013 - link
The Krait 800 as used in Galaxy S4 fits between A9 and A9R4 (it is ~5% better IPC than A9 - it makes up for that by clocking very high).
tuxRoller - Wednesday, July 17, 2013 - link
Can you provide a reference for these values?
The geekbench numbers are all over the place even for the same device (for instance, you see iphone 5 results that vary by 6%, while gs4 can vary by 16% easily).
Death666Angel - Thursday, July 18, 2013 - link
Not sure what geekbench measures and saves. But considering the multitude of options for tweaking an Android system is quite easy. Just change a some governors around and use a different scheduler and you can get quite the range of results.
tuxRoller - Sunday, July 21, 2013 - link
That's kinda my point.
Where exactly is he getting his numbers from.
michael2k - Wednesday, July 17, 2013 - link
It really does sound like the Swift or Krait cores, but widely available now to the Rockchips and Mediateks. Even if it comes out next year, it means $200 smartphones with the raw performance of an iPhone 5 or Galaxy S4 while Apple and Samsung sell something similar for $450. The real question then is how Qualcomm, Samsung, and Apple will push their architectures other than more die-shrinks. Apple still has the option of moving to 4 core as well as BIG.little, and Qualcomm still has the option of BIG.little as well, but where is Exynos going to head? 8 core BIG.little (for 16 total cores?) Asymmetric B.l with 8 big cores and 4 small cores? Something else altogether?
fteoath64 - Friday, July 19, 2013 - link
Great point regarding Big.Little design in the SoC. There are many ways to implement Big.Little design on the wafer. I think only the rudimentary one has been used and this does not really link as much to OS optimizations as we would like. It takes effort/complexity and great code in drivers and kernel changes to take advantage of the design in order to maximise what the hardware can do. And there is the variant that could go Big.Medium.Little. If you look at the frequency charts of typical use, the Medium state do take a lot of the time duration while the Big takes very little (only in the spikes) then the Little takes care of near idle time. Having a Medium module takes space but might be worth the effort in power savings more than just Big.Little switching. The switching cost in power is negligible but time sustained power use on a certain frequency do have good savings (eg 30% in Medium state vs 5% on Big state). Optimizing the OS to change state is important as the efficiency and time savings are there to be had. The slower it does it, the more power it draws for a given duration. Another software optimizing is to split threads to specific core types or core number to optimise performance. eg Little core does all I/O since that is slow while FP/INT goes to Big, or INT split between Big and Little. Dynamic switching to keep one Big core active for several seconds longer might be a benefit if it gets triggered soon after, ie Why switch when a delay in switch solves the problem!. OF course a huge simulation is needed to find the optimal design points that are worth implementing. It is an iterative process. The same goes for GPU cores to get active and boost frequency on demand. For now, they kick fully blast when the game wants it. A great feedback way would be an FPS counter to throttle down the gpus since > 60fps is pretty useless unless you are running 120fps 3D displays. For that cap it at 120fps when the is mode is used. Due to the time to release durations, I am certain many compromised were made just to get the silicon out. ARM vendors are not like Intel who can afford the wait on a release because they had a monopoly on the chip in the market. Competition ensure that ARM evolves quickly and efficiently. This is where you can see Qualcomm succeeding while Tegra falters. Samsung is trying to find their secret sauce toward success with Octa-core design. I think next iteration might be a good one for them coupled with process node improvements they will use.
I see a 2:3:4 design as optimum. 2Big 3Medium 4 Little. Here is how it should work:
Full Bore: 2Big 2Medium and 1 Little active (PentaCore design).
Medium operation: 3Medium and 2 Little active (Still holding PentaCore threading)
Step Down1: 1Medium 2 Little.
Idle: 1 Little only. Note Little takes ALL the I/O traffic.
roberto.tomas - Wednesday, July 17, 2013 - link
Looks pretty clear to me that there will be an A55 at 14nm or at least 10nm. The A12 is technically replacing the A9, right at the start of the next gen of chips which are all 64 bit. It doesn't do them any good to have a high end and low end chip that is 64 bit, and a mid range chip that is only 32 bit. But the power/performance claims are very close to the A15... so this is basically replacing the A15, from that perspective.

The A57 will expire sometime at/after 14nm, and new designs will come out. At that time, an A55 that replaces it would make sense, fulfilling the same roll as the A12 at 32-bit.
Qwertilot - Wednesday, July 17, 2013 - link
I'm sure I remember reading somewhere (some interview?) that they decided that it just didn't make sense (yet) to go 64 bit for the sorts of devices that the A12 will be targeting. The A57 obviously has to go 64 bit to support servers and the like, and that presumably means that the A53 has to follow in order to be matched for bigLittle purposes for high end smart phones/tablets etc.

As michael2k refers to above, the A12 is aimed more at mid/in time, low end phones and the like. Much less reason to push 64 bit there just yet. ARM have to support this sort of thing but I guess the business model means that they can too.
WhitneyLand - Wednesday, July 17, 2013 - link
The NEON improvements are compelling, but it would be nice to peek behind the curtain of the 48% improvement claims on FFMPEG.

To start FFMPEG covers a vast amount of functionality, but certain FFMPEG codecs like h.264 are much more relevant than the obscure ones. So which codecs were used, and are the improvements seen in encoding or decoding, or both?

As we learned with AVX and x264, it's not always easy to realize big gains in real life scenarios with new SIMD hardware.

If there's interest in an article benchmarking x264 on the A9/A15/Krait (to tide us over until the A12 arrives) let me know, been trying to find a way to contribute to AT. :)

The ARM Diaries, Part 2: Understanding the Cortex A12

Back End Improvements

Post Your Comment

65 Comments

View All Comments

lmcd - Wednesday, July 17, 2013 - link

Wilco1 - Wednesday, July 17, 2013 - link

tuxRoller - Wednesday, July 17, 2013 - link

Death666Angel - Thursday, July 18, 2013 - link

tuxRoller - Sunday, July 21, 2013 - link

michael2k - Wednesday, July 17, 2013 - link

fteoath64 - Friday, July 19, 2013 - link

roberto.tomas - Wednesday, July 17, 2013 - link

Qwertilot - Wednesday, July 17, 2013 - link

WhitneyLand - Wednesday, July 17, 2013 - link

Log in

Don't have an account? Sign up now