Performance Expectations & Final Words

ARM’s Cortex A9 was first released to licensees back in 2009, with design work beginning on the core years before that. To say that the smartphone market has changed tremendously over the past several years would be an understatement. Many of the assumptions that were true at the time of the Cortex A9’s development are no longer the case. There’s far more NEON/FP code in use on mobile platforms, higher frequency of memory accesses and much heavier performance demands in general. While the Cortex A9 was a good design for its time, its weaknesses on the FP and memory fronts needed addressing. Thankfully, Cortex A12 modernizes the segment.

Although ARM referred to Cortex A9 as an out-of-order design, in reality it supported out-of-order integer execution with in-order FP and memory operations. ARM’s Cortex A12 moves to an almost completely OoO design. All aspects of the design have been improved as well. Although the Cortex A9 is expected to continue to ramp in frequency over the next year as designs transition to 28nm HPM and beyond, Cortex A12 should deliver much better performance in an more energy efficient manner.

At the same frequency (looking just at IPC), ARM expects roughly a 40% uplift in performance over Cortex A9. The power efficiency and area implications are more interesting. ARM claims that on the same process node as a Cortex A9, a Cortex A12 design should be able to deliver the same or better power efficiency. The design achieves improved power efficiency by throwing more die area at the problem; ARM expects a Cortex A12 implementation to be up to 40% larger than a Cortex A9. Just like the increasing performance of the Cortex A15 line of microarchitectures necessitates development of the Cortex A9/A12 line, the increasing size of this line drives up demand for the Cortex A7/A53 family below it.

ARM’s unique business model allows for the extreme targeting and customization of its microprocessor IP portfolio. If one of its cores gets too large (or power hungry), there’s always a smaller/more energy efficient option downstream.

The Cortex A12 IP has been finalized as of a couple of weeks ago and is now available to licensees for integration. The first designs will likely ship in silicon in a bit over a year, with the first devices implementing Cortex A12 showing up in late 2014 or early 2015. Whether or not the design will be too late once it arrives is the biggest unknown. Qualcomm’s Krait 300 core should provide the smartphone market with an alternative solution, but the question is whether or not the mobile world will need a Cortex A12 when it shows up. We always like to say that there are no bad products, just bad pricing. A more aggressively priced alternative to a Snapdragon 600 class SoC may entice some customers. Until then, the latest revision to the Cortex A9 core (r4) is expected to carry the torch for ARM. ARM also tells us that we might see more power optimized implementations of Cortex A15 in the interim as well.

Back End Improvements
Comments Locked

65 Comments

View All Comments

  • lmcd - Wednesday, July 17, 2013 - link

    Where do various Kraits fit in?
  • Wilco1 - Wednesday, July 17, 2013 - link

    The Krait 800 as used in Galaxy S4 fits between A9 and A9R4 (it is ~5% better IPC than A9 - it makes up for that by clocking very high).
  • tuxRoller - Wednesday, July 17, 2013 - link

    Can you provide a reference for these values?
    The geekbench numbers are all over the place even for the same device (for instance, you see iphone 5 results that vary by 6%, while gs4 can vary by 16% easily).
  • Death666Angel - Thursday, July 18, 2013 - link

    Not sure what geekbench measures and saves. But considering the multitude of options for tweaking an Android system is quite easy. Just change a some governors around and use a different scheduler and you can get quite the range of results.
  • tuxRoller - Sunday, July 21, 2013 - link

    That's kinda my point.
    Where exactly is he getting his numbers from.
  • michael2k - Wednesday, July 17, 2013 - link

    It really does sound like the Swift or Krait cores, but widely available now to the Rockchips and Mediateks. Even if it comes out next year, it means $200 smartphones with the raw performance of an iPhone 5 or Galaxy S4 while Apple and Samsung sell something similar for $450. The real question then is how Qualcomm, Samsung, and Apple will push their architectures other than more die-shrinks. Apple still has the option of moving to 4 core as well as BIG.little, and Qualcomm still has the option of BIG.little as well, but where is Exynos going to head? 8 core BIG.little (for 16 total cores?) Asymmetric B.l with 8 big cores and 4 small cores? Something else altogether?
  • fteoath64 - Friday, July 19, 2013 - link

    Great point regarding Big.Little design in the SoC. There are many ways to implement Big.Little design on the wafer. I think only the rudimentary one has been used and this does not really link as much to OS optimizations as we would like. It takes effort/complexity and great code in drivers and kernel changes to take advantage of the design in order to maximise what the hardware can do. And there is the variant that could go Big.Medium.Little. If you look at the frequency charts of typical use, the Medium state do take a lot of the time duration while the Big takes very little (only in the spikes) then the Little takes care of near idle time. Having a Medium module takes space but might be worth the effort in power savings more than just Big.Little switching. The switching cost in power is negligible but time sustained power use on a certain frequency do have good savings (eg 30% in Medium state vs 5% on Big state). Optimizing the OS to change state is important as the efficiency and time savings are there to be had. The slower it does it, the more power it draws for a given duration. Another software optimizing is to split threads to specific core types or core number to optimise performance. eg Little core does all I/O since that is slow while FP/INT goes to Big, or INT split between Big and Little. Dynamic switching to keep one Big core active for several seconds longer might be a benefit if it gets triggered soon after, ie Why switch when a delay in switch solves the problem!. OF course a huge simulation is needed to find the optimal design points that are worth implementing. It is an iterative process. The same goes for GPU cores to get active and boost frequency on demand. For now, they kick fully blast when the game wants it. A great feedback way would be an FPS counter to throttle down the gpus since > 60fps is pretty useless unless you are running 120fps 3D displays. For that cap it at 120fps when the is mode is used. Due to the time to release durations, I am certain many compromised were made just to get the silicon out. ARM vendors are not like Intel who can afford the wait on a release because they had a monopoly on the chip in the market. Competition ensure that ARM evolves quickly and efficiently. This is where you can see Qualcomm succeeding while Tegra falters. Samsung is trying to find their secret sauce toward success with Octa-core design. I think next iteration might be a good one for them coupled with process node improvements they will use.
    I see a 2:3:4 design as optimum. 2Big 3Medium 4 Little. Here is how it should work:
    Full Bore: 2Big 2Medium and 1 Little active (PentaCore design).
    Medium operation: 3Medium and 2 Little active (Still holding PentaCore threading)
    Step Down1: 1Medium 2 Little.
    Idle: 1 Little only. Note Little takes ALL the I/O traffic.
  • roberto.tomas - Wednesday, July 17, 2013 - link

    Looks pretty clear to me that there will be an A55 at 14nm or at least 10nm. The A12 is technically replacing the A9, right at the start of the next gen of chips which are all 64 bit. It doesn't do them any good to have a high end and low end chip that is 64 bit, and a mid range chip that is only 32 bit. But the power/performance claims are very close to the A15... so this is basically replacing the A15, from that perspective.

    The A57 will expire sometime at/after 14nm, and new designs will come out. At that time, an A55 that replaces it would make sense, fulfilling the same roll as the A12 at 32-bit.
  • Qwertilot - Wednesday, July 17, 2013 - link

    I'm sure I remember reading somewhere (some interview?) that they decided that it just didn't make sense (yet) to go 64 bit for the sorts of devices that the A12 will be targeting. The A57 obviously has to go 64 bit to support servers and the like, and that presumably means that the A53 has to follow in order to be matched for bigLittle purposes for high end smart phones/tablets etc.

    As michael2k refers to above, the A12 is aimed more at mid/in time, low end phones and the like. Much less reason to push 64 bit there just yet. ARM have to support this sort of thing but I guess the business model means that they can too.
  • WhitneyLand - Wednesday, July 17, 2013 - link

    The NEON improvements are compelling, but it would be nice to peek behind the curtain of the 48% improvement claims on FFMPEG.

    To start FFMPEG covers a vast amount of functionality, but certain FFMPEG codecs like h.264 are much more relevant than the obscure ones. So which codecs were used, and are the improvements seen in encoding or decoding, or both?

    As we learned with AVX and x264, it's not always easy to realize big gains in real life scenarios with new SIMD hardware.

    If there's interest in an article benchmarking x264 on the A9/A15/Krait (to tide us over until the A12 arrives) let me know, been trying to find a way to contribute to AT. :)

Log in

Don't have an account? Sign up now