How do you keep increasing performance in a power constrained environment like a smartphone without decreasing battery life? You can design more efficient microarchitectures, but at some point you’ll run out of steam there. You can transition to newer, more power efficient process technologies but even then progress is very difficult to come by. In the past you could rely on either one of these options to deliver lower power consumption, but these days you have to rely on both - and even then it’s potentially not enough. Heterogeneous multiprocessing is another option available - put a bunch of high performance cores alongside some low performance but low power cores and switch between them as necessary.

NVIDIA recently revealed it was doing something similar to this with its upcoming Tegra 3 (Kal-El) SoC. NVIDIA outfitted its next-generation SoC with five CPU cores, although only a maximum of four are visible to the OS. If you’re running light tasks (background checking for email, SMS/MMS, twitter updates while your phone is locked) then a single low power Cortex A9 core services those needs while the higher performance A9s remain power gated. Request more of the OS (e.g. unlock your phone and load a webpage) and the low power A9 goes to sleep and the 4 high performance cores wake up. 

While NVIDIA’s solution uses identical cores simply built using different transistors (LP vs. G), the premise doesn’t change if you move to physically different cores. For NVIDIA, ARM didn’t really have a suitable low power core thus it settled on a lower power Cortex A9. Today, ARM is expanding the Cortex family to include a low power core that can either be used by itself or as an ISA-compatible companion core in Cortex A15 based SoCs. It’s called the ARM Cortex A7.

Architecture

Starting with the Cortex A9, ARM moved to an out-of-order execution core (instructions can be reordered around dependencies for improved parallelism) - a transition that we saw in the x86 space back in the days of the Pentium Pro. The Cortex A15 continues the trend as an OoO core but increases the width of the machine. The Cortex A7 however takes a step back and is another simple in-order core capable of issuing up to two instructions in parallel. This should sound a lot like the Cortex A8, however the A7 is different in a number of areas.

The A8 is a very old design with work originally beginning on the core in 2003. Although ARM offered easily synthesizable versions of the core, in order to hit higher clock speeds you needed to include a lot of custom logic. The custom design requirements on A8 not only lengthened time to market but also increased development costs, limiting the A8’s overall reach. The Cortex A7 on the other hand would have to be fully synthesizable while being able to deliver good performance. ARM could leverage process technology advancements over the past few years to deliver clock speed and competitive power consumption, but it needed a revised architecture to meet the cost and time to market requirements.

The Cortex A7 features an 8-stage integer pipeline and is capable of dual-issue. Unlike the Cortex A8 however, the A7 cannot dual-issue floating point or NEON instructions. There are other instructions that turn the A7 into a single-issue machine as well. The integer execution cluster is quite similar to the Cortex A8, although the FPU is fully pipelined and more compact than its older brother. 

Limiting issue width for more complex instructions helps keep die size in check, which was a definite goal for the core. ARM claims a single Cortex A7 core will measure only 0.5mm2 on a 28nm process. On an equivalent process node ARM expects customers will be able to implement an A7 in 1/3 - 1/2 the die area of a Cortex A8. As a reference, an A9 core uses about the same (if not a little less) die area as an A8 while an A15 is a bit bigger than both.

Architecture Comparison
  ARM11 ARM Cortex A7 ARM Cortex A8 ARM Cortex A9 Qualcomm Scorpion Qualcomm Krait
Decode single-issue partial dual-issue 2-wide 2-wide 2-wide 3-wide
Pipeline Depth 8 stages 8 stages 13 stages 8 stages 10 stages 11 stages
Out of Order Execution N N N Y Partial Y
Pipelined FPU Y Y N Y Y Y
NEON N/A Y (64-bit wide) Y (64-bit wide) Optional MPE (64-bit wide) Y (128-bit wide) Y (128-bit wide)
Process Technology 90nm 40nm/28m 65nm/45nm 40nm 40nm 28nm
Typical Clock Speeds 412MHz 1.5GHz (28nm) 600MHz/1GHz 1.2GHz 1GHz 1.5GHz

Despite the limited dual issue capabilities, ARM is hoping for better performance per clock and better overall performance out of the Cortex A7 compared to the Cortex A8. Branch prediction performance is improved partly by using a more modern predictor, and partly because the shallower pipeline lessens the mispredict penalty. The Cortex A7 features better prefetching algorithms to help improve efficiency. ARM also includes a very low latency L2 cache (10 cycles) with its Cortex A7 design, although actual latency can be configured by the partner during implementation.

Note that in decoding bound scenarios, the Cortex A7 will offer the same if not lower performance than a Cortex A8 due to its limited dual-issue capabilities. The mildly useful DMIPS/MHz ratings of ARM’s various cores are below:

Estimated Core Performance
  ARM11 ARM Cortex A7 ARM Cortex A8 ARM Cortex A9 Qualcomm Scorpion Qualcomm Krait
DMIPS/MHz 1.25 1.9 2.0 2.5 2.1 3.3

The big news is the Cortex A7 is 100% ISA compatible with the Cortex A15, this includes the new virtualization instructions, integer divide support and 40-bit memory addressing. Any code running on an A15 can run on a Cortex A7, just slower. This is a very important feature as it enables SoC vendors to build chips with both Cortex A7 and Cortex A15 cores, switching between them depending on workload requirements. ARM calls this a big.LITTLE configuration.

big.LITTLE: Heterogeneous ARM MP
POST A COMMENT

76 Comments

View All Comments

  • psychobriggsy - Thursday, October 20, 2011 - link

    For seamless *running* application migration between the different core types, they should both support the exact same instruction set extensions, which currently Atom and SB don't. I don't think that AMD's Bobcat and Bulldozer do either.

    I wouldn't say no to a chip comprising of a Bulldozer module or two (like Trinity), and a couple of Bobcat cores as well for lower-power modes. This would surely save a lot of power over even Bulldozer in its lowest operational clock/power state.

    However neither AMD nor Intel can compete in power against this ARM technology - A15 for power (around Bobcat performance per core) and A7 for power saving (around 1GHz Atom performance per core I would imagine). As soon as Intel takes a step towards lower power with Atom, ARM moves the goalposts. Even an Atom core implemented at 22nm can't compete with a 28nm 0.5mm^2 core... which is practically free in terms of silicon (even with a small L2 cache added on top).
    Reply
  • gostan - Wednesday, October 19, 2011 - link

    Wouldn't a dynamic frequency design (like speedstep) a better implementation? Rather than having two different architectures exchanging data and handling different tasks. Reply
  • mythun.chandra - Wednesday, October 19, 2011 - link

    DVFS is in use in almost all current-gen SoC's. This certainly does bring with it power saving, but given the present nature of workloads on most mobile devices, the CPU is either in standby (most of the time) or ramped up fully (for most of the remainder). Having cores running at different frequency steps, while a good idea on paper, can prove detrimental to performance if not implemented correctly.

    Having a low-power 'companion' core shows power savings more readily, especially given the extremes in mobile CPU workload (standby-to-full-clock). The companion core is capable of running the exact set of tasks as the main-core(s), albeit at lower performance levels. This is completely transparent to the OS and software layers above since they are in fact the exact same architecture (or instruction set, to be clearer).
    Reply
  • metafor - Wednesday, October 19, 2011 - link

    Even at the lowest frequency and voltage, a complex core will still use more power than a simple core. Take a Cortex A5 compared to a Cortex A15 -- even if you step down the voltage to minimum (~0.7V) on the Cortex A15, it would still consume more power than the Cortex A5 at max speed.

    And that's not even accounting for the power savings operating an A5 at lower voltage/frequency would do.
    Reply
  • bnolsen - Wednesday, October 19, 2011 - link

    There are issues like transistor leakage, etc that larger cores cannot fully overcome just by clocking down. This is why there's a move to unbalanced MP. Reply
  • fteoath64 - Thursday, October 20, 2011 - link

    @gostan: "Wouldn't a dynamic frequency design (like speedstep) a better implementation?"

    NO!. You cannot change the number of pipelines in the CPU, nor the components it needed, cache, eu, iu, fpu etc. So the number of transistors needed current is the same even with lower current. If the number of transistors are 1/3 then you get 3X savings!. so multiple simpler cores saves power way more, ir scales well.
    Reply
  • Rick83 - Friday, October 21, 2011 - link

    Yet when Intel demo'ed their claremont prototype they were able to demonstrate scaling by a factor of 1000.
    This renders the multi-chip approach an expensive crutch.
    Reply
  • rupaniii - Wednesday, October 19, 2011 - link

    I remember NEC ascribing to much the same philosophy many years ago when they started doing embedded multi core development.
    Did ARM tread on similar ground or is it me?
    Reply
  • Guspaz - Wednesday, October 19, 2011 - link

    People have been doing this with ARM designs for ages anyhow, although not necessarily for power efficiency reasons.

    Nintendo has done it since the GBA. The GBA shipped with an ARM7 and Z80 and the DS shipped with an ARM7 and ARM9. The 3DS was the first to go homogeneous, with two ARM11 cores.

    To go off on a bit of a tangent, the 3DS's CPU is rather disappointing, as two 266MHz ARM11 chips is pretty pathetic, with similar performance to a first-gen iPhone. The PS Vita's quad-core Cortex A9 probably has 10-15x the performance... Makes me kind of regret buying a 3DS ;)
    Reply
  • iwod - Wednesday, October 19, 2011 - link

    While A7 at best 1/3 Die Size of A8. It doesn't state the power compare to A8. And i dont understand where the 5x power efficiency coming from. I am guessing it will be able to delivery Double the Performance of A8 while using half the power. ( While that is amazing, it is still only 4x power efficiency!!!! )

    It states about powering up and down individual core. What about having A7 constantly running task on phones, such as signal, phone calls, email etc... and only use A9 if there is a need? i.e delegating task to that core only.

    The most amazing thing is A15 and A7 would appear to be the same to applications. That is unlike the current Atom and SandyBridge. Where SB support additional instructions and features. This make Atom even further away from getting to A7 level.

    We all thought with further tweaking, and 22nm die shrink, Atom would only be one or two steps away from ARM on Mobile Phones. Not anymore with Cortex A7.

    And we have PowerVR 6 coming out soon plus their Power VR RTX ( Hardware Ray Trace ).

    I wonder when will ARM start to tackle the server market.
    Reply

Log in

Don't have an account? Sign up now