big.LITTLE: Heterogeneous ARM MP

The Cortex A15 is going to be a significant step forward in performance for ARM architectures. ARM hopes it will be enough to actually begin to threaten the low end of the x86 space, which gives you an idea of just how powerful these cores are going to be. The A15 will also find its way into smartphones and tablets, ultimately replacing the Cortex A9s used by high-end devices today. 

For heavy workloads, the Cortex A15 is expected to be more power efficient than the A9. The core may draw more instantaneous power, but it will do so for a shorter period of time thus allowing the CPU(s) to get to sleep quicker and reducing average power.

As ARM has often argued (particularly against Intel) however, these big out-of-order microprocessor architectures are inefficient at dealing with lightweight mobile workloads. In particular, things like background tasks running on your phone while it’s locked in your pocket simply don’t demand the performance of a Cortex A15. ARM further argues that the power consumed by an A15 running these tasks, even though only for a short period of time, is greater than it would be on a much simpler in-order architecture. This is where the A7 comes into play.

Although the Cortex A7 is fully capable of being used on its own (and it most definitely will be), ARM’s partners are free to integrate Cortex A7 cores alongside Cortex A15 cores in a big.LITTLE (or little.BIG?) configuration. 

Since the A7 and A15 are equally capable of executing the same ARM instruction set, any applications running on one core can just as easily be migrated to run on the other. In the example above there are a pair of A15s and a pair of A7s on a single SoC. In this particular configuration, the OS only believes there are two cores in the machine. ARM’s own power management firmware determines which core cluster to activate depending on performance states requested by the OS. If the OS wants a high performance state, ARM returns the A15 cores at a high p-state. If it wants a low performance state, the chip will put the A15s to sleep and schedule everything on the A7s. Cache coherency is guaranteed via the CCI-400 interconnect, so any data invalidated by one core cluster will be reflected in the other cluster’s cache. ARM claims it can switch between core clusters in this configuration in as quick as 20 microseconds.

If everything works the way ARM has described it, a big.LITTLE configuration should be perfectly transparent to the OS (similar to what NVIDIA is promising with Kal-el). ARM did add that SoC vendors are free to expose all cores to the OS if they would like, although doing so would obviously require OS awareness of the different core types.

Core Configurations, Process Technology & Final Words

ARM’s Cortex A7 will be available in 1 - 4 core configurations, both as the primary CPU in an SoC as well as in a big.LITTLE configuration alongside some A15s. ARM expects that we will see some 40nm A7 designs as early as the end of next year for use in low end smartphones (~$100). Most smartphone configurations, even at these price points will likely use dual-core A7 implementations. It’s only in emerging markets that ARM is expecting to see single core Cortex A7 smartphone devices. This is pretty big news as it means that even value smartphones will be dual-core by 2013.

Costs will keep the A7 on 40nm for a while although the cores will be offered at 28nm for integration into A15 designs as well as for even higher performance/lower power implementations.

I have to say that I’m pretty excited about the Cortex A7 announcement across the board. It looks like this core will not only enable much better performance at the value end of the device spectrum but it should bring battery life improvements at the high end as well. Chip architects have argued for years that we were going to see heterogeneous computing as the next phase in the evolution of microprocessors, it’s fascinating to see that we may get the first consumer application of it in ultra mobile devices.

Architecture
POST A COMMENT

76 Comments

View All Comments

  • psychobriggsy - Thursday, October 20, 2011 - link

    For seamless *running* application migration between the different core types, they should both support the exact same instruction set extensions, which currently Atom and SB don't. I don't think that AMD's Bobcat and Bulldozer do either.

    I wouldn't say no to a chip comprising of a Bulldozer module or two (like Trinity), and a couple of Bobcat cores as well for lower-power modes. This would surely save a lot of power over even Bulldozer in its lowest operational clock/power state.

    However neither AMD nor Intel can compete in power against this ARM technology - A15 for power (around Bobcat performance per core) and A7 for power saving (around 1GHz Atom performance per core I would imagine). As soon as Intel takes a step towards lower power with Atom, ARM moves the goalposts. Even an Atom core implemented at 22nm can't compete with a 28nm 0.5mm^2 core... which is practically free in terms of silicon (even with a small L2 cache added on top).
    Reply
  • gostan - Wednesday, October 19, 2011 - link

    Wouldn't a dynamic frequency design (like speedstep) a better implementation? Rather than having two different architectures exchanging data and handling different tasks. Reply
  • mythun.chandra - Wednesday, October 19, 2011 - link

    DVFS is in use in almost all current-gen SoC's. This certainly does bring with it power saving, but given the present nature of workloads on most mobile devices, the CPU is either in standby (most of the time) or ramped up fully (for most of the remainder). Having cores running at different frequency steps, while a good idea on paper, can prove detrimental to performance if not implemented correctly.

    Having a low-power 'companion' core shows power savings more readily, especially given the extremes in mobile CPU workload (standby-to-full-clock). The companion core is capable of running the exact set of tasks as the main-core(s), albeit at lower performance levels. This is completely transparent to the OS and software layers above since they are in fact the exact same architecture (or instruction set, to be clearer).
    Reply
  • metafor - Wednesday, October 19, 2011 - link

    Even at the lowest frequency and voltage, a complex core will still use more power than a simple core. Take a Cortex A5 compared to a Cortex A15 -- even if you step down the voltage to minimum (~0.7V) on the Cortex A15, it would still consume more power than the Cortex A5 at max speed.

    And that's not even accounting for the power savings operating an A5 at lower voltage/frequency would do.
    Reply
  • bnolsen - Wednesday, October 19, 2011 - link

    There are issues like transistor leakage, etc that larger cores cannot fully overcome just by clocking down. This is why there's a move to unbalanced MP. Reply
  • fteoath64 - Thursday, October 20, 2011 - link

    @gostan: "Wouldn't a dynamic frequency design (like speedstep) a better implementation?"

    NO!. You cannot change the number of pipelines in the CPU, nor the components it needed, cache, eu, iu, fpu etc. So the number of transistors needed current is the same even with lower current. If the number of transistors are 1/3 then you get 3X savings!. so multiple simpler cores saves power way more, ir scales well.
    Reply
  • Rick83 - Friday, October 21, 2011 - link

    Yet when Intel demo'ed their claremont prototype they were able to demonstrate scaling by a factor of 1000.
    This renders the multi-chip approach an expensive crutch.
    Reply
  • rupaniii - Wednesday, October 19, 2011 - link

    I remember NEC ascribing to much the same philosophy many years ago when they started doing embedded multi core development.
    Did ARM tread on similar ground or is it me?
    Reply
  • Guspaz - Wednesday, October 19, 2011 - link

    People have been doing this with ARM designs for ages anyhow, although not necessarily for power efficiency reasons.

    Nintendo has done it since the GBA. The GBA shipped with an ARM7 and Z80 and the DS shipped with an ARM7 and ARM9. The 3DS was the first to go homogeneous, with two ARM11 cores.

    To go off on a bit of a tangent, the 3DS's CPU is rather disappointing, as two 266MHz ARM11 chips is pretty pathetic, with similar performance to a first-gen iPhone. The PS Vita's quad-core Cortex A9 probably has 10-15x the performance... Makes me kind of regret buying a 3DS ;)
    Reply
  • iwod - Wednesday, October 19, 2011 - link

    While A7 at best 1/3 Die Size of A8. It doesn't state the power compare to A8. And i dont understand where the 5x power efficiency coming from. I am guessing it will be able to delivery Double the Performance of A8 while using half the power. ( While that is amazing, it is still only 4x power efficiency!!!! )

    It states about powering up and down individual core. What about having A7 constantly running task on phones, such as signal, phone calls, email etc... and only use A9 if there is a need? i.e delegating task to that core only.

    The most amazing thing is A15 and A7 would appear to be the same to applications. That is unlike the current Atom and SandyBridge. Where SB support additional instructions and features. This make Atom even further away from getting to A7 level.

    We all thought with further tweaking, and 22nm die shrink, Atom would only be one or two steps away from ARM on Mobile Phones. Not anymore with Cortex A7.

    And we have PowerVR 6 coming out soon plus their Power VR RTX ( Hardware Ray Trace ).

    I wonder when will ARM start to tackle the server market.
    Reply

Log in

Don't have an account? Sign up now