big.LITTLE: Heterogeneous ARM MP

The Cortex A15 is going to be a significant step forward in performance for ARM architectures. ARM hopes it will be enough to actually begin to threaten the low end of the x86 space, which gives you an idea of just how powerful these cores are going to be. The A15 will also find its way into smartphones and tablets, ultimately replacing the Cortex A9s used by high-end devices today. 

For heavy workloads, the Cortex A15 is expected to be more power efficient than the A9. The core may draw more instantaneous power, but it will do so for a shorter period of time thus allowing the CPU(s) to get to sleep quicker and reducing average power.

As ARM has often argued (particularly against Intel) however, these big out-of-order microprocessor architectures are inefficient at dealing with lightweight mobile workloads. In particular, things like background tasks running on your phone while it’s locked in your pocket simply don’t demand the performance of a Cortex A15. ARM further argues that the power consumed by an A15 running these tasks, even though only for a short period of time, is greater than it would be on a much simpler in-order architecture. This is where the A7 comes into play.

Although the Cortex A7 is fully capable of being used on its own (and it most definitely will be), ARM’s partners are free to integrate Cortex A7 cores alongside Cortex A15 cores in a big.LITTLE (or little.BIG?) configuration. 

Since the A7 and A15 are equally capable of executing the same ARM instruction set, any applications running on one core can just as easily be migrated to run on the other. In the example above there are a pair of A15s and a pair of A7s on a single SoC. In this particular configuration, the OS only believes there are two cores in the machine. ARM’s own power management firmware determines which core cluster to activate depending on performance states requested by the OS. If the OS wants a high performance state, ARM returns the A15 cores at a high p-state. If it wants a low performance state, the chip will put the A15s to sleep and schedule everything on the A7s. Cache coherency is guaranteed via the CCI-400 interconnect, so any data invalidated by one core cluster will be reflected in the other cluster’s cache. ARM claims it can switch between core clusters in this configuration in as quick as 20 microseconds.

If everything works the way ARM has described it, a big.LITTLE configuration should be perfectly transparent to the OS (similar to what NVIDIA is promising with Kal-el). ARM did add that SoC vendors are free to expose all cores to the OS if they would like, although doing so would obviously require OS awareness of the different core types.

Core Configurations, Process Technology & Final Words

ARM’s Cortex A7 will be available in 1 - 4 core configurations, both as the primary CPU in an SoC as well as in a big.LITTLE configuration alongside some A15s. ARM expects that we will see some 40nm A7 designs as early as the end of next year for use in low end smartphones (~$100). Most smartphone configurations, even at these price points will likely use dual-core A7 implementations. It’s only in emerging markets that ARM is expecting to see single core Cortex A7 smartphone devices. This is pretty big news as it means that even value smartphones will be dual-core by 2013.

Costs will keep the A7 on 40nm for a while although the cores will be offered at 28nm for integration into A15 designs as well as for even higher performance/lower power implementations.

I have to say that I’m pretty excited about the Cortex A7 announcement across the board. It looks like this core will not only enable much better performance at the value end of the device spectrum but it should bring battery life improvements at the high end as well. Chip architects have argued for years that we were going to see heterogeneous computing as the next phase in the evolution of microprocessors, it’s fascinating to see that we may get the first consumer application of it in ultra mobile devices.

Architecture
POST A COMMENT

76 Comments

View All Comments

  • lancedal - Thursday, October 20, 2011 - link

    True, the OSs do profiling, but it is hardly accurate to guarantee real-time performance. Yes, the API would help, but not many application, if any, that specify the resource it needs as that would depend on system. It's just not practical to require software developer to specify the resource needed given how many developers and applications we have today. Big guy like Pandora, sure, but not million of little guys out there.
    Deciding when to switch back and forth between the LITTLE and BIG core is hard because it's not free. It cost power and performance (latency). If you switch to often, then you end up costing more power. The problem is there is no fix criteria to switch.
    If you have the little core to handle "system tasks" and the big core to handle application (like Tegra-3), then it may work. However, that only help standby power and wont' do much for extend web-browsing time.
    Reply
  • sarge78 - Wednesday, October 19, 2011 - link

    http://www.arm.com/products/processors/cortex-a/co...

    The Cortex-A5 processor is the smallest, lowest power ARM multicore processor capable of delivering the Internet to the widest possible range of devices: from ultra low cost handsets, feature phones and smart mobile devices, to pervasive embedded, consumer and industrial devices.The Cortex-A5 processor is fully application compatible with the Cortex-A8, Cortex-A9, and Cortex-A15 processors, enabling immediate access to an established developer and software ecosystem including Android, Adobe Flash, Java Platform Standard Edition (Java SE), JavaFX, Linux, Microsoft Windows Embedded, Symbian and Ubuntu. Cortex-A5 benefits include:

    - Full application compatibility with the Cortex-A8, Cortex-A9, and Cortex-A15 processors
    - Provides a high-value migration path for the large number of existing ARM926EJ-S™ and ARM1176JZ-S™ processor licensees.
    - 1/3 the power and area of Cortex-A9, with full instruction set compatibility.

    Why didn't nVidia use a cortex A5 for Kal-El?
    Reply
  • ET - Thursday, October 20, 2011 - link

    The Cortex-A5 was announced in 2009 and hasn't apparently there hasn't been much demand for it (according to one article). At 1.57 DMIPS / MHz (according to the ARM page) it's significantly weaker than the A8, and I figure that was one problem. My guess is that the Cortex-A7 is a response to that, with higher clock rates and performance that should surpass A8 in most cases. Reply
  • Lucian Armasu - Saturday, October 22, 2011 - link

    But it wasn't supposed to replace Cortex A8, but ARM11, which is still in all low-end Android phones today, and I hate it. Cortex A5 with close to Cortex A8 performance, and 3x more efficient, would've been a really nice replacement for ARM11. Reply
  • bjacobson - Wednesday, October 19, 2011 - link

    wait, so why is the A7 better if 7 is a smaller number than 8? O.o Reply
  • Manabu - Sunday, October 23, 2011 - link

    Because 8 is an even number, and ARM was cursed so that every even numbered architeture they make is bad. You don't hear about ARM6, ARM8 or ARM10, but ARM7, ARM9 and ARM11 are still very much alive everywhere (low end, Tegra2, etc). The cortex A8 was a bit more sucessfull because the new instruction-set and raw power, but still was a bad desing. We probably won't be hearing about any new SoC using it in the future.

    Cortex A9, A5, A15 and now A7. ARM is on a roll now, as they stopped being stubborn and are side-stepping the even numbers. ;-)

    The Cortex A8 is bigger and theorically faster clock for clock than the Cortex A7, even if in practice it will likely be slower because it laugably slow FPU, lower efficiency and core counts. And as it isn't faster than A9, 7 is the logical number to use.
    Reply
  • iwod - Thursday, October 20, 2011 - link

    Anand, could you do an article on xx bit CPU. With 64 Bit x86 CPU we get two major benefits, memory addressing space, and extra register for faster performance. But other then that, how many program actually uses 64bit Integer and Floating Point?

    ARM A7 / A15 seems to provide 40bit address, 1TB of Memory or 250 times more then current 4GB limit. I remember Intel also had 40bit memory addressing but require Software, OS, BIOS working together and it doesn't work very well on software development. Is this still the case with ARM?
    Reply
  • mihaimm - Thursday, October 20, 2011 - link

    I can't wait to boot Ubuntu on those. With little tweaks we'll be able to have nice threads go to A7 and others to A15 automatically. Quad cores at 1-1.5 GHz should be enough for for mostly anything on Linux. And if we get it packaged with 543MP2 (and good drivers) this would kill x86. Reply
  • introiboad - Thursday, October 20, 2011 - link

    Even better than 543MP2, if it runs Ubuntu and by the time this comes to market, we'll have Rogue. Reply
  • french toast - Thursday, October 20, 2011 - link

    hey i posted i comment earlier, replying to someone else, asking a question or two, nothing impolite or anythng like that, and when i have looked back to see if i got an answer my comment has been removed!?? why?? Reply

Log in

Don't have an account? Sign up now