How do you keep increasing performance in a power constrained environment like a smartphone without decreasing battery life? You can design more efficient microarchitectures, but at some point you’ll run out of steam there. You can transition to newer, more power efficient process technologies but even then progress is very difficult to come by. In the past you could rely on either one of these options to deliver lower power consumption, but these days you have to rely on both - and even then it’s potentially not enough. Heterogeneous multiprocessing is another option available - put a bunch of high performance cores alongside some low performance but low power cores and switch between them as necessary.

NVIDIA recently revealed it was doing something similar to this with its upcoming Tegra 3 (Kal-El) SoC. NVIDIA outfitted its next-generation SoC with five CPU cores, although only a maximum of four are visible to the OS. If you’re running light tasks (background checking for email, SMS/MMS, twitter updates while your phone is locked) then a single low power Cortex A9 core services those needs while the higher performance A9s remain power gated. Request more of the OS (e.g. unlock your phone and load a webpage) and the low power A9 goes to sleep and the 4 high performance cores wake up. 

While NVIDIA’s solution uses identical cores simply built using different transistors (LP vs. G), the premise doesn’t change if you move to physically different cores. For NVIDIA, ARM didn’t really have a suitable low power core thus it settled on a lower power Cortex A9. Today, ARM is expanding the Cortex family to include a low power core that can either be used by itself or as an ISA-compatible companion core in Cortex A15 based SoCs. It’s called the ARM Cortex A7.

Architecture

Starting with the Cortex A9, ARM moved to an out-of-order execution core (instructions can be reordered around dependencies for improved parallelism) - a transition that we saw in the x86 space back in the days of the Pentium Pro. The Cortex A15 continues the trend as an OoO core but increases the width of the machine. The Cortex A7 however takes a step back and is another simple in-order core capable of issuing up to two instructions in parallel. This should sound a lot like the Cortex A8, however the A7 is different in a number of areas.

The A8 is a very old design with work originally beginning on the core in 2003. Although ARM offered easily synthesizable versions of the core, in order to hit higher clock speeds you needed to include a lot of custom logic. The custom design requirements on A8 not only lengthened time to market but also increased development costs, limiting the A8’s overall reach. The Cortex A7 on the other hand would have to be fully synthesizable while being able to deliver good performance. ARM could leverage process technology advancements over the past few years to deliver clock speed and competitive power consumption, but it needed a revised architecture to meet the cost and time to market requirements.

The Cortex A7 features an 8-stage integer pipeline and is capable of dual-issue. Unlike the Cortex A8 however, the A7 cannot dual-issue floating point or NEON instructions. There are other instructions that turn the A7 into a single-issue machine as well. The integer execution cluster is quite similar to the Cortex A8, although the FPU is fully pipelined and more compact than its older brother. 

Limiting issue width for more complex instructions helps keep die size in check, which was a definite goal for the core. ARM claims a single Cortex A7 core will measure only 0.5mm2 on a 28nm process. On an equivalent process node ARM expects customers will be able to implement an A7 in 1/3 - 1/2 the die area of a Cortex A8. As a reference, an A9 core uses about the same (if not a little less) die area as an A8 while an A15 is a bit bigger than both.

Architecture Comparison
  ARM11 ARM Cortex A7 ARM Cortex A8 ARM Cortex A9 Qualcomm Scorpion Qualcomm Krait
Decode single-issue partial dual-issue 2-wide 2-wide 2-wide 3-wide
Pipeline Depth 8 stages 8 stages 13 stages 8 stages 10 stages 11 stages
Out of Order Execution N N N Y Partial Y
Pipelined FPU Y Y N Y Y Y
NEON N/A Y (64-bit wide) Y (64-bit wide) Optional MPE (64-bit wide) Y (128-bit wide) Y (128-bit wide)
Process Technology 90nm 40nm/28m 65nm/45nm 40nm 40nm 28nm
Typical Clock Speeds 412MHz 1.5GHz (28nm) 600MHz/1GHz 1.2GHz 1GHz 1.5GHz

Despite the limited dual issue capabilities, ARM is hoping for better performance per clock and better overall performance out of the Cortex A7 compared to the Cortex A8. Branch prediction performance is improved partly by using a more modern predictor, and partly because the shallower pipeline lessens the mispredict penalty. The Cortex A7 features better prefetching algorithms to help improve efficiency. ARM also includes a very low latency L2 cache (10 cycles) with its Cortex A7 design, although actual latency can be configured by the partner during implementation.

Note that in decoding bound scenarios, the Cortex A7 will offer the same if not lower performance than a Cortex A8 due to its limited dual-issue capabilities. The mildly useful DMIPS/MHz ratings of ARM’s various cores are below:

Estimated Core Performance
  ARM11 ARM Cortex A7 ARM Cortex A8 ARM Cortex A9 Qualcomm Scorpion Qualcomm Krait
DMIPS/MHz 1.25 1.9 2.0 2.5 2.1 3.3

The big news is the Cortex A7 is 100% ISA compatible with the Cortex A15, this includes the new virtualization instructions, integer divide support and 40-bit memory addressing. Any code running on an A15 can run on a Cortex A7, just slower. This is a very important feature as it enables SoC vendors to build chips with both Cortex A7 and Cortex A15 cores, switching between them depending on workload requirements. ARM calls this a big.LITTLE configuration.

big.LITTLE: Heterogeneous ARM MP
POST A COMMENT

76 Comments

View All Comments

  • Snowstorms - Thursday, October 20, 2011 - link

    these guys are so close to competing for desktop x86 space, they are already down to 28nm and are using the exact same ASML as Intel and AMD Reply
  • Pessimism - Thursday, October 20, 2011 - link

    Does anyone else find their naming conventions infuriating?
    11->7->8->9->15 HUH?
    plus they label their instruction sets with an A and a similar but different number... and the other licensees of ARM have their own gimpy naming conventions with one for the base core and one for the chip and so on and so forth...
    Reply
  • iwod - Thursday, October 20, 2011 - link

    11 is ARM 11 which is different to Cortex Series

    If we inflate number based on timing they A7 would have been named A16.

    A8 is replaced by A9 and will be subsequently by A15.

    A7 is more like an replacement for the ultra low power A5. It just happen to be even more powerful then A8 therefore they used A8 as comparison.
    Reply
  • mczak - Thursday, October 20, 2011 - link

    I don't think A15 is really a replacement of A9.
    Each member of the Cortex-A family seems to have its place, except the A8 (which is the oldest and obsolete).
    A5: very small, low power - though once you include larger L1 caches and NEON it seems A7 would be a better fit as it's hardly smaller anymore.
    A7: small, low power, probably highest perf/power and perf/area of the whole family. Unlike A5 l1 cache sizes are fixed and NEON always included, and with all features of the A15 (including virtualization for instance).
    A8: a dud, unlike all others not MP capable and with nonpipelined FPU. Worst efficiency of the family (by a large margin) in perf/area and perf/power. I don't think there's any reason at all why you'd want to use this in a new design. In some areas it might be faster than A7 (I think NEON might be twice as fast).
    A9: similar size to A8 but quite a bit more advanced (out-of-order) and with higher efficiency.
    A15: quite a beast compared to A9, much more complex and faster, but much bigger - the first to target low-power servers too. Efficiency might be similar to A9, not sure.

    Of course, this completely ignores the timeframe - A8 was the only option for quite some time, and apart from that only A9 has made it to devices yet (I think we should see A5 soon enough - MSM7227A has Cortex-A5 and possibly quite a few low-end smartphones might use it).
    Reply
  • ET - Thursday, October 20, 2011 - link

    I do see the A15 as a replacement for the A9. The high end was A9 and will be A15. The mid range was A8 and will be A7. Both will offer significant performance increases. A9 will survive a little longer (as will the A8, probably), but I don't think it has a real place between the A15 and A7.

    As for the names, ARM Cortex family names reflects core complexity or size, far as I understand, not how new the core is.
    Reply
  • C300fans - Thursday, October 20, 2011 - link

    Intel has his ACE, atom E6xx, which has already been proved to be much more power efficient than ARM. For example, Sony PRS T1 is using intel 1Ghz processor(Atom E640) in his e-book reader runing on Andorid. Reply
  • mczak - Thursday, October 20, 2011 - link

    Good point though there should be really a rather large difference in performance (and in chip size) between a A7 and a A15, hence I think there's still some room for A9 (which should still be a fair bit faster than A7 after all) - say for a low-end smartphone in 2013. I guess though this will rather depend on the licensing cost differences (afaik the less complex designs are cheaper) between A7/A9/A15. Reply
  • iwod - Thursday, October 20, 2011 - link

    I dont think the replacement are intentional. it is the market overlaping when one product's power and effcieny leap forward.

    It is like a Pentium SandyBridge was never meant to replace the good old Core 2 Duo, but it is just SandyBridge being so much better and cheaper to happens to replaced it.
    Reply
  • nofumble62 - Thursday, October 20, 2011 - link

    is that ARM cannot boost core performance without sucking up more power. The power versus performance chart shows a straight line. So they have to employ this trick to maintain power efficiency, while getting enough horse power for difficult tasks.

    Believe Intel has done this sort of thing already. What else is new?

    What's next for ARM? double the core count?
    Reply
  • iwod - Thursday, October 20, 2011 - link

    No one can boost performance without sucking more power. But ARM still have many more things to do with IPC. And as a matter of fact, a Quad Core Cortex A15 @ 2.5Ghz is very capable, and faster then a Core 2 Duo. Reply

Log in

Don't have an account? Sign up now