ARM Cortex A9: What I'm Excited About

NVIDIA won't talk about Tegra GPU architecture, but ARM is more than willing to talk about the Cortex A9.

I'm not used to seeing so much pipeline variance between microprocessor cores. The ARM11 core was introduced in 2003 and featured a single-issue 8-stage integer pipeline. Floating point was optional. The Cortex A8 was announced in 2005 and doubled the front end with. The A8 has a dual-issue in-order 13-stage integer pipeline. Doubling issue width increased IPC (instructions per clock) and the deeper pipeline gave it frequency headroom.

The Cortex A9 goes back down to an 8-stage pipeline. It's still a dual-issue pipeline, but instructions can execute out of order. What's even more ridiculous are the frequencies you can get out of this core. TI is going to be shipping a 750MHz and 1GHz SoC based on the Cortex A9. NVIDIA's Tegra 2 will run at up to 1GHz. And even ARM is willing to supply Cortex A9 designs that can run at up to 2GHz on TSMC's 40nm process. Privately I've heard that designs scaling beyond 2GHz, especially at 28nm, are going to be possible.

This is huge for two reasons. Cortex A9 has a shallower pipeline compared to A8, so it does more per clock. It also has an out of order execution engine, allowing it to also do more per clock. At the same clock speed, A9 should destroy A8. ARM estimates that the A8 can do up to 2 DMIPS per MHz (or 2000 DMIPS at 1GHz), whereas the A9 can do 2.5 DMIPS per MHz (2500 DMIPS at 1GHz). Given that most A8 implementations have been at or below 600MHz (1200 DMIPS), and TI's A9s are running at 750MHz or 1GHz (1875 DMIPS or 2500 DMIPS) I'd expect anywhere from a 30 - 100% performance improvement over existing Cortex A8 designs.

That's just for a single core though. At 40nm there's enough room to cram two of these out of order cores on a single SoC. That's what NVIDIA's doing at first with Tegra 2. Two cores together running multithreaded code and now you're looking at multiples of Cortex A8 performance. I'm talking iPhone to 3GS levels of performance improvement. And then some.

The shallower pipeline is very important for keeping power consumption low. Mispredicted branches have a much lower performance and power impact on shallow pipelines than they do on deep ones.

Each Cortex A9 MPCore has its own private L1 instruction and data caches. I'd expect these to be 32KB in size (each) just as they are today on the A8s. The L2 cache is shared by all cores on the SoC. A shared L2 makes sense, especially with a dual-core design. The architecture can scale up to 8MB of L2, but it seems a bit excessive. I'd expect L2 sizes to stay at around 256KB or 512KB. The L2 can run at the CPU's clock speed or for extremely high clocked versions of the A9 it can run at a divider.

What we're seeing is repetition of the sort of evolution we had in the desktop microprocessor, just on a much smaller scale. The Pentium processor was Intel's last high end in-order chip. The Pentium Pro brought out of order execution into the mix. ARM took that same evolutionary step going from the Cortex A8 to A9.

The world is very different today than it was when the Pentium Pro first came out. Multithreaded code is far more commonplace and thus we see that ARM's first out-of-order processor is also multi-core capable. Technically ARM11 could be used in multi-core environments, it just wasn't (at least not commonly). Even NVIDIA's Tegra 1 used the ARM11 MPCore processor, but only used one of them on its SoC. Cortex A9 will change all of that. The first implementations announced by TI as well as NVIDIA are dual-core designs. The next stage in smartphone evolution is enabling usable multitasking through interfaces like what we saw on the Palm Pre. In order to enable good performance in smartphone multitasking you'll need multiple cores.

There is of course a single core version of the Cortex A9. ARM suggests that the single core A9 is a great upgrade path for ARM11 designs. You get full backwards compatibility on code, an extremely small core (most ARM11 designs were 130nm, at 40nm a single A9 core is very space efficient) and much higher performance.

NEON Optional

With the Cortex A8 ARM introduced its own vector FP instruction set called NEON (think of it like ARM's SSE). A8 processors included a NEON core, but with Cortex A9 partners can either choose to use an ARM FPU or NEON. The FPU based Cortex A9s will most likely be single core implementations designed to be ARM11 replacements. The FPU will be smaller to implement than a full NEON unit and thus save cost/power.

Tegra Tablets Today, Smartphones Soon Atom vs. Cortex A9
Comments Locked

55 Comments

View All Comments

  • strikeback03 - Friday, January 8, 2010 - link

    That would be a gigantic phone. I'd personally like to see something with this kind of processing power (minus the video acceleration) and small enough to use something like a 2.5-3" screen
  • yyrkoon - Thursday, January 7, 2010 - link

    on a smart phone, or a mp3 player . . . hmm am I missing something here ? Is this absolutely necessary ? Personally, I don't think so.

    Also comparing ARM with an Atom processor is like comparing apples to oranges isn't it ? One is x86, the other is not.

    Personally, I would be more interested in seeing how viable nVidias Tegra 2 would be used in other SoC embedded applications, or if nVidia will make derivatives that are more suitable for other than smartphone / mp3 player applications. Based just on the ARM technology, I would have to say these are going to be well suited for any low power application provided they perform well.
  • FaaR - Thursday, January 7, 2010 - link

    1080P decode support on a thing like this is... Well, virtually useless, really.

    When it's been shown that most people can't see the difference between blu-ray video and regular 'ol DVDs even on big-screen TVs, and the vast majority of people just don't see the point of HD video, then what the hell are we going to use this thing for? Watch BR rips on a 3" LCD screen, no I don't think so.

    Plug it in to your big screen TV in the living room? Please. Don't you have a stationary player for that?

    I've no idea who exactly this product is intended for.

    And the dual A9 cores, well, I'm sure they're great - compared to whatever came before them anyway, but dual A9 cores, quad A9 cores or a quadrillion A9 cores doesn't really matter as long as they don't run any really useful software and THEY DON'T. As long as a portable isn't x86 compatible it'll never be more than a toy. Yeah sure, you can "do stuff" with an Iphone or whatever, but it's still just toy apps and it will stay that way until x86 becomes a realistic alternative in the mobile marketspace. Atom is a joke right now, it's slow AND power hungry. Maybe in another 5 years, who knows...
  • Visual - Friday, January 8, 2010 - link

    it is nice if the device can decode the video in real time, even if it doesn't show it in its full resolution. then you don't need to re-encode stuff specially for the device, if size isn't a constraint - like if you are watching it from a network share.

    but the main advantage of tegra 2 isn't just some stupid video decode. its all-round general purpose cpu performance, and 3d acceleration, at very low power usage. the modular design allows it to use as little power as the current usage pattern of the device requires so it will make fantastic handheld game console/phone/media player hybrids

    and you whining that there aren't useful apps is just stupid. x86 isn't the world, you know - properly developed apps can be ported to anything, and once the platform is in people's hands, the apps will be too.
  • FlyTexas - Friday, January 8, 2010 - link

    Huh?

    What are you smoking???

    If you can't tell the very obvious difference between Blu-Ray and DVD video on a large 1080P LCD, then you're blind...

    None of that makes your other point invalid, 1080P isn't needed for a 4" screen, but it is nice that it can do it.
  • GeorgeH - Thursday, January 7, 2010 - link

    Looking at that reference board it doesn't appear that it would be all that difficult to make a mini-ITX Atom alternative. You'd have to run Linux on it, but for an HTPC, NAS, or other low power single-purpose application spending $100+ on a fully featured Windows license is a little bit silly anyway.

    If it is faster than Atom (and better at HTPC-centric video tasks) and can be had for much less than an i3 system (especially if i3 can't be passively cooled) I'd think NVIDIA would be jumping at the chance to show Intel up a bit. Even if the actual marketshare and economic gains were minimal, it seems to me that the "mindshare" gains could be huge.
  • yyrkoon - Thursday, January 7, 2010 - link

    Personally, I do not see it happening. There is a reason why companies like SGI moved from RISC to x86 hardware. However, with that said ther is simply no reason why these SoCs could not be used in an external NAS / SAN system with the right software to back it up. x86 has the advantage of running desktop classed Windows, even if only for gaming, which is a larger market than most think.

    Still, as a novice embedded designer, I see lots of potential in Tegra 2, but a lot of it would be unnecessary for my, and possibly others purposes. Smart phone, and MP3 players, sure, but not for a lot of other things. Perhaps if the graphics core were CUDA compliant and offered good number crunching performance . . .
  • yyrkoon - Thursday, January 7, 2010 - link

    Personally, I do not see it happening. There is a reason why companies like SGI moved from RISC to x86 hardware. However, with that said ther is simply no reason why these SoCs could not be used in an external NAS / SAN system with the right software to back it up. x86 has the advantage of running desktop classed Windows, even if only for gaming, which is a larger market than most think.

    Still, as a novice embedded designer, I see lots of potential in Tegra 2, but a lot of it would be unnecessary for my, and possibly others purposes. Smart phone, and MP3 players, sure, but not for a lot of other things. Perhaps if the graphics core were CUDA compliant and offered good number crunching performance . . .
  • altarity - Thursday, January 7, 2010 - link

    LOL... I seriously didn't know this when I posted earlier:

    http://blog.boxee.tv/2010/01/07/boxee-box-internal...">http://blog.boxee.tv/2010/01/07/boxee-box-internal...
  • sprockkets - Thursday, January 7, 2010 - link

    Now if only it wasn't shaped so weird...

    Is that really the final look?

Log in

Don't have an account? Sign up now