Cortex A15 Architecture

I want to go deeper into ARM’s Cortex A15 but I’ll have to save that for another time. At a high level you’re looking at a much deeper, much wider architecture than the Cortex A9. The integer pipeline is significantly deeper (15 stages vs. 9 stages), however branch prediction has been improved considerably to hopefully offset the difference.

The front end is 50% wider and has double the instruction fetch bandwidth of the Cortex A9, which helps increase instruction level parallelism. In order to capitalize on the 3-wide machine, ARM dramatically increased the size of the reorder buffer and all associated data structures within the machine. While the Cortex A9 could keep around 32 - 40 decoded instructions in its reorder buffer, Cortex A15 can hold 128 - an increase of up to 4x. The larger ROB alone gives you a good idea of the magnitude of difference between the Cortex A9 and A15. While the former was a natural evolution over the Cortex A8, ARM’s Cortex A15 is really a leap forward both in performance and power consumption - clearly aimed at something much more than just smartphones.

Getting to the execution core, A15 continues the trend of being considerably wider than A9. There are more execution ports and more execution units, all of which help to increase ILP/single threaded performance. ARM went to multiple, independent issue queues in order to keep frequencies high. Each issue queue can accept up to three instructions and all issue queues can dispatch in parallel.

The A15 can execute instructions out of order like the A9, however its abilities grow quite a bit. All FP/NEON instructions had to be executed in-order on Cortex A9, but they can now be executed OoO in the A15. Despite the beefier OoO execution engine, the Cortex A15 can’t reorder all memory operations (independent loads can be executed out of order, but stores can’t be completed ahead of loads).

The Cortex A15 moves back to an integrated L2 cache structure, rather than a separate IP block as was the case with the Cortex A9. L1 and L2 cache latencies remain largely unchanged, although I do believe A15 does see a 1 - 2 cycle penalty over A9 in a few cases. The level 2 TLB and other data structures grow in size considerably in order to feed the hungrier machine.

Although the L1 caches remain the same size as NVIDIA’s Cortex A9 (32KB I + 32KB D), the the L2 cache grows to 2MB. The 2MB L2 is shared by all four cores (the companion core has its own private 512KB L2), and any individual core can occupy up to the entire 2MB space on its own. Alternatively, all four cores can evenly share and access the large L2.

Introduction & Power The Cortex A9 r4p1 & Tegra Clock Speeds
Comments Locked

75 Comments

View All Comments

  • Krysto - Monday, February 25, 2013 - link

    S600 is just a slightly overclocked S4 Pro with the same GPU.

    The real competitor of Tegra 4 will be S800. We'll see if it wins in CPU performance (it might not), and I think there's a high chance it will lose in GPU performance, as Adreno 330 is only 50% faster than Adreno 320 I think, and Tegra 4 is about twice as fast.

    Qualcomm has always had slower graphics performance than Nvidia actually. The only "gap" they found in the market was last fall with the Adreno 320, when Nvidia didn't have anything good to show. But Tegra 3 beat S4 with its Adreno 225.
  • watersb - Monday, February 25, 2013 - link

    I'm amazed at the depth of this NVIDIA data-dump. Brilliant work.

    Anand's observation re: die size, cost strategy, position in the market and how this buys them time to consolidate... Wow.

    Clearly, Nvidia is in this game for the long haul.
  • djgandy - Monday, February 25, 2013 - link

    So OpenGL ES 3.0 doesn't matter, but quad core A15 does? Why do people suck up to Nvidia and their marketing BS so much?

    T4i still single channel memory? What a joke configuration.
  • djgandy - Monday, February 25, 2013 - link

    Also a 9 page article about a mobile SoC without a single reference to the word "battery".
  • varad - Monday, February 25, 2013 - link

    Read the article before you write such comments. The very first page is "Introduction & Power" where they do mention some numbers and their thoughts.
  • djgandy - Tuesday, February 26, 2013 - link

    Yeah its all smoke and mirrors under lab test conditions. Where is the real battery life? Is this not for battery powered devices?
  • Krysto - Monday, February 25, 2013 - link

    Personally, I think all 2013 GPU's should have support for OpenGL ES 3.0 and OpenCL. I was stunned to find out Tegra 4 was not going to support it as they haven't even switched to a unified shader architecture.

    That being said, Anand is probably right that it was the right move for Nvidia, and they are just going to wait for the Maxwell architecture to streamline the same custom ARMv8 CPU from Tegra 5 to Project Denver across product line-ups, and also the same Maxwell GPU cores.

    If that's indeed their plan, then switching Tegra 4 to Kepler this year, only to switch again to Maxwell next year wouldn't have made any sense. GPU architectures barely change even every 2-3 years, let alone 1 year. It wouldn't have been cost effective for them.

    I do hope they aren't going to delay the transition again with Tegra 5 though, and I also do hope they follow Qualcomm's strategy with S4 last year of switching IMEMDIATELY to the 20nm process, instead of continuing on 28nm with Tegra 5, like they did with Tegra 3 on 40nm. But I fear Nvidia will repeat the same mistake.

    If they put Tegra 5 on 20nm, and make it 120mm2 in size, with Maxwell GPU core, I don't think even Apple's A8X will stand against it next year in terms of GPU performance (and of course it will get beaten easily in CPU performance, just like this year).
  • djgandy - Tuesday, February 26, 2013 - link

    Tegra is smaller because it lacks features and also memory bandwidth. The comparison is not really fair to assume you can just throw more shaders at the problem. You'll need wider memory bus for a start. You'll need more TMU's and in the future it's probably smart to have a dedicate ROP unit. Then also are you seriously going to just stick with FP20 and not support ES 3.0 and OpenCL? OEMs see OpenCL as a de facto feature these days, not because it is widely used but because it opens up future possibilities. Nvidia has simply designed an SoC for gaming here.

    Your post focuses on performance, but these are battery powered devices. The primary design goal is efficiency, and it would appear that is why apple went swift and not A15. A15 is just too damn power hungry, even for a tablet.
  • metafor - Tuesday, February 26, 2013 - link

    If the silicon division of Apple were its own business, they'd be in the red. Very few silicon providers can afford to make 120mm^2 chips and still make a profit; let alone one with as little bargaining clout in the mobile space as nVidia.

    Numbers are great but at the end of the day, making money is what matters.
  • milli - Monday, February 25, 2013 - link

    nVidia is trying hard but Tegra still isn't making them any money ...

Log in

Don't have an account? Sign up now