Cortex A15 Architecture

I want to go deeper into ARM’s Cortex A15 but I’ll have to save that for another time. At a high level you’re looking at a much deeper, much wider architecture than the Cortex A9. The integer pipeline is significantly deeper (15 stages vs. 9 stages), however branch prediction has been improved considerably to hopefully offset the difference.

The front end is 50% wider and has double the instruction fetch bandwidth of the Cortex A9, which helps increase instruction level parallelism. In order to capitalize on the 3-wide machine, ARM dramatically increased the size of the reorder buffer and all associated data structures within the machine. While the Cortex A9 could keep around 32 - 40 decoded instructions in its reorder buffer, Cortex A15 can hold 128 - an increase of up to 4x. The larger ROB alone gives you a good idea of the magnitude of difference between the Cortex A9 and A15. While the former was a natural evolution over the Cortex A8, ARM’s Cortex A15 is really a leap forward both in performance and power consumption - clearly aimed at something much more than just smartphones.

Getting to the execution core, A15 continues the trend of being considerably wider than A9. There are more execution ports and more execution units, all of which help to increase ILP/single threaded performance. ARM went to multiple, independent issue queues in order to keep frequencies high. Each issue queue can accept up to three instructions and all issue queues can dispatch in parallel.

The A15 can execute instructions out of order like the A9, however its abilities grow quite a bit. All FP/NEON instructions had to be executed in-order on Cortex A9, but they can now be executed OoO in the A15. Despite the beefier OoO execution engine, the Cortex A15 can’t reorder all memory operations (independent loads can be executed out of order, but stores can’t be completed ahead of loads).

The Cortex A15 moves back to an integrated L2 cache structure, rather than a separate IP block as was the case with the Cortex A9. L1 and L2 cache latencies remain largely unchanged, although I do believe A15 does see a 1 - 2 cycle penalty over A9 in a few cases. The level 2 TLB and other data structures grow in size considerably in order to feed the hungrier machine.

Although the L1 caches remain the same size as NVIDIA’s Cortex A9 (32KB I + 32KB D), the the L2 cache grows to 2MB. The 2MB L2 is shared by all four cores (the companion core has its own private 512KB L2), and any individual core can occupy up to the entire 2MB space on its own. Alternatively, all four cores can evenly share and access the large L2.

Introduction & Power The Cortex A9 r4p1 & Tegra Clock Speeds
Comments Locked

75 Comments

View All Comments

  • tipoo - Sunday, February 24, 2013 - link

    Under 500 in Sunspider, about twice as fast as anything else ARM. But then again, it's a few months newer than that, and actually still not shipping. And as usual with Nvidia they're early to each party (first to dual core, first to quad core), but not always the best performing. We'll see if other Cortex A15 designs beat it.

    I'd love to see four of those cores paired with SGXs upcoming 600/Rogue series.
  • jeffkibuule - Sunday, February 24, 2013 - link

    SunSpider is so software sensitive that a Tegra 3 @ 1.2 Ghz on Windows RT beats a Snapdraon S4 Pro @ 1.5Ghz on Nexus 4 using Chrome. It's a terrible benchmark because its so dependent on underlying kernel optimizations in the Android phone market.
  • tipoo - Sunday, February 24, 2013 - link

    True, other benchmarks are similarly impressive though.
  • karasaj - Sunday, February 24, 2013 - link

    Psh it has nothing on my desktop! 125ms on sunspider... Nvidia so behind.

    Anyways, still looks impressive. I really want to see some Krait 600/800 benchmarks.
  • tipoo - Sunday, February 24, 2013 - link

    The fact that they're getting well below an order of magnitude slower than desktops is impressive in itself too. Even with iPad 2 level performance I still was reluctant to do most of my web browsing on a tablet for the performance. Maybe with Tegra 4 and beyond hardware speed that will change.
  • Mumrik - Sunday, February 24, 2013 - link

    As someone with heavily tabbed browsing habits, I don't think I'll ever make that jump (and I own a tablet).
  • tipoo - Sunday, February 24, 2013 - link

    Also true, that's my other thing. I like to open a bunch of background tabs and have them ready as I go through each one. Right now, tablets don't do background loading, as far as I know, and if they did they wouldn't be powerful enough to keep the main tab smooth while doing it.
  • Tarwin - Monday, February 25, 2013 - link

    Tablets DO do background loading, as long as they're android. The only performance I've seen is from lack of RAM on my phone and lack of bandwidth on the phone and tablet but those things affect any computer as well. One observation to ne made, they do load in the background but things like audio and video playback will pause if you switch to another tab.
  • von Krupp - Monday, February 25, 2013 - link

    Even Windows Phone 7.5 and 8 do background loading. I haven't used it, but I'd wager that RT does as well, if even the gimpy mobile OS can.
  • tuxRoller - Sunday, February 24, 2013 - link

    As someone who had, until recently, over 40 tabs open on my chrome browser (Nexus 4), the critical problem has been memory. With enough memory, and good enough task management, these problems tend to go away.
    Of course, maybe you are than 0.00001% who has hundreds or thousands of tabs open in which case I pity any computer you are likely to own.

Log in

Don't have an account? Sign up now