NVIDIA Tegra 4 Architecture Deep Dive, Plus Tegra 4i, Icera i500 & Phoenix Hands On

Name: NVIDIA Tegra 4 Architecture Deep Dive, Plus Tegra 4i, Icera i500 & Phoenix Hands On
Item: NVIDIA Tegra 4 Architecture Deep Dive, Plus Tegra 4i, Icera i500 & Phoenix Hands On

by Anand Lal Shimpi & Brian Klug on February 24, 2013 3:00 PM EST

75 Comments | Add A Comment

75 Comments

Cortex A15 Architecture

I want to go deeper into ARM’s Cortex A15 but I’ll have to save that for another time. At a high level you’re looking at a much deeper, much wider architecture than the Cortex A9. The integer pipeline is significantly deeper (15 stages vs. 9 stages), however branch prediction has been improved considerably to hopefully offset the difference.

The front end is 50% wider and has double the instruction fetch bandwidth of the Cortex A9, which helps increase instruction level parallelism. In order to capitalize on the 3-wide machine, ARM dramatically increased the size of the reorder buffer and all associated data structures within the machine. While the Cortex A9 could keep around 32 - 40 decoded instructions in its reorder buffer, Cortex A15 can hold 128 - an increase of up to 4x. The larger ROB alone gives you a good idea of the magnitude of difference between the Cortex A9 and A15. While the former was a natural evolution over the Cortex A8, ARM’s Cortex A15 is really a leap forward both in performance and power consumption - clearly aimed at something much more than just smartphones.

Getting to the execution core, A15 continues the trend of being considerably wider than A9. There are more execution ports and more execution units, all of which help to increase ILP/single threaded performance. ARM went to multiple, independent issue queues in order to keep frequencies high. Each issue queue can accept up to three instructions and all issue queues can dispatch in parallel.

The A15 can execute instructions out of order like the A9, however its abilities grow quite a bit. All FP/NEON instructions had to be executed in-order on Cortex A9, but they can now be executed OoO in the A15. Despite the beefier OoO execution engine, the Cortex A15 can’t reorder all memory operations (independent loads can be executed out of order, but stores can’t be completed ahead of loads).

The Cortex A15 moves back to an integrated L2 cache structure, rather than a separate IP block as was the case with the Cortex A9. L1 and L2 cache latencies remain largely unchanged, although I do believe A15 does see a 1 - 2 cycle penalty over A9 in a few cases. The level 2 TLB and other data structures grow in size considerably in order to feed the hungrier machine.

Although the L1 caches remain the same size as NVIDIA’s Cortex A9 (32KB I + 32KB D), the the L2 cache grows to 2MB. The 2MB L2 is shared by all four cores (the companion core has its own private 512KB L2), and any individual core can occupy up to the entire 2MB space on its own. Alternatively, all four cores can evenly share and access the large L2.

Introduction & Power The Cortex A9 r4p1 & Tegra Clock Speeds

PRINT THIS ARTICLE

Post Your Comment
Please log in or sign up to comment.

Comments Locked

75 Comments

View All Comments

klmccaughey - Sunday, February 24, 2013 - link
Definitely. All good for us too! :)
twotwotwo - Sunday, February 24, 2013 - link
> In the PC industry we learned that there’s no real downside to quad-core as long as you can power gate individual cores, and turbo up to higher frequencies when fewer than four cores are active, there’s no real tradeoff other than cost.

I'm not completely sure, because there are always other possible uses for die area.

You could do the big/little thing with A7 'companion' cores, like Samsung. You could use even more area for GPU, like Apple. Wiki suggests you could double the L2 cache to 4MB (though more cache would always be eating power, even with only one core turned on).

But in favor of quad-core: software might start using cores a little more effectively w/time--Google and Apple are apparently trying to make WebKit able to do things like HTML parsing and JavaScript garbage collection in the background, and Microsoft's browser team backgrounds JavaScript compilation. And the other uses of space are also only sort-of useful, and cores (like GHz) are handy for marketing. I can't say I know what the right tradeoff for NVidia is, only that there were were other seemingly-interesting options.
guidryp - Sunday, February 24, 2013 - link
"there are always other possible uses for die area"

Yes, in the case of Tegra 3, they could certainly have used extra GPU power more than 4 CPU cores. But they seem to have remedied that this time.
twotwotwo - Monday, February 25, 2013 - link
Def possible, and what they disclosed in this presentation would suggest they've handled it.

All that's working against them, GPU-wise, is that user expectations increased since last gen, and Mali/PowerVR improved. So now T4i needs to drive 1080p phone screens and T4 needs to drive screens like the Nexus 10's, if they want to be the most bleeding-edge, anyway.

But they did talk about large integer-factor improvements in the GPU, so maybe they haven't merely built the GPU that would've been nice to have last gen, but moved up enough to be great this gen.
sosadsohappy - Sunday, February 24, 2013 - link
Samsung has just said it is doing A15-A7 pairing. Saying out the future plans just to keep the crowd excited is not new. That does not rule out the possibility of Qualcomm or Nvidia going for similar big.LITTLE designs. They are for the next-gen I would think. (Tell me if I am wrong but have anyone sampled big.LITTLE based SoC yet?)

And talking about die area, what is impressive about Nvidia is how their chips are always smaller. Quad-core A15 is about 80mm^2 while you can check for the sizes of Qualcomm's or Apple's chips! FWIW Apple's are not in 28nm but still they don't scale equally.

I am excited to see the 60mm^2 (right?) chip (Tegra4i). If it is what they claim, it should have great battery life for a smartphone.
s44 - Monday, February 25, 2013 - link
4+1 is Nvidia's version of big.LITTLE. The 1 low-power A15 is about the same die space as the 4 A7s on the next Exynos...
sosadsohappy - Monday, February 25, 2013 - link
Yes. The only difference is that the big.LITTLE will sport different architectures on the big and LITTLE while NV's 4+1 will have the same arch (A15 for both).

And personally I think 4+1 is better as of now until we have Atlas and Apollo combination of big.LITTLE because (correct me if I'm wrong) A7 does not have as much of memory parallelism, it is to weak as well...

No matter what, it has been impressive that Nvidia chips have significantly lower die size than the competition's dual-core chips!
Krysto - Monday, February 25, 2013 - link
Too weak? For what? Receiving notifications? We'll see if Tegra 4 is more energy efficient than Samsung's Exynos 5 Octa later this year. Then we might get a better idea whether Nvidia or ARM's implementation is better.

And I agree. Nvidia managed to have the same graphics performance + a quad core Cortex A15 CPU in 80mm2 vs Apple with a dual core CPU and same graphics performance in 120 mm2. That's pretty impressive, even if it arrives half a year late.

I still wish Nvidia would actually want to compete at the high-end though, with a 120mm2 chip, and beat Apple. It annoys me that they are still trying to build only "good enough for most people" chips. They should be trying to be the king of mobile graphics. They are freaking Nvidia, and they can't even beat a mobile GPU maker? Come on, Nvidia.
name99 - Monday, February 25, 2013 - link
> In the PC industry we learned that there’s no real downside to quad-core as long as you can power gate individual cores, and turbo up to higher frequencies when fewer than four cores are active, there’s no real tradeoff other than cost.

Sony Ericsson recently released a paper claiming this was not true, even apart from the die area issues. In particular they claimed that with current technology, coupling capacitance, ground plane issues, communication (with the L2, including coherence) and suchlike, quad-core imposed something like a 25% reduction in peak MHz possible for two cores, compared to those same two cores isolated rather than on a quad-core die.

Now obviously any company publication is talking up its book, but I imagine they're not going to make a statement that is blatantly false in a technical publication, implying there is some truth to what they say.
Wilco1 - Wednesday, February 27, 2013 - link
Given Tegra 4i achieves 2.3GHz in a quad core with shared L2, way more than Krait which uses per-CPU L2, I think the claim that a shared L2 is clock limiting seems more marketing than substance.

NVIDIA Tegra 4 Architecture Deep Dive, Plus Tegra 4i, Icera i500 & Phoenix Hands On

Cortex A15 Architecture

Post Your Comment

75 Comments

View All Comments

klmccaughey - Sunday, February 24, 2013 - link

twotwotwo - Sunday, February 24, 2013 - link

guidryp - Sunday, February 24, 2013 - link

twotwotwo - Monday, February 25, 2013 - link

sosadsohappy - Sunday, February 24, 2013 - link

s44 - Monday, February 25, 2013 - link

sosadsohappy - Monday, February 25, 2013 - link

Krysto - Monday, February 25, 2013 - link

name99 - Monday, February 25, 2013 - link

Wilco1 - Wednesday, February 27, 2013 - link

Log in

Don't have an account? Sign up now