The GPU: Apple's Gift to Game Developers

The GPU side of the A5 is really what's most exciting. As we mentioned in our iPad 2 GPU Performance analysis, the A5 includes a dual-core PowerVR SGX 543 - also known as the SGX 543MP2. In our earlier article we showed the SGX 543MP2 easily beating both an iPad 1 and the Tegra 2 based Motorola Xoom.

To understand why the SGX 543MP2 has such a performance advantage we need to first remember that NVIDIA's Tegra 2 is nearly a year late. NVIDIA's first competitive ultra mobile GPU was supposed to be shipping in products in the first half of 2010, instead it found itself shipping in 2011. While NVIDIA is good at designing GPUs, it's not good enough that it can release a product and maintain a two year performance advantage over the competition. Let's look at the architecture, shall we?

NVIDIA's Tegra 2 features a DirectX 9-class GPU. NVIDIA used to call it the GeForce ULP (Ultra Low Power) but now it's just GeForce. As a DX9 class GPU we're dealing with a conventional, non-unified shader architecture. While all OpenGL ES 2.0 GPUs can execute pixel and vertex shader instructions, the GeForce in Tegra 2 runs pixel and vertex shaders on separate groups of hardware.

NVIDIA calls each pixel and vertex shader ALU a core. The Tegra 2 has four pixel shader cores and four vertex shader cores. The four pixel shader ALUs make up a single Vec4 and the same goes for the four vertex shader ALUs. NVIDIA wouldn't elaborate on what limitations exist when dispatching operations to the cores. All pixel shader operations happen at 20-bits per component precision while all vertex shader operations happen at 32-bits per component.

Each core is capable of executing one multiply+add (MAD) operation per clock. Do the math and that works out to be a peak rate of 8 MADs per clock for the entire GPU. The maximum operating frequency for the Tegra 2 GeForce GPU is 300MHz, however device vendors may run the GPU at a lower frequency to save on power. At 300MHz this works out to be 4.8 GFLOPS (counting a MAD as two FLOPs).

Imagination Technologies' PowerVR SGX 543MP2 is fundamentally a bigger GPU than the GeForce in NVIDIA's Tegra 2. Let's go through the math.

The SGX 543 features four USSE2 pipes. This is a unified shader architecture so both vertex and pixel shader code runs on the same set of hardware. The benefit of this approach is you get better performance in peaky situations where you're running a lot of vertex or pixel shader code and not a balance that's perfectly tailored to your architecture. The Tegra 2 will only run at peak efficiency if it encounters a mix of 50% vertex and 50% pixel shader code. The PowerVR SGX series will never have any of its execution pipes idle regardless of the instruction mix.

Each USSE2 pipe has a 4-wide vector ALU capable of cranking out 4 MADs per clock. Two of these pipes is enough to equal the peak throughput of what NVIDIA built in Tegra 2, but the PowerVR SGX 543 has four of them. As for the MP2? Go ahead and double that number again. The SGX 543MP2 is simply two 543s placed next to one another.

All of this works out to be 16 MADs per clock for the SGX 543 and 32 MADs per clock for the SGX 543MP2. At 200MHz that's 12.8GFLOPS and at 250MHz we're talking about 16 GFLOPS.

Mobile SoC GPU Comparison
  PowerVR SGX 530 PowerVR SGX 535 PowerVR SGX 540 PowerVR SGX 543 PowerVR SGX 543MP2 GeForce ULP Kal-El GeForce
SIMD Name USSE USSE USSE USSE2 USSE2 Core Core
# of SIMDs 2 2 4 4 8 8 12
MADs per SIMD 2 2 2 4 4 1 ?
Total MADs 4 4 8 16 32 8 ?
GFLOPS @ 200MHz 1.6 GFLOPS 1.6 GFLOPS 3.2 GFLOPS 6.4 GFLOPS 12.8 GFLOPS 3.2 GFLOPS ?
GFLOPS @ 300MHz 2.4 GFLOPS 2.4 GFLOPS 4.8 GFLOPS 9.6 GFLOPS 19.2 GFLOPS 4.8 GFLOPS ?

At its lowest expected clock speed, the 543MP2 already has over twice the compute power of the Tegra 2's GPU at its highest operating frequency. Take into account the fact that the A5 likely has more memory bandwidth than Tegra 2 and the SGX 543MP2 is a tile based architecture with lower bandwidth requirements and the performance numbers we talked about last time shouldn't be all that surprising.

The real competition for the SGX 543MP2 will be NVIDIA's Kal-El. That part is expected to ship on time and will feature a boost in core count: from 8 to 12. The ratio of pixel to vertex shader cores is not known at this point but I'm guessing it won't be balanced anymore. NVIDIA is promising 3x the GPU performance out of Kal-El so I suspect that we'll see an increase in throughput per core.

GPU Performance

Taken from our iPad 2 GPU Performance Preview:

As always we turn to GLBenchmark 2.0, a benchmark crafted by a bunch of developers who either have or had experience doing development work for some of the big dev houses in the industry. We'll start with some of the synthetics.

Over the course of PC gaming evolution we noticed a significant increase in geometry complexity. We'll likely see a similar evolution with games in the ultra mobile space, and as a result this next round of ultra mobile GPUs will seriously ramp up geometry performance.

Here we look at two different geometry tests amounting to the (almost) best and worst case triangle throughput measured by GLBenchmark 2.0. First we have the best case scenario - a textured triangle:

Geometry Throughput - Textured Triangle Test

The original iPad could manage 8.7 million triangles per second in this test. The iPad 2? 29 million. An increase of over 3x. Developers with existing titles on the iPad could conceivably triple geometry complexity with no impact on performance on the iPad 2.

Now for the more complex case - a fragment lit triangle test:

Geometry Throughput - Fragment Lit Triangle Test

The performance gap widens. While the PowerVR SGX 535 in the A4 could barely break 4 million triangles per second in this test, the PowerVR SGX 543MP2 in the A5 manages just under 20 million. There's just no competition here.

I mentioned an improvement in texturing performance earlier. The GLBenchmark texture fetch test puts numbers to that statement:

Fill Rate - Texture Fetch

We're talking about nearly a 5x increase in texture fetch performance. This has to be due to more than an increase in the amount of texturing hardware. An improvement in throughput? Increase in memory bandwidth? It's tough to say without knowing more at this point.

Apple iPad vs. iPad 2
  Apple iPad (PowerVR SGX 535) Apple iPad 2 (PowerVR SGX 543MP2)
Array test - uniform array access
3412.4 kVertex/s
3864.0 kVertex/s
Branching test - balanced
2002.2 kShaders/s
11412.4 kShaders/s
Branching test - fragment weighted
5784.3 kFragments/s
22402.6kFragments/s
Branching test - vertex weighted
3905.9 kVertex/s
3870.6 kVertex/s
Common test - balanced
1025.3 kShaders/s
4092.5 kShaders/s
Common test - fragment weighted
1603.7 kFragments/s
3708.2 kFragments/s
Common test - vertex weighted
1516.6 kVertex/s
3714.0 kVertex/s
Geometric test - balanced
1276.2 kShaders/s
6238.4 kShaders/s
Geometric test - fragment weighted
2000.6 kFragments/s
6382.0 kFragments/s
Geometric test - vertex weighted
1921.5 kVertex/s
3780.9 kVertex/s
Exponential test - balanced
2013.2 kShaders/s
11758.0 kShaders/s
Exponential test - fragment weighted
3632.3 kFragments/s
11151.8 kFragments/s
Exponential test - vertex weighted
3118.1 kVertex/s
3634.1 kVertex/s
Fill test - texture fetch
179116.2 kTexels/s
890077.6 kTexels/s
For loop test - balanced
1295.1 kShaders/s
3719.1 kShaders/s
For loop test - fragment weighted
1777.3 kFragments/s
6182.8 kFragments/s
For loop test - vertex weighted
1418.3 kVertex/s
3813.5 kVertex/s
Triangle test - textured
8691.5 kTriangles/s
29019.9 kTriangles/s
Triangle test - textured, fragment lit
4084.9 kTriangles/s
19695.8 kTriangles/s
Triangle test - textured, vertex lit
6912.4 kTriangles/s
20907.1 kTriangles/s
Triangle test - white
9621.7 kTriangles/s
29771.1 kTriangles/s
Trigonometric test - balanced
1292.6 kShaders/s
3249.9 kShaders/s
Trigonometric test - fragment weighted
1103.9 kFragments/s
3502.5 kFragments/s
Trigonometric test - vertex weighted
1018.8 kVertex/s
3091.7 kVertex/s
Swapbuffer Speed
600
599

Enough with the synthetics - how much of an improvement does all of this yield in the actual GLBenchmark 2.0 game tests? Oh it's big.

GLBenchmark 2.0 Egypt

Without AA, the Egypt test runs at 5.4x the frame rate of the original iPad. It's even 3.7x the speed of the Tegra 2 in the Xoom running at 1280 x 800 (granted that's an iOS vs. Android comparison as well).

GLBenchmark 2.0 Egypt - FSAA

With AA enabled the iPad 2 advantage grows to 7x. In a game with the complexity of the Egypt test the original iPad wouldn't be remotely playable while the iPad 2 could run it smoothly.

The Pro test is a little more reasonable, showing a 3 - 4x increase in performance compared to the original iPad:

GLBenchmark 2.0 PRO

GLBenchmark 2.0 PRO - FSAA

While we weren't able to reach the 9x figure claimed by Apple (I'm not sure that you'll ever see 9x running real game code), a range of 3 - 7x in GLBenchmark 2.0 is more reasonable. In practice I'd expect something less than 5x but that's nothing to complain about.

The Right SoC at the Right Time: Apple's A5 Battery Life
Comments Locked

189 Comments

View All Comments

  • JarredWalton - Sunday, March 20, 2011 - link

    Considering the source (ARMflix), you need to take that video with a huge grain of salt. It looks like they're running some Linux variant on the two systems (maybe Chromium?), and while the build may be the same, that doesn't mean it's optimized equally well for Atom vs. A9.

    Single-core Atom at 1.6GHz vs. dual-core A9 at 500MHz surfing the web is fine and all, but when we discuss Atom being faster than A9 we're talking about raw performance potential. A properly optimized web browser and OS experience with high-speed Internet should be good on just about any modern platform. Throw in some video playback as well, give us something more than a script of web pages in a browser, etc.

    Now, none of this means ARM's A9 is bad, but to show that it's as fast as Atom when browsing some web pages is potentially meaningless. What we really need to know is what one platform can do well that the other can't handle properly. Where does A9 fall flat? Where does Atom stumble?

    For me, right now, Atom sucks at anything video related. Sorry, but YouTube and Hulu are pretty important tools for me. That also means iOS has some concerns, as it doesn't support Flash at all, and there are enough places where Flash is still used that it creates issues. Luckily, I have plenty of other devices for accessing the web. In the end, I mostly play Angry Birds on my iPod Touch while I'm waiting for someone. :-)
  • Wilco1 - Sunday, March 20, 2011 - link

    The article is indeed wrong to suggest that the A9 has only half the performance of an Atom. There are cases where a netbook with a single core Atom might be faster, for example if it runs at a much higher frequency, uses hyperthreading, and has a fast DDR3 memory system. However in terms of raw CPU performance the out-of-order A9 is significantly faster than the in-order Atom. Benchmark results such as CoreMark confirm this, a single core Atom cannot beat an A9 at the same frequency - even with hyperthreading. So it would be good to clarify that netbooks are faster because they use higher frequency CPUs and a faster memory system - as well as a larger battery...
  • somata - Sunday, March 27, 2011 - link

    CoreMark is nearly as meaningless as MIPS. Right now the best cross-platform benchmark we have is Geekbench. It uses portable, multi-threaded, native code to perform real tasks. My experience with Geekbench on the Mac/PC over the years indicates that Geekbench scores correlate pretty well to average application performance (determined by my personal suite of app benchmarks). Of course there will be outliers, but Geekbench does a pretty good job at representing typical code.

    Given that, the fact that a single-core 1.6GHz Atom (with HT) scores about 28% higher than the IPad's dual-core 1GHz A9s in the integer suite leaves me little doubt that the Atom, despite being in-order, has as good or better per-clock performance than the A9s.

    Even the oft-maligned PowerPC G4 totally outclasses the dual A9s, with 43% better integer performance at 1.42GHz... and that's just with a single core competing against two!
  • tcool93 - Sunday, March 20, 2011 - link

    Tablets do have their advantages despite what the article claims. For one thing, their battery life far out lives any Netbook or Notebook. They also run a lot cooler, unlike Notebooks and Netbooks, which you can fry an egg on. Maybe they aren't as portable as a phone, but who wants to look at the super tiny print on a phone.

    Tablets don't replace computers, and never will. There are nice to sit in bed with at night and browse the web or read books on, or play a simple game on. Anything that doesn't require a lot of typing.

    Even a 10" tablet screen isn't real big to read text, but its MUCH easier to zoom in on text to read it with tablets. Unlike any Notebook/'Netbook, which its a huge pain to get to zoom in.
  • tcool93 - Sunday, March 20, 2011 - link

    I do think the benchmarks shown here do show that there is quite an improvement over the Ipad 1, despite what many seem to claim that there isn't much of an upgrade.
  • secretmanofagent - Sunday, March 20, 2011 - link

    Anand,
    Appreciate the article, and appreciating that you're responding to the readers as well. All three of you said that it didn't integrate into your workflow, and I have a similar problem (which has prevented me from purchasing one). One thing I'm very curious about: What is your opinion on what would have been the Courier concept? Do you feel that is the direction that tablets should have taken, or do you think that Apple's refining as opposed to paradigming is the way to go?
  • VivekGowri - Sunday, March 20, 2011 - link

    I still despise Microsoft for killing the Courier project. Honestly, I'd have loved to see the tablet market go that direction - a lot more focused on content creation instead of a very consumption-centric device like the iPad. A $4-500 device running that UI, an ARM processor, and OneNote syncing ability would have sold like hotcakes to students. If only...
  • tipoo - Sunday, March 20, 2011 - link

    Me too, the Courier looked amazing. They cancel that, yet go ahead with something like the Kin? Hard to imagine where their heads are at.
  • Anand Lal Shimpi - Monday, March 21, 2011 - link

    While I've seen the Courier video, and it definitely looked impressive, it's tough to say how that would've worked in practice.

    I feel like there are performance limitations that are at work here. Even though a pair of A9s are quick, they are by no means fast enough. I feel like as a result, evolutionary refinement is the only way to go about getting to where we need to be. Along the way Apple (and its competitors) can pick up early adopters to help fund the progress.

    I'm really curious to see which company gets the gaming side of it down. Clearly that's a huge market.

    Take care,
    Anand
  • Azethoth - Monday, March 21, 2011 - link

    Gaming side is a good question. Apple will have an advantage there due to limited hardware specs to code to. They are a lot more like a traditional console that way vs Android which will be anything but.

    Are actual game controls like in the psp phone necessary?

    I am also curious what additional UI tech will eventually make it to the pad space:
    * Speech, although it is forever not there yet.
    * 3D maybe if its not a fad (glasses free)
    * Some form of the Kinect maybe to manipulate the 3d stuff and do magical kinect gestures and incantations we haven't dreamed up yet.
    * Haptic as mentioned earlier in the thread.

    Speech could make a pad suitable for hip bloggers like the AnandTech posse.

Log in

Don't have an account? Sign up now