Technical Comparisons

As has quickly become tradition for us, to close out our look at the Midgard architecture we want to spend a bit of time comparing it to other SoC GPU architectures. As this is not a performance or benchmark article we aren’t going to dwell on the subject too much, but we find it’s helpful to get a high level overview of theoretical performance.

To do this we’ll take a quick look at theoretical performance for FP32 FLOPS, along with pixel and texel throughput. As this is a purely theoretical comparison it doesn’t (and can’t) take into account architectural efficiency, nor can it take into account real-world clockspeeds. But none the less it gives us something of a baseline.

To that end we asked ARM what a reasonable high-end Mali-T760 configuration might look like. T760 can scale up to 16 shader cores, but as we’ve seen in these scalable designs it’s very rare for anyone to build a SoC that actually takes the number of cores up to the architecture’s limit. And since T760 was only released to customers back in October of 2013, there are only a handful of designs announced so far and none of them are particularly high-end. To that end ARM suggested that a Mali-T760 MP10 would be a reasonable approximation of a high-end shipping configuration, so that is what we’ve gone with.

GPU Specification Comparison
  NVIDIA K1 Imagination PVR GX6650 ARM Mali-T760 MP10 AMD A4-1350
FP32 ALUs 192 192 100 128
FP32 FLOPs 384 384 200 (340) 256
Pixels/Clock (ROPs) 4 12 10 4
Texels/Clock 8 12 10 8
GFLOPS @ 300MHz 115.2 GFLOPS 115.2 GFLOPS 60 (102) GFLOPS 76.8 GFLOPS
Architecture Kepler Rogue (6XT) Midgard (T700) GCN 1.1

Briefly, we can see that as far as theoretical shading performance is concerned, our theoretical Mali-T760 would push 60 GFLOPS when counting MADs (20 FLOPS/clock/core). Or when using ARM’s preferred metric of MAD plus a dot product (34 FLOPS/clock/core) this becomes 102 GLOPS. How you count ends up being important here as it means the theoretical throughput of the T760MP10 is either close to something like AMD’s A4-1350, or close to the very high end configurations that NVIDIA and Imagination will be peddling.

On the other hand T760MP10’s pixel and texel throughput looks very good, easily exceeding both our AMD and NVIDIA configurations on both and specifically more than doubling the pixel throughput. Pixel throughput is going to be especially important going forward as these SoCs get paired with increasingly high resolution displays – the TV industry has in recent years become big SoC consumers and 4K TVs are growing in popularity – so being able to push a lot of pixels will in turn be helpful for pushing such displays. However ARM’s efficiency technology such as Transaction Elimination and AFBC will also have to play a big part here, as writing that many pixels per clock raw would consume a large amount of memory bandwidth, something SoCs rarely have to spare.

Final Words

With apologies in advance to ARM, wrapping up this article the first thing that comes to mind is something we wrote when looking at Imagination’s Rogue architecture earlier this year: “So it’s with some hope and a bit of luck that this might get the ball rolling with the other SoC GPU vendors, getting them to open up their doors a bit more so that we can see what’s inside their designs.”

It’s safe to say then that we have indeed been lucky about getting other SoC GPU vendors to open up about their architectures. ARM’s decision to come take a seat at the “open architecture” table has given us a great opportunity to see into the heart of another SoC GPU and to better understand and appreciate just what’s going on under the hood when we look at Mali powered products. Plus in opening up on their GPU architecture, we have been given the chance to see what just may be the least conventional GPU of the modern era.

When ARM first began to brief me on the Midgard architecture, they told me that it would be something unlike anything else we’ve seen before, and while I believed them I don’t believe that description is quite strong enough to get across just how surprised I was by ARM’s autonomous, TLP insensitive shader design. It took the better part of a few days even after the briefing to really internalize just what they had done, and while it seems simple (and very cool) in retrospect, going for an unorthodox architecture certainly throws you for a loop at first after spending several years covering the world of wavefront-driven architectures.

As for Midgard and its resulting products, this stands to be an interesting and exciting time for ARM. The finalization of OpenGL ES 3.1 and the announcement of the Android Extension Pack means that some of the functionality that ARM has had to sit on thus far is finally going to be exposed and used. And meanwhile with 64bit Android coming up and ARM’s 64bit Cortex-A5x processors similarly near, ARM can begin exploiting some of that shared 64bit development that ARMv8 and Midgard went through.

At the same time however ARM also will face the same struggle for market share that the other SoC GPU vendors also face. As we’ve discussed in the past, the SoC GPU market is full of competitors, some who make their own SoCs and hence won’t be ARM GPU customers, and others who are in the licensing business just as much as ARM. With the latest generation Mali-T700 series parts ARM already has some T760 wins with MediaTek, who will be using T760 with their mid-range Cortex-A53 SoCs. But at the same time I’d love to see what flagship-caliber device would look like with a T760, so hopefully we’ll get that chance over the next year.

This incidentally is all the more reason to be open right now, as it’s that much easier to convince your immediate customers and even build a brand among end users when they can freely learn more about your products and how they operate. To that end the “open architecture” table remains open, and as we shift to the next generation of SoCs and next wave of SoC GPUs, with any luck this won’t be the last time we get to learn more about the GPUs that are increasingly in our everyday devices.

Tricks of the Trade: Transaction Elimination and Frame Buffer Compression
Comments Locked

66 Comments

View All Comments

  • LemmingOverlord - Thursday, July 3, 2014 - link

    Quick suggestion: considering Adreno is one of the most widespread GPU architectures for mobile, could you edit the table in the last page to include Adreno 3xx/4xx GPUs?

    Thanks!
  • Anand Lal Shimpi - Tuesday, July 8, 2014 - link

    Unfortunately Qualcomm refuses to disclose much detail about their GPU architectures. I completely disagree with their position and have worked on Qualcomm for years to get them to open up but at this point it's a meaningless effort.
  • da_asmodai - Thursday, July 3, 2014 - link

    How about adding the Qualcomms Adreno 420 to the comparison.
  • Anand Lal Shimpi - Thursday, July 3, 2014 - link

    I wish we could - Qualcomm refuses to disclose any deeper architectural details about any modern Adreno GPU architectures.
  • Krysto - Thursday, July 3, 2014 - link

    Their loss. Plus, neither Adreno 420 nor their upcoming CPU's look that interesting or competitive anyway. Adreno 420 should still give only about HALF the performance of Tegra K1's GPU.
  • ChefJeff789 - Thursday, July 3, 2014 - link

    Really? That's disappointing... I'm really looking forward to a time when ARM, nVidia, and AMD all compete on an architectural level in their GPUs, if it ever comes. The one-horse race with Intel in the desktop CPU space has been pretty lackluster for the past few years, in terms of performance increases. nVidia's Maxwell architecture seems pretty amazing in terms of efficiency, and I'm not yet convinced AMD will be able to compete. They have yet to impress with their APU and mobile processor efficiencies.
  • frostyfiredude - Thursday, July 3, 2014 - link

    Important to note that NVidia's TK1 will be achieving that double GFLOPS performance of the Adreno 420 at a clock speed of around 950Mhz. At that performance level the TDP is listed at <10W, so it's not exactly comparable to the S805 and Adreno 420 which target a TDP half as high. What I can see happening is the TK1 being able to stretch it's legs and thus being superior in large tablets but being too thermally crippled in phones and small tablets to reach those levels. Based on the previews I found, Adreno is more efficient in it's shader resource usage, closing that further.
  • lmcd - Thursday, July 3, 2014 - link

    That's actually pretty bad math there -- if the TK1 achieves double perf at double power, it should achieve the same perf at 1/4 power (well, not quite since it isn't as simple as the basic E&M I learned, but yeah).

    And by your logic still, why would the K1 not fit in phones and tablets even as the 420 manages?
  • tuxRoller - Friday, July 4, 2014 - link

    Power is linear to f, but squares with V. I don't know that we can say that at half the f you can halve the V. Actually, that's almost certainly not the case, as it's not the case with any common processor tmk.
  • tuxRoller - Thursday, July 3, 2014 - link

    The adreno 420 provides around 220gflops. The 430 well then be over 300gflops. These are not counting changes in clock speed that could raise our lower performance.

Log in

Don't have an account? Sign up now