ARM’s Mali Midgard Architecture Explored

Name: ARM’s Mali Midgard Architecture Explored
Item: ARM’s Mali Midgard Architecture Explored
Author: Ryan Smith

by Ryan Smith on July 3, 2014 11:00 AM EST

Posted in
GPUs
Arm
SoCs
Mali
Midgard

66 Comments | Add A Comment

66 Comments

Technical Comparisons

As has quickly become tradition for us, to close out our look at the Midgard architecture we want to spend a bit of time comparing it to other SoC GPU architectures. As this is not a performance or benchmark article we aren’t going to dwell on the subject too much, but we find it’s helpful to get a high level overview of theoretical performance.

To do this we’ll take a quick look at theoretical performance for FP32 FLOPS, along with pixel and texel throughput. As this is a purely theoretical comparison it doesn’t (and can’t) take into account architectural efficiency, nor can it take into account real-world clockspeeds. But none the less it gives us something of a baseline.

To that end we asked ARM what a reasonable high-end Mali-T760 configuration might look like. T760 can scale up to 16 shader cores, but as we’ve seen in these scalable designs it’s very rare for anyone to build a SoC that actually takes the number of cores up to the architecture’s limit. And since T760 was only released to customers back in October of 2013, there are only a handful of designs announced so far and none of them are particularly high-end. To that end ARM suggested that a Mali-T760 MP10 would be a reasonable approximation of a high-end shipping configuration, so that is what we’ve gone with.

GPU Specification Comparison
	NVIDIA K1	Imagination PVR GX6650	ARM Mali-T760 MP10	AMD A4-1350
FP32 ALUs	192	192	100	128
FP32 FLOPs	384	384	200 (340)	256
Pixels/Clock (ROPs)	4	12	10	4
Texels/Clock	8	12	10	8
GFLOPS @ 300MHz	115.2 GFLOPS	115.2 GFLOPS	60 (102) GFLOPS	76.8 GFLOPS
Architecture	Kepler	Rogue (6XT)	Midgard (T700)	GCN 1.1

Briefly, we can see that as far as theoretical shading performance is concerned, our theoretical Mali-T760 would push 60 GFLOPS when counting MADs (20 FLOPS/clock/core). Or when using ARM’s preferred metric of MAD plus a dot product (34 FLOPS/clock/core) this becomes 102 GLOPS. How you count ends up being important here as it means the theoretical throughput of the T760MP10 is either close to something like AMD’s A4-1350, or close to the very high end configurations that NVIDIA and Imagination will be peddling.

On the other hand T760MP10’s pixel and texel throughput looks very good, easily exceeding both our AMD and NVIDIA configurations on both and specifically more than doubling the pixel throughput. Pixel throughput is going to be especially important going forward as these SoCs get paired with increasingly high resolution displays – the TV industry has in recent years become big SoC consumers and 4K TVs are growing in popularity – so being able to push a lot of pixels will in turn be helpful for pushing such displays. However ARM’s efficiency technology such as Transaction Elimination and AFBC will also have to play a big part here, as writing that many pixels per clock raw would consume a large amount of memory bandwidth, something SoCs rarely have to spare.

Final Words

With apologies in advance to ARM, wrapping up this article the first thing that comes to mind is something we wrote when looking at Imagination’s Rogue architecture earlier this year: “So it’s with some hope and a bit of luck that this might get the ball rolling with the other SoC GPU vendors, getting them to open up their doors a bit more so that we can see what’s inside their designs.”

It’s safe to say then that we have indeed been lucky about getting other SoC GPU vendors to open up about their architectures. ARM’s decision to come take a seat at the “open architecture” table has given us a great opportunity to see into the heart of another SoC GPU and to better understand and appreciate just what’s going on under the hood when we look at Mali powered products. Plus in opening up on their GPU architecture, we have been given the chance to see what just may be the least conventional GPU of the modern era.

When ARM first began to brief me on the Midgard architecture, they told me that it would be something unlike anything else we’ve seen before, and while I believed them I don’t believe that description is quite strong enough to get across just how surprised I was by ARM’s autonomous, TLP insensitive shader design. It took the better part of a few days even after the briefing to really internalize just what they had done, and while it seems simple (and very cool) in retrospect, going for an unorthodox architecture certainly throws you for a loop at first after spending several years covering the world of wavefront-driven architectures.

As for Midgard and its resulting products, this stands to be an interesting and exciting time for ARM. The finalization of OpenGL ES 3.1 and the announcement of the Android Extension Pack means that some of the functionality that ARM has had to sit on thus far is finally going to be exposed and used. And meanwhile with 64bit Android coming up and ARM’s 64bit Cortex-A5x processors similarly near, ARM can begin exploiting some of that shared 64bit development that ARMv8 and Midgard went through.

At the same time however ARM also will face the same struggle for market share that the other SoC GPU vendors also face. As we’ve discussed in the past, the SoC GPU market is full of competitors, some who make their own SoCs and hence won’t be ARM GPU customers, and others who are in the licensing business just as much as ARM. With the latest generation Mali-T700 series parts ARM already has some T760 wins with MediaTek, who will be using T760 with their mid-range Cortex-A53 SoCs. But at the same time I’d love to see what flagship-caliber device would look like with a T760, so hopefully we’ll get that chance over the next year.

This incidentally is all the more reason to be open right now, as it’s that much easier to convince your immediate customers and even build a brand among end users when they can freely learn more about your products and how they operate. To that end the “open architecture” table remains open, and as we shift to the next generation of SoCs and next wave of SoC GPUs, with any luck this won’t be the last time we get to learn more about the GPUs that are increasingly in our everyday devices.

Tricks of the Trade: Transaction Elimination and Frame Buffer Compression

PRINT THIS ARTICLE

Post Your Comment
Please log in or sign up to comment.

Comments Locked

66 Comments

View All Comments

seanlumly - Thursday, July 3, 2014 - link
Given the "exotic" the ILP and the 128-bit VLIW SIMD, the Mali looks like an impressive performer. If a Mali T760MP10 is indeed a part fit for smartphone-level power consumption, then a quick linear-scaling -- given GFXBench scores of a T760MP4 -- would imply that such a GPU is very competitive with something like the adreno 420 and certainly impressive if scaled up further to a tablet-level power consumption. If, however, an MP10 consumes roughly as much as a K1 or GX6650, then I'm very sceptical about its competitive performance.

I find the distribution of ALUs to memory units strange given mobile bandwidth limits. The Mali T760 in an 8-core configuration clocked at 600MHz will allow for 19.2 GB/s of load/store access (256-bits/clock: source ARM). This is quite high memory bandwidth, and an increase in GPU clocks or cores will likely yield idling load/stores units doing nothing but taking up valuable die area. Operations with high varying access, cache reads, and tile read/writes will of course make good use of these additional units, but it still seems like overkill on all but very memory-access heavy apps. ARM would know best, though I'm suspicious that so many load/store units are needed for common workloads. I would guess that in common scenarios, bandwidth to main memory would be exhausted long before all of the memory units were fully utilized on a Mali T760 of high core count (eg. MP12-16).

Decoupling the load/stores and texture units as their own "core" may allow more appropriate scaling to fit the bandwidth of the target system. A system with an ultra-high resolution, could be endowed with more load/stores and/or texture units. A system with a lower resolution could use less and opt for more ALUs in the same space. This would be similar to big.LITTLE (different cores for different targeted workloads). In this scenario, the memory unit cores could be scaled independently of the ALU cores, perfectly tailored to the target system.
EdvardS - Friday, July 4, 2014 - link
Remember that there is a cache system between those units and the SoC memory controller. Bandwidth to the cores is quite different from bandwidth to the DDR memory.
seanlumly - Friday, July 4, 2014 - link
Indeed! Bandwidth to cache, tile memory, varying data, or textures would likely be more relevant with a tile-based renderer that often exploits spatial locality when processing batches of pixels. This is especially true with modern screen-space effects that do multiple dependent reads per display pixel (eg. SSAO), but are strongly confined to buffer fragments surrounding the target pixel. Such situations would value having many LS/Tex units at little penalty.

But I do still like the idea of an independently scalable "memory" core (containing load/store/texure pipes) to complement a "math" core (containing ALUs). A high-performance system targeting a 720p display will likely consume far less bandwidth than one targeting a 4K display, and as such, it would be nice to trade LS/Tex units for more ALUs in such a case.

Such an arrangement may also enable ARM more leeway when making predictions about a new architecture -- no doubt the Midgard arch was in development many years before it saw implementation in a retail product, which means that ARM would have had to guess trends (eg. resolution) far in advance to attain the right balance of on-chip units per core; independently scalable memory-cores would be more forgiving if the trends turned out not match the initial predictions.
seanlumly - Friday, July 4, 2014 - link
Actually, I am starting to understand the motivation behind the ratio of ALUs to Memory units in a Midgard core. I notice in GFXBench 3.0 "Manhattan", that the Mali T760MP4 (Rockchip rk3288) performs incredibly well at 720p, but it's performance drops off more than proportionately as the resolution increases. This may imply that in these higher than 720p scenarios, the 4-core variant of GPU may not be able to keep up with the memory demands, as computation should scale very close to proportionally.

Thus the 1080p offscreen score for the Mail T760 MP4 in the GFXBench 3.0 (offline) database may be misleading as the MP4 may be a bit small for this resolution, and thus the performance may be low relative to its competition. A T760 MP8 would likely more than double the performance for a doubling of the resolution, pushing something like the Mali T760 MP8 well beyond the competition, at what I suspect are similar levels of power and die-size. I predict that a T760 MP8 would get slightly north of 16fps in GFXBench 3.0 (assuming adequate bandwidth to DDR). Even an MP6 variant of the GPU (as was the case with the T628 MP6 in the Exynos 5420) should put it more-or-less on-par with the competition!

The Mali performs even better in the GFXBench 2.7 "T-Rex" test, where a small 4-core Mali T760 MP4 surpasses the competition at 720p and even sub-720p resolutions in some instances! This is incredible. In this case, it seems that the test is more computation bound, as there is a more proportional scaling between performance and resolution.

I hope that future GPUs consider using the T760 in higher-core-count configurations. I still like the idea of a Memory-core, though I have little doubt that a Mali GPU of evenly matched size can go toe-to-toe with the competition.
Frenetic Pony - Friday, July 4, 2014 - link
Every time I read an overview of a SoC GPU I am so, so glad I don't do anything with mobile stuff. "We support tessellation! I mean, don't actually do it. Ever. But you know, it's supported."
kkb - Friday, July 4, 2014 - link
How come there is no comparison with intel GPUs like the ones in Baytrail?
darkich - Friday, July 4, 2014 - link
Because there is no comparison phrase.
That GPU is completely inferior compared to latest Mali, PowerVR and Adreno architectures
Krysto - Friday, July 4, 2014 - link
Word.
kkb - Monday, July 7, 2014 - link
well.. I don't really agree. Please look at the AT review from last week or so.. http://www.anandtech.com/show/8197/samsung-galaxy-...
MEMO pad is a baytrail product and definitely performs better than MALI devices.
darkich - Monday, July 7, 2014 - link
Get your facts and reading skills in order.

Firstly, the GPU in Memo Pad is definitely not definitely performing better than even the Mali T628, in fact those very tests show it trades blows with it, mostly due to much lower resolution screen.

Secondly, do you realize that the T760 is MUCH faster than T628?

You can see here that it is basically comparable to the Tegra K1 and even the intimidating Series 6XT doesn't trounce it.

Rest assured that any of these three, as well as the Adreno 420, is way above the ULP HD graphics chip

ARM’s Mali Midgard Architecture Explored

Technical Comparisons

Final Words

Post Your Comment

66 Comments

View All Comments

seanlumly - Thursday, July 3, 2014 - link

EdvardS - Friday, July 4, 2014 - link

seanlumly - Friday, July 4, 2014 - link

seanlumly - Friday, July 4, 2014 - link

Frenetic Pony - Friday, July 4, 2014 - link

kkb - Friday, July 4, 2014 - link

darkich - Friday, July 4, 2014 - link

Krysto - Friday, July 4, 2014 - link

kkb - Monday, July 7, 2014 - link

darkich - Monday, July 7, 2014 - link

Log in

Don't have an account? Sign up now