ARM’s Mali Midgard Architecture Explored

Name: ARM’s Mali Midgard Architecture Explored
Item: ARM’s Mali Midgard Architecture Explored
Author: Ryan Smith

by Ryan Smith on July 3, 2014 11:00 AM EST

Posted in
GPUs
Arm
SoCs
Mali
Midgard

66 Comments | Add A Comment

66 Comments

The Midgard Architecture

Diving in to the Mali architecture, we’ll start with a high level overview of the architecture. What we’re looking at here is a single Midgard shader core, which despite the “shader” name actually contains a whole lot more. A shader core in this context contains the actual shader core within one of Midgard’s “tri pipe” shader blocks, but also contains a triangle setup unit, rasterizer, Z & stencil hardware, a ROP/blender, tiling hardware, and a compute thread creator specifically for feeding a tri pipe with compute workloads.

Midgard – and Mali-T760 in particular in our case – is designed to scale up and down in performance and size on the basis of these shader cores. While they contain a large part of the hardware for a GPU, they are not on their own a complete GPU. They are instead designed to be laid down in multiples, and then fed by a collection of supporting hardware to make a complete GPU. This combination of shared hardware and scalable core counts allowing Midgard designs to scale from single cores up to a huge 16 cores in the largest Mali-T760MP16 configuration.

The shared hardware in Midgard is primarily concerned with managing the interaction of the shader cores, followed by providing the L2 cache and all further memory interfaces for accessing main memory and/or the CPU cache. In the case of Mali-T760 there is 1 task management unit and memory management unit, but 2 sets of L2 cache and the AMBA interface that connects the GPU to the rest of the system.

Jumping back to the core itself, let’s talk about the tri pipe for a bit. We’ll discuss arithmetic and thread execution on the next page, but from a high level a tri pipe is composed of one or more ALU blocks, a load/store unit, and a texture unit. From a high level the inclusion of ALUs, load/stores, and texturing in a single block is fairly typical for a GPU. However the variable number of ALU blocks is uncommon for a GPU. Many GPU designs maintain the same structure throughout and replicate what amounts to the tri pipe, rather than varying the number of units in the tri pipe itself.

As we’ve stated before, for our purposes we’re primarily looking at the Mali-T760. On the T760 ARM uses 2 ALU blocks per tri pipe, which is the most common configuration that you will see for Midgard. However ARM also has Midgard designs that have 1 ALU block or 4 ALU blocks per tri pipe, which is one of the reasons why seemingly similarly GPUs such as T760, T720, and T678 can look so similar and yet behave so differently.

ARM Mali Midgard Arithmetic Pipeline Count (Per Core)
T628	2
T678	4
T720	1
T760	2

Without being fully exhaustive, among various Midgard designs T628 and T760 are 2 ALU designs, while T720 is a 1 ALU design, and T678 is a 4 ALU design.

As one would expect, the different number of arithmetic pipelines per tri pipe has a knock-on effect on performance in all aspects, due to the changing ratio between the number of arithmetic pipelines and the number of load/store units and texture units. T678, for example, would be fairly shader-heavy, whereas the 2 ALU designs strike a closer balance between shading and texturing. Among the various Midgard designs ARM has experimented with several configurations, and with the T700 series they have settled on 2 ALU designs for the high-end T760 and 1 ALU for the mid-range T720 (although ARM likes to point out that T720 has some further optimizations just for this 1 ALU configuration).

In any case, each arithmetic pipeline is independent, working on its own set of threads. This is unlike some superscalar architectures, where additional arithmetic pipelines would be used for the execution of further non-blocking instructions in a thread.

The texturing unit meanwhile fairly simple. Midgard can process 1 bilinear filtered texel per clock, or 1 trilinear filtered texel over 2 clocks (to account for the second mipmap access). Note that the number of texture units and the number of ROPs per shader core is identical in the case of our T760, so in any core configuration there is a 1:1 ratio between texels and pixels.

Meanwhile, on one final stop while we’re looking at the functional blocks of Midgard, let’s talk about what we don’t see: a tessellation block. As we discussed before Midgard will support tessellation – T760 even supports Direct3D FL 11_1 – but nowhere will you see a tessellation block in the Midgard architecture.

As it turns out this isn’t an oversight or purposeful omission on ARM’s part, but rather an intentional design choice for Midgard. Midgard is fully capable of tessellation, it just does so without a fixed function unit to handle the geometry creation process.

Overview: Fixed Function Tessellator

In lieu of that Midgard implements tessellation on its shader hardware. Not just the hull and domain shader stages, mind you (which are done on shader hardware anyhow), but the actual geometry generation takes place there too. This is without a doubt a highly unusual situation, as we have never seen a tessellation-capable GPU without a dedicated tessellation unit thus far, making Midgard the first (and so far only) GPU we know of with such a design.

In discussing this design decision, ARM has told us that at the end of the day this choice is based on balancing the use of limited die space with benefits of a dedicated tessellation unit, and just how well the rest of the GPU can consume tessellated polygons. As it turns out the Midgard shader engine can tessellate at a higher rate than the rest of the pipeline can consume the additional polygons, meaning that a dedicated tessellation unit in the case of Midgard would not improve tessellation performance since it would be bottlenecked by the rest of the pipeline just as well. Furthermore any die space that ARM did dedicate to a tessellation unit would be die space they couldn’t dedicate to something else, such as additional shader hardware.

Ultimately whether this was the right decision or not is only something ARM can know – only they would have the simulation data for hardware tessellation versus shader tessellation – but this single omission already makes Midgard a highly unusual (and not to mention unique) GPU architecture. But from an outsider’s perspective we are left wondering what the power cost of this decision is. There’s no reason to doubt ARM’s space/performance argument, but as traditional GPU logic goes, doing anything in flexible hardware is going to be more power intensive than a highly specialized fix function unit, which is why we have so many of them in SoC GPUs and desktop GPUs alike. Performance aside, is shader tessellation significantly more power intensive than a theoretical hardware tessellation Midgard GPU? And if so, what does that do for power bottlenecking?

Moving on from our look at the fixed function units of the shader core to a broader execution picture, when it comes to graphics and like many other SoC-class GPUs, Midgard is a tile based architecture. This in a nutshell involves the GPU (specifically the tiler) breaking up scenes into individual tiles to be processed on a tile-by-tile basis, rather than rendering the scene at once, generally on a per-polygon basis. By breaking up a scene into tiles and then working on them one piece at a time, this allows Midgard to keep a constrained portion of the frame as the active workload, and in turn keep that tile in local memory on the GPU, rather than having to go to the slower main memory.

SoCs are ultimately constrained in every way possible – power, die size, and memory bandwidth – but it’s the last item more than anything else that can impact performance, making effective memory bandwidth usage especially important. Main memory accesses not only eat up valuable and limited memory bandwidth, but require burning further power to do so.

For Midgard ARM uses what they call a Hierarchical Tiling system, which as the name implies implements tiles on a hierarchical basis. The use of a hierarchy in this case allows Midgard to use variable tile sizes, based on the idea of further decomposing tiles (going down the hierarchy) until the tile complexity is at the desired size (or they’ve reached the minimum tile complexity). This ultimately allows Midgard to only use small tiles where it’s necessary, and conserve resources by working on large tiles where there is little scene complexity.

Making the best use of tiling does require some cooperation on the part of OSes and programs – Windows actually has a driver flag to indicate whether a GPU is a tiling GPU – but it offers some very important and necessary memory bandwidth optimizations for SoC-class GPUs. Making use of tiling just ultimately requires managing scene complexity to keep the tiling system and the memory footprint of the tiles under control, which is another situation of where Hierarchical Tiling flexes its muscles.

On the subject of tiling and bandwidth efficiency, Midgard also further cuts down on work and bandwidth wastage by rejecting work as early and as frequently as it can. The Early Z unit in each shader core can identify some polygons that would otherwise be covered and not visible, ejecting them before they make it to shading and texturing. And when that’s still not enough, Midgard (as of T620 and later) implements a technique called Forward Pixel Kill to further reject pixels. Forward Pixel Kill is at a basic level to pixels what Early Z is to polygons, identifying and rejecting covered pixels early enough to keep them from rendering. By keeping a small FIFO buffer and watching what pixels are being colored, Midgard can reject threads of pixels in mid-flight if it determines that the location they’re to be drawn is now covered by an opaque pixel.

Moving on, let’s talk about an area where ARM’s CPU background has come into play for their GPU designs: bitness and memory access. Developed alongside Midgard was ARM’s ARMv8 (aka ARM64) architecture. As one of the biggest design goals of ARMv8 was to extend it to a 64bit architecture, so the same choice was made for Midgard. As a result Midgard is fully 64bit; it can process 64bit Integers, double precision floating point numbers (FP64), and even natively use 64bit memory addresses. ARM wanted Midgard to be just as 64bit ready as ARMv8 would be.

The presence of 64bit operations is in and of itself fairly rare for an SoC GPU – 64bit operations are important for certain classes of compute workloads, but are seldom-to-never used for graphics operations – however ARM’s plans extend beyond just computational precision. Because Midgard can use 64bit memory addresses, Midgard can use the same 64bit address space as ARMv8 CPUs. This ability allows Midgard to achieve I/O concurrency with ARMv8 CPUs, allowing it to read directly from the CPU’s cache. Even with this Midgard and ARMv8 are not fully cache coherent – this feature is mostly about letting the GPU gorge on the CPU cache to more efficiently pass data from the CPU to GPU – but full cache coherency is coming.

In fact for as important as ARM’s decisions have been in making Midgard’s shader cores compute capable, equally as important is 64bit addressing and I/O coherency. ARM is after all a member of the Heterogeneous System Architecture (HSA) group, and has for some time now reiterated their intentions to back and invest in GPU computing both on a stand-alone basis and on a tightly integrated CPU/GPU basis. Midgard to that end is still in the stand-alone compute stage, but it incorporates optimizations that help performance right now while setting a clear roadmap towards an HSA-capable ARM CPU + GPU design. At the end of the day ARM needs to save power, and letting the GPU work on hardware-appropriate workloads is one such way to achieve this.

Finally, speaking of power let’s talk about Midgard’s power management capabilities. It would be disingenuous to merely state that Midgard is power optimized, as it implies that SoC GPUs are somehow not power optimized when in fact everyone is just as badly power bound, but going forward power consumption is only going to get more and more important as energy efficiency improvements from new fab nodes is not what it once was.

All told then, Midgard implements a number of tricks to reduce power consumption. Every rendering optimization we’ve talked about thus far is a power optimization, but on top of that Midgard implements both clock gating and finely grained power gating to either shut off power usage (or at least reduce it) for every inactive block as the situation allows. Each shader core sits on its own power domain, allowing cores to be shut down in the case of light workloads.

Even within those shader cores, the various blocks can be clock gated to bring down their power consumption. Inside the Midgard arithmetic unit the scalar and vector units can be separately clock gated, right down to the individual SIMD lanes of the vector unit itself.

Ultimately there’s only so much ARM can do to reduce power usage when a block is needed. In those cases ARM relies on achieving a balance between (relatively power hungry) clockspeeds and wide GPUs, and further utilizing workload rejection technologies to do what they can (and we’ll look at a couple more of those in a moment). But in lieu of the ability to further reduce power consumption under load, ARM is giving Midgard’s blocks a chance to rest whenever they can.

Midgard: The Modern Mali Midgard’s Arithmetic Pipelines

PRINT THIS ARTICLE

Post Your Comment
Please log in or sign up to comment.

Comments Locked

66 Comments

View All Comments

seanlumly - Thursday, July 3, 2014 - link
Given the "exotic" the ILP and the 128-bit VLIW SIMD, the Mali looks like an impressive performer. If a Mali T760MP10 is indeed a part fit for smartphone-level power consumption, then a quick linear-scaling -- given GFXBench scores of a T760MP4 -- would imply that such a GPU is very competitive with something like the adreno 420 and certainly impressive if scaled up further to a tablet-level power consumption. If, however, an MP10 consumes roughly as much as a K1 or GX6650, then I'm very sceptical about its competitive performance.

I find the distribution of ALUs to memory units strange given mobile bandwidth limits. The Mali T760 in an 8-core configuration clocked at 600MHz will allow for 19.2 GB/s of load/store access (256-bits/clock: source ARM). This is quite high memory bandwidth, and an increase in GPU clocks or cores will likely yield idling load/stores units doing nothing but taking up valuable die area. Operations with high varying access, cache reads, and tile read/writes will of course make good use of these additional units, but it still seems like overkill on all but very memory-access heavy apps. ARM would know best, though I'm suspicious that so many load/store units are needed for common workloads. I would guess that in common scenarios, bandwidth to main memory would be exhausted long before all of the memory units were fully utilized on a Mali T760 of high core count (eg. MP12-16).

Decoupling the load/stores and texture units as their own "core" may allow more appropriate scaling to fit the bandwidth of the target system. A system with an ultra-high resolution, could be endowed with more load/stores and/or texture units. A system with a lower resolution could use less and opt for more ALUs in the same space. This would be similar to big.LITTLE (different cores for different targeted workloads). In this scenario, the memory unit cores could be scaled independently of the ALU cores, perfectly tailored to the target system.
EdvardS - Friday, July 4, 2014 - link
Remember that there is a cache system between those units and the SoC memory controller. Bandwidth to the cores is quite different from bandwidth to the DDR memory.
seanlumly - Friday, July 4, 2014 - link
Indeed! Bandwidth to cache, tile memory, varying data, or textures would likely be more relevant with a tile-based renderer that often exploits spatial locality when processing batches of pixels. This is especially true with modern screen-space effects that do multiple dependent reads per display pixel (eg. SSAO), but are strongly confined to buffer fragments surrounding the target pixel. Such situations would value having many LS/Tex units at little penalty.

But I do still like the idea of an independently scalable "memory" core (containing load/store/texure pipes) to complement a "math" core (containing ALUs). A high-performance system targeting a 720p display will likely consume far less bandwidth than one targeting a 4K display, and as such, it would be nice to trade LS/Tex units for more ALUs in such a case.

Such an arrangement may also enable ARM more leeway when making predictions about a new architecture -- no doubt the Midgard arch was in development many years before it saw implementation in a retail product, which means that ARM would have had to guess trends (eg. resolution) far in advance to attain the right balance of on-chip units per core; independently scalable memory-cores would be more forgiving if the trends turned out not match the initial predictions.
seanlumly - Friday, July 4, 2014 - link
Actually, I am starting to understand the motivation behind the ratio of ALUs to Memory units in a Midgard core. I notice in GFXBench 3.0 "Manhattan", that the Mali T760MP4 (Rockchip rk3288) performs incredibly well at 720p, but it's performance drops off more than proportionately as the resolution increases. This may imply that in these higher than 720p scenarios, the 4-core variant of GPU may not be able to keep up with the memory demands, as computation should scale very close to proportionally.

Thus the 1080p offscreen score for the Mail T760 MP4 in the GFXBench 3.0 (offline) database may be misleading as the MP4 may be a bit small for this resolution, and thus the performance may be low relative to its competition. A T760 MP8 would likely more than double the performance for a doubling of the resolution, pushing something like the Mali T760 MP8 well beyond the competition, at what I suspect are similar levels of power and die-size. I predict that a T760 MP8 would get slightly north of 16fps in GFXBench 3.0 (assuming adequate bandwidth to DDR). Even an MP6 variant of the GPU (as was the case with the T628 MP6 in the Exynos 5420) should put it more-or-less on-par with the competition!

The Mali performs even better in the GFXBench 2.7 "T-Rex" test, where a small 4-core Mali T760 MP4 surpasses the competition at 720p and even sub-720p resolutions in some instances! This is incredible. In this case, it seems that the test is more computation bound, as there is a more proportional scaling between performance and resolution.

I hope that future GPUs consider using the T760 in higher-core-count configurations. I still like the idea of a Memory-core, though I have little doubt that a Mali GPU of evenly matched size can go toe-to-toe with the competition.
Frenetic Pony - Friday, July 4, 2014 - link
Every time I read an overview of a SoC GPU I am so, so glad I don't do anything with mobile stuff. "We support tessellation! I mean, don't actually do it. Ever. But you know, it's supported."
kkb - Friday, July 4, 2014 - link
How come there is no comparison with intel GPUs like the ones in Baytrail?
darkich - Friday, July 4, 2014 - link
Because there is no comparison phrase.
That GPU is completely inferior compared to latest Mali, PowerVR and Adreno architectures
Krysto - Friday, July 4, 2014 - link
Word.
kkb - Monday, July 7, 2014 - link
well.. I don't really agree. Please look at the AT review from last week or so.. http://www.anandtech.com/show/8197/samsung-galaxy-...
MEMO pad is a baytrail product and definitely performs better than MALI devices.
darkich - Monday, July 7, 2014 - link
Get your facts and reading skills in order.

Firstly, the GPU in Memo Pad is definitely not definitely performing better than even the Mali T628, in fact those very tests show it trades blows with it, mostly due to much lower resolution screen.

Secondly, do you realize that the T760 is MUCH faster than T628?

You can see here that it is basically comparable to the Tegra K1 and even the intimidating Series 6XT doesn't trounce it.

Rest assured that any of these three, as well as the Adreno 420, is way above the ULP HD graphics chip

ARM’s Mali Midgard Architecture Explored

The Midgard Architecture

Post Your Comment

66 Comments

View All Comments

seanlumly - Thursday, July 3, 2014 - link

EdvardS - Friday, July 4, 2014 - link

seanlumly - Friday, July 4, 2014 - link

seanlumly - Friday, July 4, 2014 - link

Frenetic Pony - Friday, July 4, 2014 - link

kkb - Friday, July 4, 2014 - link

darkich - Friday, July 4, 2014 - link

Krysto - Friday, July 4, 2014 - link

kkb - Monday, July 7, 2014 - link

darkich - Monday, July 7, 2014 - link

Log in

Don't have an account? Sign up now