A Brief History of Mali

ARM as a CPU designer of course needs no introduction. The vast majority of the world’s smartphones and tablets are powered either by an ARM designed CPU or a CPU based on ARM’s instruction sets. In the world of SoC CPUs, ARM is without a doubt the 800lb gorilla.

However in the world of SoC GPUs, while ARM is a major competitor they are just one of several. In fact from a technology perspective they’re among the newest, having roots that go back over a decade but still not as far as the rest of the major competitors. This is something of a point of pride for ARM’s GPU team, and as we’ll see, results in a GPU design that’s not quite like anything else we’ve seen so far.

Like all good GPU stories, the earliest history of ARM’s GPU division goes back to the late 1990s, where what would become ARM’s GPU division was first created. Originally a research project by Norwegian University of Science and Technology, the core Mali group was spun off to form Falanx Microsystems in 2001. At first Falanx was attempting to break into the PC video card market, a risky venture in the post-3dfx era that saw several other PC GPU manufacturers get shaken out – S3, Rendition, Revolution, and Imagination among them – and ultimately a venture that crashed and burnt as Falanx was unable to raise funding.

During a period of development of which they call their “scrappy phase,” due to their limited resources Falanx pivoted from designing PC GPUs to designing SoC-class GPUs and licensing those designs to SoC integrators, a much easier field to get in to. From this change in direction came the first Mali GPUs, and ultimately Falanx’s first customers. Among these were Zoran, who used the Mali-55 for their Approach 5C SoC, which in turn ended up in a couple of notable products including LG’s Viewty. But with that said, Falanx never saw great success on their own, never landing the “big fish” that they were hoping for.


LG's Viewty, One of Mali's First Wins

Ultimately as the SoC industry began to heat up from growing cell phone sales, ARM purchased Falanx in 2006. This gave ARM a capable (if previously underutilized) GPU division to create GPUs to go along with their growing CPU business. And from a business perspective the two companies were a solid match for each other as Falanx was already in the business of licensing GPUs, so for Falanx this was largely a continuation of status quo. Though for what was becoming ARM’s GPU division this was just as much of net win as it was for ARM, since it gave the Mali team access to ARM’s engineering resources and validation, capabilities that were harder to come by as a struggling 3rd party GPU designer.

Now as a part of ARM, the Mali team released their first OpenGL ES 2.0 design in 2007, the Mali-200. Mali-200 and its immediate successors Mali-300, Mali-400, and Mali-450, were based on the team’s Utgard architecture. Utgard was a non-unified GPU (discrete pixel and vertex shaders) designed for SoC-class graphics, and over the years received various upgrades to improve performance and scalability, especially on Mali-400 where Mali products first introduced the ability to use multiple cores.

Even to this day Utgard is arguably the Mali team’s most successful GPU architecture, based in large part on the architecture’s no-frills ES 2.0 design and resulting low die space requirements. Along with driving high-end SoCs of the era, including those that have powered devices such as the Samsung Galaxy S II, Utgard remains a popular mid-range GPU to this day, with Mali-450 securing a spot just this week in Samsung’s forthcoming Galaxy S5 Mini.


Mali-450 Lives On In Samsung's Galaxy S5 Mini

Now with a team of nearly 500 people and having shipped 400 million Mali GPUs in 2013 (or as close to “shipping” as an IP licenser can get), the Mali team’s latest architecture and the subject of today’s article is Midgard. Midgard ships as the basis of ARM’s Mali-T600 and Mali-T700 family GPUs, and while it was initially introduced as a high-end architecture, it will be making a transition to cover both the high-end and mid-range through the recent introduction of more space/power/cost optimized designs. After initially replacing Utgard at the high-end last year, this year will see Utgard finally replaced in the mid-range market.

ARM’s Mali Midgard Architecture Explored Midgard: The Modern Mali
Comments Locked

66 Comments

View All Comments

  • toyotabedzrock - Friday, July 4, 2014 - link

    Didn't imagine also buy from ATI? Maybe that is why they are concerned with patents?
  • Kevin G - Saturday, July 5, 2014 - link

    There likely is a cross licensing agreement in place. Literally to build any modern GPU a company has to cross license patents. It works out to be zero sum in that money isn't exchanged but one company can use the other's patents (and future patents) royalty free.

    This also puts a rather high barrier to entry in the market.
  • nolaviz - Thursday, July 3, 2014 - link

    Ah, the history... I led the integration of MALI55 into Zoran's APPROACH-5C. Sweet memories :)
  • skiboysteve - Thursday, July 3, 2014 - link

    Very interesting. Complete 100% opposite of the AMD architecture.

    Also, the fact that they can power down shader cores and even individual ALUs makes this pure ILP and zero TLP architecture really shine. Because the compiler only needs to optimize work on an ALU level.... Not on a shader or wavefront level. So then based upon demand they just feed more or less of these ALU-optomized-packets of work into the GPU and power down the rest of the ALUs. It makes complete sense. Make each ALU independent, compile work on a per ALU basis, then scale your ALU utilization at run time... And also scale your chip portfolio on ALU count. Perfect scalability in price, performance and power combined with straightforward driver work.

    Awesome!
  • rootheday - Thursday, July 3, 2014 - link

    Transaction Elimination might work for "render to texture", where the texture is then used later in the scene, but it doesn't make sense to me for the final render target that is shown on the screen.. Typically your final render target is part of a swap chain with 2 or 3 different buffers. So the CRC computed for a tile in Frame N would need to match the CRC for frame N-2 (or maybe N-3), not N-1.
  • EdvardS - Thursday, July 3, 2014 - link

    So each buffer has its own CRC buddy attached to it - problem solved :-) Longer buffer chains just reduces you temporal coherency a bit.
  • hexgrid - Thursday, July 3, 2014 - link

    Relying on a 32bit hash for transaction elimination is a very bad idea. Assuming a normal distribution of hash results, the Birthday Paradox means you've got a better than 50% chance of collision if you have more than ~5000 items being hashed. The compartmentalized comparisons (it looks like they only compare against the same tile) means the collision rate will be lower, but there will be collisions, and in some cases they will look terrible.

    If there's a glitch, there's an excellent chance it will persist for a while. For example, let's say I'm bringing up a "pause" menu. If the title overlay of the pause menu happens to hash to the same value as the background that it replaced, that one tile of the pause menu title won't appear. Next frame, it still won't appear, because again the hash value hasn't changed. Until something happens to actually change the title or make it disappear, the glitch will persist.

    The "fix" will be to do GUI overlays in translucency and keep the background animating somehow to prevent cache misses from sticking around, but it's an ugly hardware hack that will force software workarounds.
  • EdvardS - Thursday, July 3, 2014 - link

    What if the CRC is 64 bit for each small tile? Quite overkill and very hard to break.
  • mkozakewich - Sunday, July 6, 2014 - link

    If the pause screen fades in, wouldn't that change the image data enough that it would generate entirely different CRCs for each frame until the animation ended?

    At any rate, my rule of thumb would be that if it doesn't happen more often than a keyframe is dropped in a video, it'll be fine.
  • Krysto - Thursday, July 3, 2014 - link

    I'd like you guys to do more 30x GFX loops, that how whether these GPU's only prop up high numbers at first, but then quickly get throttled. From what I noticed Imagination is the only one that doesn't throttle in that test. But I'm curious to see if K1 and T760 do.

Log in

Don't have an account? Sign up now