The Mali G76 µarch - Fine Tuning It

Section by Ryan Smith

While the biggest change in the G72 is by far Arm’s vastly wider cores, it’s not the only change to come to the Bifrost architecture. The company has also undertaken a few smaller changes to further optimize the architecture and improve performance efficiency.

First off, within their ALUs Arm has added support for Int8 dot products. These operations are becoming increasingly important in machine learning inference, as it’s a critical operation in processing neural networks and despite the limited precision, is still deep enough for basic inference in a number of cases. To be sure, even the original Bifrost already natively supported Int8 data types, including packing 4 of them into a single lane, but G76 becomes the first to be able to use them in a dot product in a single cycle.

As a result, Arm is touting a 2.7x increase in machine learning performance. This will of course depend on the workload – particularly the framework and model used – so it’s just a high-level approximation. But Arm is betting big on machine learning, so significantly speeding up GPU machine learning inference gives Arm’s customers another option for efficiently processing these neural networks.

Meanwhile, in part as a consequence of the better scalability of Mali-G76’s core design, Arm has also taken a look at other aspects of GPU scalability to improve performance. Their research found that another potential scaling bottleneck is the tiler, which could block the rest of the GPU if it stalled during a polygon writeback. As a result, Arm has moved from an in-order writeback mechanism to an out-of-order writeback mechanism, allowing for polygons to be written back with more flexibility by bypassing those writeback stalls. Unfortunately Arm is being somewhat mum here on how this was implemented – generally changing an in-order process to out-of-order is not a simple task – so we haven’t been given much other information on the matter.

Arm has also made a subtle but important change to how their tile buffers can be used in an effort to keep more traffic local to the GPU core. In certain cases, it’s now possible for applications that run out of color tile buffer space to spill over into the depth tile buffer. Arm is specifically citing workloads involving heavy use of multiple render targets without MSAA for driving this change; the lack of MSAA means that the depth tile buffer is used only sparingly, while the multiple render targets quickly chew through the color tile buffer rather quickly. The net result of this is that it cuts down on the number of trips that need to be made to main memory, which is a rather expensive operation.

Speaking of spilling, G76’s thread local storage mechanism has also been optimized for how it handles register spills. Now the GPU will attempt to group data chunks from spills together so that they can be more easily fetched in the future. This is as opposed to how earlier GPUs did it, where register spills were scattered based on which SIMD lane the data ultimately belonged to.

The Mali G76 µarch - Scaling It Up Performance & Efficiency - End Remarks
Comments Locked

25 Comments

View All Comments

  • levizx - Friday, June 1, 2018 - link

    " the size of a wavefront is typically a defining feature of an architecture. For long-lived architectures, especially in the PC space, wavefront sizes haven’t changed for years.."

    That's self-contradictory, if something stays the same across years of different μarch, it's by definition NOT a defining feature.
  • levizx - Friday, June 1, 2018 - link

    "Arm is touting a 2.7x increase in machine learning performance"

    No they are not. They are claiming 2.7x the performance, 1.7x increase.
  • Quantumz0d - Friday, June 1, 2018 - link

    I remember how bad the S8s Exynos GPU was, plus older Kirin SoCs power guzzlers. If it was delivering performance that would be still okay but in this age of slim era glass backed phones Multicore configurations will end up throttling. Still a progress is welcomed.
  • newblar - Monday, June 4, 2018 - link

    I always wondered why ARM didn't just buy imagination technologies on the cheap so they could get their GPU tech.
  • digitalwhatsup - Tuesday, June 5, 2018 - link

    Wow . Lot of information at one place. Love to see details on storage system. Thanks https://www.digitalwhatsup.com/

Log in

Don't have an account? Sign up now