The Mali G76 µarch - Fine Tuning It

Section by Ryan Smith

While the biggest change in the G72 is by far Arm’s vastly wider cores, it’s not the only change to come to the Bifrost architecture. The company has also undertaken a few smaller changes to further optimize the architecture and improve performance efficiency.

First off, within their ALUs Arm has added support for Int8 dot products. These operations are becoming increasingly important in machine learning inference, as it’s a critical operation in processing neural networks and despite the limited precision, is still deep enough for basic inference in a number of cases. To be sure, even the original Bifrost already natively supported Int8 data types, including packing 4 of them into a single lane, but G76 becomes the first to be able to use them in a dot product in a single cycle.

As a result, Arm is touting a 2.7x increase in machine learning performance. This will of course depend on the workload – particularly the framework and model used – so it’s just a high-level approximation. But Arm is betting big on machine learning, so significantly speeding up GPU machine learning inference gives Arm’s customers another option for efficiently processing these neural networks.

Meanwhile, in part as a consequence of the better scalability of Mali-G76’s core design, Arm has also taken a look at other aspects of GPU scalability to improve performance. Their research found that another potential scaling bottleneck is the tiler, which could block the rest of the GPU if it stalled during a polygon writeback. As a result, Arm has moved from an in-order writeback mechanism to an out-of-order writeback mechanism, allowing for polygons to be written back with more flexibility by bypassing those writeback stalls. Unfortunately Arm is being somewhat mum here on how this was implemented – generally changing an in-order process to out-of-order is not a simple task – so we haven’t been given much other information on the matter.

Arm has also made a subtle but important change to how their tile buffers can be used in an effort to keep more traffic local to the GPU core. In certain cases, it’s now possible for applications that run out of color tile buffer space to spill over into the depth tile buffer. Arm is specifically citing workloads involving heavy use of multiple render targets without MSAA for driving this change; the lack of MSAA means that the depth tile buffer is used only sparingly, while the multiple render targets quickly chew through the color tile buffer rather quickly. The net result of this is that it cuts down on the number of trips that need to be made to main memory, which is a rather expensive operation.

Speaking of spilling, G76’s thread local storage mechanism has also been optimized for how it handles register spills. Now the GPU will attempt to group data chunks from spills together so that they can be more easily fetched in the future. This is as opposed to how earlier GPUs did it, where register spills were scattered based on which SIMD lane the data ultimately belonged to.

The Mali G76 µarch - Scaling It Up Performance & Efficiency - End Remarks
Comments Locked

25 Comments

View All Comments

  • ET - Monday, June 4, 2018 - link

    How 'significantly cheaper' would you expect such a card to be compared to a $70 discrete GPU?

    Based on the expected GFXBench score and further extrapolation, the G76MP20 could perform about the same as the 1030, and it's possible that it could work with slower RAM and save there, but still, I don't see how it could be a really successful or high margin product. There would be need for a complete product line reaching significantly higher performance to make this more than a curiosity.
  • eastcoast_pete - Monday, June 4, 2018 - link

    I would really appreciate if you could provide a link to a vendor's site that lists a 1030 card for $ 70. The cheapest I have seen them was for ~ $ 120. If I can get one for $ 70 - we have a deal, even if it is the even further throttled DDR4 version. $ 70 is about what that card is really worth.

    Unrelated to this: My question arose from a situation I believe a number of us have: a HTPC that's otherwise Ok (in my case, around a Haswell i5), but cannot for the life of it decode 2160p HEVC at 30 fps or faster. If nothing else, a 1030 class card does at least have HDMI 2.0 out. For a new build, I would probably give the Ryzen 2400G a spin.
  • ET - Wednesday, June 6, 2018 - link

    I think I can post again. Spam filter blocked me yesterday from posting anything at all. I'll try the part without dollar signs first.

    If you just want video, why would you need a GeForce 1030 level GPU? Video is a different ARM IP anyway, not part of the G76.

    I do see a small market for a very low power USB GPU that's simply a mobile CPU with some low power RAM. All that basically needs is drivers, and preferably BIOS support. That would allow for example creating Ryzen based PCs without having to stick a GPU in the case, and would work for people like you with old hardware who want support for newer standards, including for laptop owners who want video out and for whom a GPU upgrade is impractical.
  • ET - Wednesday, June 6, 2018 - link

    Okay, now for the tricky part.

    I indeed see that the 1030 has gone up in price. I can find it for $ 90 at Amazon and Newegg, so it's not as bad as you say, and there's a DDR4 version for $ 77, which may be okay if what you're looking for is video playback and not 3D performance. However, I don't think a G76 part would solve the GPU market prices problem. If it's good enough, its price will go up like the rest of them. If it's not, its market share will be rather small. I think (as I posted in the other part) that a low power USB card would have a larger market. It would be a more convenient add-on, which could be applied to more configurations.
  • darkich - Friday, June 1, 2018 - link

    16.9fps/W vs 11.9fps/W (Snapdragon 845), and you "don't think it will catch up with the competition".
  • vladx - Friday, June 1, 2018 - link

    Indeed the author/s seem quite biased.
  • Andrei Frumusanu - Saturday, June 2, 2018 - link

    There's a process node difference between that comparison. An eventual Snapdragon 855 will surpass it.
  • vladx - Saturday, June 2, 2018 - link

    Jumping to such conclusions doesn't sit well with being an impartial party.
  • jospoortvliet - Monday, June 4, 2018 - link

    Oh come on you think they should assume the next snapdragon is not improved to be seen as impartial?

    They point out that the projection is that this MALI will be 15% faster than the current snapdragon. But it comes out next year and this will have to compete with the next snapdragon, not the 845. Totally sane to point out that given their history it seems a stretch to same that Qualcomm will only improve their new high end SOC by 15% or less...
  • jospoortvliet - Monday, June 4, 2018 - link

    Same -> assume

Log in

Don't have an account? Sign up now