Arm's New Mali-G77 & Valhall GPU Architecture: A Major Leap

Name: Arm's New Mali-G77 & Valhall GPU Architecture: A Major Leap
Item: Arm's New Mali-G77 & Valhall GPU Architecture: A Major Leap
Author: Andrei Frumusanu

by Andrei Frumusanu on May 27, 2019 12:00 AM EST

42 Comments | Add A Comment

42 Comments

The Mali-G77 Microarchitecture

Having covered the execution engine which is responsible for arithmetic processing, this is only part of the wider core design. Here Arm has generally kept the overall design quite similar to previous generation GPUs, however with some important changes in several blocks.

A shader core still contains the execution engine, load/store unit with cache, attribute unit, varying unit, texture mapping unit and pixel backend, as well as various other 3D fixed function blocks.

The biggest change here was on the texture unit block, which has doubled its throughput compared to the already doubled unit which we found on the Mali-G76.

From a high-level functionality standpoint, the new TMU looks quite similar to its predecessor, however we find some very significant changes in terms of the throughput of the new design.

The design is prationioned into two “paths”, a hit- and miss-path that either deal with misses inside the cache or outside the texture cache. The hit-path is naturally a shorter more latency optimised path.

On the hit-path, the texture cache itself has been improved and is now 32KB and is able of 16 texels/cycle throughput. The filtering unit has also been improved and its throughput increased and now supports one quad per cycle for bilinear texturing, or half a quad per cycle for trilinear texturing, both 2x of G76’s throughput.

Interestingly, Arm says that the new TMU is roughly the same area as its predecessor yet still enabling this doubling of capability, which is quite a nice engineering feat.

Fundamentally this large increase in the texturing capability of a core changes the ALU:Tex ratio of the GPU. Even though ALU capability has increased by 33%, the doubling of the TMU throughput means that essentially we’re now back to a lower ratio, more in favour of texture throughput, whereas past GPUs focused on increasing the compute performance. Arm deemed this as a necessary change for workloads that are now starting to tax this aspect of GPUs more.

It’s to be noted that while the texture filtering throughput has increased, the actual pixel backend throughput has not. Here a shader core is still only able to draw out 2 pixels per clock, so we now have a 2:1 texel:pixel ratio whereas in the past it remained 1:1.

Another new redesign among the shader core blocks is a new load-store cache block. Functionally it’s the same as in the past, however it’s now been redesigned with more throughput in mind. Within the same area, the amount of pipeline stages have been reduced by half, further reducing the latency of the core’s operation. The bandwidth has been widened to a full cacheline width, which should be a doubling over its predecessor.

The actual cache is 16KB in size and 4-way set associative, and is said to be very useful for ML workloads.

Putting all the pieces together and zooming out from a shader core to the GPU-level, we again see a large familiarity on how Arm organises its overall block. The architecture supports scaling shader cores from 1 core to 32 cores, although the microarchitecture of the G77 currently only supports up to 16 cores. Furthermore the current smallest design that Arm makes RTL ready for is a 7-core configuration, as the company deems customers going for smaller configurations would be better served by different IP (Such as the G52, or maybe a future unannounced IP in the same range).

The L2 cache still consists of up to four slices with each from 256KB to 1MB in size. Currently, most vendors have gone with 2MB configurations and I don’t think any licensee has ever implemented 4MB. In terms of bandwidth, the L2 to the LSC bandwidth has also doubled up from 32B/cycle to 64B/cycle (a full cacheline), while the external bandwidth depends on whether the vendor implements a 128-bit or 256-bit AXI interface to each of the L2 slices.

Introducing Valhall: A New Compute Core & New ISA Performance Targets: 30% Better PPC and Efficiency, End Remarks

PRINT THIS ARTICLE

Post Your Comment
Please log in or sign up to comment.

Comments Locked

42 Comments

View All Comments

darkich - Monday, May 27, 2019 - link
40% more performance just from design improvements?
That's ridiculous, if true..
spaceship9876 - Monday, May 27, 2019 - link
I really hope they release a Mali-G32 replacement for the G31 with this new architecture, a smaller die with lower power consumption and better performance would be great for entry level phones.
KECHEES - Tuesday, May 28, 2019 - link
And come to think of it, The other Mali gpu was fab on 8nm. so given that 7nm euv is supposedly 50% more efficient, we should be looking at a staggering performance improvement that's way above Arm's 40% target
ballsystemlord - Monday, May 27, 2019 - link
Spelling and grammar corrections (Hint: have someone read what you're writing so that you don't make so many dumb mistakes).

"Valhall and the new Mali-G77 follow up on the last three generation of Mali GPUs with some significant improvements in performance,..."
Missing s:
"Valhall and the new Mali-G77 follow up on the last three generations of Mali GPUs with some significant improvements in performance,..."

"...the new ISA is said to be more compiler friendly and adapted and designed to better aligned with modern APIs such as Vulkan."
Missing "be":
"...the new ISA is said to be more compiler friendly and adapted and designed to be better aligned with modern APIs such as Vulkan."

"Dwelling deeper into the structure of the execution engine,..."
Very awkward, try delving:
"Delving deeper into the structure of the execution engine,..."

"One single has more instances on the primary datapath, and less instances of the control and I-cache,..."
Single what, engine? Maybe "EE"?
"One single EE has more instances on the primary datapath, and less instances of the control and I-cache,..."

"On the hit-path, the texture cache itself has been improved and is now 32KB and is able of 16 texels/cycle throughput."
Missing words, maybe:
"On the hit-path, the texture cache itself has been improved and is now 32KB and is able to process 16 texels/cycle throughput."

"Arm states that fundamentally frequency between the G76 and G77 shouldn't change much at all, an internally Arm still targets an 850MHz sign-off."
"and" not "an"
"Arm states that fundamentally frequency between the G76 and G77 shouldn't change much at all, and internally Arm still targets an 850MHz sign-off."
warreo - Monday, May 27, 2019 - link
Not to say we should excuse journalists for less than stellar writing, but having read his stuff for a long time, with Andrei you have to accept the good (technical expertise) with the "could use improvement" (writing/word choice). There's no one out there that offers the kind of analysis and insights Andrei does, so as a reader I continue to read his articles with great interest and don't let the typos and writing bother me.
phoenix_rizzen - Tuesday, May 28, 2019 - link
I don't mind the typos and wording issues and grammar issues ... if this was a blog where the content was written and posted directly by the author.

What really bugs me is that Anandtech (and Ars, and other news sites) supposedly have editors on staff, yet these issues still slip through. :( There was a time when articles would pass through two or three stages of proofing to make sure these kinds of things didn't make it to press. But, it seems even for-pay "newspapers" these days are lacking in the QA/proofing department, so there's not much we can expect from for-free news sites. :(
Andrei Frumusanu - Tuesday, May 28, 2019 - link
Thanks for the corrections.
eastcoast_pete - Monday, May 27, 2019 - link
As mentioned in my post on Andrei's A77 article, I believe that at least some of these efforts are also to help establish ARM's designs as believable competition in the ultraportable space. With the graphics, that won't apply to Qualcomm, but is vital for Huawei and Samsung, as they rely on ARM-designed GPUs. A hexa- or octacore A77 with 12 or 16 of these might just be able to go head-to-head with Intel's low power chips.
Andrei Frumusanu - Tuesday, May 28, 2019 - link
Currently the big issue with Mali and ultra-portable is the fact that Arm has no plans for Windows drivers. Thus aside from ChromeOS devices, they're not really targeting that form-factor as much on the GPU as they are on the CPU (because Qualcomm uses the CPU).
eastcoast_pete - Wednesday, May 29, 2019 - link
Andrei, that's an important point. Also shows that MS is not as full-throated in its Windows-on-ARM as they let on. While I believe that some of the existing graphics support in Windows for QC's Adreno House is due to QC doing a lot of the heavy lifting, I don't believe that ARM would say no to a collaborative effort with MS to get MALI supported in Windows.

Arm's New Mali-G77 & Valhall GPU Architecture: A Major Leap

The Mali-G77 Microarchitecture

Post Your Comment

42 Comments

View All Comments

darkich - Monday, May 27, 2019 - link

spaceship9876 - Monday, May 27, 2019 - link

KECHEES - Tuesday, May 28, 2019 - link

ballsystemlord - Monday, May 27, 2019 - link

warreo - Monday, May 27, 2019 - link

phoenix_rizzen - Tuesday, May 28, 2019 - link

Andrei Frumusanu - Tuesday, May 28, 2019 - link

eastcoast_pete - Monday, May 27, 2019 - link

Andrei Frumusanu - Tuesday, May 28, 2019 - link

eastcoast_pete - Wednesday, May 29, 2019 - link

Log in

Don't have an account? Sign up now