ARM Unveils Next Generation Bifrost GPU Architecture & Mali-G71: The New High-End Mali

Name: ARM Unveils Next Generation Bifrost GPU Architecture & Mali-G71: The New High-End Mali
Item: ARM Unveils Next Generation Bifrost GPU Architecture & Mali-G71: The New High-End Mali
Author: Ryan Smith

by Ryan Smith on May 30, 2016 7:00 AM EST

Posted in
GPUs
Arm
Mali
Mali G
Bifrost

57 Comments | Add A Comment

57 Comments

The Bifrost Core: Decoupled

Finally moving up to the 500ft view, we have the logical design of a single Bifrost core. Augmenting the changes we’ve discussed so far at the quad/execution engine level, ARM has made a number of changes to how the rest of the architecture works, and how all of this fits together as a whole.

First and foremost, a single Bifrost core contains 3 quad execution engines. This means that a single core is at any time executing up to 12 FMAs, spread over the aforementioned 3 quads. These quads are in turn fed by the core’s thread management frontend (now called a Quad Manager), which combined with the other frontends issues work to all of the functional units throughout the core.

As we’ve now seen the quad execution engines, insightful readers might have noticed that the execution engines are surprisingly sparse. They contain ALUs, register files, and little else. In most other architectures – including Midgard – there are more functional units organized within the execution engines, and this is not the case for Bifrost. Instead the load/store unit, texture unit, and other units have been evicted from the execution engines and placed as separate units along the control fabric.

Along with the shift from ILP to TLP, this is one of the more significant changes in Bifrost as compared to Midgard. Not unlike the TLP shift then, much of this change is driven by resource utilization. These units aren’t used as frequently as the ALUs, and this is especially the case as shader programs grow in length. As a result rather than placing this hardware within the execution engines and likely having it underutilized, ARM has moved them to separate units that are shared by the whole core.

The one risk here is now that there’s contention for these resources, but in practice it should not be much of an issue. Comparatively speaking, this is relatively similar to NVIDIA’s SMs, where multiple blocks of ALUs share load/store and texture units. Meanwhile this should also simplify core design a bit; only a handful of units have L2 cache data paths, and all of those units are now outside of execution engines.

Overall these separated units are not significantly different from their Midgard counterparts, and the big change here is merely their divorce from the execution engines. The texture unit, for example, still offers the same basic feature sets and throughput as Midgard’s, according to ARM.

Meanwhile something that has seen a significant overhaul compared to Midgard is ARM’s geometry subsystem. Bifrost still uses hierarchical tiling to bin geometry into tiles to work on it. However ARM has gone through quite a bit of effort here to reduce the memory usage of the tiler, as high resolution screens and higher geometry complexity was pushing up the memory usage of the tiler, and ultimately hurting performance and power efficiency.

Bifrost implements a much finer grained memory allocation system, one that also does away entirely with minimum allocation requirements. This keeps memory consumption down by reducing the amount of overhead from otherwise oversized buffers.

But perhaps more significant is that ARM has implemented a micro-triangle discard accelerator into Bifrost. By eliminating sub-pixel triangles that can’t be seen early on, ARM no longer needs to store those tringles in the tiler, further reducing memory needs. Overall, ARM is reporting that Bifrost’s tiler changes are reducing tiler memory consumption by up to 95%.

Along similar lines, ARM has also targeted vertex shading memory consumption for optimization. New to Bifrost is a feature ARM is calling Index-Driven Position Shading, which takes advantage of some of the aforementioned tiler changes to reduce the amount of memory bandwidth consumed there. ARM’s estimates put the total bandwidth savings for position shading at around 40%, given that only certain steps of the process can be optimized.

Finally, at the opposite end of the rendering pipeline we have Bifrost’s ROPs, or as ARM labels them, the blending unit and the depth & stencil unit. While these units take a similar direction as the texture unit – there are no major overhauls here – ARM has confirmed that Bifrost’s blending unit does offer some new functionality not found in Midgard’s. Bifrost’s blender can now blend FP16 targets, whereas Midgard was limited to integer targets. The inclusion of floating point blends not only saves ARM a conversion – Midgard would have to covert FP16s to integer RGBA – but the native FP16 blend means that precision/quality should be improved as well.

FP16 blends have a throughput of 1 pixel/clock, just like integer blends, so these are full speed. On that note, Bifrost’s ROP hardware does scale with the core count, so virtually every aspect of the architecture will scale up with larger configurations. Given what Mali-G71 can scale to, this means that the current Bifrost implementation can go up to 32px/clock.

The Bifrost Quad: Replacing ILP with TLP Putting It Together: Mali-G71

PRINT THIS ARTICLE

Post Your Comment
Please log in or sign up to comment.

Comments Locked

57 Comments

View All Comments

mdriftmeyer - Tuesday, May 31, 2016 - link
Aren't you glad you commented yesterday? See the update to HSA.
prisonerX - Wednesday, June 1, 2016 - link
You're confusing OpenCL C with OpenCL. SPIR-V is an intermediate language also supported by OpenCL.
pencea - Monday, May 30, 2016 - link
How about the review for the GTX 1080? It's been days since the card came out. Other major sites have already posted their reviews on both the 1080 and 1070, while AnandTech still haven't posted one yet. Only a pathetic preview.
Quake - Monday, May 30, 2016 - link
Quantiy over quality my friend. Anandtech is known to write some concise, detailed and thorough articles. For me, sites like Engadget are tabloid newspapers while Anandtech is a respected newspaper that takes it time to write some thorough and intelligent reviews.
Tabalan - Monday, May 30, 2016 - link
True, but they could split review into 2 parts - 1st one would consists of tests, benchmarks, etc (like other websites do), other would be about uarchi on it's own. This way almost everyone would be happy.
funkforce - Monday, May 30, 2016 - link
Engadget?! Come on! PCWorld, PCGamesHardware, HardOCP, KitGuru, Hothardware, Tomshardware, TechSpot, HardwareCanucks, TweakTown all posted their review, many almost as thourough as Anandtech... 13 days ago! And we could forgive Mr Smith if it was a one time thing, but it's been like this every GPU review since he took over as Editor in Chief.
When Anand was in charge, no review was late like this and it still was as thorough.

Now I LOVE Ryan's writing, it's the best bar none, he is and awesome guy for sure!
But please! For the love of GOD, step down as Editor and focus on writing only, delivering on time without a boss to push you is not your thing...
Just check earlier reviews, same thing, 1-2 weeks late, even though promised so many times it would come out a week or two before actually published. (Except GTX 960 which never got published at all after 7 weeks of promises and then just silence)
Alexa shows this website has lost an insane amount of readers in 1 year. I have been here for almost 20 years and I just want AT to be great again. Please someone do something! Anyone?! <3 Please save AT!
r3loaded - Tuesday, May 31, 2016 - link
Because while other sites will be satisfied with a review that covers "zomg runs Crysis 3 and GTA V on max settings at xyz FPS, temps and noise are pretty good", Anandtech doesn't really roll that way. They won't be satisfied with their review until they've completed a deep dive on the Pascal architecture, the merits of GDDR5X and how it compares with GDDR5 and HBM/HBM2, and quantifying frame latency and consistency.

Other reviews are written by gamers and computer enthusiasts. Anandtech reviews are written by computer engineers.
prisonerX - Tuesday, May 31, 2016 - link
1080 whiners like you are really tedious. I hope they cancel the review.
jjj - Monday, May 30, 2016 - link
Any clue about cache sizes and if a reduction there is factored into the perf density math? Also wondering about thermal , some of the mentioned changes will help but some more details would be nice.
allanmac - Monday, May 30, 2016 - link
Nice review. Delivering a full-featured Vulkan/SPIR-V 1.1 GPU to the masses is something we're all ready for.

ARM Unveils Next Generation Bifrost GPU Architecture & Mali-G71: The New High-End Mali

The Bifrost Core: Decoupled

Post Your Comment

57 Comments

View All Comments

mdriftmeyer - Tuesday, May 31, 2016 - link

prisonerX - Wednesday, June 1, 2016 - link

pencea - Monday, May 30, 2016 - link

Quake - Monday, May 30, 2016 - link

Tabalan - Monday, May 30, 2016 - link

funkforce - Monday, May 30, 2016 - link

r3loaded - Tuesday, May 31, 2016 - link

prisonerX - Tuesday, May 31, 2016 - link

jjj - Monday, May 30, 2016 - link

allanmac - Monday, May 30, 2016 - link

Log in

Don't have an account? Sign up now