Tricks of the Trade: Transaction Elimination and Frame Buffer Compression

While we have spent some time covering various techniques ARM uses to improve efficiency in Midgard, we wanted to spend a bit more time talking about two specific techniques in general that we find especially cool: transaction elimination and frame buffer compression.

Going back once again to what we said earlier about rendering and power efficiency, any rendering work ARM can eliminate before it’s completed not only improves performance by freeing up resources, but it also frees up power by not having to spend it on said redundant work. This is especially the case for anything that wants to hit system memory, as compared to the on-die caches and memories available to the GPU, system memory is slow and expensive to operate from a power perspective.

For their final two tricks then, having already eliminated as much rendering work as possible through other means, ARM’s last tricks involve minimizing the amount of data from rendered tiles and pixels that needs to hit system memory. The first of these tricks is Transaction Elimination (TE), which is based on the idea that if a scene (or parts of it) do not change, then it makes no sense to spend power and bandwidth rewriting those identical screen portions.

To accomplish this, ARM relies on their tiling system to break down the scene for them, and from there they can begin comparing tiles that are waiting for finalization (ROP/blending) to the tiles that are already in the frame buffer from the previous frame. Using a simple cyclic redundancy check to compare the tiles, if the tile to be rendered is found to be identical to the tile already there, the tile can be skipped and the memory bandwidth saved. Altogether of all of ARM’s various tricks, this is among the simplest conceptually.

The effectiveness of Transaction Elimination in turn depends on the content. A generally static workload such as a movie will have a high degree of redundancy overall (notably when the camera is not moving), while a game may have many moving elements but will still have redundant elements that can be skipped. As a result ARM can save anywhere between almost nothing and over 99% for a highly static workload, with the average more than offsetting the roughly 1.5% overhead from computing and comparing the CRCs.

Of course Transaction Elimination does have one drawback besides its low overhead, and that is CRC collisions. During a CRC collision a pair of tiles that are different will compute to the same CRC value, and as such Transaction Elimination will consider them identical and throw away the new tile. With a standard CRC value being 64bits, such a collision is rare but not impossible, and indeed will statistically occur sooner or later. In which case Transaction Elimination has no fallback method; it is judge, jury, and executioner as it were, and the new tile will be lost.

As a result Transaction Elimination is interestingly imprecise in a world of precision. When a collision occurs the displayed tile will be wrong, but only for as long as there is a collision, which in turn should only be for 1 frame, or 1/60th of a second.

Moving on, when worse comes to worse and ARM does need to write a new tile, on the Mali-T700 series GPUs they can turn to ARM Frame Buffer Compression (AFBC) to minimize the amount of memory bandwidth they spend on that operation. By using a lossless compression algorithm to write out and store a frame, memory bandwidth is saved on both the writing of the frame and in the reading of it.

AFBC requires that both the GPU and the Display Controller support the technology, as the frame remains compressed the entire time until decompressed for display/consumption. Interestingly this means that the GPU needs to be able to compress as well as decompress, as it can reuse its own frames either in frame buffer objects (where a frame is rendered to a texture) or in Transaction Elimination. This becomes a secondary vector of saving bandwidth since it results in similar bandwidth savings for the frame even if the frame is never touched by the display controller itself. A similar principle applies to ARM’s video decoders (VPUs) which can use AFBC to compress a frame before shipping it off to the GPU.

On that note, it’s worth pointing out that while AFBC is an ARM technology, for interoperability purposes ARM does license it out to other display controller designers. ARM puts together their own display controllers, but because SoC integrators can use one of many display controllers it’s to ARM’s own benefit that everyone else be able to read AFBC as well as ARM can.

Midgard’s Execution Model: ILP, not TLP Technical Comparisons & Final Words
Comments Locked

66 Comments

View All Comments

  • tuxRoller - Monday, July 7, 2014 - link

    Yup, that is indeed what I said earlier.
    I don't think anand is interested in investigating that avenue, however. For one, he might fear that such published info would blacklist him from future qualcomm info dumps (pretty far fetched, imho, but AT likes the corporate relationships its been able to cultivate). For another, AT doesn't seem to be terribly interested in oss.
  • prabindh - Monday, July 7, 2014 - link

    Do the GFLOPS give a measure of how much performance an OpenCL kernel can provide in ARM Mali architecture ? Simple fp16/32 kernels doing a MAD and writing to global memory do not seem to match the GFLOPS calculated here. Are there other HW LD/ST limits in the pipeline (assuming no system memory Bandwidth limitations) ?
  • Amadiro - Friday, July 11, 2014 - link

    So how do the dFdx/dFdy operators work for such a non-wavefront based design? Does using them imply a huge overhead/stall/context-switch of some sort on this kind of architecture?
  • MrSpadge - Wednesday, August 13, 2014 - link

    Thanks ARM, Ryan and Anandtech for this very interesting article! This finally gives this "yet another SoC GPU irrelevant to me" a face. And I think ARM could employ this design for some very nice GP-GPU (co-)processors:

    - FP64 performance rivaling the best
    - not relying on TLP could open up the GPU for entire new application ranges where they just didn't make sense yet
    - ARM can scale the amount of compute per core easily

    I know building a massive GP-GPU chip is anything but trivial.. but this seems an architecture worthy of this!
  • manoj1919 - Friday, September 25, 2015 - link

    Hi Anand,
    Nice article!.I am working with mali GPU, and I am a researcher trying to figure out power management capabilities of MALI GPU. last part of this article seems to be covering a bit of that. can you please point me to the source of ARM's slide showing gating techniques of MALI GPU(last figure in article).
    Thanks in advance,
    Manoj
  • gregware - Monday, February 20, 2017 - link

    Interesting article, thanks!

Log in

Don't have an account? Sign up now