Imagination’s PowerVR Rogue Series 6/6XT USCs Dissected

Diving right into the heart of the matter, it’s probably best to start with a partial explanation for why Imagination is being so forward about their architecture after all this time.

Imagination’s principle blog, Graphics cores: trying to compare apples to apples, opens up with an argument over just what a “core” is and how it should be counted. Imagination doesn’t name any names, but from the context of their blog it’s clear that they’re worried about being in a core war and losing based on who’s counting cores and how. Time and time again we’ve seen chip and device makers promote their products both by their clockspeeds and their number of cores, leading to both MHz wars and core wars.

Both of these pieces of data are in fact absolutely essential to understanding a product, but consumers – rightfully disinterested in needing a Comp. Engineering degree to understand their purchases –  will latch on to one or both numbers, which in turn rewards whoever can push a product with the highest clockspeed or the highest number of cores. The latter in turn can result in some very creative accounting, such as the Motorola X8, where components that aren’t traditionally called a core end up in the core count.

In the SoC GPU space there thankfully isn’t any creative accounting going on, but the nature of the Rogue architecture combined with Imagination’s previous method of counting cores has left Imagination shooting themselves in the foot if they were to get into a core war. If nothing else, the fact that Imagination has previously uses the term “core” to define a single USC, which depending on the specific design may be replicated several times, means that they end up having relatively low core counts.

Compounding that matter is that at a pipeline or shader core level (there’s that word again), Rogue has several FP32 (or FP16) ALUs when other architectures may only have one. So even if one were to count pipelines, Imagination would come up with just 16 cores in a USC.

Their competition and impetus – and again Imagination hasn’t named any names – would appear to be NVIDIA, who just recently announced their Tegra K1 SoCs. A K1 contains 192 CUDA cores, each much narrower than a Rogue pipeline. So even though a high-end Rogue design would have 6 or so USCs, that would only be 96 pipelines versus 192 CUDA cores. Imagination clearly doesn’t want to be dismissed based on this metric, hence their interest in arguing what a core is, and more importantly part of the reason for their desire to open up and discuss their architecture so that we can understand why this difference matters. Imagination may only have 96 pipelines, but they have many more ALUs in each one of those pipelines.

The Rogue

And that brings us to the Rogue architecture. For today’s article and blog releases, Imagination has given us some surprisingly deep details on the Rogue USC, from which we can finally paint a picture of how thread execution works on Rogue, and furthermore gather some idea on what Rogue’s strengths are and what weaknesses it faces.

We’ll begin with differentiating between Rogue as in PowerVR Series 6, and Series 6XT. Series 6XT was the recently announced refresh for the Series 6 family, intending to further improve on PowerVR performance by undertaking several optimizations to the design. At the time of that announcement we didn’t have very many details on what those optimizations were, and now we have a much better idea.

It turns out that Imagination has been doing some tinkering under the hood, and while the number of FP32 slots remains unchanged, the FP16 slots have been altered. Wait, did we say FP16 slots?! Oh yes!

It’s a shame we don’t have such similar low level details for other SoC architectures, but from a PC perspective it’s extremely interesting that Imagination has FP16 slots. As a quick refresher, on the PC side GPUs have been using FP32 operations exclusively for quite some time, and all ALUs have been FP32 for years. But of course mobile trails PC in functionality and shader complexity, so while FP16 usage is rarer in PC games and applications, it can be very common in the mobile space.

The tradeoff of course is performance for accuracy. A 16bit floating point number is pretty imprecise – a 16bit value only gives us 65K combinations – hence it’s rarely used in the PC space in favor of far more accurate 32bit floating point numbers, which offer 4 billion combinations. But a 16bit value is smaller than a 32bit value, requiring less memory to store, less bandwidth to move, less energy to compute, and fewer transistors to operate on. So in the mobile space FP16 is still in use, both on the software side and the hardware side, with developers taking care not to let the lower precision cause rendering errors.

The interesting part on the hardware side is the implementation of dedicated FP16 units, versus just running the values on FP32 units. In the PC space we do the latter, but of course FP16 operations are relatively rare. The catch being that it’s not the most efficient way to execute FP16 operations; we’re firing up a lot of extra transistors and burning extra power to do it. In the PC space this is a reasonable tradeoff, but in the mobile space it’s clear the Imagination has decided otherwise.

The end result being that Rogue has dedicated FP16 units, not unlike how NVIDIA has dedicated FP64 CUDA cores on their desktop architectures. FP16 units take up additional die space – you have to place them alongside your FP32 units – but when you need to compute an FP16 number, you don’t have to fire up a more expensive FP32 unit to do the work. Hence the inclusion of FP16 units is a tradeoff between die size and performance, with the additional performance coming from the reduction in power consumption and heat when operating on FP16 values.

That little segue aside, let’s get back to the difference between Series 6 and Series 6XT. With Series 6, Imagination has an interesting setup where their FP16 ALUs can process up to 3 operations in one cycle.

As a refresher, for a more traditional ALU we say that one ALU can achieve 2 FLOPs in one cycle because we typically count the most dense operation, the Multiply Addition (MAD), otherwise seen as a Multiply Accumulate (MAC) or Fused Multiply Addition (FMA). MAD is a special case where one ALU can do the multiply and the accumulate in one cycle, one of a handful of scenarios where multiple operations can occur at once. MADs and their variants are relatively common in graphics work, hence for GPU performance this is the number we usually quote as it’s especially useful in graphics, and being the bigger number it also makes manufacturers happier, too.

With that said, Imagination’s FP16 ALUs are more intricate than their FP32 ALUs, especially under Series 6. Unfortunately Imagination is unable to share the complete details with us of why this is (as we've noted before, they’re not yet fully open), but it’s clear from this diagram that the Series 6 FP16 ALU has been optimized for some kind of 3 operator instruction. What that operation is we don’t know, but presumably it was important enough for Imagination to give their FP16 ALUs the ability to do it in one cycle.

But from the perspective of an individual pipeline it’s interesting to see that with Series 6XT Imagination has shaken things up by "adding" 2 more FP16 ALUs. Keeping in mind that these are logical diagrams and not physical diagrams – in reality these ALUs are almost certainly tied together in a limited fashion rather than standing alone – Imagination has changed it so that Series 6XT can process more 2 operator FP16 instructions such as MAD. Without more details it’s hard to say for sure what’s going on, but if these “new” FP16 ALUs are really just modifications of the old FP16 ALUs, then it would be a safe assumption that Imagination can still execute their 3 operator instructions over what’s logically 2 ALUs, making these new FP16 ALUs a wider variant of the previous ALU.

Otherwise from this point of view Series 6 and Series 6XT are nearly identical. We have the 2 FP32 ALUs, and then for special functions – transcendentals and the like – a dedicated special function unit exists off to the side. On that note, for the rest of this article we will primarily be focusing on Series 6XT since it’s the newer architecture, but with the exception of FP16 operations it will be equally applicable to both Series 6 and 6XT.

Background: How GPUs Work How Rogues Get Executed: Wavefronts & Superscalar ILP
Comments Locked

95 Comments

View All Comments

  • MrPoletski - Sunday, March 9, 2014 - link

    The reason PoewrVR left the desktop market was simply because ST Microelectronics sold their graphics division. The KYRO was an extremely successful card and would have continued to be. IIRC Via tried to buy it up and carry on selling the KYRO 3 but could not reach a licence deal with STMicro - who claimed copyright on the chip design (the non powervr parts)
  • Scali - Monday, February 24, 2014 - link

    They could... But Imagination is just like ARM: they don't build GPUs themselves, they only license the designs.
    It has been possible for years to license a PowerVR design and scale it up to an interesting desktop GPU. It's just that so far, no company has done that. Probably too big a risk to take, trying to compete with giants such as nVidia and AMD.
    The last desktop PowerVR cards mainly failed because of poor software support. Aside from the drivers not being all that mature, there was also the problem that many games made assumptions that simply would not hold on a TBDR architecture, and rendering bugs were the result.
    If you were to build a PowerVR-based desktop solution today, chances are that you run into quite a lot of incompatibilities with existing games.
  • iwod - Monday, February 24, 2014 - link

    I didn't want the word Apple in it to retrain from trolls and flame war, so i didn't write it out clearly the first time.
    The sole reason why PowerVR failed in the first place were their Drivers And the same reason why most other GPU company failed as well. Much like S3. Drivers in the GPU market means literally everything. It doesn't matter if their GPU is insanely great if it doesn't run any of the latest games and error upon error it simply wont sell. Unlike CPU which you actually get down to the mental programming.

    Nvidia famously pointed out they have much more software engineers then Hardware. Writing a decent performing drivers takes time and money. Hence why not many GPU manufacturer survive. Most of them dont have enough resources to scale. Same goes with PowerVR. I still remember my Kyro Graphics Card I love, until it doesn't work on games I want to play.

    But this time it is different. The Mobile Market has already exceed the PC market and will likely exceed the total GPU shipped in PC + Console Combined! Since the drivers you are writing for Mobile iOS can in many case effectively be used on MacOSX as well. That is why using PowerVR on Mac makes an appealing case.

    May be the industry leader view Tablet / Mobile Phone + Console being the next trend, while PC & Mac will simply relinquish from Gaming?
  • Scali - Tuesday, February 25, 2014 - link

    "The sole reason why PowerVR failed in the first place were their Drivers"

    As I said, it was not necessarily the drivers themselves. A nice example is 3DMark2001. Some scenes did not work correctly because of illegal assumptions about z-buffer contents. When 3DMark2001SE was released, one of the changes was that it now worked correctly on Kyro cards.

    It is unclear where PowerVR stands today, since both their hardware and the 3D APIs and engines have changed massively. The only thing we know for sure is that there are various engines and games that work correctly on iPhone/iPad.
  • Sushisamurai - Monday, February 24, 2014 - link

    Typo: "one thread per shader care, which like the shader cores are grouped together into what we call wavefronts." Should be shader core?

    In "Background: how GPU's work"
  • Ryan Smith - Monday, February 24, 2014 - link

    Indeed it was. Thank you for pointing that out.
  • chinmaythosar - Monday, February 24, 2014 - link

    i wonder how AAPL will handle the FP16 cores ... they are moving to 64bit in CPUs and they would have hoped to move to FP64 in GPU ... it would have given them real talking point in the keynote for iPad 6 (or whatever they call it) .. " next-gen 192 core GPU FP64 architechture ..4x graphics power etc etc .. :P
  • MrSpadge - Saturday, March 1, 2014 - link

    Not sure what AAPL is, but pure FP64 for graphics would be horrible. You don't need the precision but waste lot's of die space and power.
  • xeizo - Monday, February 24, 2014 - link

    They would be more competitive and interesting to use if they published open drivers instead of "open architecture" pics .... :(
  • Eckej - Monday, February 24, 2014 - link

    Couple of small errors/typos:
    Under How Rogues Get Executed: Wavefronts & Superscalar ILP - the diagram should probably have 16, not 20 pipelines - looks like an extra row slipped in!

    The page before: " With Series 6, Imagination has an interesting setup where there FP16 ALUs can process up to 3 operations in one cycle." There should read their.

    Bottom of page 2 "And with that behind us, we can now take a look at the PowerVR Series 6/6XT Unfired Shading Cluster." - Unfired should read Unified.

    Sorry to be picky.

Log in

Don't have an account? Sign up now