How Rogues Get Executed: Wavefronts & Superscalar ILP

Now that we’ve seen the basic makeup of a single Rogue pipeline, let’s expand our view to the wider USC.

A single Rogue USC is comprised of 16 pipelines, making the design a 16 wide array. This, along with a texture unit, comprises one “cluster” when we’re talking about a multi-cluster (multiple USC) Rogue setup. In a setup with multiple USCs, the texture unit will then be shared among a pair of USCs.

We don’t have a great deal of information on the texture units themselves, but we do know that a Rogue texture unit can fetch 4 32bit bilinear texels per clock. So for a top-end 6 USC part, we’d be looking at a texture rate of 12 texels/clock.

Now by PC standards the Rogue pipeline/USC setup is a bit unusual due to its width. Both AMD and NVIDIA’s architectures are fairly narrow at this level, possessing just a small number of ALUs per shader core/pipeline. The impact of this is that by having multiple ALUs per pipeline in Rogue’s case, there is a need to extract some degree of instruction level parallelism (ILP) out of threads to feed as many ALUs as possible. Extracting ILP in turn requires having instructions in a single thread that have no dependencies on each other that can be executed in parallel. This can be many (but not all) instructions, so it’s worth noting that the efficiency of a USC is going to depend in part on the instructions in a thread. We call this property a superscalar design.

For the sake of comparison, AMD’s Graphics Core Next is not a superscalar design at all, while NVIDIA’s Kepler is superscalar in a similar manner. NVIDIA’s CUDA cores only have 1 FP32 ALU per core, but there are additional banks of CUDA cores that can be co-issued additional instructions, conditions permitting. So Rogue has a similar reliance on ILP within a thread, needing it to achieve maximum efficiency.

What makes Rogue all the more interesting is just how wide it is. For FP32 operations it’s only 2-wide, but if we throw in the FP16 operations we’re technically looking at a 6-wide design. The odds of having FP16 and FP32 operations ready to co-issue in such a manner is far rarer than having just a pair of FP32 instructions to co-issue, so again Rogue technically is very unlikely to achieve 100% utilization of a pipeline.

That said, the split between FP16 and FP32 units makes it clear that Imagination expects to be using one or the other most of the time rather than both, so as far as the design goes this is not unexpected. For FP32 instructions then it’s a simpler 2-wide setup, while FP16 instructions are going to be trickier as full utilization of FP16 is going to require a full 4 instruction setup (say 4 MADs following each other). The fact that Series 6XT has 4 FP16 units despite that is interesting, as it implies that it was worth the extra die space compared to the Series 6 setup of 2 FP16 units.

With that out of the way, let’s talk about how work is dispatched to the pipelines within a USC. Each pipeline works on one thread at a time, the same as any other modern GPU architecture. Consequently we’d expect the wavefront size to be 16 threads.

However there’s an interesting fact that we found out about the USCs, and that is that they don’t run at the same clockspeed throughout. The ALUs themselves run at the published clockspeed for the GPU, but the frontends that feed them – the decoders and operand collectors do not. Imagination has not specified at what rate they run at, but the only thing that makes sense is ½ the rate of the ALUs. So a 300MHz USC would have its decoder frontend running at 150MHz, etc.


An example of a wavefront executing. Instructions per thread not to scale

Consequently we believe that the size of a wavefront is not 16 threads, but rather 32 threads, executed over 2 cycles of the ALUs. This is not the first time we’ve seen this design – NVIDIA did something similar for their retired Fermi architecture – but this isn’t something we were expecting to see again. But with the idiosyncrasies of the SoC space, this is apparently something that still makes sense. Imagination did tell us that there are tangible power savings from doing this, and since SoC GPUs are power limited in most cases anyhow, this is essentially the higher performance option. Go faster by going slower.

Finally, this brings us to the highest level, the USC array. Each USC in an array receives its own thread to work on, so the number of threads actively being executed will be identical to the number of USCs in a design. For a high-end 6 module design, we’d be looking at 6 threads, whereas for a smaller 2 module design it would be just 2 threads.

Imagination’s PowerVR Rogue Series 6/6XT USCs Dissected Technical Comparisons & Final Words
Comments Locked

95 Comments

View All Comments

  • MrPoletski - Sunday, March 9, 2014 - link

    The reason PoewrVR left the desktop market was simply because ST Microelectronics sold their graphics division. The KYRO was an extremely successful card and would have continued to be. IIRC Via tried to buy it up and carry on selling the KYRO 3 but could not reach a licence deal with STMicro - who claimed copyright on the chip design (the non powervr parts)
  • Scali - Monday, February 24, 2014 - link

    They could... But Imagination is just like ARM: they don't build GPUs themselves, they only license the designs.
    It has been possible for years to license a PowerVR design and scale it up to an interesting desktop GPU. It's just that so far, no company has done that. Probably too big a risk to take, trying to compete with giants such as nVidia and AMD.
    The last desktop PowerVR cards mainly failed because of poor software support. Aside from the drivers not being all that mature, there was also the problem that many games made assumptions that simply would not hold on a TBDR architecture, and rendering bugs were the result.
    If you were to build a PowerVR-based desktop solution today, chances are that you run into quite a lot of incompatibilities with existing games.
  • iwod - Monday, February 24, 2014 - link

    I didn't want the word Apple in it to retrain from trolls and flame war, so i didn't write it out clearly the first time.
    The sole reason why PowerVR failed in the first place were their Drivers And the same reason why most other GPU company failed as well. Much like S3. Drivers in the GPU market means literally everything. It doesn't matter if their GPU is insanely great if it doesn't run any of the latest games and error upon error it simply wont sell. Unlike CPU which you actually get down to the mental programming.

    Nvidia famously pointed out they have much more software engineers then Hardware. Writing a decent performing drivers takes time and money. Hence why not many GPU manufacturer survive. Most of them dont have enough resources to scale. Same goes with PowerVR. I still remember my Kyro Graphics Card I love, until it doesn't work on games I want to play.

    But this time it is different. The Mobile Market has already exceed the PC market and will likely exceed the total GPU shipped in PC + Console Combined! Since the drivers you are writing for Mobile iOS can in many case effectively be used on MacOSX as well. That is why using PowerVR on Mac makes an appealing case.

    May be the industry leader view Tablet / Mobile Phone + Console being the next trend, while PC & Mac will simply relinquish from Gaming?
  • Scali - Tuesday, February 25, 2014 - link

    "The sole reason why PowerVR failed in the first place were their Drivers"

    As I said, it was not necessarily the drivers themselves. A nice example is 3DMark2001. Some scenes did not work correctly because of illegal assumptions about z-buffer contents. When 3DMark2001SE was released, one of the changes was that it now worked correctly on Kyro cards.

    It is unclear where PowerVR stands today, since both their hardware and the 3D APIs and engines have changed massively. The only thing we know for sure is that there are various engines and games that work correctly on iPhone/iPad.
  • Sushisamurai - Monday, February 24, 2014 - link

    Typo: "one thread per shader care, which like the shader cores are grouped together into what we call wavefronts." Should be shader core?

    In "Background: how GPU's work"
  • Ryan Smith - Monday, February 24, 2014 - link

    Indeed it was. Thank you for pointing that out.
  • chinmaythosar - Monday, February 24, 2014 - link

    i wonder how AAPL will handle the FP16 cores ... they are moving to 64bit in CPUs and they would have hoped to move to FP64 in GPU ... it would have given them real talking point in the keynote for iPad 6 (or whatever they call it) .. " next-gen 192 core GPU FP64 architechture ..4x graphics power etc etc .. :P
  • MrSpadge - Saturday, March 1, 2014 - link

    Not sure what AAPL is, but pure FP64 for graphics would be horrible. You don't need the precision but waste lot's of die space and power.
  • xeizo - Monday, February 24, 2014 - link

    They would be more competitive and interesting to use if they published open drivers instead of "open architecture" pics .... :(
  • Eckej - Monday, February 24, 2014 - link

    Couple of small errors/typos:
    Under How Rogues Get Executed: Wavefronts & Superscalar ILP - the diagram should probably have 16, not 20 pipelines - looks like an extra row slipped in!

    The page before: " With Series 6, Imagination has an interesting setup where there FP16 ALUs can process up to 3 operations in one cycle." There should read their.

    Bottom of page 2 "And with that behind us, we can now take a look at the PowerVR Series 6/6XT Unfired Shading Cluster." - Unfired should read Unified.

    Sorry to be picky.

Log in

Don't have an account? Sign up now