How Rogues Get Executed: Wavefronts & Superscalar ILP

Now that we’ve seen the basic makeup of a single Rogue pipeline, let’s expand our view to the wider USC.

A single Rogue USC is comprised of 16 pipelines, making the design a 16 wide array. This, along with a texture unit, comprises one “cluster” when we’re talking about a multi-cluster (multiple USC) Rogue setup. In a setup with multiple USCs, the texture unit will then be shared among a pair of USCs.

We don’t have a great deal of information on the texture units themselves, but we do know that a Rogue texture unit can fetch 4 32bit bilinear texels per clock. So for a top-end 6 USC part, we’d be looking at a texture rate of 12 texels/clock.

Now by PC standards the Rogue pipeline/USC setup is a bit unusual due to its width. Both AMD and NVIDIA’s architectures are fairly narrow at this level, possessing just a small number of ALUs per shader core/pipeline. The impact of this is that by having multiple ALUs per pipeline in Rogue’s case, there is a need to extract some degree of instruction level parallelism (ILP) out of threads to feed as many ALUs as possible. Extracting ILP in turn requires having instructions in a single thread that have no dependencies on each other that can be executed in parallel. This can be many (but not all) instructions, so it’s worth noting that the efficiency of a USC is going to depend in part on the instructions in a thread. We call this property a superscalar design.

For the sake of comparison, AMD’s Graphics Core Next is not a superscalar design at all, while NVIDIA’s Kepler is superscalar in a similar manner. NVIDIA’s CUDA cores only have 1 FP32 ALU per core, but there are additional banks of CUDA cores that can be co-issued additional instructions, conditions permitting. So Rogue has a similar reliance on ILP within a thread, needing it to achieve maximum efficiency.

What makes Rogue all the more interesting is just how wide it is. For FP32 operations it’s only 2-wide, but if we throw in the FP16 operations we’re technically looking at a 6-wide design. The odds of having FP16 and FP32 operations ready to co-issue in such a manner is far rarer than having just a pair of FP32 instructions to co-issue, so again Rogue technically is very unlikely to achieve 100% utilization of a pipeline.

That said, the split between FP16 and FP32 units makes it clear that Imagination expects to be using one or the other most of the time rather than both, so as far as the design goes this is not unexpected. For FP32 instructions then it’s a simpler 2-wide setup, while FP16 instructions are going to be trickier as full utilization of FP16 is going to require a full 4 instruction setup (say 4 MADs following each other). The fact that Series 6XT has 4 FP16 units despite that is interesting, as it implies that it was worth the extra die space compared to the Series 6 setup of 2 FP16 units.

With that out of the way, let’s talk about how work is dispatched to the pipelines within a USC. Each pipeline works on one thread at a time, the same as any other modern GPU architecture. Consequently we’d expect the wavefront size to be 16 threads.

However there’s an interesting fact that we found out about the USCs, and that is that they don’t run at the same clockspeed throughout. The ALUs themselves run at the published clockspeed for the GPU, but the frontends that feed them – the decoders and operand collectors do not. Imagination has not specified at what rate they run at, but the only thing that makes sense is ½ the rate of the ALUs. So a 300MHz USC would have its decoder frontend running at 150MHz, etc.


An example of a wavefront executing. Instructions per thread not to scale

Consequently we believe that the size of a wavefront is not 16 threads, but rather 32 threads, executed over 2 cycles of the ALUs. This is not the first time we’ve seen this design – NVIDIA did something similar for their retired Fermi architecture – but this isn’t something we were expecting to see again. But with the idiosyncrasies of the SoC space, this is apparently something that still makes sense. Imagination did tell us that there are tangible power savings from doing this, and since SoC GPUs are power limited in most cases anyhow, this is essentially the higher performance option. Go faster by going slower.

Finally, this brings us to the highest level, the USC array. Each USC in an array receives its own thread to work on, so the number of threads actively being executed will be identical to the number of USCs in a design. For a high-end 6 module design, we’d be looking at 6 threads, whereas for a smaller 2 module design it would be just 2 threads.

Imagination’s PowerVR Rogue Series 6/6XT USCs Dissected Technical Comparisons & Final Words
Comments Locked

95 Comments

View All Comments

  • Scali - Monday, February 24, 2014 - link

    "the alternative to DirectX is OpenGL. Is it more suited for TBDR architecture than "brute force" ones?"

    In theory, D3D used to be more suited to TBDR than OpenGL, because it had explicit BeginScene() and EndScene() markers. But those have been dropped after D3D9.
    I can't really think of anything off the top of my head that would make one API more suited than the other these days. They're both very similar: just bind your textures/buffers/shaders, update the constants, and fire off your geometry.

    "How is PowerVR going to fight against an architecture as flexible as nvidia ones that can also be used for CUDA computing and thus being adopted into markets (and for other tasks) PowerVR cannot with their current architecture?"

    PowerVR supports OpenCL, so they too can do the GPGPU-game. It all depends on who delivers the best package in terms of features, performance, power usage, etc.
  • Jhwzz - Tuesday, February 25, 2014 - link

    >Again, as someone altready asked, tile based rendering was used on the desktop but was soon
    >abandoned as it could not give any real advantages over the raw power of other architectures
    >that grew much faster that what PowerVR could optimize their algorithms, making tile based
    >rendering less and less profitable. What makes that scenario different that what we are
    > witnessing in this period where mobile resolutions are growing to be even bigger than desktop
    > monitors and that games complexity is gonig to increase for the arrival of these really powerful
    >GPUs (K1 in primis)?

    It is a common misconception that PowerVR's desktop parts where not competitive or had compatibility problems. The last card they produced, the Kyro II was actual very competitive with the offerings from both NVidia and ATi. The claims of incompatibility where largely unfounded marketing FUD from competitors, with later drivers running the majority of content without problems. Further they did not leave the market because they could not compete on performance, instead their partner at the time, STM, decided to pull out of the market for unstated reasons, although this was most likely due to them not being able to invest the amount of money they needed to in order to take the high ground.

    Mobile is VERY different to desktop space, NV and ATi where able to brute force their way to the top as both power consumption and memory bandwidth had extremely wide envelopes, this is not the case in mobile space. In Kyro II PowerVR had demonstrated that they were able to compete with considerable higher specification part (for clock & memory BW) from both NV and AMD, with considerably lower memory BW and power requirements. Although NV and AMD have evolved so have PowerVR, as such there is no reason to assume that they don't still have advantages.

    >It seems PowerVR is behaving a bit like 3DFx did at the time, till it died. They were using their
    >advanced but old technology to the exterme, so they rendered at 16bit instead of 32, used 16bit
    >Zbuffer instead of 24 and many more "tricks" that were forced to try to hide what was quite
    >clear: 3DFX didn't have the right architecture to compete with new companies like nvidia
    > and ATI that started their story with the right step and much more powerful architectures
    > (TNT2 simply destroyed Voodoo3 under all points, and beware, I was an Voodoo3 unfortunate > owner).

    This simply makes no sense in the context of the market these cores are being target at. At the fundamental level the primary API currently used in mobile is OGLES2.0 which does not mandate anything higher than FP16 within fragment shaders. This means that the vast majority of current mobile content only use FP16 in fragment shaders, in these circumstances do you think it make sense to through area at higher precision paths? Of course it doesn’t! Further it’s not like the PowerVR architecture looks like slouch at FP32.

    At the end of the day they truth will only be seen in benchmark in actual devices, not in marketing claims and FUD from various companies.
  • Jhwzz - Tuesday, February 25, 2014 - link

    BTW you do realise that NV run many of these benchmark at 16 bit Z and 16 bit frame buffer don't you? They do thsi because the become even less comeptitive when forced to use 32 bits. So who is actually using old "technology" to hide real deficiencies in thir architectures?
  • MrSpadge - Saturday, March 1, 2014 - link

    If you can't see a difference due to using 16 bit somewhere along the rendering path, then it's actually the smart thing to do. It saves power, which can be better used elsewhere (higher clocks). Well, that was actually 3DFX's argument for why Glide with 16 bit + dithering was the smart choice back then. But "they've got the bigger bits" won (along with other advantages of the early nVidias).
  • MrPoletski - Sunday, March 9, 2014 - link

    I tmight be the smart thing to do, but when it's in a benchmark that's supposed to be at 32 bits then I call that cheating!
  • allanmac - Monday, February 24, 2014 - link

    If a new GPU architecture "deep dive" doesn't include the number of registers per multiprocessor then it's bordering on worthless.

    Both Intel, AMD and NVIDIA publish these numbers so the other mobile GPU vendors should as well.

    Please dig up these numbers since then we can begin to compare these next gen mobile GPUs.

    I suspect ARM, ImgTech and QCOM simply won't tell you... but you might be able to find the answer through a series of OpenCL tests.
  • boostern - Monday, February 24, 2014 - link

    OpenCL tests on a GPU that isn't even in production? Before you say "test it on the iphone 5S", there is no OpenCL public libraries available on iOs, as far as I know.
  • allanmac - Monday, February 24, 2014 - link

    The first option is to ask the vendor.
  • boostern - Monday, February 24, 2014 - link

    Yes, maybe :D
  • MrSpadge - Saturday, March 1, 2014 - link

    While this is surely important the raw number of registers doesn't tell you that much either. In fact, it could be very misleading. How many registers you need first and foremost depends on the out of order window (in CPUs) or here the number of threads in flight. Which is something they didn't tell us either. Also different cache sizes and latencies would determine how bad it is to run out of register space.

Log in

Don't have an account? Sign up now