VLIW4: Finding the Balance Between TLP, ILP, and Everything Else

To properly frame why AMD went with a VLIW4 design we’d have to first explain why AMD went with a VLIW5 design. And to do that we’d have to go back even further to the days of DirectX 9, and thus that is where we will start.

Back in the days of yore, when shading was new and pixel and vertex shaders were still separate entities, AMD (née ATI) settled on a VLIW5 design for their vertex shaders. Based on their data this was deemed the ideal configuration for a vertex shader block, as it allowed them to process a 4 component dot product (e.g. w, x, y, z) and a scalar component (e.g. lighting) at the same time.

Fast forward to 2007 and the introduction of AMD’s Radeon HD 2000 series (R600), where AMD introduced their first unified architecture for the PC. AMD went with a VLIW5 design once more, as even though the product was their first DX10 product it still made sense to build something that could optimally handle DX9 vertex shaders. This was also well before GPGPU had a significant impact on the market, as AMD had at best toyed around with the idea late in the X1K series’ lifetime (and well after R600 was started).

Now let us jump to 2008, when Cayman’s predecessors were being drawn up. GPGPU computing is still fairly new – NVIDIA is at the forefront of a market that only amounts to a few million dollars at best – and DX10 games are still relatively rare. With 2+ years to bring up a GPU, AMD has to be looking forward at where things will be in 2010. Their predictions are that GPGPU computing will finally become important, and that DX9 games will fade in importance to DX10/11 games. It’s time to reevaluate VLIW5.

This brings us to the present day and the launch of Cayman. GPGPU computing is taking off, and DX10 & DX11 alongside Windows 7 are gaining momentum while DX9 is well past its peak. AMD’s own internal database of games tells them an interesting story: the average slot utilization is 3.4 – on average a 5th streaming processor is going unused in games. VLIW5, which made so much sense for DX9 vertex shaders is now becoming too wide, while scalar and narrow workloads are increasing in number. The stage is set for a narrower Streaming Processor Unit; enter VLIW4.

As you may recall from a number of our discussions on AMD’s core architecture, AMD’s architecture is heavily invested in Instruction Level Parallelism, that is having instructions in a single thread that have no dependencies on each other that can be executed in parallel. With VLIW5 the best case scenario is that 5 instructions can be scheduled together on every SPU every clock, a scenario that rarely happens. We’ve already touched on how in games AMD is seeing an average of 3.4, which is actually pretty good but still is under 80% efficient. Ultimately extracting ILP from a workload is hard, leading to a wide delta between the best and worst case scenarios.

Meanwhile all of this is in stark contrast to Thread Level Parallelism (TLP), which looks for threads that can be run at the same time without having any interdependencies. This is where NVIDIA has focused their energies at the high-end, as GF100/GF100 are both scalar architectures that rely on TLP to achieve efficient operation.

Ultimately the realization is that AMD’s VLIW5 architecture is not the best architecture going forward. Up until now it has made sense at a high efficiency gaming-oriented design, and even today in a gaming part like the 6800 series it’s still a reasonable choice. But AMD needs a new architecture for the future, not only as something that’s going to better fit their 3.4 shader average, but something that is better designed for compute workloads. AMD’s choice is an overhauled version of their existing architecture. Overall it’s built on a solid foundation, but VLIW5 is too wide to meet their future goals.

The solution is to shrink their VLIW5 SPU to a VLIW4 SPU. Specifically, the solution is to remove the t-unit, the architecture’s 5th SP and largest SP that’s capable of both regular INT/FP operations as well as being responsible for transcendental operations. In the case of regular INT/FP operations this means an SPU is reduced from being able to process 5 operations at once to 4. While in the case of transcendentals an SPU now ties together 3 SPs to process 1 transcendental in the same period of time, representing a much more severe reduction in theoretical performance as an SPU can only process 1 transcendental + 1 INT/FP per clock as opposed to 1 transcendental + 4 INT/FP operations (or any variations).

There are a number of advantages to this change. As far as compute is concerned, the biggest advantage is that much of the space previously allocated to the t-unit can now be scrounged up to build more SIMDs. Cypress had 20 SIMDs while Cayman has 24; on average Cayman’s shader block is 10% more efficient per mm2 than Cypress’s , taking in to account the fact that Cayman’s SPs are a bit larger than Cypress’ to pick up the workload the t-unit would handle. The SIMDs are further tied to a number of attributes: the number of texture units, the number of threads that can be in flight at once, and the number of FP64 operations that can be completed per clock. The latter is particularly important for AMD’s compute efforts, as they can now retire FP64 FMA/MUL operations at 1/4th their FP32 rate, in the case of a full Cayman up to 384/clock. Technically speaking they’re no faster per SPU, but with this layout change they have more SPUs to work with, improving their performance.


Fewer SPs per SIMD = More Space For More SIMDs

There are even ancillary benefits within the individual SPUs. While the SP count changed the register file did not, leading to less pressure on each SPU’s registers as now only 4 SPs vie for register space. Even scheduling is easier as there are fewer SPs to schedule and the fact that they’re all alike means the scheduler no longer has to take into consideration the difference between the w/x/y/z units and the t-unit.

Meanwhile in terms of gaming the benefits are similar. Games that were already failing to fully utilize the VLIW5 design now have additional SIMDs to take advantage of, and as rendering is still an embarrassingly parallel operation as far as threading is concerned, it’s very easy to further divide the rendering workload in to more threads to take advantage of this change. The extra SIMDs mean that Cayman has additional texturing horsepower over Cypress, and the overall compute:texture ratio has been reduced, a beneficial situation for any games that are texture/filtering bound more than they’re compute bound.

Of course any architectural change involves tradeoffs, so it’s not a pure improvement. For gaming the tradeoff is that Cayman isn’t going to be well suited to VLIW5-style vertex shaders; generally speaking games using such shaders already run incredibly fast, but if they’re even GPU-bound in the first place they’re not going to gain much from Cayman. The other big tradeoff is when transcendental operations are paired with vector operations, as Cypress could handle both in one clock while Cayman will take two. It’s AMD’s belief that these operations are rare enough that the loss of performance in this one situation is worth it for the gain in performance everywhere else.

It’s worth noting that AMD still considers VLIW4 to be a risky/experimental design, or at least this is their rationale for going with it first on Cayman while sticking to VLIW5 elsewhere. At this point we’d imagine the real experiment to already be over, as AMD would already be well in the middle of designing Cayman’s 28nm successor, so they undoubtedly know if they’ll be using VLIW4 in the future.

Finally, the switch to a new VLIW architecture means the AMD driver team has to do some relearning. While VLIW4 is quite similar to VLIW5 it’s not by any means identical, which is both good and bad for performance purposes. The bad news is that it means many of AMD’s VLIW5-centric shader compiler tricks are no longer valid; at the start shader compiler performance is going to be worse while AMD learns how to better program a VLIW4 design. The good news is that in time they’re going to learn how to better program a VLIW4 design, meaning there’s the potential for sizable performance increases throughout the lifetime of the 6900 series. That doesn’t mean they’re guaranteed, but we certainly expect at least some improvement in shader performance as the months wear on.

On that note these VLIW changes do mean that some code is going to have to be rewritten to better deal with the reduction of VLIW width. AMD’s shader compiler goes through a number of steps to try to optimize code, but if kernels were written specifically to organize instructions to go through AMD’s shaders in a 5-wide fashion, then there’s only so much AMD’s compiler can do. Of course code doesn’t have to be written that way, but it is the best way to maximize ILP and hence shader performance.

VLIW5:

  • 4 32-bit FP MAD
  • Or 2 64-bit FP MUL or ADD
  • Or 1 64-bit FP MAD
  • Or 4 24-bit Int MUL or ADD
  • Plus 1 transcendental or 1 32-bit FP MAD

VLIW4:

  • 4 32-bit FP MAD/MUL/ADD
  • Or 2 64-bit FP ADD
  • Or 1 64-bit FP MAD/FMA/MUL
  • Or 4 24-bit INT MAD/MUL/ADD
  • Or 4 32-bit INT ADD/Bitwise
  • Or 1 32-bit MAD/MUL
  • Or 1 64-bit ADD
  • Or 1 transcendental plus 1 32-bit FP MAD
Cayman: The Last 32nm Castaway Cayman: The New Dawn of AMD GPU Computing
Comments Locked

168 Comments

View All Comments

  • AnnonymousCoward - Wednesday, December 15, 2010 - link

    First of all, 30fps is choppy as hell in a non-RTS game. ~40fps is a bare minimum, and >60fps all the time is hugely preferred since then you can also use vsync to eliminate tearing.

    Now back to my point. Your counter was "you know that non-AA will be higher than AA, so why measure it?" Is that a point? Different cards will scale differently, and seeing 2560+AA doesn't tell us the performance landscape at real-world usage which is 2560 no-AA.
  • Dug - Wednesday, December 15, 2010 - link

    Is it me, or are the graphs confusing.
    Some leave out cards on certain resolutions, but add some in others.

    It would be nice to have a dynamic graph link so we can make our own comparisons.
    Or a drop down to limit just ati, single card, etc.

    Either that or make a graph that has the cards tested at all the resolutions so there is the same number of cards in each graph.
  • benjwp - Wednesday, December 15, 2010 - link

    Hi,

    You keep using Wolfenstein as an OpenGL benchmark. But it is not. The single player portion uses Direct3D9. You can check this by checking which DLLs it loads or which functions it imports or many other ways (for example most of the idTech4 renderer debug commands no longer work).

    The multiplayer component does use OpenGL though.

    Your best bet for an OpenGL gaming benchmark is probably Enemy Territory Quake Wars.
  • Ryan Smith - Wednesday, December 15, 2010 - link

    We use WolfMP, not WolfSP (you can't record or playback timedemos in SP).
  • 7Enigma - Wednesday, December 15, 2010 - link

    Hi Ryan,

    What benchmark do you use for the noise testing? Is it Crysis or Furmark? Along the same line of questioning I do not think you can use Furmark in the way you have the graph setup because it looks like you have left Powertune on (which will throttle the power consumption) while using numbers from NVIDIA's cards where you have faked the drivers into not throttling. I understand one is a program cheat and another a TDP limitation, but it seems a bit wrong to not compare them in the unmodified position (or VERBALLY mention this had no bearing on the test and they should not be compared).

    Overall nice review, but the new cards are pretty underwhelming IMO.
  • Ryan Smith - Thursday, December 16, 2010 - link

    Hi 7Enigma;

    For noise testing it's FurMark. As is the case with the rest of our power/temp/noise benchmarks, we want to establish the worst case scenario for these products and compare them along those lines. So the noise results you see are derived from the same tests we do for temperatures and power draw.

    And yes, we did leave PowerTune at its default settings. How we test power/temp/noise is one of the things PowerTune made us reevaluate. Our decision is that we'll continue to use whatever method generates the worst case scenario for that card at default settings. For NVIDIA's GTX 500 series, this means disabling OCP because NVIDIA only clamps FurMark/OCCT, and to a level below most games at that. Other games like Program X that we used in the initial GTX 580 article clearly establish that power/temp/noise can and do get much worse than what Crysis or clamped FurMark will show you.

    As for the AMD cards the situation is much more straightforward: PowerTune clamps everything blindly. We still use FurMark because it generates the highest load we can find (even with it being reduced by over 200MHz), however because PowerTune clamps everything, our FurMark results are the worst case scenario for that card. Absolutely nothing will generate a significantly higher load - PowerTune won't allow it. So we consider it accurate for the purposes of establishing the worst case scenario for noise.

    In the long run this means that results will come down as newer cards implement this kind of technology, but then that's the advantage of such technology: there's no way to make the card louder without playing wit the card's settings. For the next iteration of the benchmark suite we will likely implement a game-based noise test, even though technologies like PowerTune are reducing the dynamic range.

    In conclusion: we use FurMark, we will disable any TDP limiting technology that discriminates based on the program type or is based on a known program list, and we will allow any TDP limiting technology that blindly establishes a firm TDP cap for all programs and games.

    -Thanks
    Ryan Smith
  • 7Enigma - Friday, December 17, 2010 - link

    Thanks for the response Ryan! I expected it to be lost in the slew of other posts. I highly recommend (as you mentioned in your second to last paragraph) that a game-based benchmark is used along with the Furmark for power/noise. Until both adopt the same TDP limitation it's going to put the NVIDIA cards in a bad light when comparisons are made. This could be seen as a legitimate beef for the fanboys/trolls, and we all know the less ammunition they have the better. :)

    Also to prevent future confusion it would be nice to have what program you are using for the power draw/noise/heat IN the graph title itself. Just something as simple as "GPU Temperature (Furmark-Load)" would make it instantly understandable.

    Thanks again for the very detailed review (on 1 week nonetheless!)
  • Hrel - Wednesday, December 15, 2010 - link

    I really hope these architexture changes lead to better minimum FPS results. AMD is ALWAYS behind Nvidia on minimum FPS and in many ways that's the most important measurment since min FPS determines if the game is playable or not. I dont' care if it maxes out 122 FPS if when the shit hits the fan I get 15 FPS, I won't be able to accurately hit anything.
  • Soldier1969 - Wednesday, December 15, 2010 - link

    I'm dissapointed in the 6970, its not what I was expecting over my 5870. I will wait to see what the 6990 brings to the table next month. I'm looking for a 30-40% boost from my 5870 at 2560 x 1600 res I game at.
  • stangflyer - Wednesday, December 15, 2010 - link

    Now that we see the power requirements for the 6970 and that it needs more power than the 5870 how would they make a 6990 without really cutting off the performance like the 5970?

    I had a 5970 for a year b4 selling it 3 weeks ago in preparation of getting 570 in sli or 6990.
    It would obviously have to be 2x8 pin power! Or they would have to really use that powertune feature.

    I liked my 5970 as I didn't have the stuttering issues (or i don't notice them) And actually have no issues with eyefinity as i have matching dell monitors with native dp inputs.

    If I was only on one screen I would not even be thinking upgrade but the vram runs out when using aa or keeping settings high as I play at 5040x1050. That is the only reason I am a little shy of getting the 570 in sli.

    Don't see how they can make a 6990 without really killing the performance of it.

    I used my 5970 at 5870 and beyond speeds on games all the time though.

Log in

Don't have an account? Sign up now