VLIW4: Finding the Balance Between TLP, ILP, and Everything Else

To properly frame why AMD went with a VLIW4 design we’d have to first explain why AMD went with a VLIW5 design. And to do that we’d have to go back even further to the days of DirectX 9, and thus that is where we will start.

Back in the days of yore, when shading was new and pixel and vertex shaders were still separate entities, AMD (née ATI) settled on a VLIW5 design for their vertex shaders. Based on their data this was deemed the ideal configuration for a vertex shader block, as it allowed them to process a 4 component dot product (e.g. w, x, y, z) and a scalar component (e.g. lighting) at the same time.

Fast forward to 2007 and the introduction of AMD’s Radeon HD 2000 series (R600), where AMD introduced their first unified architecture for the PC. AMD went with a VLIW5 design once more, as even though the product was their first DX10 product it still made sense to build something that could optimally handle DX9 vertex shaders. This was also well before GPGPU had a significant impact on the market, as AMD had at best toyed around with the idea late in the X1K series’ lifetime (and well after R600 was started).

Now let us jump to 2008, when Cayman’s predecessors were being drawn up. GPGPU computing is still fairly new – NVIDIA is at the forefront of a market that only amounts to a few million dollars at best – and DX10 games are still relatively rare. With 2+ years to bring up a GPU, AMD has to be looking forward at where things will be in 2010. Their predictions are that GPGPU computing will finally become important, and that DX9 games will fade in importance to DX10/11 games. It’s time to reevaluate VLIW5.

This brings us to the present day and the launch of Cayman. GPGPU computing is taking off, and DX10 & DX11 alongside Windows 7 are gaining momentum while DX9 is well past its peak. AMD’s own internal database of games tells them an interesting story: the average slot utilization is 3.4 – on average a 5th streaming processor is going unused in games. VLIW5, which made so much sense for DX9 vertex shaders is now becoming too wide, while scalar and narrow workloads are increasing in number. The stage is set for a narrower Streaming Processor Unit; enter VLIW4.

As you may recall from a number of our discussions on AMD’s core architecture, AMD’s architecture is heavily invested in Instruction Level Parallelism, that is having instructions in a single thread that have no dependencies on each other that can be executed in parallel. With VLIW5 the best case scenario is that 5 instructions can be scheduled together on every SPU every clock, a scenario that rarely happens. We’ve already touched on how in games AMD is seeing an average of 3.4, which is actually pretty good but still is under 80% efficient. Ultimately extracting ILP from a workload is hard, leading to a wide delta between the best and worst case scenarios.

Meanwhile all of this is in stark contrast to Thread Level Parallelism (TLP), which looks for threads that can be run at the same time without having any interdependencies. This is where NVIDIA has focused their energies at the high-end, as GF100/GF100 are both scalar architectures that rely on TLP to achieve efficient operation.

Ultimately the realization is that AMD’s VLIW5 architecture is not the best architecture going forward. Up until now it has made sense at a high efficiency gaming-oriented design, and even today in a gaming part like the 6800 series it’s still a reasonable choice. But AMD needs a new architecture for the future, not only as something that’s going to better fit their 3.4 shader average, but something that is better designed for compute workloads. AMD’s choice is an overhauled version of their existing architecture. Overall it’s built on a solid foundation, but VLIW5 is too wide to meet their future goals.

The solution is to shrink their VLIW5 SPU to a VLIW4 SPU. Specifically, the solution is to remove the t-unit, the architecture’s 5th SP and largest SP that’s capable of both regular INT/FP operations as well as being responsible for transcendental operations. In the case of regular INT/FP operations this means an SPU is reduced from being able to process 5 operations at once to 4. While in the case of transcendentals an SPU now ties together 3 SPs to process 1 transcendental in the same period of time, representing a much more severe reduction in theoretical performance as an SPU can only process 1 transcendental + 1 INT/FP per clock as opposed to 1 transcendental + 4 INT/FP operations (or any variations).

There are a number of advantages to this change. As far as compute is concerned, the biggest advantage is that much of the space previously allocated to the t-unit can now be scrounged up to build more SIMDs. Cypress had 20 SIMDs while Cayman has 24; on average Cayman’s shader block is 10% more efficient per mm2 than Cypress’s , taking in to account the fact that Cayman’s SPs are a bit larger than Cypress’ to pick up the workload the t-unit would handle. The SIMDs are further tied to a number of attributes: the number of texture units, the number of threads that can be in flight at once, and the number of FP64 operations that can be completed per clock. The latter is particularly important for AMD’s compute efforts, as they can now retire FP64 FMA/MUL operations at 1/4th their FP32 rate, in the case of a full Cayman up to 384/clock. Technically speaking they’re no faster per SPU, but with this layout change they have more SPUs to work with, improving their performance.


Fewer SPs per SIMD = More Space For More SIMDs

There are even ancillary benefits within the individual SPUs. While the SP count changed the register file did not, leading to less pressure on each SPU’s registers as now only 4 SPs vie for register space. Even scheduling is easier as there are fewer SPs to schedule and the fact that they’re all alike means the scheduler no longer has to take into consideration the difference between the w/x/y/z units and the t-unit.

Meanwhile in terms of gaming the benefits are similar. Games that were already failing to fully utilize the VLIW5 design now have additional SIMDs to take advantage of, and as rendering is still an embarrassingly parallel operation as far as threading is concerned, it’s very easy to further divide the rendering workload in to more threads to take advantage of this change. The extra SIMDs mean that Cayman has additional texturing horsepower over Cypress, and the overall compute:texture ratio has been reduced, a beneficial situation for any games that are texture/filtering bound more than they’re compute bound.

Of course any architectural change involves tradeoffs, so it’s not a pure improvement. For gaming the tradeoff is that Cayman isn’t going to be well suited to VLIW5-style vertex shaders; generally speaking games using such shaders already run incredibly fast, but if they’re even GPU-bound in the first place they’re not going to gain much from Cayman. The other big tradeoff is when transcendental operations are paired with vector operations, as Cypress could handle both in one clock while Cayman will take two. It’s AMD’s belief that these operations are rare enough that the loss of performance in this one situation is worth it for the gain in performance everywhere else.

It’s worth noting that AMD still considers VLIW4 to be a risky/experimental design, or at least this is their rationale for going with it first on Cayman while sticking to VLIW5 elsewhere. At this point we’d imagine the real experiment to already be over, as AMD would already be well in the middle of designing Cayman’s 28nm successor, so they undoubtedly know if they’ll be using VLIW4 in the future.

Finally, the switch to a new VLIW architecture means the AMD driver team has to do some relearning. While VLIW4 is quite similar to VLIW5 it’s not by any means identical, which is both good and bad for performance purposes. The bad news is that it means many of AMD’s VLIW5-centric shader compiler tricks are no longer valid; at the start shader compiler performance is going to be worse while AMD learns how to better program a VLIW4 design. The good news is that in time they’re going to learn how to better program a VLIW4 design, meaning there’s the potential for sizable performance increases throughout the lifetime of the 6900 series. That doesn’t mean they’re guaranteed, but we certainly expect at least some improvement in shader performance as the months wear on.

On that note these VLIW changes do mean that some code is going to have to be rewritten to better deal with the reduction of VLIW width. AMD’s shader compiler goes through a number of steps to try to optimize code, but if kernels were written specifically to organize instructions to go through AMD’s shaders in a 5-wide fashion, then there’s only so much AMD’s compiler can do. Of course code doesn’t have to be written that way, but it is the best way to maximize ILP and hence shader performance.

VLIW5:

  • 4 32-bit FP MAD
  • Or 2 64-bit FP MUL or ADD
  • Or 1 64-bit FP MAD
  • Or 4 24-bit Int MUL or ADD
  • Plus 1 transcendental or 1 32-bit FP MAD

VLIW4:

  • 4 32-bit FP MAD/MUL/ADD
  • Or 2 64-bit FP ADD
  • Or 1 64-bit FP MAD/FMA/MUL
  • Or 4 24-bit INT MAD/MUL/ADD
  • Or 4 32-bit INT ADD/Bitwise
  • Or 1 32-bit MAD/MUL
  • Or 1 64-bit ADD
  • Or 1 transcendental plus 1 32-bit FP MAD
Cayman: The Last 32nm Castaway Cayman: The New Dawn of AMD GPU Computing
Comments Locked

168 Comments

View All Comments

  • AnnihilatorX - Thursday, December 16, 2010 - link

    I disagree with you rarson

    This is what sets Anandtech apart, it has quality over quantity.
    Anandtech is the ONLY review site which offers me comprehensive information on the architecture, with helpful notes on the expected future gaming performance. It mention AMD intended the 69xx to run on 35nm, and made sacrifices. If you go to Guru3D''s review, the editor in the conclusion stated that he doesn't know why the performance lacks the wow factor. Anandtech answered that question with the process node.

    If you want to read reviews only, go onto google and search for 6850 review, or go to DailyTech's daily recent hardware review post, you can find over 15 plain reviews. Even easier, just use the Quick Navigation menu or the Table of Content in the freaking first page of article. This laziness does not entrice sypathy.
  • Quidam67 - Thursday, December 16, 2010 - link

    Rarson's comments may have been a little condescending in their tone, but I think the critism was actually constructive in nature.

    You can argue the toss about whether the architecture should be in a separate article or not, but personally speaking, I actually would prefer it was broken out. I mean, for those who are interested, simply provide a hyper-link, that way everyone gets what they want.

    In my view, a review is a review and an analysis on architecture can compliment that review but should not actually a part of the review itself. A number of other sites follow this formula, and provide both, but don't merge them together as one super-article, and there are other benefits to this if you read on.

    The issue of spelling anf grammer is trivial, but in fact could be symptomatic of a more serious problem, such as the sheer volume of work Ryan has to perform in the time-frame provided, and the level of QA being squeesed in with it. Given the nature of NDA's, perhaps it might take the pressure off if the review did come first, and the architecture second, so the time-pressures weren't quite so restrictive.

    Lastly, employing a professional proof-reader is hardly an insult to the original author. It's no different than being a software engineer (which I am) and being backed up by a team of quality test analysts. It certainly makes you sleep better when stuff goes into production. Why should Ryan shoulder all the responsibility?
  • silverblue - Thursday, December 16, 2010 - link

    I do hope you're joking. :) (can't tell at this early time)
  • Arnulf - Thursday, December 16, 2010 - link

    "... unlike Turbo which is a positive feedback mechanism."

    Turbo is a negative feedback mechanism. If it was a positive feedback mechanism (= a consequence of an action resulting in further action in same direction) the CPU would probably burn up almost instantly after Turbo triggered as its clock would increase indefinitely, ever more following each increase, the higher the temperature, the higher the frequency. This is not how Turbo works.

    Negative feedback mechanism is a result of an action resulting in reaction (= action in the opposite direction). In the case of CPUs and Turbo it's this to temperature reaction that keeps CPU frequency under control. The higher the temperature, the lower the frequency. This is how Turbo and PowerTune work.

    The fact that Turbo starts at lower frequency and ramps it up and that PowerTune starts at higher frequency and brings it down has no bearing on whether the mechanism of control is called "positive" or "negative" feedback.

    Considering your fondness for Wikipedia (as displayed by the reference in the article) you might want to check out these:

    http://en.wikipedia.org/wiki/Negative_feedback
    http://en.wikipedia.org/wiki/Positive_feedback

    and more specifically:

    http://en.wikipedia.org/wiki/Negative_feedback#Con...
  • Ryan Smith - Thursday, December 16, 2010 - link

    Hi Arnulf;

    Fundamentally you're right, so I won't knock you. I guess you could say I'm going for a very loose interpretation there. The point I'm trying to get across is that Turbo provides a performance floor, while PowerTune is a performance ceiling. People like getting extra performance for "free" more than they like "losing" performance. Hence one experience is positive and one is negative.

    I think in retrospect I should have used positive/negative reinforcement instead of feedback.
  • Soda - Thursday, December 16, 2010 - link

    Anyone noticed that the edge missing og the boards 8-pin power connector ?

    Apparently the AMD made a mistake in the reference design of the board and didn't calculating the space needed by the cooler.

    If you look closely on the power connector in http://images.anandtech.com/doci/4061/6970Open.jpg you'll notice the missing edge.

    For a full story on the matter you can go to http://www.hardwareonline.dk/nyheder.aspx?nid=1060...
    For the english speaking people I suggest the googlish version here http://translate.google.com/translate?hl=da&sl...

    There are some pictures to backup the claim the mistake made AMD here.

    Though it haven't been confirmed by AMD if this is only a mistake on the review boards or all cards of the 69xx series.
  • versesuvius - Thursday, December 16, 2010 - link

    I have a 3870, on a 17 inch monitor, and everything is fine as long as games go. The hard disk gets in the way sometimes, but that is just about it. All the games run fine. No problem at all. Oh, there's more: They run better on the lousy XBOX. Why the new GPU then? Giant monitors? Three of them? Six of them? (The most fun I had on Anandtech was looking at pictures of AT people trying to stabilize them on a wall). Oh, the "Compute GPU"? Wouldn't that fit on a small PCI card, and act like the old 486 coprecessor, for those who have some use for it? Or is it just a silly excuse for not doing much at all, or rather not giving much to the customers, and still charge the same? The "High End"! In an ideal world the prices of things go down, and more and more people can afford them. That lovely capitalist idea was turned on its head, sometime in the eighties of the last century, and instead the notion of value was reinvented. You get more value, for the same price. You still have to pay $400 for your graphic card, even though you do not need the "Compute GPU", and you do not need the aliased superduper antialiasing that nobody yet knows how to achieve in software. Can we have a cheap 4870? No that is discontinued. The 58 series? Discontinued. There are hundreds of thousands or to be sure, millions of people who will pay 50 dollars for one. All ATI or Nvidia need to do is to fine tune the drivers and reduce power consumption. Then again, that must be another "High End" story. In fact the only tale that is being told and retold is "High End"s and "Fool"s, (i.e. "We can do whatever we want with the money that you don't have".) Until better, saner times. For now, long live the console. I am going to buy one, instead of this stupid monstrosity and its equally stupid competitive monstrosity. Cheaper, and gets the job done in more than one way.

    End of Rant.
    God Bless.
  • Necc - Thursday, December 16, 2010 - link

    So True.
  • Ananke - Thursday, December 16, 2010 - link

    Agree. I have 5850 and it does work fine, and I got it on day one at huge discount, but still - it is kind of worthless. Our entertainment comes more exclusively from consoles, and I discrete high end card that commands above $100 price tag is worthless. It is nice touch, but I have no application for it in everyday life, and several months later is already outdated or discontinued.

    My guess, integrated in the CPU graphics will take over, and the mass market discrete cards will have the fate of the dinosaurs very soon.
  • Quidam67 - Thursday, December 16, 2010 - link

    Wonderfully subversive commentary. Loved it.

    Still, the thing I like about the High end (I'll never buy it until my Mortgage is done with) is that it filters down to the middle/low end.

    Yes, lots of discontinued product lines but for example, I thought the HD5770 was a fantastic product. Gave ample performance for maintstream gamers in a small form-factor (you can even get it in single slot) with low heat and power requirements meaning it was a true drop-in upgrade to your existing rig, with a practical upgrade path to Crossfire X.

    As for the xbox, that hardware is so outdated now that even the magic of software optimisation (a seemingly lost art in the world of PC's) cannot disguise the fact that new games are not going to look any better, or run any faster, than those that came out at launch. Was watching GT5 in demo the other day and with all the hype about how realistic it looks (and plays) I really couldn't get past the massive amount of Jaggies on screen. Also, very limited damage modelling, and in my view that's a nod towards hardware limitations rather than a game-design consideration.

Log in

Don't have an account? Sign up now