Stream Processor Implementation

Going Deeper: Single Instruction, Multiple Data

SIMD (single instruction, multiple data) is the concept of running one instruction across lots of data. This is fundamental in the implementation of graphics hardware: multiple vertices, primitives, or pixels will need to have the same shader program run on them. Building hardware to do one operation at a time on massive amounts of data makes processing each piece of data very efficient.

In SIMD hardware, multiple processing units are tied together. The hardware issues one instruction to the SIMD hardware and all the processing units perform that operation on unique data. All graphics hardware is built on this concept at some level. Implementing hardware this way avoids the complexity of requiring each SP to manage not only the data coming through it, but the instructions it will be running as well.

Going Deeper: Very Long Instruction Word

Normally when we think about instructions on a processor, we think about a single operation, like Add or Multiply. But imagine if you wanted to run multiple instructions at once on a parallel array of hardware. You might come up with a technique similar to VLIW (Very Long Instruction Word), which allows you to take simple operations and, if they are not dependent on each other, stick them together as one instruction.

Imagine we have five processing units that operate in parallel. Utilizing this hardware would require us to issue independent instructions on each of the five units. This is hard to determine while code is running. VLIW allows us to take the determination of instruction dependence out of the hardware and put it in the complier. The compiler can then build a single instruction that consists of as much independent processing work as possible.

VLIW is a good way of exploiting parallelism without adding hardware complexity, but it can create a huge headache for compiler designers when dealing with dependencies. Luckily, graphics hardware lends itself well to this type of processing, but as shaders get more complex and interesting we might see more dependent instructions in practice.

Bringing it Back to the Hardware: AMD's R600

AMD implements their R600 shader core using four SIMD arrays. These SIMD arrays are issued 5-wide (6 with a branch) VLIW instructions. These VLIW instructions operate on 16 threads (vertices, primitives or pixels) at a time. In addition to all this, AMD interleaves two different VLIW instructions from different shaders in order to maximize pipeline utilization on the SIMD units. Our understanding is that this is in order to ensure that all the data from one VLIW instruction is available to a following dependent VLIW instruction in the same shader.

Based on this hardware, we can do a little math and see that R600 is capable of issuing up to four different VLIW instructions (up to 20 distinct shader operations), working on a total of 64 different threads. Each thread can have up to five different operations working on it as defined by the VLIW instruction running on the SIMD unit that is processing that specific thread.

For pixel processing, AMD assigns threads to SIMD units in 8x8 blocks (64 pixels) processed over multiple clocks. This is to enable a small branch granularity (each group of 64 pixels must follow the same code path), and it's large enough to exploit locality of reference in tightly packed pixels (in other words, pixels that are close together often need to load similar data/textures). There are apparently cases where branch granularity jumps to 128 pixels, but we don't have the data on when or why this happens yet.

If it seems like all this reads in a very complicated way, don't worry: it is complex. While AMD has gone to great lengths to build hardware that can efficiently handle parallel data, dependencies pose a problem to realizing peak performance. The compiler might not be able to extract five operations for every VLIW instruction. In the worst case scenario, we could effectively see only one SP per block operating with only four VLIW instructions being issued. This drops our potential operations per clock rate down from 320 at peak to only 64.

On the bright side, we will probably not see a shader program that causes R600 to run at its worst case performance. Because vertices and colors are still four components each, we will likely see utilization closer to peak in many common cases.

Different Types of Stream Processors Next Up: NVIDIA's G80
Comments Locked

86 Comments

View All Comments

  • yyrkoon - Tuesday, May 15, 2007 - link

    See, the problem here is: guys like you are so bent on saving that little bit of money, by buying a lesser brand name, that you do not even take the time to research your hardware. USe newegg , and read the user reviews, and if that is not enough for you, go to the countless other resources all over the internet.
  • yyrkoon - Tuesday, May 15, 2007 - link

    Blame the crappy OEM you bought the card from, not nVIdia. Get an EVGA card, and embrace a completely different aspect on video card life.

    MSI may make some decent motherboards, but their other components have serious issues.
  • LoneWolf15 - Thursday, May 17, 2007 - link

    Um, since 95% of nvidia-GPU cards on the market are the reference design, I'd say your argument here is shaky at best. EVGA and MSI both use the reference design, and it's even possible that cards with the same GPU came off the same production line at the same plant.
  • DerekWilson - Thursday, May 17, 2007 - link

    it is true that the majority of parts are based on reference designs, but that doesn't mean they all come from the same place. I'm sure some of them do, but to say that all of these guys just buy completed boards and put their name on them all the time is selling them a little short.

    at the same time, the whole argument of which manufacturer builds the better board on a board component level isn't something we can really answer.

    what we would suggest is that its better to buy from OEMs who have good customer service and long extensive warranties. this way, even if things do go wrong, there is some recourse for customers who get bad boards or have bad experiences with drivers and software.
  • cmdrdredd - Monday, May 14, 2007 - link

    you're wrong. 99% of people buying these high end cards are gaming. Those gamers demand and deserve the best possible performance. If a card that uses MORE power and costs MORE (x2900xt vs 8800gts) and performs generally the same or slower what is the point? Fact is...ATI's high end is in fact slower than mid range offerings from Nvidia and consumes alot more power. Regardless of what you think, people are buying these based on performance benchmarks in 99% of all cases.
  • AnnonymousCoward - Tuesday, May 15, 2007 - link

    No, you're wrong. Did you overlook the emphasis he put on "NOT ALWAYS"?

    You said 99% use for gaming--so there's 1%. Out of the gamers, many really want LCD scaling to work, so that games aren't stretched horribly on widescreen monitors. Some gamers would also like TVout to work.

    So he was right: faster is NOT ALWAYS better.
  • erwos - Monday, May 14, 2007 - link

    It'd be nice to get the scoop on the video decode acceleration present on these boards, and how it stocks up to the (excellent) PureVideo HD found in the 8600 series.
  • imaheadcase - Tuesday, May 15, 2007 - link

    I agree! They need to do a whole article on video acceleration on a range of cards and show the pluses and cons of each card in respective areas. A lot of people like myself like to watch videos and game on cards, but like the option open to use the advanced video features.

  • Turnip - Monday, May 14, 2007 - link

    "We certainly hope we won't see a repeat of the R600 launch when Barcelona and Agena take on Core 2 Duo/Quad in a few months...."


    Why, that's exactly what I had been thinking :)

    Phew! I made it through the whole thing though, I even read all of those awfully big words and everything! :)

    Thanks guys, another top review :)
  • Kougar - Monday, May 14, 2007 - link

    First, great article! I will be going back to reread the very indepth analysis of the hardware and features, something that keeps me a avid Anandtech reader. :)

    Since it was mentioned that overclocking will be included in a future article, I would like to suggest that if possible watercooling be factored into it. So far one review site has already done a watercooled test with a low-end watercooling setup, and without mods acheived 930MHz on the Core, which indirectly means 930MHz shaders if I understand the hardware.

    I'm sure I am not the only reader extremely interested to see if all R600 needs is a ~900-950MHz overclock to offer some solid GTX level performance... or if it would even help at all. Again thanks for the consideration, and the great article! Now off to find some Folding@Home numbers...

Log in

Don't have an account? Sign up now