Stream Processor Implementation

Going Deeper: Single Instruction, Multiple Data

SIMD (single instruction, multiple data) is the concept of running one instruction across lots of data. This is fundamental in the implementation of graphics hardware: multiple vertices, primitives, or pixels will need to have the same shader program run on them. Building hardware to do one operation at a time on massive amounts of data makes processing each piece of data very efficient.

In SIMD hardware, multiple processing units are tied together. The hardware issues one instruction to the SIMD hardware and all the processing units perform that operation on unique data. All graphics hardware is built on this concept at some level. Implementing hardware this way avoids the complexity of requiring each SP to manage not only the data coming through it, but the instructions it will be running as well.

Going Deeper: Very Long Instruction Word

Normally when we think about instructions on a processor, we think about a single operation, like Add or Multiply. But imagine if you wanted to run multiple instructions at once on a parallel array of hardware. You might come up with a technique similar to VLIW (Very Long Instruction Word), which allows you to take simple operations and, if they are not dependent on each other, stick them together as one instruction.

Imagine we have five processing units that operate in parallel. Utilizing this hardware would require us to issue independent instructions on each of the five units. This is hard to determine while code is running. VLIW allows us to take the determination of instruction dependence out of the hardware and put it in the complier. The compiler can then build a single instruction that consists of as much independent processing work as possible.

VLIW is a good way of exploiting parallelism without adding hardware complexity, but it can create a huge headache for compiler designers when dealing with dependencies. Luckily, graphics hardware lends itself well to this type of processing, but as shaders get more complex and interesting we might see more dependent instructions in practice.

Bringing it Back to the Hardware: AMD's R600

AMD implements their R600 shader core using four SIMD arrays. These SIMD arrays are issued 5-wide (6 with a branch) VLIW instructions. These VLIW instructions operate on 16 threads (vertices, primitives or pixels) at a time. In addition to all this, AMD interleaves two different VLIW instructions from different shaders in order to maximize pipeline utilization on the SIMD units. Our understanding is that this is in order to ensure that all the data from one VLIW instruction is available to a following dependent VLIW instruction in the same shader.

Based on this hardware, we can do a little math and see that R600 is capable of issuing up to four different VLIW instructions (up to 20 distinct shader operations), working on a total of 64 different threads. Each thread can have up to five different operations working on it as defined by the VLIW instruction running on the SIMD unit that is processing that specific thread.

For pixel processing, AMD assigns threads to SIMD units in 8x8 blocks (64 pixels) processed over multiple clocks. This is to enable a small branch granularity (each group of 64 pixels must follow the same code path), and it's large enough to exploit locality of reference in tightly packed pixels (in other words, pixels that are close together often need to load similar data/textures). There are apparently cases where branch granularity jumps to 128 pixels, but we don't have the data on when or why this happens yet.

If it seems like all this reads in a very complicated way, don't worry: it is complex. While AMD has gone to great lengths to build hardware that can efficiently handle parallel data, dependencies pose a problem to realizing peak performance. The compiler might not be able to extract five operations for every VLIW instruction. In the worst case scenario, we could effectively see only one SP per block operating with only four VLIW instructions being issued. This drops our potential operations per clock rate down from 320 at peak to only 64.

On the bright side, we will probably not see a shader program that causes R600 to run at its worst case performance. Because vertices and colors are still four components each, we will likely see utilization closer to peak in many common cases.

Different Types of Stream Processors Next Up: NVIDIA's G80
Comments Locked

86 Comments

View All Comments

  • dragonsqrrl - Thursday, August 25, 2011 - link

    You forgot c).

    -if you're an ATI fanboy
  • vijay333 - Monday, May 14, 2007 - link

    http://www.randomhouse.com/wotd/index.pperl?date=1...">http://www.randomhouse.com/wotd/index.pperl?date=1...

    "the expression to call a spade a spade is thousands of years old and etymologically has nothing whatsoever to do with any racial sentiment."
  • strikeback03 - Wednesday, May 16, 2007 - link

    What about in Euchre, where a spade can be a club (and vice versa)?
  • johnsonx - Monday, May 14, 2007 - link

    Just wait until AT refers to AMD's marketing budget as 'niggardly'...
  • bldckstark - Monday, May 14, 2007 - link

    What do shovels have to do with race?
  • Stan11003 - Monday, May 14, 2007 - link

    My big hope out all of this that the ATI part forces the Nvidia parts lower so I can use my upgrade option from EVGA to get a nice 8800 GTX instead of my 8800 GTS ACS3 320. However with a quad core and a decent 2GB I have no gaming issues at all. I play at 1600x1200(when that become a low rez?) and everything is butter smooth. Without newer titles all this hardware is a waist anyways.
  • Gul Westfale - Monday, May 14, 2007 - link

    the article says that the part is not a failure, but i disagree. i switched from a radeon 1950pro to an nvidia geforce 8800GTS 320MB about a mont ago, and i paid only $350US for it. now i see that it still outperforms the new 2900...

    one of my friends wanted to wait to buy a new card, he said he hoped that the ATI part was going to be faster. now he says he will just buy the 8800GTS 320, since ATI have failed.

    if they can bring out a part that competes well with the 8800GTS and price it similarly or lower then it would be worth buying, but until then i will stick with nvidia. better performance, better price, and better drivers... why would anyone buy the ATI card now?
  • ncage - Monday, May 14, 2007 - link

    My conclusion is to wait. All of the recent GPU do great with dx9...the question is how will they do with dx10? I think its best to wait for dx10 titles to come out. I think crysis would be a PERFECT test.
  • wingless - Monday, May 14, 2007 - link

    I agree with you. Crysis is going to be the benchmark for these DX10 cards. Its hard to tell both Nvidia and AMD's DX10 performance with these current, first generation DX10 titles (most of which have a DX9 version) because they don't fully take advantage of all the power on both the G80 or R600 yet. Its true that Crysis will have a DX9 version as well but the developer stated there are some big differences in code. I'm an Nvidia fanboy but I'm disappointed with the Pure Video and HDMI support on the 8800 series cards. ATI got this worked out with their great AVIVO and their nice HDMI implementation but for now Nvidia is still the performance champ with "simpler" hardware. The G80 and R600 continue the traditions of their manufacturers. Nvidia has always been about raw power and all out speed with few bells and whistles. ATI is all about refinement, bells and whistles, innovations, and unproven new methods which may make or break them.

    All I really want to wait for is to see how developers embrace CUDA or ATI's setup for PHYSICS PROCESSING! Both companies seem to have well thought out methods to do physics and I cant wait to see that showdown. AGEIA and HAVOK need to hop on-board and get some software support for all this good hardware potential they have to play with. Physics is the next big gimmick and you know how much we all love gimmicks (just like good 'ole 3D acceleration 10 years ago).
  • poohbear - Monday, May 14, 2007 - link

    they dont make a profit from high end parts that's why they're not bothering w/ it? that's AMD's story? so why bother having an FX line w/ their cpus?

Log in

Don't have an account? Sign up now