Stream Processor Implementation

Going Deeper: Single Instruction, Multiple Data

SIMD (single instruction, multiple data) is the concept of running one instruction across lots of data. This is fundamental in the implementation of graphics hardware: multiple vertices, primitives, or pixels will need to have the same shader program run on them. Building hardware to do one operation at a time on massive amounts of data makes processing each piece of data very efficient.

In SIMD hardware, multiple processing units are tied together. The hardware issues one instruction to the SIMD hardware and all the processing units perform that operation on unique data. All graphics hardware is built on this concept at some level. Implementing hardware this way avoids the complexity of requiring each SP to manage not only the data coming through it, but the instructions it will be running as well.

Going Deeper: Very Long Instruction Word

Normally when we think about instructions on a processor, we think about a single operation, like Add or Multiply. But imagine if you wanted to run multiple instructions at once on a parallel array of hardware. You might come up with a technique similar to VLIW (Very Long Instruction Word), which allows you to take simple operations and, if they are not dependent on each other, stick them together as one instruction.

Imagine we have five processing units that operate in parallel. Utilizing this hardware would require us to issue independent instructions on each of the five units. This is hard to determine while code is running. VLIW allows us to take the determination of instruction dependence out of the hardware and put it in the complier. The compiler can then build a single instruction that consists of as much independent processing work as possible.

VLIW is a good way of exploiting parallelism without adding hardware complexity, but it can create a huge headache for compiler designers when dealing with dependencies. Luckily, graphics hardware lends itself well to this type of processing, but as shaders get more complex and interesting we might see more dependent instructions in practice.

Bringing it Back to the Hardware: AMD's R600

AMD implements their R600 shader core using four SIMD arrays. These SIMD arrays are issued 5-wide (6 with a branch) VLIW instructions. These VLIW instructions operate on 16 threads (vertices, primitives or pixels) at a time. In addition to all this, AMD interleaves two different VLIW instructions from different shaders in order to maximize pipeline utilization on the SIMD units. Our understanding is that this is in order to ensure that all the data from one VLIW instruction is available to a following dependent VLIW instruction in the same shader.

Based on this hardware, we can do a little math and see that R600 is capable of issuing up to four different VLIW instructions (up to 20 distinct shader operations), working on a total of 64 different threads. Each thread can have up to five different operations working on it as defined by the VLIW instruction running on the SIMD unit that is processing that specific thread.

For pixel processing, AMD assigns threads to SIMD units in 8x8 blocks (64 pixels) processed over multiple clocks. This is to enable a small branch granularity (each group of 64 pixels must follow the same code path), and it's large enough to exploit locality of reference in tightly packed pixels (in other words, pixels that are close together often need to load similar data/textures). There are apparently cases where branch granularity jumps to 128 pixels, but we don't have the data on when or why this happens yet.

If it seems like all this reads in a very complicated way, don't worry: it is complex. While AMD has gone to great lengths to build hardware that can efficiently handle parallel data, dependencies pose a problem to realizing peak performance. The compiler might not be able to extract five operations for every VLIW instruction. In the worst case scenario, we could effectively see only one SP per block operating with only four VLIW instructions being issued. This drops our potential operations per clock rate down from 320 at peak to only 64.

On the bright side, we will probably not see a shader program that causes R600 to run at its worst case performance. Because vertices and colors are still four components each, we will likely see utilization closer to peak in many common cases.

Different Types of Stream Processors Next Up: NVIDIA's G80
Comments Locked

86 Comments

View All Comments

  • wjmbsd - Monday, July 2, 2007 - link

    What is the latest on the so-called Dragonhead 2 project (aka, HD 2900 XTX)? I heard it was just for OEMs at first...anyone know if the project is still going and how the part is benchmarking with newest drivers?
  • teainthesahara - Monday, May 21, 2007 - link

    After this failure of the R600 and likely overrated(and probably late) Barcelona/Agena processors I think that Intel will finally bury AMD. Paul Ottelini is rubbing his hands with glee at the moment and rightfully so. AMD now stands for mediocrity.Oh dear what a fall from grace.... To be honest Nvidia don't have any real competition on the DX10 front at any price points.I cannot see AMD processors besting Intel's Core 2 Quad lineup in the future especially when 45nm and 32 nm become the norm and they don't have a chance in hell of beating Nvidia. Intel and Nvidia are turning the screws on Hector Ruiz.Shame AMD brought down such a great company like ATI.
  • DerekWilson - Thursday, May 24, 2007 - link

    To be fair, we really don't have any clue how these cards compete on the DX10 front as there are no final, real DX10 games on the market to test.

    We will try really hard to get a good idea of what DX10 will look like on the HD 2000 series and the GeForce 8 Series using game demos, pre-release code, and SDK samples. It won't be a real reflection of what users will experience, but we will certainly hope to get a glimpse at performance.

    It is fair to say that NVIDIA bests AMD in current game performance. But really there are so many possibilities with DX10 that we can't call it yet.
  • spinportal - Friday, May 18, 2007 - link

    From the last posting of results for the GTS 320MB round-up
    http://www.anandtech.com/video/showdoc.aspx?i=2953...">Prey @ AnandTech - 8800GTS320
    we see that the 2900XT review chart pushes the nVidia cards down about 15% across the board.
    http://www.anandtech.com/video/showdoc.aspx?i=2988...">Prey @ AnandTech - ATI2900XT
    The only difference in systems is software drivers as the cpu / mobo / mem are the same.

    Does this mean ATI should be getting a BIGGER THRASHING BEAT-DOWN than the reviewer is stating?
    400$ ATI 2900XT performing as good as a 300$ nVidia 8800 GTS 320MB?

    Its 100$ short and 6 months late along with 100W of extra fuel.

    This is not your uncle's 9700 Pro...
  • DerekWilson - Sunday, May 20, 2007 - link

    We switched Prey demos -- I updated our benchmark.

    Both numbers are accurate for the tests I ran at the time.

    Our current timedemo is more stressful and thus we see lower scores with this test.
  • Yawgm0th - Wednesday, May 16, 2007 - link

    The prices listed in this article are way off.

    Currently, 8800GTS 640MB retails for $350-380, $400+ for OC or special versions. 2900XT retails for $430+. In the article, both are listed as $400, and as such the card is given a decent review in the conclusion.

    Realistically, this card provides slightly inferior performance to the 8800GTS 640MB at a considerably higher price point -- $80-$100 more than the 8800GTS. I mean, it's not like the 8800Ultra, but for the most part this card has little use outside of AMD and/or ATI fanboys. I'd love for this card to do better as AMD needs to be competing with Nvidia and Intel right now, but I just can't see how this is even worth looking at, given current prices.
  • DerekWilson - Thursday, May 17, 2007 - link

    really, this article focuses on architechture more than product, and we went with MSRP prices...

    we will absolutly look closer at price and price/performance when we review retail products.
  • quanta - Tuesday, May 15, 2007 - link

    As I recalled, the Radeon HD 2900 only has DVI ports, but nowhere in DVI documentation specifies it can carry audio signals. Unless the card comes with adapter that accepts audio input, it seems the audio portion of R600 is rendered useless.
  • DerekWilson - Wednesday, May 16, 2007 - link

    the card does come with an adapter of sorts, but the audio input is from the dvi port.

    you can't use a standard DVI to HDMI converter for this task.

    when using AMD's HDMI converter the data sent out over the DVI port does not follow the DVI specification.

    the bottom line is that the DVI port is just a physical connector carrying data. i could take a DVI port and solder it to a stereo and use it to carry 5.1 audio if I wanted to ... wouldn't be very useful, but I could do it :-)

    While connected to a DVI device, the card operates the port according to the DVI specification. When connected to an HDMI device through the special converter (which is not technically "dvi to hdmi" -- it's amd proprietry to hdmi), the card sends out data that follows the HDMI spec.

    you can look at it another way -- when the HDMI converter is connected, just think of the dvi port as an internal connector between an I/O port and the TMDS + audio device.
  • ShaunO - Tuesday, May 15, 2007 - link

    I was at an AMD movie night last night where they discussed the technical details of the HD 2900 XT and also showed the Ruby Whiteout DX10 Demo rendered using the card. It looked amazing and I had high hopes until I checked out the benchmark scores. They're going to need more than free food and popcorn to convince me to buy an obsolete card.

    However there is room for improvement of course. Driver updates, DX10 and whatnot. The main thing for me personally will be driver updates, I will be interested to see how well the card improves over time while I save my pennies for my next new machine.

    Everyone keeps saying "DX10 performance will be better, yadda yadda" but I also want to be able to play the games I have now and older games without having to rely on DX10 games to give me better performance. Nothing like totally underperforming in DX9 games and then only being equal or slightly better in DX10 games compared to the competition. I would rather have a decent performer all-round. Even saying that we don't even know for sure if DX10 games are even going to bring any performance increases of the competition, it's all speculation right now and that's all we can do, speculate.

    Shaun.

Log in

Don't have an account? Sign up now