Different Types of Stream Processors

The first thing we need to do when looking at the R600 shader core is to define our terms. AMD and NVIDIA build and refer to their Stream Processors (SPs) differently, and that makes counting them a little more difficult. Throughout our explanation, it will help to remember from our G80 coverage that threads refer to a vertex, primitive or pixel and not a stream of instructions as it would on a CPU.

Stream Processors: The NVIDIA Way

G80 has 128 SPs (for the 8800 GTX; there are 96 SPs on the 8800 GTS models) that are capable of doing a very small number of things at the same time. They can do either standard FP operations (like a MADD), a special function operation (like sine), or an integer operation. There are some cases where they can squeeze out an extra MUL, but more often than not this MUL isn't accessible. Each of these SPs operates on an individual thread (be it a vertex, primitive or pixel).

This gives us a total of up to 128 threads being processed per clock. It is important to realize that each of the 128 SPs isn't entirely independent. That is, we can't run 128 different instructions in one clock, in spite of the fact that we can run a number of instructions on 128 different threads. We'll delve a little deeper into this shortly, but depending on the type of shader running, the same instruction must be running on multiple threads.

For NVIDIA hardware, the minimum number of threads that must be processed using the same instruction is 16 (for vertex threads). NVIDIA's block diagrams show that each group of 16 SPs shares texture, register, and cache resources, so this makes sense. Pixel shaders, which are more important from a performance perspective, must run one instruction on 32 pixels at a time. What we can extrapolate from this is that NVIDIA can issue up to eight separate instructions across all of its 128 SPs (only four if working on pixels) per clock.

128 SPs / 16 Threads per Instruction per Clock = 8 Vertex Instructions per Clock

128 SPs / 32 Threads per Instruction per Clock = 4 Pixel Instructions per Clock

Stream Processors: AMD's R600

Things are a little different on R600. AMD tells us that there are 320 SPs, but these aren't directly comparable to G80's 128. First of all, most of the SPs are simpler and aren't capable of special function operations. For every block of five SPs, only one can handle either a special function operation or a regular floating point operation. The special function SP is also the only one able to handle integer multiply, while other SPs can perform simpler integer operations.

This isn't a huge deal because straight floating point MAD and MUL performance is by far the limiting factors in shader performance today. The big difference comes in the fact that AMD only executes one thread (vertex, primitive or pixel) across a group of five SPs.

What this means is that each of the five SPs in a block must run instructions from one thread. While AMD can run up to five scalar instructions from that thread in parallel, these instructions must be completely independent from one another. This can place a heavy burden on AMD's compiler to extract parallel operations from shader code. While AMD has gone to great lengths to make sure every block of five SPs is always busy, it's much harder to ensure that every SP within each block is always busy.

If we take a step back, we can determine how many threads AMD is able to work on per clock. With 320 total SPs, each grouped into blocks of five-to-a-thread, we get 64 threads per clock. And here's where it starts to get complicated. Before we go back and compare this to NVIDIA's architecture, let's go a little deeper into the implementation.

R600 Overview Stream Processor Implementation
Comments Locked

86 Comments

View All Comments

  • dragonsqrrl - Thursday, August 25, 2011 - link

    You forgot c).

    -if you're an ATI fanboy
  • vijay333 - Monday, May 14, 2007 - link

    http://www.randomhouse.com/wotd/index.pperl?date=1...">http://www.randomhouse.com/wotd/index.pperl?date=1...

    "the expression to call a spade a spade is thousands of years old and etymologically has nothing whatsoever to do with any racial sentiment."
  • strikeback03 - Wednesday, May 16, 2007 - link

    What about in Euchre, where a spade can be a club (and vice versa)?
  • johnsonx - Monday, May 14, 2007 - link

    Just wait until AT refers to AMD's marketing budget as 'niggardly'...
  • bldckstark - Monday, May 14, 2007 - link

    What do shovels have to do with race?
  • Stan11003 - Monday, May 14, 2007 - link

    My big hope out all of this that the ATI part forces the Nvidia parts lower so I can use my upgrade option from EVGA to get a nice 8800 GTX instead of my 8800 GTS ACS3 320. However with a quad core and a decent 2GB I have no gaming issues at all. I play at 1600x1200(when that become a low rez?) and everything is butter smooth. Without newer titles all this hardware is a waist anyways.
  • Gul Westfale - Monday, May 14, 2007 - link

    the article says that the part is not a failure, but i disagree. i switched from a radeon 1950pro to an nvidia geforce 8800GTS 320MB about a mont ago, and i paid only $350US for it. now i see that it still outperforms the new 2900...

    one of my friends wanted to wait to buy a new card, he said he hoped that the ATI part was going to be faster. now he says he will just buy the 8800GTS 320, since ATI have failed.

    if they can bring out a part that competes well with the 8800GTS and price it similarly or lower then it would be worth buying, but until then i will stick with nvidia. better performance, better price, and better drivers... why would anyone buy the ATI card now?
  • ncage - Monday, May 14, 2007 - link

    My conclusion is to wait. All of the recent GPU do great with dx9...the question is how will they do with dx10? I think its best to wait for dx10 titles to come out. I think crysis would be a PERFECT test.
  • wingless - Monday, May 14, 2007 - link

    I agree with you. Crysis is going to be the benchmark for these DX10 cards. Its hard to tell both Nvidia and AMD's DX10 performance with these current, first generation DX10 titles (most of which have a DX9 version) because they don't fully take advantage of all the power on both the G80 or R600 yet. Its true that Crysis will have a DX9 version as well but the developer stated there are some big differences in code. I'm an Nvidia fanboy but I'm disappointed with the Pure Video and HDMI support on the 8800 series cards. ATI got this worked out with their great AVIVO and their nice HDMI implementation but for now Nvidia is still the performance champ with "simpler" hardware. The G80 and R600 continue the traditions of their manufacturers. Nvidia has always been about raw power and all out speed with few bells and whistles. ATI is all about refinement, bells and whistles, innovations, and unproven new methods which may make or break them.

    All I really want to wait for is to see how developers embrace CUDA or ATI's setup for PHYSICS PROCESSING! Both companies seem to have well thought out methods to do physics and I cant wait to see that showdown. AGEIA and HAVOK need to hop on-board and get some software support for all this good hardware potential they have to play with. Physics is the next big gimmick and you know how much we all love gimmicks (just like good 'ole 3D acceleration 10 years ago).
  • poohbear - Monday, May 14, 2007 - link

    they dont make a profit from high end parts that's why they're not bothering w/ it? that's AMD's story? so why bother having an FX line w/ their cpus?

Log in

Don't have an account? Sign up now