Many people, especially in huge articles like the GT200 launch article, skip over the very text heavy pages I tend to write. Especially as I have a flair for low-level technical detail that not everyone enjoys.

In this recent foray into GPU architecture guess work, we spent some time speculating about G80 and GT200 SP pipeline depth. Our guess was 8 stages based on the depth of other architectures at that speed and the potential of wasted power with very deep pipelines. It turns out that we may have guessed way too low on this one (Anand: ahem, actually someone came up with 15).

One of our readers, Denis Riedijk, pointed us to NVIDIA's own forums and CUDA programming guide. These sources reveal that properly hiding instruction latency requires 6 active warps per SM. The math on this comes out to an effective latency of 24 cycles before a warp can be scheduled to process the next instruction in its instruction stream. Each warps takes 4 cycles to process in an SM (4 threads from a warp are processed on each of the 8 SPs) and 6*4 is 24. You can also look at it as 6 warps * 32 threads/warp = 192 threads and 192 threads / 8 SPs = 24 threads per SP, and with a throughput of 1 instruction per cycle = 24 cycles.

My first thought was that their scheduling hardware might not be able to handle scheduling fewer warps fast enough or that the way they manage local memory might require a delay for some other reason to cover read after write dependancies. But reading through the threads Denis pointed us to really seem to indicate that it might just be pipeline depth that gives us this lantecy. From NVIDIA's Mark Harris in one of the threads:

"The latency is approximately 22 clocks (this is the 1.35 GHz clock on 8800 GTX), and it takes 4 clocks to execute an arithmetic instruction (ADD, MUL, MAD, etc,) for a whole warp."

There's also an indication of the size of G80/GT200's SP register file in the CUDA forums.  Harris mentions that one way of hiding ALU latency is by ensuring at most 25% of the available register space is in use, or 42 registers per thread.  That would put G80 at 168 registers or GT200 at 336 registers per thread.

Here are some links to the relevant threads in case you guys want to read through them yourselves. It's definitely interesting stuff.

Which brings us to a broader point. NVIDIA is going to have to give CUDA developers more detail in order for them to effectively use the hardware. Certainly we don't believe Intel gives away as much technical detail as they do because they are so benevolent: developers need to know the details in order to get the most out of their code, and this is more and more true as you reach up into the HPC markets that NVIDIA is targeting. Companies that pay hundreds of thousands for compute clusters aren't interested in just throwing away compute power: they want and need to get the most out of every cycle.

While I do hope that NVIDIA will continue to move in the direction of giving us more detail, or at least of giving us the detail they are already publicly sharing with developers, we certainly do have a better idea of where to look when we want low-level technical information now. Looks like I'm going to have to sit down and start writing all those CUDA apps I've been putting off.

Comments Locked


View All Comments

  • Aileur - Wednesday, June 18, 2008 - link

    I have found the cuda forums to be a great place to learn.
    Many of the contributors that wrote the programs in the SDK participate on the forums and id like to think they know their stuff!

    As for registers per thread. If we accept there are 8192 threads available per multiprocessor, and if we want to run at least one full warp of 32 threads, that would put the maximum of registers per thread to 256. I guess we could run only 1 thread and have a full 8192 registers to a thread but that would obviously be completly useless.

    I guess what im saying is that i dont think there is a "register per threads" value. There is a registers per multiprocessor fixed (per card) value and your launch configuration decides how many registers a kernel can hope to be able to use. On the other hand, a given kernel knows how many register it needs (and unlike general purpose cpus, it NEEDS those registers as there is no cachine mechanism), so you have to generate a launch configuration that agrees with this value.

    Hope to see you on the cuda forums soon!
  • jibbo79 - Thursday, June 19, 2008 - link

    Anyone with interest in these specs should read the CUDA Programming Guide doc.

    For devices with compute capability 1.0 (eg GeForce 8800)
    - The maximum number of threads per block is 512
    - The number of registers per multiprocessor is 8192
    - The maximum number of active blocks per multiprocessor is 8
    - The maximum number of active warps per multiprocessor is 24
    - The maximum number of active threads per multiprocessor is 768

    For devices with compute capability 1.2 (eg GeForce GTX 280/260)
    - The maximum number of threads per block is 512
    - The number of registers per multiprocessor is 16384
    - The maximum number of active blocks per multiprocessor is 8
    - The maximum number of active warps per multiprocessor is 32
    - The maximum number of active threads per multiprocessor is 1024
  • Denis Riedijk - Thursday, June 19, 2008 - link

    GTX260 & 280 are compute capability 1.3 actually, but the numbers are correct.
  • jibbo79 - Thursday, June 19, 2008 - link

    Yes, but 1.3 only adds double precision and is completely unrelated to register counts.
  • Denis Riedijk - Friday, June 20, 2008 - link

    When receiving the card my first impression was that they doubled the registercount because of double support since it takes 2 registers per double. But since there was apparently a (internal?) separate compute capability it might indeed be unrelated.
  • Zak - Wednesday, June 18, 2008 - link

    "Some of us enjoy the technical stuff even if we don't fully understand it,"

    Yup, and we sometimes learn something geeky from them:)

  • SiliconDoc - Monday, July 28, 2008 - link

    Yes, and some of us can't help thinking with a bad attitude, "The b****rds, they're always holding back, making it all harder than it should be, the proprietary/patent prenatalist protectors".
  • Gannon - Wednesday, June 18, 2008 - link

    Some of us enjoy the technical stuff even if we don't fully understand it, I think it would be great if one could link to articles / books / references on the web that would enable one to look into it on ones own time and understand it.

    I know reading you articles I come across terms and I think "If only I had link to look further into this".

    No doubt on a GPU most people are interested in gaming performance and whether it's worth their $ that's what the majority of the market wants to know.

    Most people do not have an interest in technical minutae, the care as much about GPU design or architecture as they do what kind of butter knife they use. They don't care about how the knife was made all they want to know is: Does it get the job done at the price that is affordable?
  • skiboysteve - Wednesday, June 18, 2008 - link

    good book over the stuff from a microprocessor architecture class">;s=books...
  • DerekWilson - Wednesday, June 18, 2008 - link

    the book ...">;s=books...

    that's where it's at baby. hennessy and patterson really need to tackle GPU architecture, but if you start with CPUs you'll definitely get be in a position to understand GPUs as well.

    i'd say if you want to learn more, check out the above book and look into graphics programming introductions. i prefer opengl, but to be fair i haven't done anything with dx10 yet.

    i would love to link concepts to things ... but that'd generate quite a bit of traffic to wikipedia (since it'd take a significant ammount of time for us to do it all ourselves), but they really aren't even the best source for people who want to learn and don't already mostly understand what's happening ...

Log in

Don't have an account? Sign up now