Many people, especially in huge articles like the GT200 launch article, skip over the very text heavy pages I tend to write. Especially as I have a flair for low-level technical detail that not everyone enjoys.

In this recent foray into GPU architecture guess work, we spent some time speculating about G80 and GT200 SP pipeline depth. Our guess was 8 stages based on the depth of other architectures at that speed and the potential of wasted power with very deep pipelines. It turns out that we may have guessed way too low on this one (Anand: ahem, actually someone came up with 15).

One of our readers, Denis Riedijk, pointed us to NVIDIA's own forums and CUDA programming guide. These sources reveal that properly hiding instruction latency requires 6 active warps per SM. The math on this comes out to an effective latency of 24 cycles before a warp can be scheduled to process the next instruction in its instruction stream. Each warps takes 4 cycles to process in an SM (4 threads from a warp are processed on each of the 8 SPs) and 6*4 is 24. You can also look at it as 6 warps * 32 threads/warp = 192 threads and 192 threads / 8 SPs = 24 threads per SP, and with a throughput of 1 instruction per cycle = 24 cycles.

My first thought was that their scheduling hardware might not be able to handle scheduling fewer warps fast enough or that the way they manage local memory might require a delay for some other reason to cover read after write dependancies. But reading through the threads Denis pointed us to really seem to indicate that it might just be pipeline depth that gives us this lantecy. From NVIDIA's Mark Harris in one of the threads:

"The latency is approximately 22 clocks (this is the 1.35 GHz clock on 8800 GTX), and it takes 4 clocks to execute an arithmetic instruction (ADD, MUL, MAD, etc,) for a whole warp."

There's also an indication of the size of G80/GT200's SP register file in the CUDA forums.  Harris mentions that one way of hiding ALU latency is by ensuring at most 25% of the available register space is in use, or 42 registers per thread.  That would put G80 at 168 registers or GT200 at 336 registers per thread.

Here are some links to the relevant threads in case you guys want to read through them yourselves. It's definitely interesting stuff.

Which brings us to a broader point. NVIDIA is going to have to give CUDA developers more detail in order for them to effectively use the hardware. Certainly we don't believe Intel gives away as much technical detail as they do because they are so benevolent: developers need to know the details in order to get the most out of their code, and this is more and more true as you reach up into the HPC markets that NVIDIA is targeting. Companies that pay hundreds of thousands for compute clusters aren't interested in just throwing away compute power: they want and need to get the most out of every cycle.

While I do hope that NVIDIA will continue to move in the direction of giving us more detail, or at least of giving us the detail they are already publicly sharing with developers, we certainly do have a better idea of where to look when we want low-level technical information now. Looks like I'm going to have to sit down and start writing all those CUDA apps I've been putting off.

Comments Locked

20 Comments

View All Comments

  • IntelUser2000 - Saturday, June 21, 2008 - link

    Well for one thing, Intel has far more documents than AMD/ATI/Nvidia in documenting various parts of their hardware. It's almost impossible to find exact current usage and TDP on chipsets for ATI/Nvidia.

    To all its own.
  • Denis Riedijk - Thursday, June 19, 2008 - link

    There are indeed 8192 (or 16384) registers per multiprocessor. So they are shared by all the warps running on that multiprocessor. Now comes an interesting part. Say you can have 10 warps (320 threads) running on a multiprocessor, and you have 2 warps per block (64 threads). Then you have 5 blocks per MP. And when a block is finished, it gets quickly replaced by the next block that needs to run.
    So the scheduler is constantly juggling threads around to keep the ALU's busy, and when a et of warps is done, it quickly fetches the next set to keep everything nice & warm.

    Some tests done by people on the CUDA forums have indicated that this bringing in of new blocks is happening very fast indeed.
  • soloman02 - Wednesday, June 18, 2008 - link

    As a ASEET degree holder and a BSEE student, I love these technical articles, even if programming isn't my thing. I used to go to toms hardware, but their articles no longer include the technical stuff that makes us EE's, CE's, and other geeks all warm and fuzzy inside. So I come to Anand now.

    So keep up the good work, and don't sellout like toms hardware did.
  • Ztx - Wednesday, June 18, 2008 - link

    "Some of us enjoy the technical stuff even if we don't fully understand it,"

    Yup, and we sometimes learn something geeky from them:)

    ----
    ^
    I agree with them, keep writing the articles Derek they are VERY informative! :D
  • Aileur - Wednesday, June 18, 2008 - link

    Im not so sure where you get your register count from.

    It is explicited in the cuda programming guide at 8192 per multiprocessor.

    As for the last comment about nvidia opening up, pretty much all the needed info to make the most of out the hardware is present in the programming guide.
    nVidia also has a visual profiler that runs your code and profiles your occupancy and memory transactions (which are most of the time the bottleneck in kernels)
  • Aileur - Wednesday, June 18, 2008 - link

    Oh and the way to hide latency is not to use more registers (as you seem to have hinted at), but to use less. Since the number of registers is fixed per MP, the less you use, the more blocks can run on a given MP.

    When you have more blocks running, you can hide the latency better since you have a bigger pool a blocks to pick from.

    Or maybe we dont have the same definition of "register space". You might be refering to occupancy, or number of active warps.
  • DerekWilson - Wednesday, June 18, 2008 - link

    yeah, sorry, the register space bit was something i forgot to put in in originally and the update was a text message i sent to anand -- we got our lines crossed and it should have read as it reads now.

    which is to say that using 25% or less of your register space will help hide latency.

    ...

    on your original comment, the 8k registers are not registers physically available hardware resources. developers can use that many in software, but i can guarantee that they'll be optimized out in compilers/assemblers and swapped into and out of memory when physical register space runs low.

    the comment in the thread really does suggest that the 42 registers make up 25% of the physical register file on G80. i suppose i could have misunderstood or harris could have been representing things wrong ...
  • Aileur - Wednesday, June 18, 2008 - link

    I dont know... it was always my understanding (from developing cuda software) that there were 8192 registers per MP. That does sound like a ridiculously huge number of registers though.

    That number is the base with which it is possible to calculate the maximum number of threads in a thread block. The nvcc compiler can be asked to create a "cubin" file in which the number of registers needed by a kernel (per thread) is displayed. 8192/that number = the maximum number of threads that ca be in a thread block. Exceed that number and the kernel will not launch and a cuda exception "invalid launch parameters" will be raised.

    Page 63 of the cuda programming guide for cuda beta 2.0 gives a similar equation.

    Maybe youre right and there is some swapping magic occuring down the line, but it is not how i understood it.
  • Aileur - Wednesday, June 18, 2008 - link

    Sorry for replying to myself again!
    in http://forums.nvidia.com/index.php?showtopic=66238...">http://forums.nvidia.com/index.php?showtopic=66238...
    if, as he says, you use 64 theads in a block, with 42 regs per thread, and that this number represents 25% of the total register space, that amount to (64*42)/0.25=~10000.
    not 8192, but still of the same magnitude.
  • DerekWilson - Wednesday, June 18, 2008 - link

    no problem at all ... reply to yourself all you want :-)

    and that's an interesting point ... i was thinking register space per thread, but i was even going on about how context is per warp myself which would put register space defined per warp rather than per thread anyway -- it makes sense that threads in a warp would share register space.

    if you multiply my number by 64 you get yours ... which makes sense as he was talking about 64 thread blocks ...

    and super insane numbers of registers does make sense when realizing that register space is defined per warp too ...

    my numbers should still be right on a per thread basis though ...

    i have to finish reading through the cuda manuals and guides and see if i can't start talking to nvidia tech support rather than PR :-)

Log in

Don't have an account? Sign up now