Many people, especially in huge articles like the GT200 launch article, skip over the very text heavy pages I tend to write. Especially as I have a flair for low-level technical detail that not everyone enjoys.
In this recent foray into GPU architecture guess work, we spent some time speculating about G80 and GT200 SP pipeline depth. Our guess was 8 stages based on the depth of other architectures at that speed and the potential of wasted power with very deep pipelines. It turns out that we may have guessed way too low on this one (Anand: ahem, actually someone came up with 15).
One of our readers, Denis Riedijk, pointed us to NVIDIA's own forums and CUDA programming guide. These sources reveal that properly hiding instruction latency requires 6 active warps per SM. The math on this comes out to an effective latency of 24 cycles before a warp can be scheduled to process the next instruction in its instruction stream. Each warps takes 4 cycles to process in an SM (4 threads from a warp are processed on each of the 8 SPs) and 6*4 is 24. You can also look at it as 6 warps * 32 threads/warp = 192 threads and 192 threads / 8 SPs = 24 threads per SP, and with a throughput of 1 instruction per cycle = 24 cycles.
My first thought was that their scheduling hardware might not be able to handle scheduling fewer warps fast enough or that the way they manage local memory might require a delay for some other reason to cover read after write dependancies. But reading through the threads Denis pointed us to really seem to indicate that it might just be pipeline depth that gives us this lantecy. From NVIDIA's Mark Harris in one of the threads:
"The latency is approximately 22 clocks (this is the 1.35 GHz clock on 8800 GTX), and it takes 4 clocks to execute an arithmetic instruction (ADD, MUL, MAD, etc,) for a whole warp."
There's also an indication of the size of G80/GT200's SP register file in the CUDA forums. Harris mentions that one way of hiding ALU latency is by ensuring at most 25% of the available register space is in use, or 42 registers per thread. That would put G80 at 168 registers or GT200 at 336 registers per thread.
Which brings us to a broader point. NVIDIA is going to have to give CUDA developers more detail in order for them to effectively use the hardware. Certainly we don't believe Intel gives away as much technical detail as they do because they are so benevolent: developers need to know the details in order to get the most out of their code, and this is more and more true as you reach up into the HPC markets that NVIDIA is targeting. Companies that pay hundreds of thousands for compute clusters aren't interested in just throwing away compute power: they want and need to get the most out of every cycle.
While I do hope that NVIDIA will continue to move in the direction of giving us more detail, or at least of giving us the detail they are already publicly sharing with developers, we certainly do have a better idea of where to look when we want low-level technical information now. Looks like I'm going to have to sit down and start writing all those CUDA apps I've been putting off.