A More Efficient Architecture

GPUs, like CPUs, work on streams of instructions called threads. While high end CPUs work on as many as 8 complicated threads at a time, GPUs handle many more threads in parallel.

The table below shows just how many threads each generation of NVIDIA GPU can have in flight at the same time:

  Fermi GT200 G80
Max Threads in Flight 24576 30720 12288

 

Fermi can't actually support as many threads in parallel as GT200. NVIDIA found that the majority of compute cases were bound by shared memory size, not thread count in GT200. Thus thread count went down, and shared memory size went up in Fermi.

NVIDIA groups 32 threads into a unit called a warp (taken from the looming term warp, referring to a group of parallel threads). In GT200 and G80, half of a warp was issued to an SM every clock cycle. In other words, it takes two clocks to issue a full 32 threads to a single SM.

In previous architectures, the SM dispatch logic was closely coupled to the execution hardware. If you sent threads to the SFU, the entire SM couldn't issue new instructions until those instructions were done executing. If the only execution units in use were in your SFUs, the vast majority of your SM in GT200/G80 went unused. That's terrible for efficiency.

Fermi fixes this. There are two independent dispatch units at the front end of each SM in Fermi. These units are completely decoupled from the rest of the SM. Each dispatch unit can select and issue half of a warp every clock cycle. The threads can be from different warps in order to optimize the chance of finding independent operations.

There's a full crossbar between the dispatch units and the execution hardware in the SM. Each unit can dispatch threads to any group of units within the SM (with some limitations).

The inflexibility of NVIDIA's threading architecture is that every thread in the warp must be executing the same instruction at the same time. If they are, then you get full utilization of your resources. If they aren't, then some units go idle.

A single SM can execute:

Fermi FP32 FP64 INT SFU LD/ST
Ops per clock 32 16 32 4 16

 

If you're executing FP64 instructions the entire SM can only run at 16 ops per clock. You can't dual issue FP64 and SFU operations.

The good news is that the SFU doesn't tie up the entire SM anymore. One dispatch unit can send 16 threads to the array of cores, while another can send 16 threads to the SFU. After two clocks, the dispatchers are free to send another pair of half-warps out again. As I mentioned before, in GT200/G80 the entire SM was tied up for a full 8 cycles after an SFU issue.

The flexibility is nice, or rather, the inflexibility of GT200/G80 was horrible for efficiency and Fermi fixes that.

Architecting Fermi: More Than 2x GT200 Efficiency Gets Another Boon: Parallel Kernel Support
Comments Locked

415 Comments

View All Comments

  • SiliconDoc - Thursday, October 1, 2009 - link

    Sweet ! Nice pick, looks like carbon fiber at the bracket end.

    Wowzie, a real honker based on THOUSANDS OF DOLLARS of tech and core per part.

    I feel SO PRIVLEDGED to have a chance at the gaming segment version, all that massive power jammed into a gaming card !

    Whoo! U P S C A L E !
  • justaviking - Thursday, October 1, 2009 - link

    Look at every bright area of high contrast. All the spotlight reflections have a red ring around them. So the thumb, in front of the highly reflective gold connectors, also has the same halo effect. I think that it's as much evidence of a digital camera as it is Photoshop manipulation.

    With that said, it could also be a non-functional mock-up. Holding a mock-up or prototype in your hand is not the same as benchmarking a production (ready for consumer release) product.
  • papapapapapapapababy - Thursday, October 1, 2009 - link

    look at that irregular borders closely. ( above the watch) also, the shadows (finger) are off. thats a (terrible) shop.
  • v1001 - Thursday, October 1, 2009 - link

    All they did was blacken out the background more. Probably was more noise and distraction going on that they didn't want in there.
  • justaviking - Thursday, October 1, 2009 - link

    OK, so assuming it's a fake (and I'm not saying it isn't), I have three questions:

    1) Where did you get the photo?
    2) Why do it? (And "Who did it?", but that's closely related to Q1.
    3) Where did they get the photo of the hardware, which they then put into the person's hand?

    Combining #2 and #3) If the card is from a real photo of real hardware, then what was the value of photoshopping it into someone's hand?

    I'm not trying to argue, just trying to understand.
  • papapapapapapapababy - Thursday, October 1, 2009 - link

    more fakes! source: bit-tech ( this one is even "better")

    http://i34.tinypic.com/34inz9j.jpg">http://i34.tinypic.com/34inz9j.jpg


    also, not mine ( from xnews)

    http://img28.imageshack.us/img28/2883/tesafilm.png">http://img28.imageshack.us/img28/2883/tesafilm.png
  • papapapapapapapababy - Thursday, October 1, 2009 - link

    also below the card... whats that sloppy withe trim in the middle of a shadow? JAaAAA
  • UNCjigga - Thursday, October 1, 2009 - link

    Seriously? I have a 1080p monitor and Radeon 4670 with UVD2, but my PS3 with 1080p output to the same monitor looks MUCH better at upscaling DVDs (night and day difference.) PowerDVD does have a better upscaling tech, but that's using software decoding. Can somebody port ffdshow/libmpeg2 for CUDA and ATI Stream (or DirectCompute?) kthxbye
  • Pastuch - Thursday, October 1, 2009 - link

    I buy two videocards per year on average. I've owned an almost equal number of ATI/Nvidia cards. I loved my geforce 8800 GTX despite it costing a fortune but since then it's been ALL down hill. I've had driver issues with home theater PCs and Nvidia drivers. I've been totally disappointed with Nvidias performance with high def audio formats. The fact that the entire ATI 48xx line can do 7.1 audio pass-through while only a handful of Nvidia videocards can even do 5.1 audio passthrough is just sad. The world is moving to hometheater gaming PCs and Nvidia is dragging arse.

    The fact that 5850 can do bitstreaming audio for $250 RIGHT NOW and is the second fastest 1 GPU solution for gaming makes it one hell of a product in my eyes. You no longer nead an Asus Xonar or Auzentech soundcard saving me $200. Hell with the money I saved I could almost buy a SECOND 5850! Lets see if the new Nvidia cards can do bitstreaming... if they can't then Nvidia won't be getting any more of my money.

    P.S. Thanks Anand for inspiring me to build the hometheater of my dreams. Gaming on a 110 Inch screen is the future!
  • SiliconDoc - Thursday, October 1, 2009 - link

    Well that's very nice, and since this has been declared the home of "only game fps and bang for that buck" matters, and therefore PhysX, ambient occlusion, CUDA, and other nvidia advantages, and your "outlier" htpc desires are WORTHLESS according to the home crowd, I guess they can't respond without contradiciting themselves, so I will considering I have always supported added value, and have been attacked for it.
    --
    Yes, throw out your $200 sound cards, or sell them, and plop that heat monster into the tiny unit, good luck. Better spend some on after market cooling, or the raging videocard fan sound will probably drive you crazy. So another $100 there.
    Now the $100 you got for the used soundcard is gone.
    I also wonder what sound chip you're going to use then when you aren't playing a movie or whatever, I suppose you'll use your motherboard sound chip, which might be a lousy one, and definitely is lousier than the Auzentech you just sold or tossed.
    So how exactly does "passthrough" save you a dime ?
    If you're going to try to copy Anand's basement theatre projection, I have to wonder why you wouldn't use the digital or optical output of the high end soundcard... or your motherboards, if indeed it has a decent soundchip on it, which isn't exactly likely.
    -
    Maybe we'll all get luckier,and with TESLA like massive computing power, we'll get an NVIDIA blueray dvd movie player converter that runs on the holy grail of the PhysX haters, openCL and or direct compute, and you'll have to do with the better sound of your add on sound cards, anyway, instead of using a videocard as a transit device.
    I can't imagine "cable mamnagement" as an excuse either, with a 110" curved screen home built threate room...
    ---
    Feel free to educate me.

Log in

Don't have an account? Sign up now