GF110: Fermi Learns Some New Tricks

We’ll start our in-depth look at the GTX 580 with a look at GF110, the new GPU at the heart of the card.

There have been rumors about GF110 for some time now, and while they ultimately weren’t very clear it was obvious NVIDIA would have to follow up GF100 with something else similar to it on 40nm to carry them through the rest of the processes’ lifecycle. So for some time now we’ve been speculating on what we might see with GF100’s follow-up part – an outright bigger chip was unlikely given GF100’s already large die size, but NVIDIA has a number of tricks they can use to optimize things.

Many of those tricks we’ve already seen in GF104, and had you asked us a month ago what we thought GF110 would be, we were expecting some kind of fusion of GF104 and GF100. Primarily our bet was on the 48 CUDA Core SM making its way over to a high-end part, bringing with it GF104’s higher theoretical performance and enhancements such as superscalar execution and additional special function and texture units for each SM. What we got wasn’t quite what we were imagining – GF110 is much more heavily rooted in GF100 than GF104, but that doesn’t mean NVIDIA hasn’t learned a trick or two.



GF100/GF110 Architecture

Fundamentally GF110 is the same architecture as GF100, especially when it comes to compute. 512 CUDA Cores are divided up among 4 GPCs, and in turn each GPC contains 1 raster engine and 4 SMs. At the SM level each SM contains 32 CUDA cores, 16 load/store units, 4 special function units, 4 texture units, 2 warp schedulers with 1 dispatch unit each, 1 Polymorph unit (containing NVIDIA’s tessellator) and then the 48KB+16KB L1 cache, registers, and other glue that brought an SM together. At this level NVIDIA relies on TLP to keep a GF110 SM occupied with work. Attached to this are the ROPs and L2 cache, with 768KB of L2 cache serving as the guardian between the SMs and the 6 64bit memory controllers. Ultimately GF110’s compute performance per clock remains unchanged from GF100 – at least if we had a GF100 part with all of its SMs enabled.

On the graphics side however, NVIDIA has been hard at work. They did not port over GF104’s shader design, but they did port over GF104’s texture hardware. Previously with GF100, each unit could compute 1 texture address and fetch 4 32bit/INT8 texture samples per clock, 2 64bit/FP16 texture samples per clock, or 1 128bit/FP32 texture sample per clock. GF104’s texture units improved this to 4 samples/clock for 32bit and 64bit, and it’s these texture units that have been brought over for GF110. GF110 can now do 64bit/FP16 filtering at full speed versus half-speed on GF100, and this is the first of the two major steps NVIDIA took to increase GF110’s performance over GF100’s performance on a clock-for-clock basis.

NVIDIA Texture Filtering Speed (Per Texture Unit)
  GF110 GF104 GF100
32bit (INT8) 4 Texels/Clock 4 Texels/Clock 4 Texels/Clock
64bit (FP16) 4 Texels/Clock 4 Texels/Clock 2 Texels/Clock
128bit (FP32) 1 Texel/Clock 1 Texel/Clock 1 Texel/Clock

Like most optimizations, the impact of this one is going to be felt more on newer games than older games. Games that make heavy use of 64bit/FP16 texturing stand to gain the most, while older games that rarely (if at all) used 64bit texturing will gain the least. Also note that while 64bit/FP16 texturing has been sped up, 64bit/FP16 rendering has not – the ROPs still need 2 cycles to digest 64bit/FP16 pixels, and 4 cycles to digest 128bit/FP32 pixels.

It’s also worth noting that this means that NVIDIA’s texture:compute ratio schism remains. Compared to GF100, GF104 doubled up on texture units while only increasing the shader count by 50%; the final result was that per SM 32 texels were processed to 96 instructions computed (seeing as how the shader clock is 2x the base clock), giving us 1:3 ratio. GF100 and GF110 on the other hand retain the 1:4 (16:64) ratio. Ultimately at equal clocks GF104 and GF110 widely differ in shading, but with 64 texture units total in both designs, both have equal texturing performance.

Moving on, GF110’s second trick is brand-new to GF110, and it goes hand-in-hand with NVIDIA’s focus on tessellation: improved Z-culling. As a quick refresher, Z-culling is a method of improving GPU performance by throwing out pixels that will never be seen early in the rendering process. By comparing the depth and transparency of a new pixel to existing pixels in the Z-buffer, it’s possible to determine whether that pixel will be seen or not; pixels that fall behind other opaque objects are discarded rather than rendered any further, saving on compute and memory resources. GPUs have had this feature for ages, and after a spurt of development early last decade under branded names such as HyperZ (AMD) and Lightspeed Memory Architecture (NVIDIA), Z-culling hasn’t been promoted in great detail since then.


Z-Culling In Action: Not Rendering What You Can't See

For GF110 this is changing somewhat as Z-culling is once again being brought back to the surface, although not with the zeal of past efforts. NVIDIA has improved the efficiency of the Z-cull units in their raster engine, allowing them to retire additional pixels that were not caught in the previous iteration of their Z-cull unit. Without getting too deep into details, internal rasterizing and Z-culling take place in groups of pixels called tiles; we don’t believe NVIDIA has reduced the size of their tiles (which Beyond3D estimates at 4x2); instead we believe NVIDIA has done something to better reject individual pixels within a tile. NVIDIA hasn’t come forth with too many details beyond the fact that their new Z-cull unit supports “finer resolution occluder tracking”, so this will have to remain a mystery for another day.

In any case, the importance of this improvement is that it’s particularly weighted towards small triangles, which are fairly rare in traditional rendering setups but can be extremely common with heavily tessellated images. Or in other words, improving their Z-cull unit primarily serves to improve their tessellation performance by allowing NVIDIA to better reject pixels on small triangles. This should offer some benefit even in games with fewer, larger triangles, but as framed by NVIDIA the benefit is likely less pronounced.

In the end these are probably the most aggressive changes NVIDIA could make in such a short period of time. Considering the GF110 project really only kicked off in earnest in February, NVIDIA only had around half a year to tinker with the design before it had to be taped out. As GPUs get larger and more complex, the amount of tweaking that can get done inside such a short window is going to continue to shrink – and this is a far cry from the days where we used to get major GPU refreshes inside of a year.

Index Keeping It Cool: Transistors, Throttles, and Coolers
POST A COMMENT

159 Comments

View All Comments

  • mac2j - Tuesday, November 09, 2010 - link

    Actually the new ATI naming makes a bit more sense.

    Its not a new die shrink but the 6xxx all do share some features not found at all in the 5xxx series such as Displayport 1.2 (which could become very important if 120 and 240Hz monitors ever catch on).

    Also the Cayman 69xx parts are in fact a significantly original design relative to the 58xx parts.

    Nvidia to me is the worst offender ... cause a 580 is just fully-enabled 480 with the noise and power problems fixed.
    Reply
  • Sihastru - Tuesday, November 09, 2010 - link

    If you think that stepping up the spec on the output ports warrants skipping a generation when naming your product, see that mini-HDMI port on the 580, that's HDMI 1.4 compliant... the requirements for 120Hz displays are met.

    The GF110 in not a GF100 with all the shaders enabled. It looks that way to the uninitiated. GF110 has much more in common with GF104.

    GF110 has three types of tranzistors, graded by leakage, while the GF100 has just two. This gives you the ability to clock the core higher, while having a lower TDP. It is smaller in size then GF100 is, while maintaining the 40nm fab node. GTX580 has a power draw limitation system on the board, the GTX480 does not...

    What else... support for full speed FP16 texture filtering which enhances performance in texture heavy applications. New tile formats which improve Z-cull efficiency...

    So how does displayport 1.2 warrant the 68x0 name for AMD but the few changes above do not warrant the 5x0 name for nVidia?

    I call BS.
    Reply
  • Griswold - Wednesday, November 10, 2010 - link

    I call your post bullshit.

    The 580 comes with the same old video engine as the GF100 - if it was so close to GF104, it would have that video engine and all the goodies and improvements it brings over the one in the 480 (and 580).

    No, GT580 is a fixed GF100 and most of what you listed there supports that because it fixes what was broken with the 480. Thats all.
    Reply
  • Sihastru - Wednesday, November 10, 2010 - link

    I'm not sure what you mean... maybe you're right... but I'm not sure... If you're referring to bitstreaming support, just wait for a driver update, the hardware supports it.

    See: http://www.guru3d.com/article/geforce-gtx-580-revi...

    "What is also good to mention is that HDMI audio has finally been solved. The stupid S/PDIF cable to connect a card to an audio codec, to retrieve sound over HDMI is gone. That also entails that NVIDIA is not bound to two channel LPCM or 5.1 channel DD/DTS for audio.

    Passing on audio over the PCIe bus brings along enhanced support for multiple formats. So VP4 can now support 8 channel LPCM, lossless format DD+ and 6 channel AAC. Dolby TrueHD and DTS Master Audio bit streaming are not yet supported in software, yet in hardware they are (needs a driver update)."

    NEVER rely just on one source of information.

    Fine, if a more powerful card then the GTX480 can't be named the GTX580 then why is a lower performing then the HD5870 card is ok to be named HD6870... screw technology, screw refinements, talk numbers...

    Whatever...
    Reply
  • Ryan Smith - Wednesday, November 10, 2010 - link

    To set the record straight, the hardware does not support full audio bitstreaming. I had NV themselves confirm this. It's only HDMI 1.4a video + the same audio formats that GTX 480 supported. Reply
  • B3an - Wednesday, November 10, 2010 - link

    You can all argue all you want, but at the end of the day, for marketing reasons alone, NV really didn't have much of a choice but to name this card the 580 instead of 485 after ATI gave there cards the 6xxx series names. Which dont deserve a new series name either. Reply
  • chizow - Tuesday, November 09, 2010 - link

    No ATI's new naming convention makes no sense at all. Their x870 designation has always been reserved for their Single-GPU Flagship part ever since the HD3870, and this naming convention has held true through both the HD4xxx and HD5xxxx series. But the 6870 clearly isn't the flagship of this generation, in fact, its slower than the 5870 while the 580 is clearly faster than the 480 in every aspect.

    To further complicate matters, ATI also launched the 5970 as a dual-GPU part, so single-GPU Cayman being a 6970 will be even more confusing and will also be undoubtedly slower than the 5970 in all titles that have working CF profiles.

    If anything, Cayman should be 5890 and Barts should be 5860, but as we've seen from both caps, marketing names are often inconvenient and short-sighted when they are originally designated......
    Reply
  • Galid - Tuesday, November 09, 2010 - link

    We're getting into philosophy there. Know what's a sophism? An argument that seems strong but isn't because there's a fail in it. The new honda 2011 ain't necessarily better than the 2010 because it's newer.

    They name it differently because it's changed and wanna make you believe it's better but history proved it's not always the case. So the argument of newer generation means better is a false argument. Not everything new ''gotta'' be better in every way to live up to it's name.

    But it's my opinion.
    Reply
  • Galid - Tuesday, November 09, 2010 - link

    It seems worse but that rebranding is all ok in my mind as it comes the 6870 comes in at a cheaper price than the 5870. So everyone can be happy about it. Nvidia did worse rebranding some of the 8xxx series into 9xxx chips for higher price but almost no change and no more performance. 9600gt comes to my mind...

    What is 9xxx series? a remake of a ''better'' 8xxx series. What is GTS3xx series, remake of GTx2xx, what is GTX5xx, .... and so on. Who cares? If it's priced well it's all ok. When I see someone going at staples to get a 9600gt at 80$ and I know I can get a 4850 for almost the same price, I say WTF!!!

    GTX580 deserve the name they want to give it. Whoever tries to understand all that naming is up to him. But whoever wants to pay example 100$ for a card should get performance according to that and it seems more important than everything else to me!
    Reply
  • Taft12 - Tuesday, November 09, 2010 - link

    In this article, Ryan does exactly what you are accusing him of not doing! It is you who need to be asked WTF is wrong Reply

Log in

Don't have an account? Sign up now