GF100’s Gaming Architecture

Although it feels like ages ago, NVIDIA announced the Fermi architecture back in September of 2009, focusing on the compute abilities of the GPU that would be GF100. Today’s announcement is about filling in the blanks – where does the graphics hardware fit in to the design that NVIDIA revealed earlier.

As a quick refresher, the GF100 is composed of 512 SPs, which NVIDIA now calls CUDA cores. Each CUDA core is organized in to a block of 32 cores which we call a Streaming Multiprocessor (SM), which in turn are laid out so that 4 of them are in a Graphics Processing Cluster (GPC). Finally, there are 4 GPCs in a single GF100 chip. 32x4x4 = 512 CUDA cores.

  GF100 GT200x2 (GTX 295) GT200 (GTX 285) G92 (9800+ GTX)
Stream Processors 512 2 x 240 240 128
Texture Address / Filtering 64/256 2 x 80 / 80 80 / 80 64 / 64
ROPs 48 2x 28 32 16
Core Clock ? 576MHz 648MHz 738MHz
Shader Clock ? 1242MHz 1476MHz 1836MHz
Memory Clock ? GDDR5 999MHz (1998MHz data rate) GDDR3 1242MHz (2484MHz data rate) GDDR3 1100MHz (2200MHz data rate) GDDR3
Memory Bus Width 384-bit 2 x 448-bit 512-bit 256-bit
Frame Buffer ? 2 x 896MB 1GB 512MB
Transistor Count 3B 2 x 1.4B 1.4B 754M
Manufacturing Process TSMC 40nm TSMC 55nm TSMC 55nm TSMC 55nm
Price Point $? $500 $400 $150 - 200

When NVIDIA first unveiled GF100, we did not know anything about the ROPs, texture units, or any of the fixed-function graphics units that are customary in a GPU. Today we now have that information, and can discuss where it goes.

While GF100 resembles GT200 in a number of ways as a compute GPU, as a gaming GPU it’s very close to being a complete departure from GT200. The big change here is that the single block of fixed-function hardware as we know it is gone. Virtually every piece of hardware has been split up and moved down a level, and can now be found as part of a GPC or as a SM. NVIDIA’s GF100 block diagram does a good job of showing this.

What’s left of the fixed function hardware is the GigaThread Engine, which is what NVIDIA is calling their scheduler. To understand where our fixed function hardware went, we must meet the newest execution units: The PolyMorph Engine and the Raster Engine.

We’ll start with the Raster Engine. The Raster Engine is the combination of all the raster related hardware that we would have found previously in the fixed function pipeline. It does edge/triangle setup, rasterization, and z-culling in a pipelined manner. Each GPC has its own Raster Engine. NVIDIA has very little to say about the Raster Engine, but very little has changed from GT200 besides the fact that there are now 4 of them (one for each GPC) as opposed to 1 larger block. Each rasterizer can do 8 pixels per clock, for a total of 32 pixels per clock over the entirety of GF100.

The much more important of the two new execution units is the PolyMorph Engine, which is what NVIDIA is calling the execution unit that handles geometry for GF100. The PolyMorph Engine is responsible for handling Vertex Fetch, Tessellation, Viewport Transform, Attribute Setup, and Stream Output. It’s here where the Tessellator is located, one of the biggest changes that DX11 is bringing to GPU design. Unlike the Raster Engine, each SM has a PolyMorph Engine, giving the GF100 16 PolyMorph Engines altogether.

While the PolyMoprh Engine may sound simple in its description, don’t let it fool you. NVIDIA didn’t just move their geometry hardware to a different place, clone it 15 times, and call it a day. This was previously fixed-function hardware where a single unit sat in a pipeline and did its share of the work. By splitting up the fixed-function pipeline like this, NVIDIA in actuality created a lot of work for themselves. Why? Out of order execution.

OoO is something we usually reserve for CPUs, where high-end CPUs are built to execute instructions out of order in order to extract more performance out of them through instruction level parallelism. OoO is very hard to accomplish, because you can only execute certain instructions ahead of other ones while maintaining the correct result for your data. Execute an add instruction that relies on a previous operation before that’s done, and you have problems. GF100 isn’t a full OoO design, so we’re not going to cover OoO in-depth here, but if you’d like to know more please see this article.

At any rate, internally each PolyMorph Engine is still a simple in-order design. NVIDIA hasn’t gone so far as to make a PolyMorph Engine an OoO design – but because there are 16 of them when there used to be just 1, OoO hazards can occur just as they would in an OoO executing CPU. NVIDIA now has to keep track of what each PolyMorph Engine is doing in respect to the other 15, and put the brakes on any of them that get too far ahead in order to maintain the integrity of results.

To resolve the hazards of OoO, GF100 has a private communication channel just for the PolyMorph Engines that allows them to stay together on a task in spite of being spread apart. The fact of the matter is that all of the work that goes in to making a design like this work correctly is an immense amount of effort, and NVIDIA’s engineers are quite proud of this effort. They have taken the monolithic design of prior GPUs’ geometry units, and made it parallel. We can’t overstate how much of an engineering challenge this is.

However all of this work came at a cost, and not just the significant engineering resources NVIDIA threw at GF100. The other cost was time – we believe that the PolyMorph Engine is the single biggest reason that GF100 didn’t make it out last year. It’s the single biggest redesign of any component in GF100, and is something that NVIDIA had to start virtually from scratch on. When NVIDIA told us that designing a big GPU is hard, this is what they had in mind.

Now why did NVIDIA put themselves through all of this? Because in their eyes, they had to. The use of a fixed-function pipeline in their eyes was a poor choice given the geometric complexity that a tessellator would create, and hence the entire pipeline needed to be rebalanced. By moving to the parallel design of the PolyMorph Engine, NVIDIA’s geometry hardware is no longer bound by any limits of the pipelined fixed-function design (such as bottlenecks in one stage of the pipeline), and for better or for worse, they can scale their geometry and raster abilities with the size of the chip. A smaller GF100 derivative will not have as many PolyMorph or Raster units as GF100, and as a result won’t have the same level of performance; G92 derivatives and AMD’s designs both maintain the same fixed function pipeline through all chips, always offering the same level of performance.

Speaking of performance, we’ll get to that in a bit, but for now we’ll finish our discussion of GF100’s graphics units. Each SM has 4 texture units, which have changes somewhat from the GT200. In GT200, the texture units were at the TPC level; here they are per SM. The texture units are also affected by the cache reorganization of GF100; for L1 cache they still have the same 12KB of texture cache per set of 4 texture units, while they all share GF100’s 768KB L2 cache, which is now used by all execution units and not just textures. Each unit can compute 1 texture address and fetch 4 texture samples per clock, a significantly higher texture fetch rate than on the GT200. Furthermore in conjunction with DX11, NVIDIA’s texture units now support DX11’s new compressed texture formats. Finally, texture units are now tied to the shader clock and not the core clock. They run at 1/2 the shader clock speed of GF100.

Last but not least, this brings us to the ROPs. The ROPs have been reorganized, there are now 48 of them in 6 parttions of 8, and a 64bit memory channel serving each partition. The ROPs now share the L2 cache with the rest of GF100, while under GT200 they had their own L2 cache. Each ROP can do 1 regular 32bit pixel per clock, 1 FP16 pixel over 2 clocks, or 1 FP32 pixel over 4 clocks, giving the GF100 the ability to retire 48 regular pixels per clock. The ROPs are clocked together with the L2 cache.

That leaves us on a final note: clocks. The core clock has been virtually done away with on GF100, as almost every unit now operates at or on a fraction of the shader clock. Only the ROPs and L2 cache operate on a different clock, which is best described as what’s left of the core clock. The shader clock now drives the majority of the chip, including the shaders, the texture units, and the new PolyMorph and Raster Engines. Specifically, the texture units, PolyMorph Engine, and Raster Engine all run at 1/2 shader clock (which NVIDIA is tentatively calling the "GPC Clock"), while the L1 cache and the shaders themselves run at the full shader clock. Don’t be surprised if GF100 overclocking is different from GT200 overclocking as a result.

Index Why NVIDIA Is Focused On Geometry
Comments Locked

115 Comments

View All Comments

  • dentatus - Monday, January 18, 2010 - link

    Absolutely. Really, the GT200/RV700 generation of DX10 cards was inarguably 'won' (i.e most profitable) for AMD/ATI by cards like the HD4850. But the overall performance crown (i.e highest in-generation performance) was won off the back of the GTX295 for nvidia.

    But I agree with chizow that nvidia has ultimately been "winning" (the performance crown) each generation since the G80.
  • chizow - Monday, January 18, 2010 - link

    Not sure how you can claim AMD "inarguably" won DX10 with 4850 using profits as a metric. How many times did AMD turn a profit since RV770 launched? Zero. They've posted 12 straight quarters of losses last time I checked. Nvidia otoh has turned a profit in many of those quarters and most recently Q3 09 despite not having the fastest GPU on the market.

    Also, the fundamental problem people don't seem to understand with regard to AMD and Nvidia die size and product distribution is that they overlap completely different market segments. Again, this simply serves as a referendum in the differences in their business models. You may also notice these differences are pretty similar to what AMD sees from Intel on the CPU side of things....

    Nvidia GT200 die go into all high-end and mainstream parts like GTX 295, 285, 275, 260 that sell for much higher prices. AMD RV770 die went into 4870, 4850, and 4830. The latter two parts were competing with Nvidia's much cheaper and smaller G92 and G96 parts. You can clearly see that the comparison between die/wafer sizes isn't a valid one.

    AMD has learned from this btw, and this time around it looks like they're using different die for their top tier parts (Cypress) and their lower tier parts (Redwood, Cedar) so that they don't have to sell their high-end die at mainstream prices.
  • Stas - Tuesday, January 19, 2010 - link

    [quote]Not sure how you can claim AMD "inarguably" won DX10 with 4850 using profits as a metric. How many times did AMD turn a profit since RV770 launched? Zero. They've posted 12 straight quarters of losses last time I checked. Nvidia otoh has turned a profit in many of those quarters and most recently Q3 09 despite not having the fastest GPU on the market. [/quote]
    AMD also makes CPUs... they also lost market due to Intel's high end domination... they lost money on ATI... If it wasn't for success of the HD4000 series, AMD would've been in deep shit. Just think before you post.
  • Calin - Tuesday, January 19, 2010 - link

    Hard to make a profit paying the rates of a 5 billion credit - but if you want to take it this way (total profits), why wouldn't we take total income?
    AMD/ATI:
    PERIOD ENDING 26-Sep-09 27-Jun-09 28-Mar-09 27-Dec-08
    Total Revenue 1,396,000 1,184,000 1,177,000 1,227,000
    Cost of Revenue 811,000 743,000 666,000 1,112,000
    Gross Profit 585,000 441,000 511,000 115,000

    NVidia
    PERIOD ENDING 25-Oct-09 26-Jul-09 26-Apr-09 25-Jan-09
    Total Revenue 903,206 776,520 664,231 481,140
    Cost of Revenue 511,423 619,797 474,535 339,474
    Gross Profit 391,783 156,723 189,696 141,666

    Not looking so good for the "winner of the generation", though. As for the die size and product distribution, all I'm looking at is the retail video card offer, and every price bracket I choose have both NVidia and AMD in it.
  • knutjb - Wednesday, January 20, 2010 - link

    You missed my point. I wasn't talking about AMD as a whole I was talking about ATI as a division within AMD. If a company bleeds that much and still survives some part of the company must be making some money and that is the ATI division. ATI is making money. Your macro numbers mean zip.

    The model ATI is using is putting out competitive cards from a company, AMD, that is bleeding badly. What generation card is easier to sell the new and improved one with more features, useful or not, or the last generation chip?
  • beck2448 - Tuesday, January 19, 2010 - link

    Those numbers are ludicrous. AMD hasn't made a profit in years. ATI's revenue is about 30% of Nvidia's.
  • knutjb - Monday, January 18, 2010 - link

    ATI is what has been floating AMD with its profits. ATI has decided to make smaller incremental developmental steps that lower end production costs.

    Nvidia takes a long time to create a monolithic monster that required massive amounts of capital to develop. They will not recoup this investment off gamers alone because most don't have that much cash to put one of those cards in their machines. It is needed for marketing so they can push lower level cards implying superiority, real or not, they are a heavy marketing company. This chip is directed at their GPU server market and that is where they hope to make their money hoping it can do both really well.

    ATI on the other hand by making smaller steps, but at a higher cycle of product development, have focused on the performance/mainstream market. With lower development costs they can turn out new cards that payback development costs back quicker allowing them to put that capital back into new products. Look at the 4890 and 4870. They both share similar architecture but the 4890 is a more refined chip. It was a product that allowed ATI to keep Nvidia reacting to ATI's products.

    Nvidia's marketing requires them to have the fastest card on the market. ATI isn't trying to keep the absolute performance crown but hold onto the price/performance crown. Every time they put out a slightly faster card it forces Nvidia to respond. Nvidia recieves lower profits from having to drop card prices. I don't think this chip will be able to function on the 8800 model because AMD/ATI is now on stronger financial footing than they have been in the past couple years and Nvidia being late to market is helping ATI line their pockets cash. The 5000 series is just marginally better, but is better than Nvidia's current offerings.

    Will Nvidia release just a single high end card or several tiers of cards to compete across the board? I don't think one card will really help the bottom line over the longer term.
  • StormyParis - Monday, January 18, 2010 - link

    I'm not sure what "winning" means, nor, really what a generation is.

    you can win on highest performance, highest marketshare, highest profit, best engineering...

    a generation may also be adirectX iteration, a chip release cycle (in which case, each manufacturer has its own), a fiscal year...

    Anyhoo, I don't really care, as long as i'm regularly getting better, cheaper cards. I'll happily switch back to nVidia
  • chizow - Monday, January 18, 2010 - link

    I clearly defined what I considered a generation, historically the rest of the metrics measured over time (market share, mind share, profits, value-add features, game support) tend to follow suit.

    For someone like you that doesn't care about who's winning a generation it should be simple enough, buy whatever is best that suits your price:performance requirements when you're ready to buy.

    For those who want to make an informed decision once every 12-16 months per generation to avoid those niggling uncertanties and any potential buyer's remorse, they would certainly want to consider both IHV's offerings before making that decision.
  • Ahmed0 - Monday, January 18, 2010 - link

    How can you "win" if your product isnt intended for a meaningful number of customers. Im sure ATi could pull out the biggest, most expensive, hottest and fastest card in the world as well but theres a reason why they dont.

    Really, the performance crown isnt anything special. The title goes from hand to hand all the time.

Log in

Don't have an account? Sign up now