Name: GeForce 6200 TurboCache: PCI Express Made Useful
Item: GeForce 6200 TurboCache: PCI Express Made Useful
Author: Derek Wilson

Original Link: https://www.anandtech.com/show/1568

GeForce 6200 TurboCache: PCI Express Made Useful

VIEW ARTICLE

by Derek Wilson on December 15, 2004 9:00 AM EST

Posted in
GPUs

43 Comments

Introduction

Imagine if getting the support for current generation graphics technology didn't require spending more than $79. Sure, performance wouldn't be good at all, and resolution would be limited to the lower end. But the latest games would all run with the latest features. All the excellent water effects in Half-life 2 would be there. Far Cry would run in all its SM 3.0 glory. Any game coming out until a good year into the DirectX 10 timeframe would run (albeit slowly) feature-complete on your impressively cheap card.

A solution like this isn't targeted at the hardcore gamer, but at the general purpose user. This is the solution that keeps people from buying hardware that's obsolete before they get it home. The idea is that being cheap doesn't need to translate to being "behind the times" in technology. This gives casual consumers the ability to see what having a "real" graphics card is like. Games will look much better running on a full DX9 SM 3.0 part that "supports" 128MB of RAM (we'll talk about that later) than on an Intel integrated solution. Shipping higher volume with cheaper cards and getting more people into gaming translates to raising the bar on the minimum requirements for game developers. The sooner NVIDIA and ATI can get current generation parts into the game-buying world's hands, the sooner all game developers can write games for DX9 hardware at a base level rather than as an extra.

In the past, we've seen parts like the GeForce 4 MX, which was just a repackaged GeForce 2. Even today, we have the X300 and X600, which are based on the R3xx architecture, but share the naming convention of the R4xx. It really is refreshing to see NVIDIA take a stand and create a product lineup that can run games the same way from the top of the line to the cheapest card out there (the only difference being speed and the performance hit of applying filtering). We hope (if this part ends up doing well and finding a good price point for its level of performance) that NVIDIA will continue to maintain this level of continuity through future chip generations. We hope that ATI will follow suit with their lineup next time around. Relying on previous generation higher end parts to fulfill current lower end needs is not something that we want to see as long term.

We've actually already taken a look at the part that NVIDIA will be bringing out in two new flavors. The 3 vertex/4 pixel/2 ROP GeForce 6200 that came out only a couple months ago is being augmented by two lower performance versions, both bearing the moniker GeForce 6200 with TurboCache.

It's passively cooled, as we can see. The single memory module of this board is peeking out from beneath the heatsink on the upper right. NVIDIA has indicated that a higher performance version of the 6200 with TurboCache will follow to replace the current shipping 6200 models. Though better than non-existent parts such as the X700 XT, we would rather not see short-lived products hit the market. In the end, such anomalies only serve to waste the time of NVIDIA's partners and confuse customers.

For now, the two parts that we can expect to see will be differentiated by their memory bandwidth. The part priced at "under $129" will be a "13.6 GB/s" setup, while the "under $99" card will sport "10.8 GB/s" of bandwidth. Both will have core and memory clocks at 350/350. The interesting part is the bandwidth figure. On both counts, 8 GB/s of that bandwidth comes from the PCI Express bus. For the 10.8 GB/s part, the extra 2.8 GB/s comes from 16MB of local memory connected on a single 32bit channel running at a 700MHz data rate. The 13.6 GB/s version of the 6200 with TurboCache just gets an extra 32bit channel with another 16MB of RAM. We've seen pictures of boards with 64MBs of onboard RAM, pushing bandwidth way up. We don't know when we'll see a 64MB product ship, or what the pricing would look like.

So, to put it all together, either 112 or 96 MB of framebuffer is stored in system RAM and accessed via the PCI Express bus. Local graphics RAM holds the front buffer (what's currently on screen) and other high priority (low latency) data. If more than local graphics memory is needed, it is allocated dynamically from system RAM. The local graphics memory that is not set aside for high priority tasks is then used as a sort of software managed cache. And thus, the name of the product is born.

The new technology here is allowing writes directly from the GPU to system RAM. We've been able to perform reads from system RAM for quite some time, though technologies like AGP texturing were slow and never delivered on their promises. With a few exceptions, the GPU is able to see system RAM as a normal framebuffer, which is very impressive for PCI Express and current memory technology.

But it's never that simple. There are some very interesting problems to deal with when using system RAM as a framebuffer; this is not simply a driver-based software solution. The foremost and ever pressing issue is latency. Going from the GPU, across the PCI Express bus, through the memory controller, into the System RAM, and all the way back is a very long, round trip. Considering the fact that graphics cards are used to having instant access to data, something is going to have to give. And sure, the PCI Express bus may be 8 GB/s (4 up and 4 down, but it's less if you talk about actual utilization), but we are only going to be getting 6.4 GB/s out of the RAM. And that's if we are talking zero CPU utilization of memory and nothing else going on in the system, only what we're doing with the graphics card.

Let's take a closer look at why anyone would want to use system RAM as a framebuffer, and how NVIDIA has tried to solve the problems that lie within.

UPDATE: We got an email in our inbox from NVIDIA updating us on a change they have made to the naming of their TurboCache products. It seems they have listened to us and are including physical memory sizes on marketing/packaging. Here's what product names will look like:

GeForce 6200 w/ TurboCache supporting 128MB, including 16MB of local TurboCache: $79
GeForce 6200 w/ TurboCache supporting 128MB, including 32MB of local TurboCache: $99
GeForce 6200 w/ TurboCache supporting 256MB, including 64MB of local TurboCache: $129

We were off on pricing a little bit, as the $129 figure we heard was actually for the 64MB/256MB part, and the 64-bit version we tested (which supports only 128MB) actually hits the price point we are looking for.

Architecting for Latency Hiding

RAM is cheap. But it's not free. Making and selling extremely inexpensive video cards means as few components as possible. It also means as simple a board as possible. Fewer memory channels, fewer connections, and less chance for failure. High volume parts need to work, and they need to be very cheap to produce. The fact that TurboCache parts can get by with either 1 or 2 32-bit RAM chips is key in its cost effectiveness. This probably puts it on a board with a few fewer layers than higher end parts, which can have 256-bit connections to RAM.

But the only motivation isn't saving costs. NVIDIA could just stick 16MB on a board and call it a day. The problem that NVIDIA sees with this is the limitations that would be placed on the types of applications end users would be able to run. Modern games require framebuffers of about 128MB in order to run at 1024x768 and hold all the extra data that modern games require. This includes things like texture maps, normal maps, shadow maps, stencil buffers, render targets, and everything else that can be read from or draw into by a GPU. There are some games that just won't run unless they can allocate enough space in the framebuffer. And not running is not an acceptable solution. This brings us to the logical combination of cost savings and compatibility: expand the framebuffer by extending it over the PCI Express bus. But as we've already mentioned, main memory appears much "further" away than local memory, so it takes much longer to either get something from or write something to system RAM.

Let's take a look at the NV4x architecture without TurboCache first, and then talk about what needs to change.

The most obvious thing that will need to be added in order to extend the framebuffer to system memory is a connection from the ROPs to system memory. This will allow the TurboCache based chip to draw straight from the ROPs into things like back or stencil buffers in system RAM. We would also want to connect the pixel pipelines to system RAM directly. Not only do we need to read textures in normally, but we may also want to write out textures for dynamic effects.

Here's what NVIDIA has offered for us today:

The parts in yellow have been rearchitected to accommodate the added latency of working with system memory. The only part that we haven't talked about yet is the Memory Management Unit. This block handles the requests from the pixel pipelines and ROPs and acts as the interface to system memory from the GPU. In NVIDIA's words, it "allows the GPU to seamlessly allocate and de-allocate surfaces in system memory, as well as read and write to that memory efficiently." This unity works in tandem with part of the Forceware driver called the TurboCache Manager, which allocates and balances system and local graphics memory. The MMU provides the system level functionality and hardware interface from GPU to system and back while the TurboCache Manager handles all the "intelligence" and "efficiency" behind the operations.

As far as memory management goes, the only thing that is always required to be on local memory is the front buffer. Other data often ends up being stored locally, but only the physical display itself is required to have known, reliable latency. The rest of local memory is treated as a something like the framebuffer and a cache to system RAM at the same time. It's not clear as to the architecture of this cache - whether it's write-back, write-through, has a copy of the most recently used 16MBx or 32MBs, or is something else altogether. The fact that system RAM can be dynamically allocated and deallocated down to nothing indicates that the functionality of local graphics memory as a cache is very adaptive.

NVIDIA hasn't gone into the details of how their pixel pipes and ROPs have been rearchitected, but we have a few guesses.

It's all about pipelining and bandwidth. A GPU needs to draw hundreds of thousands of independent pixels every frame. This is completely different from a CPU, which is mostly built around dependant operations where one instruction waits on data from another. NVIDIA has touted the NV4x as a "superscalar" GPU, though they haven't quite gone into the details of how many pipeline stages it has in total. The trick to hiding latency is to make proper use of bandwidth by keeping a high number of pixels "in flight" and not waiting for the data to come back to start work on the next pixel.

If our pipeline is long enough, our local caches are large enough, and our memory manager is smart enough, we get cycle of processing pixels and reading/writing data over the bus such that we are always using bandwidth and never waiting. The only down time should be the time that it takes to fill the pipeline for the first frame, which wouldn't be too bad in the grand scheme of things.

The big uncertainty is the chipset. There are no guarantees on when the system will get data back from system memory to the GPU. Worst case latency can be horrible. It's also not going to be easy to design around worst case latencies, as NVIDIA is a competitor to other chipset makers who aren't going to want to share that kind of information. The bigger the guess at the latency that they need to cover, the more pixels NVIDIA needs to keep in flight. Basically, covering more latency is equivalent to increasing die size. There is some point of diminishing returns for NVIDIA. But the larger presence that they have on the chipset side, the better off they'll be with their TurboCache parts as well. Even on the 915 chipset from Intel, bandwith is limited across the PCI Express bus. Rather than a full 4GB/s up and 4GB/s down, Intel offers only 3GB/s up and 1GB/s down, leaving the TurboCache architecture with this:

Graphics performance is going to be dependant on memory bandwidth. The majority of the bandwidth that the graphics card has available comes across the PCI Express bus, and with a limited setup such as this, the 6200 TuboCache will see a performance hit. NVIDIA has reported seeing something along the lines of a 20% performance improvement by moving from a 3-down, 1-up architecture to a 4-down, 4-up system. We have yet to verify these numbers, but it is not unreasonable to imagine this kind of impact.

The final thing to mention is system performance. NVIDIA only maps free idle pages from RAM using "an approved method for Windows XP." It doesn't lock anything down, and everything is allocated and freed on they fly. In order to support 128MB of framebuffer, 512MB of system RAM must be installed. With more system RAM, the card can map more framebuffer. NVIDIA chose 512MB as a minimum because PCI Express OEM systems are shipping with no less. Under normal 2D operation, as there is no need to store any more data than can fit in local memory, no extra system resources will be used. Thus, there should be no adverse system level performance impact from TurboCache.

The Test

Performance Test Configuration
Processor:	AMD Athlon 64 4000+
Motherboard:	ASUS nForce 4
Video Card(s):	ATI Radeon X300 ATI Radeon X300SE NVIDIA GeForce 6200 TurboCache (32bit) NVIDIA GeForce 6200 TurboCache (64bit) NVIDIA GeForce 6200 (128bit)
Video Drivers:	ATI - Catalyst 4.11 NVIDIA - 71.20

Doom 3 Performance

Under Doom 3, we see quite a performance hit. The fact that Id software makes heavy use of the stencil buffer for its shadowing means that there's more pressure on bandwidth per pixel produced than in other shader heavy games. In Half-Life 2, for instance, we see less of a performance impact when moving from the old 128-bit 6200 to the new version.

Doom 3 Performance

Our resolution scaling graph shows that the performance gap between the X300 and 6200 parts closes slightly as resolution increases. Running High Quality looks great even at 640x480, but we would suggest Medium Quality at 800x600 for playing the game.

Far Cry Performance

Under Far Cry, the TurboCache parts do as well or better than their competition from ATI, this time at a lower price point. The 64-bit TurboCache part can't ever quite catch the 128-bit GeForce 6200 in performance though.

Far Cry 1.3 Performance

With the exception of the X300 SE at higher resolutions, all these cards seem to scale the same. The sweet spot for the higher class of cards is going to be 800x600 for FarCry, but the low end parts are going to have to stick with 640x480 for smooth playability.

Half-Life 2 Performance

Here is the raw data we collected from our Half-Life 2 performance analysis. The data shows fairly consistent performance across all the levels we test.

Half-Life 2 1024x768 Performance
	at_canals_08	at_coast_05	at_coast_12	at_prison_05	at_c17_12
GeForce 6200 (128-bit)	43.6	65.23	50.4	41.57	45.84
GeForce 6200 (TC-64b)	38.26	61.81	48.2	38.92	42.11
Radeon X300	34.3	57.54	39.14	32.62	40.69
GeForce 6200 (TC-32b)	31.66	51.09	39.72	30.69	34.1
Radeon X300 SE	29.59	52.42	34.91	30.55	37.01

The X300 SE and new 32-bit TurboCache cards are very evenly matched here. The original 6200 leads every time, but the TC versions do hold their own fairly well. The regular X300 isn't quite able to keep up with the 64-bit version of the 6200 TurboCache, especially in particularly GPU limited levels. This comes across in a higher average performance at high resolutions.

Again, Unlike Doom 3, the TurboCache parts are able to keep up with the 128-bit 6200 part fairly well. This has to do with the ammount of memory bandwidth required to process each pixel, and HL2 is more evenly balanced between being GPU dependant and memory bandwidth dependant. Half-Life 2 Average Performance

We can see from the resolution scaling chart that in cases other than 1024x764, the competition between the X300 series and the 6200 TurboCache parts is a wash. It is impressive that all these cards run HL2 at very playable framerates in all our tests.

Unreal Tournament 2004 Performance

Here, we see the GeForce 6200, the 6200 TurboCache 64b, and the X300 all doing well at the top end in UT2K4. The 32b TurboCache holds a good lead over the X300 SE on the low end of this performance test.

Unreal Tournament 2004 Performance

Unreal Tournament 2004 Resolution Scaling

The three high end parts run a very tight race across the board. Interestingly, the 32-bit and 64-bit TurboCache parts don't scale the same way from 640x480 to 800x600.

Final Words

On a technical level, we really like the TurboCache. The design solves price/performance problems that have been around quite a while. Making some real use of the bandwidth offered by the PCI Express bus is a promising move that we didn't expect to see happen this early on or in this fashion.

At launch, NVIDIA's marketing position could have been a little bit misleading. Since our review originally hit the web, the wording that will be on the packages for TurboCache products has gone through a bit of a change. The original positioning was centered around a setup along these lines:

NVIDIA has defined a strict set of packaging standards around which the GeForce 6200 with TurboCache supporting 128MB will be marketed. The boxes must have text, which indicates that a minimum of 512MB of system RAM is necessary for the full 128MB of graphics RAM support. A discloser of the actual amount of onboard RAM must be displayed as well, which is something that we strongly support. It is understandable that board vendors are nervous about how this marketing will go over, no matter what wording or information is included on the package. We feel that it's advantageous for vendors to have faith in the intelligence of their customers to understand the information given to them. The official names of the TurboCache boards will be:

GeForce 6200 w/ TurboCache supporting 128MB, including 16MB of local TurboCache
GeForce 6200 w/ TurboCache supporting 128MB, including 32MB of local TurboCache
GeForce 6200 w/ TurboCache supporting 256MB, including 64MB of local TurboCache

It is conceivable that game developers could want to use the bandwidth of PCI Express for their own concoctions. Depending on how smart the driver is and how tricky the developer is, this may prove somewhat at odds on a TurboCache chip. Reading from and writing to the framebuffer from the CPU hasn't been an option in the past, but as systems and processors become faster, there are some interesting things that can be done with this type of processing. We'll have to see if anyone comes up with a game that uses technology like this, because TurboCache could either hurt or help. We'll just have to see.

The final topic we need to address with the new 6200 TurboCache part is price. We are seeing GeForce 6200 128-bit parts with 400MHz data rate RAM going for about $110 dollars on Newegg. With NVIDIA talking about bringing the new 32MB 64-bit TurboCache part out at $99 and the 16MB 32-bit part out at $79, we see them right on target with price/performance. There will also be a 64MB 64-bit TC part (supporting 256MB) available for $129 coming down the pipeline at some point, though we don't have that part in our labs just yet.

When Anand initially reviewed the GeForce 6200 part, he mentioned that pricing would need to fall closer to $100 to be competitive. Now that it has, and we have admittedly lower performance parts coming out, we are glad to see NVIDIA pricing its new parts to match.

We will have to wait and see where street prices end up falling, but at this point the 32-bit memory bus version of the 6200 with TurboCache is the obvious choice over the X300 SE. The 32MB 64-bit 6200 TC part is also the clear winner over the standard X300. When we get our hands on the 64MB version of the TurboCache part, we'll have to take another look at how the 128-bit 6200 stacks up with it's current street price.

The GeForce 6200 with TurboCache supporting 128MB will be available in OEM systems in January. It won't be available for retail purchase online until sometime in February. Since this isn't really going to be an "upgrade part", but rather a PCI Express only card, it will likely sell to more OEM customers first any way. As budget PCI Express components become more readily available to the average consumer, we may see these parts move off of store shelves, but the majority of sales are likely to remain OEM.

AGP versions of 6600 cards are coming down the pipe, but the 6200 is still slated to remain PCI Express only. As TurboCache parts require less in the way of local memory, and thus power, heat and size, it is only logical to conclude where they will end up in the near future.

At the end of the day, with NVIDIA's revised position on marketing, leadership over ATI in performance, and full support of the GeForce 6 series feature set, the 6200 with TurboCache is a very nice fit for the value segment. We are very impressed with what has been done with this edgy idea. The next thing that we're waiting to see is a working implementation of virtual memory for the graphics subsystem. The entire graphics industry has been chomping at the bit for that one for years now.