Architecting for Latency Hiding

RAM is cheap. But it's not free. Making and selling extremely inexpensive video cards means as few components as possible. It also means as simple a board as possible. Fewer memory channels, fewer connections, and less chance for failure. High volume parts need to work, and they need to be very cheap to produce. The fact that TurboCache parts can get by with either 1 or 2 32-bit RAM chips is key in its cost effectiveness. This probably puts it on a board with a few fewer layers than higher end parts, which can have 256-bit connections to RAM.

But the only motivation isn't saving costs. NVIDIA could just stick 16MB on a board and call it a day. The problem that NVIDIA sees with this is the limitations that would be placed on the types of applications end users would be able to run. Modern games require framebuffers of about 128MB in order to run at 1024x768 and hold all the extra data that modern games require. This includes things like texture maps, normal maps, shadow maps, stencil buffers, render targets, and everything else that can be read from or draw into by a GPU. There are some games that just won't run unless they can allocate enough space in the framebuffer. And not running is not an acceptable solution. This brings us to the logical combination of cost savings and compatibility: expand the framebuffer by extending it over the PCI Express bus. But as we've already mentioned, main memory appears much "further" away than local memory, so it takes much longer to either get something from or write something to system RAM.

Let's take a look at the NV4x architecture without TurboCache first, and then talk about what needs to change.



The most obvious thing that will need to be added in order to extend the framebuffer to system memory is a connection from the ROPs to system memory. This will allow the TurboCache based chip to draw straight from the ROPs into things like back or stencil buffers in system RAM. We would also want to connect the pixel pipelines to system RAM directly. Not only do we need to read textures in normally, but we may also want to write out textures for dynamic effects.

Here's what NVIDIA has offered for us today:



The parts in yellow have been rearchitected to accommodate the added latency of working with system memory. The only part that we haven't talked about yet is the Memory Management Unit. This block handles the requests from the pixel pipelines and ROPs and acts as the interface to system memory from the GPU. In NVIDIA's words, it "allows the GPU to seamlessly allocate and de-allocate surfaces in system memory, as well as read and write to that memory efficiently." This unity works in tandem with part of the Forceware driver called the TurboCache Manager, which allocates and balances system and local graphics memory. The MMU provides the system level functionality and hardware interface from GPU to system and back while the TurboCache Manager handles all the "intelligence" and "efficiency" behind the operations.

As far as memory management goes, the only thing that is always required to be on local memory is the front buffer. Other data often ends up being stored locally, but only the physical display itself is required to have known, reliable latency. The rest of local memory is treated as a something like the framebuffer and a cache to system RAM at the same time. It's not clear as to the architecture of this cache - whether it's write-back, write-through, has a copy of the most recently used 16MBx or 32MBs, or is something else altogether. The fact that system RAM can be dynamically allocated and deallocated down to nothing indicates that the functionality of local graphics memory as a cache is very adaptive.

NVIDIA hasn't gone into the details of how their pixel pipes and ROPs have been rearchitected, but we have a few guesses.

It's all about pipelining and bandwidth. A GPU needs to draw hundreds of thousands of independent pixels every frame. This is completely different from a CPU, which is mostly built around dependant operations where one instruction waits on data from another. NVIDIA has touted the NV4x as a "superscalar" GPU, though they haven't quite gone into the details of how many pipeline stages it has in total. The trick to hiding latency is to make proper use of bandwidth by keeping a high number of pixels "in flight" and not waiting for the data to come back to start work on the next pixel.

If our pipeline is long enough, our local caches are large enough, and our memory manager is smart enough, we get cycle of processing pixels and reading/writing data over the bus such that we are always using bandwidth and never waiting. The only down time should be the time that it takes to fill the pipeline for the first frame, which wouldn't be too bad in the grand scheme of things.

The big uncertainty is the chipset. There are no guarantees on when the system will get data back from system memory to the GPU. Worst case latency can be horrible. It's also not going to be easy to design around worst case latencies, as NVIDIA is a competitor to other chipset makers who aren't going to want to share that kind of information. The bigger the guess at the latency that they need to cover, the more pixels NVIDIA needs to keep in flight. Basically, covering more latency is equivalent to increasing die size. There is some point of diminishing returns for NVIDIA. But the larger presence that they have on the chipset side, the better off they'll be with their TurboCache parts as well. Even on the 915 chipset from Intel, bandwith is limited across the PCI Express bus. Rather than a full 4GB/s up and 4GB/s down, Intel offers only 3GB/s up and 1GB/s down, leaving the TurboCache architecture with this:



Graphics performance is going to be dependant on memory bandwidth. The majority of the bandwidth that the graphics card has available comes across the PCI Express bus, and with a limited setup such as this, the 6200 TuboCache will see a performance hit. NVIDIA has reported seeing something along the lines of a 20% performance improvement by moving from a 3-down, 1-up architecture to a 4-down, 4-up system. We have yet to verify these numbers, but it is not unreasonable to imagine this kind of impact.

The final thing to mention is system performance. NVIDIA only maps free idle pages from RAM using "an approved method for Windows XP." It doesn't lock anything down, and everything is allocated and freed on they fly. In order to support 128MB of framebuffer, 512MB of system RAM must be installed. With more system RAM, the card can map more framebuffer. NVIDIA chose 512MB as a minimum because PCI Express OEM systems are shipping with no less. Under normal 2D operation, as there is no need to store any more data than can fit in local memory, no extra system resources will be used. Thus, there should be no adverse system level performance impact from TurboCache.

Index The Test
Comments Locked

43 Comments

View All Comments

  • sphinx - Wednesday, December 15, 2004 - link

    I think this is a good offering from NVIDIA. Passively cooled is a VERY good solution in my line of work. One less thing I have to worry about silencing. As I use my PC to make money, not for playing games. Although I like to play an occasional game from time to time don't get me wrong. I use my XBOX for gaming. When this card comes out I'll get one.
  • DerekWilson - Wednesday, December 15, 2004 - link

    #9, It'll only use 128Mb if a full 128 is needed at the same time -- which isn't usually the case, but we haven't done an indept study on this yet. Also, keep in mind that we still tested at the absolute highest quality settings with noAA/AF (excpet doom 3 even used 8x AF as well). We were not seeing slide show framerates. The FX5200 doesn't even support all the features of the FX5900, let alone the 6200TC. Nor does the FX5200 perform as well at equivalent settings.

    IGP is something I talked to NVIDIA about. This solution really could be an Intel Extreme Graphics killer (in the integrated market). In fact, with the developments in the mareketplace, Intel may finally get up and start moving to create a graphics solution that actually works. There are other markets to look for TurboCache solutions to show up as well.

    #11 ... The packaging issue is touchy. We'll see how vendors pull it off when it happens. The cards do run as if they has a full 128MB of ram, so that's very important to get across. We do feel that talking about the physical layout of the card and the method of support is important as well.

    #8, 1600x1200x32 only requires that 7.5MB be stored locally. As was mentioned in the artile, only the FRONT buffer needs to be local to the graphics card. This means that the depth buffer, back buffer and other render surfaces can all be in system memory. I know it's kind of hard to believe, but this card can actually draw everything diectly into system RAM from the pixel pipes and ROPs. When the buffers are swapped to display the back buffer, what's in system memory is copied into graphics memory.

    It really is very cool for a low performance budget part.

    And we might see higher performance version of turbo cache in the future ... though NVIDIA isn't talking about them yet. It might be nice to have the possibility of an expanded framebuffer with more system RAM if the user wanted to enable that feature.

    TurboCache is actually a performance enahancing feature. It's just that it's enhancing the performance of a card with either 16MB or 32MB of on board ram and either a 32 or 64 bit memory bus ... :-)
  • DAPUNISHER - Wednesday, December 15, 2004 - link

    "NVIDIA has defined a strict set of packaging standards around which the GeForce 6200 with TurboCache supporting 128MB will be marketed. The boxes must have text, which indicates that a minimum of 512MB of system RAM is necessary for the full 128MB of graphics RAM support. It doesn't seem to require that a discloser of the actual amount of onboard RAM be displayed, which is not something that we support. It is understandable that board vendors are nervous about how this marketing will go over, no matter what wording or information is included on the package."

    More bullsh!t deceptive advertising to bilk uninformed consumers out of their money.
  • MAValpha - Wednesday, December 15, 2004 - link

    #7, I was thinking the same thing. This concept seems absolutely perfect for nForce5 IGP, should NVidia decide to go that route. And, once again, NVidia's approach to budget seems superior to ATI's, at least from an initial glance. A heavily-castrated 6200TC running off SHARED RAM STILL manages to outperform a full X300? Come on, ATI, get with it!
    I gotta wonder, though: this solution seems unbelievably dependent on "proper implementation of the PCIe architecture." This means that the card can never be coupled with HSI for older systems, and transitional boards will have trouble running the card (Gigabyte's PT880 with converted PEG, for example- PT880 natively supports AGP). Does this mean that a budget card on a budget motherboard will suffer significantly?
  • mindless1 - Wednesday, December 15, 2004 - link

    IMO, even (as low as) $79 is too expensive. Taking 128MB of system memory away on a system budgetized to include one of these, would typically be leaving 384MB, robbing the system of memory to pay nVidia et al. for a part without (much) memory.

    I tend to disagree with the slant of the article too, that it's not necessarily a good thing to try pushing modern gaming eyecandy at expense of performance. What looks good isn't a crisp and anti-aliased slideshow, but a playable game. even someone just beginning at gaming can discern a lag when fragging it out.

    We're only looking at current games now, the bar for performance needs will be raised but the cards are memory bandwidth limited due to the architecture. These might look like a good alternative for someone who went and paid $90 for an FX5200 from BestBuy last year but in a budget system it's going to be tough to justify ~ $80-100 when a few bucks more won't rob one of system memory or as much performance.

    Even so, historically we've seen that initial price-points do fall, better to see modern support than a rehash of a FX5xxx.
  • PrinceGaz - Wednesday, December 15, 2004 - link

    nVidia's marketing department must be really pleased with coming up with the name "TurboCache". It makes it sounds like its faster than a normal card without TurboCache, whereas in reality the opposite is true. Uninformed customers would probably choose a TurboCache version over a normal version, even if they were priced the same!
    ----

    Derek- does the 16MB 6200 have limitations on what resolutions can be used in games? I know you wouldn't want to run it at 1600x1200x32 in Far Cry for instance, but in older games like Quake 3 it should be fast enough.

    Thing is that the frame-buffer at 1600x1200x32 requires 7.3MB, so with double-buffering you're using up a total of 14.65MB leaving just 1.35MB for the Z-buffer and anything else it needs to keep in local memory, which might not be enough. I'm assuming the frame the card is currently displaying must be held in local memory, as well as the frame being worked on.

    The situation is even worse with anti-aliasing as the frame-buffer size of the frame being worked on is multiplied in size by the level of AA. At 1280x960x32 with 4xAA, the single frame-buffer alone is 18.75MB meaning it won't fit in the 16MB 6200. It might not even manage 1024x768 with 4xAA as the two frame buffers would total 15MB (12MB for the one being worked on, 3MB for the one being displayed).

    It will be interesting to know what the resolution limits for the 16MB (and 32MB) cards are, with and without anti-aliasing.
  • Spacecomber - Wednesday, December 15, 2004 - link

    I may be way off base with this question, but would this sort of GPU lend it self well to some sort of integrated, onboard graphics solution? Even if it is isn't integrated directly into the main chipset (or chip for Nvidia), could it simply be soldered to the motherboard somewhere?

    Somehow this seems to make more sense to me for what to do with this technology than use it on a dedicated video card, especially if the price point is not that much less than a regular 6200.
  • bamacre - Wednesday, December 15, 2004 - link

    Great review.

    Wow, almost 50 fps on HL2 at 10x7, that is pretty good for a budget card.

    I'd like to see MS, ATI, and Nvidia get more people into PC gaming, that would make for better and cheaper games for those of us who are already loving it.
  • DerekWilson - Wednesday, December 15, 2004 - link

    Actually, nForce 4 + AMD systems are looking better than Intel non-925xe based systems for TurboCache parts. We haven't looked at the 925xe yet though ... that could be interesting. But overhead hurts utilization alot on a serial bus, and having more than 6.4GB/s from memory might not be that useful.

    The efficiency of getting bandwidth across the PCI Express bus will still be the main bottleneck in systems though. Chipsets need to impliment PCI Express properly and well. That's really the important part. The 915 chipset is an example of what not to do.
  • jenand - Wednesday, December 15, 2004 - link

    Turbo cache and Hyper memory cards should do better on Intel based systems as they do not need to go via the HTT to det to the memory. So I agree with #3 show us som i925X(E) tests. I'm not expecting higher scores on the Intel systems however. Just a larger gain from this type of technology.

Log in

Don't have an account? Sign up now