NVIDIA's GeForce 8800 (G80): GPUs Re-architected for DirectX 10

Name: NVIDIA's GeForce 8800 (G80): GPUs Re-architected for DirectX 10
Item: NVIDIA's GeForce 8800 (G80): GPUs Re-architected for DirectX 10
Author: Anand Lal Shimpi & Derek Wilson

by Anand Lal Shimpi & Derek Wilson on November 8, 2006 6:01 PM EST

Posted in
GPUs

111 Comments | Add A Comment

111 Comments

Virtual Memory

Microsoft is taking tighter control of graphics memory with it's new driver model, and thus is able to provide virtual memory support for the graphics memory subsystem. What this means is that games no longer need to worry about running out of graphics memory. When software needs to write something to local memory, and local memory is full, Windows will be able to kick out something off the graphics card and put it in system memory (this is called paging) until it is needed. This happens without the software's intervention or knowledge. If system memory becomes full, data will be kicked out to the hard drive. Of course, if something like this happens the performance will definitely suffer.

Virtual memory isn't as much a performance enhancing tool as it is a way to remove the burden on the developer to manage memory usage around a hard limit of available space. Certainly, lots of paging will degrade performance, but lower performance is generally better than a crash. On the flip side, it is possible that virtual memory could increase performance by effectively replacing local graphics memory size with unused PCIe bandwidth. This has been the idea behind TurboCache and HyperMemory, but with the added advantage that the graphics driver doesn't need to worry about object or texture management between local and system memory.

Engineers have been wanting to see virtualized graphics memory for years, as operating on really huge data sets is made significantly easier when the software developer doesn't have to manage moving data in and out of graphics memory by hand. We've seen some limited benefits of utilizing both local and system memory on low memory TurboCache and HyperMemory cards. With game developers reaching towards ever larger data sets, high end parts will soon begin to benefit from virtualized graphics memory as well. Building the hardware to accommodate the possibility of higher latencies due to paging and allowing the OS to manage all the memory in the system will definitely help developers focus on building better games rather than better memory managers. That's not to say that memory management won't still be important to game developers. Making sure space and bandwidth are used efficiently are important factors in performance, but the ability to forget about hard limits in local memory will make it easier to take one efficient approach regardless of onboard memory.

Hardware Virtualization

Lately, all the big boys of computing have been infatuated with the idea of virtualization. It makes a whole lot of sense, really. With the advent of multi-core CPUs, AMD and Intel need to find ways to take full advantage of their processing power. Single thread execution time will never disappear as a factor in computing, and some algorithms just can't be parallelized.

Obviously, encouraging users to multitask is a simple way to provide a benefit to multi-core computing. The next step is to encourage developers to write highly multithreaded applications. Beyond that is to allow the user to run multiple operating systems on one set of hardware. One example of how this may be beneficial is in the use of a single system as a normal PC during its use as a home theater / DVR box. Another example is one we've already seen: Mac users running both Windows and OS X on Intel based Macs using a virtual machine manager like Parallels.

In order to really achieve the capabilities hardware providers would like to promote, more work must be done by hardware, software, and operating system providers. One of the major advances necessary is the virtualization of the graphics subsystem. With DirectX10 and the new WDDM (Windows Display Driver Model), graphics hardware is required to support virtualization. This is not a simple request, as games will no longer be guaranteed exclusive access to the hardware while running. We can potentially share game rendering with something like physics calculations on the same GPU. Or we could run a Folding@Home GPU client in the background while we play a game. On the extreme, multiple full screen 3d applications could be running concurrently.

Drivers and hardware will have to support context switching on a massive scale due the huge number of pipelines and registers supported in DX10 class hardware. With the advent of features like TurboCache and HyperMemory (and now graphics memory virtualization), hardware developers are already prepared to handle much larger latencies than we've seen in the past. The ability to preempt a process on the GPU will only increase the potential latency that will need to be addressed.

This is another major step in bringing the GPU closer in functionality to the CPU. More attention must be paid not only to instruction and thread scheduling, but the scheduling of multiple programs. This is no small task when such a high number of pipelines need to be managed. We are very interested in discovering how well NVIDIA has implemented this feature, but we won't be able to test this until we have access to an operating system, API, and software that support it as well.

Index All GPUs are Created Equal: Say Goodbye to Cap Bits

PRINT THIS ARTICLE

Post Your Comment
Please log in or sign up to comment.

Comments Locked

111 Comments

View All Comments

Nightmare225 - Sunday, November 26, 2006 - link
Are the FPS posted in this article, Minimum FPS, Average FPS, or Maximum? Thanks!
multiblitz - Monday, November 20, 2006 - link
I enjoyed your reviews always a lot as they inclueded the video-capbilities for a HTPC on previous cards. Unfortunately this was this time not the case. Hopefully there will be a 2. Part covering this as well ? If so, it would be nice to make a compariosn on picture quality as well against the filters of ffdshow, as nvidia is now as well supporting postprocessing filters...
DerekWilson - Tuesday, November 21, 2006 - link
What we know right now is that 8800 gets a 128 out of 130 on HQV tests.

We haven't quite put together an HTPC look at 8800, but this is a possibility for the future.
epsil0n - Sunday, November 19, 2006 - link
I am not agree with this:

"It isn't surprising to see that NVIDIA's implementation of a unified shader is based on taking a pixel shader quad pipeline, and breaking up the vector units into 4 scalar units. Now, rather than 4 pixel quads, we see 16 SPs per "quad" or block of stream processors. Each block of 16 SPs shares 4 texture address units, 8 texture filter units, and an L1 cache."

If i understood well this sentence tells that given 4 pixels the numbers of SPs involved in the computation are 16. Then, this assumes that each component of the pixel shader is computed horizontally over 16 SP (4pixel x 4rgba = 16SP). But, are you sure??

I didn't found others articles over the web that speculate about this. Reading others articles the main idea that i realized is that a shader is computed by one and only one SP. Each vector instruction (inside the shader) is "mapped" as a sequence of scalar operations (a dot product beetwen two vectors is mapped as 4 MUD/ADD operations). As a consequence, in this scenario 4 pixels are computed only by 4 SPs.
DerekWilson - Tuesday, November 21, 2006 - link
Honestly, NVIDIA wouldn't give us this level of detail. We certainly pressed them about how vertices and pixels map to SPs, but the answer we got was always something about how dynamic the hardware is able to dynamically schedule the SPs optimally according to what needs to be done.

They can get away with being obscure about how they actually process the data because it could happen either way and provide the same effect to the developer and gamer alike.

Scheduling the simultaneous processing one vec4 MAD operation on 4 quads (16 pixels) over 4 groups of 4 SPs will take 4 clock cycles (in terms of throughput). Processing the same 16 pixels on 16 SPs will also take 4 clock cycles.

But there are reasons to believe that things happen the way we described. Loading components of 16 different "threads" (verts, pixels or whatever) would likely be harder on the cache than loading all 4 components of 4 different threads. We could see them schedule multiple ops from 4 threads to fill up each block of shaders -- like computing 4 consecutive scalar operations for 4 threads on 16 SPs.

At the same time, it might be easier to maximize SP utilization if 16 threads were processed on one block of SPs every clock.

I think the answer to this question is that NVIDIA knows, they didn't tell us, and all we can do is give it our best guess.
xtknight - Thursday, November 16, 2006 - link
This has been AT's best article in awhile. Tons of great, concise info.

I have a question about the gamma corrected AA. This would be detrimental if you've already calibrated your display, correct (assuming the game heeds to the calibration)? Do you know what gamma correction factor the cards use for 'gamma corrected AA'?
DerekWilson - Monday, November 20, 2006 - link
I don't know if they dynamically adjust gamma correction based on monitor (that would be nice though) ...

if they don't they likely adjusted for a gamma of either (or between) 2.2 or 2.5.

Also, thanks :-) There was a lot more we wanted to pack in, but I'm glad to see that we did a good job with what we were able to include.

Thanks,
Derek Wilson
bjacobson - Sunday, November 12, 2006 - link
This comment is unrelated, but could you implement some system where after rating a comment, on reload the page goes back to the comment I was just at? Otherwise I rate something halfway down and then have to spend several seconds finding where I just was. Just a little nuissance.

Thanks for the great article, fun read.
neo229 - Friday, November 10, 2006 - link

quote:
Both cards are extremely quiet during operation...

This is a very suspect quote. A card that requires two PCIe power connectors is going to dissipate a lot of heat. More heat means there must be a faster, louder fan or more substantial and costly heat sink. The extra costs associated with providing a truly quiet card mean that the bulk of manufacturers go with the loud fan option.
DerekWilson - Friday, November 10, 2006 - link
If manufacturers go with the NVIDIA reference design, then we will see a nice large heatsink with a huge quiet fan.

Really, it does move a lot of air without making a lot of noise ... Are there any devices we can get to measure the airflow of a cooling solution?

We are also seeing some designs using water cooling and theres even one with a thermo-electric (peltier) cooler on it. Manufacturers are going to great lengths to keep this thing running cool without generating much noise.

None of the 8 retail cards we are testing right now generate nearly the noise of the X1950 XTX ... We are working on a retail roundup right now, and we'll absolutely have noise numbers for all of these cards at load.

NVIDIA's GeForce 8800 (G80): GPUs Re-architected for DirectX 10

Virtual Memory

Hardware Virtualization

Post Your Comment

111 Comments

View All Comments

Nightmare225 - Sunday, November 26, 2006 - link

multiblitz - Monday, November 20, 2006 - link

DerekWilson - Tuesday, November 21, 2006 - link

epsil0n - Sunday, November 19, 2006 - link

DerekWilson - Tuesday, November 21, 2006 - link

xtknight - Thursday, November 16, 2006 - link

DerekWilson - Monday, November 20, 2006 - link

bjacobson - Sunday, November 12, 2006 - link

neo229 - Friday, November 10, 2006 - link

DerekWilson - Friday, November 10, 2006 - link

Log in

Don't have an account? Sign up now