NVIDIA's Dirty Dealing with DX10.1 and How GT200 Doesn't Support it

I know many people were hoping to see DX10.1 implemented in GT200 hardware, but that is not the case. NVIDIA has opted to skip including some of the features of DX10.1 in this generation of their architecture. We are in a situation as with DX9 where SM2.0 hardware was able to do the same things as SM3.0 hardware albeit at reduced performance or efficiency. DX10.1 does not enable a new class of graphics quality or performance, but does enable more options to developers to simplify their code and it does enhance performance when coding certain effects and features.

It's useful to point out that, in spite of the fact that NVIDIA doesn't support DX10.1 and DX10 offers no caps bits, NVIDIA does enable developers to query their driver on support for a feature. This is how they can support multisample readback and any other DX10.1 feature that they chose to expose in this manner. Sure, part of the point of DX10 was to eliminate the need for developers to worry about varying capabilities, but that doesn't mean hardware vendors can't expose those features in other ways. Supporting DX10.1 is all or nothing, but enabling features beyond DX10 that happen to be part of DX10.1 is possible, and NVIDIA has done this for multisample readback and can do it for other things.

While we would love to see NVIDIA and AMD both adopt the same featureset, just as we wish AMD had picked up SM3.0 in R4xx hardware, we can understand the decision to exclude support for the features DX10.1 requires. NVIDIA is well within reason to decide that the ROI on implementing hardware for DX10.1 is not high enough to warrant it. That's all fine and good.

But then PR, marketing and developer relations get involved and what was a simple engineering decision gets turned into something ridiculous.

We know that both G80 and R600 both supported some of the DX10.1 featureset. Our goal at the least has been to determine which, if any, features were added to GT200. We would ideally like to know what DX10.1 specific features GT200 does and does not support, but we'll take what we can get. After asking our question, this is the response we got from NVIDIA Technical Marketing:

"We support Multisample readback, which is about the only dx10.1 feature (some) developers are interested in. If we say what we can't do, ATI will try to have developers do it, which can only harm pc gaming and frustrate gamers."

The policy decision that has lead us to run into this type of response at every turn is reprehensible. Aside from being blatantly untrue at any level, it leaves us to wonder why we find ourselves even having to respond to this sort of a statement. Let's start with why NVIDIA's official position holds no water and then we'll get on to the bit about what it could mean.

The statement multisample readback is the only thing some developers are interested in is untrue: cube map arrays come in quite handy for simplifying and accelerating multiple applications. Necessary? no, but useful? yes. Separate per-MRT blend modes could become useful as deferred shading continues to evolve, and part of what would be great about supporting these features is that they allow developers and researchers to experiment. I get that not many devs will get up in arms about int16 blends, but some DX10.1 features are interesting, and, more to the point, would be even more compelling if both AMD and NVIDIA supported them.

Next, the idea that developers in collusion with ATI would actively try to harm pc gaming and frustrate gamers is false (and wreaks of paranoia). Developers are interested in doing the fastest most efficient thing to get their desired result with as little trouble to themselves as possible. If a techique makes sense, they will either take it or leave it. The goal of a developer is to make the game as enjoyable as possible for as many gamers as possible, and enabling the same experience on both AMD and NVIDIA hardware is vital. Games won't come out with either one of the two major GPU vendors unable to run the game properly because it is bad for the game and bad for the developer.

Just like NVIDIA made an engineering decision about support for DX10.1 features, every game developer must weight the ROI of implementing a specific feature or using a certain technique. With NVIDIA not supporting DX10.1, doing anything DX10.1 becomes less attractive to a developer because they need to write a DX10 code path anyway. Unless a DX10.1 code path is trivial to implement, produces the same result as DX10, and provides some benefit on hardware supporting DX10.1 there is no way it will ever make it into games. Unless there is some sort of marketing deal in place with a publisher to unbalance things which is a fundamental problem with going beyond developer relations and tech support and designing marketing campaigns based on how many games dispaly a particular hardware vendors logo.

The idea that NVIDIA is going to somehow hide the capabilities of their hardware from AMD is also naive. The competition through the use of xrays, electron microscopes and other tools of reverse engineering are going to be the first to discover all the ins and outs of how a piece of silicon works once it hits the market. NIVIDA knows AMD will study GT200 because NVIDIA knows it would be foolish for them not to have an RV670 core on their own chopping block. AMD will know how best to program GT200 before developers do and independantly of any blanket list of features we happen to publish on launch day.

So who really suffers from NVIDIA's flawed policy of silence and deception? The first to feel it are the hardware enthusiasts who love learning about hardware. Next in line are the developers because they don't even know what features NVIDIA is capable of offering. Of course, there is AMD who won't be able to sell developers on support for features that could make their hardware perform better because NVIDIA hardware doesn't support it (even if it does). Finally there are the gamers who can and will never know what could have been if a developer had easy access to just one more tool.

So why would NVIDIA take this less than honorable path? The possibilities are endless, but we're happy to help with a few suggestions. It could just be as simple as preventing AMD from getting code into games that runs well on their hardware (as may have happened with Assassin's Creed). It could be that the features NVIDIA does support are incredibly subpar in performance: just because you can do something doesn't mean you can do it well and admitting support might make them look worse than denying it. It could be that the fundamental architecture is incapable of performing certain basic functions and that reengineering from the ground up would be required for DX10.1 support.

NVIDIA insists that if it reveals it's true feature set, AMD will buy off a bunch of developers with its vast hoards of cash to enable support for DX10.1 code NVIDIA can't run. Oh wait, I'm sorry, NVIDIA is worth twice as much as AMD who is billions in debt and struggling to keep up with its competitors on the CPU and GPU side. So we ask: who do you think is more likely to start buying off developers to the detriment of the industry?

Derek's Conjecture Regarding SP Pipelining and TMT GT200 vs. G80: A Clock for Clock Comparison
POST A COMMENT

108 Comments

View All Comments

  • skiboysteve - Tuesday, June 17, 2008 - link

    FANTASTIC write up on fine-grained TMT. I was unaware about this threading technique and was always thinking of this in class or whenever someone would talk about hyperthreading. this technique was literaly in my head for well over a year and I didn't know what it was called or that it even had a name. I always thought there had to be a more elegant way than hyperthreading to do multithreading down at the chip level without doing the OS style time slicing.

    i was sitting there wondering how the hell the schedule and run these SPs and then bam whole page about it

    really appreciate the effort that goes into researching the core of these chips. i know not everyone likes it but for guys that are educated and work in the field its really interesting
    Reply
  • DerekWilson - Tuesday, June 17, 2008 - link

    remember though that this type of fine-grained TMT only has payoffs in systems running millions of threads concurrently.

    on an OS you'll see hundreds or even thousands of threads on heavily used systems, but there still wouldn't be enough concurrent action to justify this type of architecture for general purpose computing.

    of course, as developers push towards an effort to thread their code as much as possible, who knows what architectures might be worth exploring on the desktop ...
    Reply
  • coder0000 - Tuesday, June 17, 2008 - link

    Very well written! A couple of points:

    1) Last week at WWDC Apple announced OpenCL as an alternative to CUDA. It's a C99 based HLL for creating compute kernels that can be deployed to GPU's and CPU's. Today Khronos officially announced a working group for this, and NV is a part of the committee. As such, your wish for an industry standardized compute language similar to CUDA that runs on all platforms and vendors HW may not be so far off.

    2) I believe your interpretation of how multiple threads simultaneously execute in an SM is incorrect. Per thread context switching is not free, and you would never be able to execute a different thread every cycle in the manner described. There is far too much context that needs to be swapped out, and there would be significant power implications for doing that, in addition to the latency. Instead, I believe what NV is claiming is that any given SP executes a single thread. All threads in the SM can all be a single warp, but you can also have multiple threads (one per SP) all executing simultaneously in an SM.
    Reply
  • DerekWilson - Tuesday, June 17, 2008 - link

    1) I haven't had a good chance to look at OpenCL, but I certainly hope that if it's everything everyone is saying it is in the comments here that it takes off in a bigger way than CUDA :-)

    2) it does not context switch per thread -- warps define a context, and you have 32 threads grouped together. these threads all share the same instruction stream, which is why if threads in a warp take different directions on a branch all 32 threds must follow both paths.

    NVIDIA has flat out stated that every schedule clock a new warp is scheduled and that it takes 4 clock cycles to process one warp on an SM. For both of these to be true, we conclude that the scheduler alternates scheduling SPs and SFUs on altenating clocks which means the SPs would be scheduled every 4 clocks relative to itself.

    On 8 SPs per SM, you some how need to execute 32 threads in 4 clock cycles. This makes sense if you execute 4 threads per SP in some way. The details at this point are fuzzy though.

    regardless, if an SP executes 4 different threads from the same warp, there is no need to context switch to execute any of these threads -- again, threads in the same warp share context.
    Reply
  • skiboysteve - Tuesday, June 17, 2008 - link

    could be a large explanation of the 2x register file size. and remember that the SP doesn't have to worry about the context switch, the SM handles having the data in the right place Reply
  • anandtech02148 - Monday, June 16, 2008 - link

    From this conclusion, Amd seems to be the shrewd player, let nvidia and intel duke it out in the high voltage, heat, meaningless speed gpu while Amd can pull something like its first dualcore or athlon64 for the win.
    this new beast from Nvidia will have how many developers making games for it right away? i'm guestimating maybe 2yrs-4yrs down the road we'll see a decent title that take full advantage of this hardware.
    by then Amd will have something of a midrange that can more than handle the games.
    2 things nvidia could work on that it already has, the ps3 market, and small graphic devices to improve profits. shrink the ps3 gpu further so Sony can shrink it's machinel and sell more.

    Reply
  • PrinceGaz - Monday, June 16, 2008 - link

    The GT200 core may be a technical masterpeice in terms of actually making something that big which is fully functional on GTX280 cards, but it seems to me the penalty of fabbing it at 65nm negates much of the benefits of such a wide GPU.

    They've had to drop the clock speeds throughout presumably because of the ridiculous amount of heat such a large core generates, which means the ~60% performance advantage in current games over the G80 core at similar clock-speeds is somewhat reduced.

    Given that ATI are not producing their 55nm cores in AMD's fabs but instead are getting them churned out reliably elsewhere, nVidia have made a mistake this time around in having their high-end product rely on previous-generation fabrication as it makes it run too hot to allow the clock-speeds needed for it to be the product it should be. There is always a risk in transitioning to a smaller fab technology, and nVidia suffered badly in the past by doing so too early, but with a chip the size of the GT200, they really should have gone to 55nm even if it meant a delay of a month or three, whilst the smaller cut-down derivatives were rolled out first.
    Reply
  • ekpyr - Monday, June 16, 2008 - link

    Great article, but what about the microstuttering issues present in Nvidia's 9800GX2 cards (both SLI and Quad-SLI)? There is very little discussion on this, but I've seen some benchmarks where the FPS floor is 4fps with the 9800GX2s. Can you add a subjective review of whether or not the actual gameplay is smoother with the GTX280s across these games? Aggregate numbers may say one thing, but I've returned a 9800 GX2 Quad-SLI setup because it was unable to handle the incredible amount of texture loading that was done in Age of Conan (2560x1600 4xAA 'High' settings = 4fps). The 8800 GTX Tri-SLI configuration I am currently using is more resilient to microstuttering with its increased bus and memory capacities, but I'm very curious about the GTX280s and their increased memory and bus on texture-heavy games like Age of Conan. Reply
  • DerekWilson - Monday, June 16, 2008 - link

    the only game that came close to having this issue with quad sli for us was oblivion.

    in that game at high res lag and stutter are unbearable and the game is unplayable.

    we didn't notice any stuttering issues with a single GX2.

    i'm working on some analysis tools to show details like this better in future articles.
    Reply
  • TheJian - Monday, June 16, 2008 - link

    I find it humorous that nobody discusses the fact that the shrink has already taped out and will likely be out in two months or just after. This humongous chip was only released so that when AMD releases in the next few weeks they will be behind still in single GPU cards. This is basically what Intel does to AMD every time AMD has a better chip. For all intents and purposes this is a PAPER release of what will come in 2-2.5 months (In Intel's case they just show you what will be out 6 months from now, and a large portion of people don't buy an AMD because Intel might be ahead by xmas...LOL - works like a charm every time AMD is ahead). THE DIE SHRUNK CHIP! Most likely with faster speeds. I suspect they'll come with "ULTRA" version first (and stick it on top of the price heap, so as to not kill all FAT cards in the channel already) and then filter down as these big suckers leave the channel. That's if they even plan to sell more than a few of these to begin withat 65nm. It's only out there so AMD won't look any good in two weeks.

    MIND SHARE is everything, which is why Intel's KING of the paper launch when behind strategy. They've even went to doing it for all chips no matter what now. Nehalem scores 6 months before availability. AMD's marketers have no clue an should be fired. You have to play the same DIRTY game as your enemy or you've already lost. If AMD had half a brain in their head they'd paper launch an ultra or 2x4870 version for the same reason...LOL. Then claim "our 4870x2 makes nvidia look like crap for $600"...ROFL. Who cares when it's available, just say it. Having said that, Nvidia will wipe the floor with them in 2 months anyway on a 2xGTX280 that's die shrunk. Which is all they are doing today...BUYING TIME!
    Reply

Log in

Don't have an account? Sign up now