Compute

On the other hand compared to our Kepler cards GTX 980 introduces a bunch of benefits. Higher CUDA core occupancy is going to be extremely useful in compute benchmarks. So will the larger L2 cache and the 96KB per SMM of shared memory. Even more important, compares to GK104 (GTX 680/770) GTX 980 inherits the compute enhancements that were introduced in GK110 (GTX 780/780 Ti) including changes that relieved pressure on register file bandwidth and capacity. So although GTX 980 is not strictly a compute card – it is first and foremost a graphics card – it has a lot of resources available to spend on compute.

As always we’ll start with LuxMark2.0, the official benchmark of SmallLuxGPU 2.0. SmallLuxGPU is an OpenCL accelerated ray tracer that is part of the larger LuxRender suite. Ray tracing has become a stronghold for GPUs in recent years as ray tracing maps well to GPU pipelines, allowing artists to render scenes much more quickly than with CPUs alone.

Compute: LuxMark 2.0

Out of the gate GTX 980 takes off like a rocket. AMD’s cards could easily best even GTX 780 Ti here, but GTX 980 wipes out AMD’s lead and then some. At 1.6M samples/sec, GTX 980 Ti is 15% faster than R9 290X and 54% faster than GTX 780 Ti. This, as it’s important to remind everyone, is for a part that technically only has 71% of the CUDA cores of GTX 780 Ti. So per CUDA core, GTX 980 delivers over 2x the LuxMark performance of GTX 780 Ti. Meanwhile against GTX 680 and GTX 780 the lead is downright silly. GTX 980 comes close to tripling its GK104 based predecessors.

I’ve spent some time pondering this, and considering that GTX 750 Ti looked very good in this test as well it’s clear that Maxwell’s architecture has a lot to do with this. I don’t know if NVIDIA hasn’t also been throwing in some driver optimizations here, but a big part is being played by parts of the architecture. GTX 750 Ti and GTX 980 both share the general architecture and 2MB of L2 cache, while it seems like we can run out GTX 980’s larger 96KB shared memory since GTX 750 Ti did not have that. This may just come down to those CUDA core occupancy improvements, especially if you start comparing GTX 980 to GTX 780 Ti.

For our second set of compute benchmarks we have CompuBench 1.5, the successor to CLBenchmark. We’re not due for a benchmark suite refresh until the end of the year, however as CLBenchmark does not know what to make of GTX 980 and is rather old overall, we’ve upgraded to CompBench 1.5 for this review.

Compute: CompuBench 1.5 - Face Detection

The first sub-benchmark is Face Detection, which like LuxMark puts GTX 980 in a very good light. It’s quite a bit faster than GTX 780 Ti or R9 290X, and comes close to trebling GTX 680.

Compute: CompuBench 1.5 - Optical Flow

The second sub-benchmark of Optical Flow on the other hand sees AMD put GTX 980 in its place. GTX 980 fares only as well as GTX 780 Ti here, which means performance per CUDA core is up, but not enough to offset the difference in cores. And it doesn’t get GTX 980 anywhere close to beating R9 290X. As a computer vision test this can be pretty memory bandwidth intensive, so this may be a case of GTX 980 succumbing to its lack of memory bandwidth rather than a shader bottleneck.

Compute: CompuBench 1.5 - Particle Simulation 64K

The final sub-benchmark of the particle simulation puts GTX 980 back on top, and by quite a lot. NVIDIA does well in this benchmark to start with – GTX 780 Ti is the number 2 result – and GTX 980 only improves on that. It’s 35% faster than GTX 780 Ti, 73% faster than R9 290X, and GTX 680 is nearly trebled once again. CUDA core occupancy is clearly a big part of these results, though I wonder if the L2 cache and shared memory increase may also be playing a part compared to GTX 780 Ti.

Our 3rd compute benchmark is Sony Vegas Pro 12, an OpenGL and OpenCL video editing and authoring package. Vegas can use GPUs in a few different ways, the primary uses being to accelerate the video effects and compositing process itself, and in the video encoding step. With video encoding being increasingly offloaded to dedicated DSPs these days we’re focusing on the editing and compositing process, rendering to a low CPU overhead format (XDCAM EX). This specific test comes from Sony, and measures how long it takes to render a video.

Compute: Sony Vegas Pro 12 Video Render

Traditionally a benchmark that favored AMD, the GTX 980 doesn’t manage to beat the R9 290X, but it closes the gap significantly compared to GTX 780 Ti. This test is a mix of simple shaders and blends, so it’s likely we’re seeing a bit of both here. More ROPs for more blending, and improved shader occupancy for when the task is shader-bound.

Moving on, our 4th compute benchmark is FAHBench, the official Folding @ Home benchmark. Folding @ Home is the popular Stanford-backed research and distributed computing initiative that has work distributed to millions of volunteer computers over the internet, each of which is responsible for a tiny slice of a protein folding simulation. FAHBench can test both single precision and double precision floating point performance, with single precision being the most useful metric for most consumer cards due to their low double precision performance. Each precision has two modes, explicit and implicit, the difference being whether water atoms are included in the simulation, which adds quite a bit of work and overhead. This is another OpenCL test, utilizing the OpenCL path for FAHCore 17.

Compute: Folding @ Home: Explicit, Single PrecisionCompute: Folding @ Home: Implicit, Single Precision

This is another success story for the GTX 980. In both single precision tests the GTX 980 comes out on top, holding a significant lead over the R9 290X. Furthermore we’re seeing some big performance gains over GTX 780 Ti, and outright massive gains over GTX 680, to the point that GTX 980 comes just short of quadrupling GTX 680’s performance in single precision explicit. This test is basically all about shading/compute, so we expect we’re seeing a mix of improvements to CUDA core occupancy, shared memory/cache improvements, and against GTX 680 those register file improvements.

Compute: Folding @ Home: Explicit, Double Precision

Double precision on the other hand is going to be the GTX 980’s weak point for obvious reasons. GM204 is a graphics GPU first and foremost, so it only has very limited 1:32 rate FP64 performance, leaving it badly outmatched by anything with a better rate. This includes GTX 780/780 Ti (1:24), AMD’s cards (1:8 FP64), and even ancient GTX 580 (1:8). If you want to do real double precision work, NVIDIA clearly wants you buying their bigger, compute-focused products such as GTX Titan, Quadro, and Tesla.

Wrapping things up, our final compute benchmark is an in-house project developed by our very own Dr. Ian Cutress. SystemCompute is our first C++ AMP benchmark, utilizing Microsoft’s simple C++ extensions to allow the easy use of GPU computing in C++ programs. SystemCompute in turn is a collection of benchmarks for several different fundamental compute algorithms, with the final score represented in points. DirectCompute is the compute backend for C++ AMP on Windows, so this forms our other DirectCompute test.

Compute: SystemCompute v0.5.7.2 C++ AMP Benchmark

Once again NVIDIA’s compute performance is showing a strong improvement, even under DirectCompute. 17% over GTX 780 Ti and 88% over GTX 680 shows that NVIDIA is getting more work done per CUDA core than ever before. Though this won’t be enough to surpass the even faster R9 290X.

Overall, while NVIDIA can’t win every compute benchmark here, the fact that they are winning so many and by so much – and otherwise not terribly losing the rest – shows that NVIDIA and GM204 have corrected the earlier compute deficiencies in GK104. As an x04 part GM204 may still be first and foremost consumer graphics, but if it’s faced with a compute workload most of the time it’s going to be able to power on through it just as well as it does with games and other graphical workloads.

It would be nice to see GPU compute put to better use than it is today, and having strong(er) compute performance in consumer parts is going to be one of the steps that needs to happen for that outcome to occur.

Synthetics Power, Temperature, & Noise
Comments Locked

274 Comments

View All Comments

  • TheJian - Saturday, September 20, 2014 - link

    http://blogs.nvidia.com/blog/2014/09/19/maxwell-an...
    Did I miss it in the article or did you guys just purposely forget to mention NV claims it does DX12 too? see their own blog. Microsoft's DX12 demo runs on ...MAXWELL. Did I just miss the DX12 talk in the article? Every other review I've read mentions this (techpowerup, tomshardware, hardocp etc etc). Must be that AMD Center still having it's effect on your articles ;)

    They were running a converted elemental demo (converted to dx12) and Fable Legends from MS. Yet curiously missing info from this site's review. No surprise I guess with only an AMD portal still :(

    From the link above:
    "Part of McMullen’s presentation was the announcement of a broadly accessible early access program for developers wishing to target DX12. Microsoft will supply the developer with DX12, UE4-DX12 and the source for Epic’s Elemental demo ported to run on the DX12-based engine. In his talk, McMullen demonstrated Maxwell running Elemental at speed and flawlessly. As a development platform for this effort, NVIDIA’s GeForce GPUs and Maxwell in particular is a natural vehicle for DX12 development."

    So maxwell is a dev platform for dx12, but you guys leave that little detail out so newbs will think it doesn't do it? Major discussion of dx11 stuff missing before, now up to 11.3 but no "oh and it runs all of dx12 btw".

    One more comment on 980: If it's a reference launch how come other sites already have OC versions (IE, tomshardware has a Windforce OC 980, though stupidly as usual they downclocked it and the two OC/superclocked 970's they had to ref clocks...ROFL - like you'd buy an OC card and downclock them)? IT seems to be a launch of OC all around. Newegg even has them in stock (check EVGA OC version):
    http://www.newegg.com/Product/Product.aspx?Item=N8...
    And with a $10 rebate so only $559 and a $5 gift card also.
    "This model is factory overclocked to 1241 MHz Base Clock/1342 MHz Boost Clock (1126 MHz/1216 MHz for reference design)"

    Who would buy ref for $10 diff? IN fact the ref cards are $569 at newegg, so you save buying the faster card...LOL.
  • cactusdog - Saturday, September 20, 2014 - link

    TheJian, Wow, Did you read the article? Did you read the conclusion? AT says the 980 is "remarkable" , "well engineered", "impeccable design" and has "no competition" They covered almost all of Nvidia marketing talking points and you're going to accuse them of a conspiracy? Are you fking retarded??
  • Daniel Egger - Saturday, September 20, 2014 - link

    It would be nice to rather than just talk about about the 750 Ti to also include it in comparisons to see it clearer in perspective what it means to go from Maxwell I to Maxwell II in terms of performance, power consumption, noise and (while we are at it) performance per Watt and performance per $.

    Also where're the benchmarks for the GTX 970? I sure respect that this card is in a different ballpark but the somewhat reasonable power output might actually make the GTX 970 a viable candidate for an HTPC build. Is it also possible to use it with just one additional 6 Pin connector (since as you mentioned this would be within the specs without any overclocking) or does it absolutely need 2 of them?
  • SkyBill40 - Saturday, September 20, 2014 - link

    As was noted in the review at least twice, they were having issues with the 970 and thus it won't be tested in full until next week (along with the 980 in SLI).
  • MrSpadge - Saturday, September 20, 2014 - link

    Wow! This makes me upgrade from a GTX660Ti - not because of gaming (my card is fast enough for my needs) but because of the power efficiency gains for GP-GPU (running GPU-Grid under BOINC). Thank you nVidia for this marvelous chip and fair prices!
  • jarfin - Saturday, September 20, 2014 - link

    i still CANT understand amd 'uber' option.
    its totally out of test,bcoz its just 'oc'd' button,nothing else.
    its must be just r290x and not anantech 'amd canter' way uber way.

    and,i cant help that feeling,what is strong,that anatech is going badly amd company way,bcoz they have 'amd center own sector.
    so,its mean ppl cant read them review for nvidia vs radeon cards race without thinking something that anatech keep raden side way or another.
    and,its so clear thats it.

    btw
    i hope anantech get clear that amd card R9200 series is just competition for nvidia 90 series,bcoz that every1 kow amd skippedd 8000 series and put R9 200 series for nvidia 700 series,but its should be 8000 series.
    so now,generation of gpu both side is even.

    meaning that next amd r9 300 series or what it is coming amd company battle nvidia NEXT level gpu card,NOT 900 series.

    there is clear both gpu card history for net.

    thank you all

    p.s. where is nvidia center??
  • Gigaplex - Saturday, September 20, 2014 - link

    Uber mode is not an overclock. It's a fan speed profile change to reduce thermal throttling (underclock) at the expense of noise.
  • dexgen - Saturday, September 20, 2014 - link

    Ryan, Is it possible to see the average clock speeds in different tests after increasing the power and temperature limit in afterburner?

    And also once the review units for non-reference cards come in it would be very nice to see what the average clock speeds for different cards with and without increased power limit would be. That would be a great comparison for people deciding which card to buy.
  • silverblue - Saturday, September 20, 2014 - link

    Exceptional by NVIDIA; it's always good to see a more powerful yet more frugal card especially at the top end.

    AMD's power consumption could be tackled - at least partly - by some re-engineering. Do they need a super-wide memory bus when NVIDIA are getting by with half the width and moderately faster RAM? Tonga has lossless delta colour compression which largely negates the need for a wide bus, although they did shoot themselves in the foot by not clocking the memory a little higher to anticipate situations where this may not help the 285 overcome the 280.

    Perhaps AMD could divert some of their scant resources towards shoring up their D3D performance to calm down some of the criticism because it does seem like they're leaving performance on the table and perhaps making Mantle look better than it might be as a result.
  • Luke212 - Saturday, September 20, 2014 - link

    Where are the SGEMM compute benchmarks you used to put on high end reviews?

Log in

Don't have an account? Sign up now