GPU 2016 Benchmark Suite & The Test

As this is the first high-end card release for 2016, we have gone ahead and updated our video card benchmarking suite. Unfortunately Broadwell-E launched just a bit too late for this review, so we’ll have to hold off on updating the underlying platform to Intel’s latest and greatest for a little while longer yet.

For the 2016 suite we have retained Grand Theft Auto V, Battlefield 4, and of course, Crysis 3. Joining these games are 6 new games: Rise of the Tomb Raider, DiRT Rally, Ashes of the Singularity, The Witcher 3, The Division, and the 2016 rendition of Hitman.

AnandTech GPU Bench 2016 Game List
Game Genre API(s)
Rise of the Tomb Raider Action DX11
DiRT Rally Racing DX11
Ashes of the Singularity RTS DX12
Battlefield 4 FPS DX11
Crysis 3 FPS DX11
The Witcher 3 RPG DX11
The Division FPS DX11
Grand Theft Auto V Action/Open World DX11
Hitman (2016) Action/Stealth DX11 + DX12

As was the case in 2015, the API used will be based on the best API available for a given card. Rise of the Tomb Raider and Hitman both support DirectX 11 + DirectX 12; in the case of Tomb Raider the DX12 path was until last week a regression – a new patch changed things too late for this article – and meanwhile the best API for Hitman depends on whether we’re looking at an AMD or NVIDIA card. For now Tomb Raider is benchmarked using DX11 and Hitman on both DX11 and DX12. Meanwhile Ashes of the Singularity is essentially tailor made for DirectX 12, as the first DX12 game to be designed for it as opposed to porting over a DX11 engine, so it is being run under DX12 at all times.

Meanwhile from a design standpoint our benchmark settings remain unchanged. For lower-end cards we’ll look at 1080p at various quality settings when practical, and for high-end cards we’ll be looking at 1080p and above at the highest quality settings.

The Test

As for our hardware testbed, it remains unchanged from 2015, being composed of an overclocked Core i7-4960X housed in an NZXT Phantom 630 Windowed Edition case.

CPU: Intel Core i7-4960X @ 4.2GHz
Motherboard: ASRock Fatal1ty X79 Professional
Power Supply: Corsair AX1200i
Hard Disk: Samsung SSD 840 EVO (750GB)
Memory: G.Skill RipjawZ DDR3-1866 4 x 8GB (9-10-9-26)
Case: NZXT Phantom 630 Windowed Edition
Monitor: Asus PQ321
Video Cards: NVIDIA GeForce GTX 1080 Founders Edition
NVIDIA GeForce GTX 1070 Founders Edition
NVIDIA GeForce GTX 980 Ti
NVIDIA GeForce GTX 980
NVIDIA GeForce GTX 970
NVIDIA GeForce GTX 780
NVIDIA GeForce GTX 680
AMD Radeon RX 480
AMD Radeon Fury X
AMD Radeon R9 Nano
AMD Radeon R9 390X
AMD Radeon R9 390
AMD Radeon HD 7970
Video Drivers: NVIDIA Release 368.39
AMD Radeon Software Crimson 16.7.1 (RX 480)
AMD Radeon Software Crimson 16.6.2 (All Others)
OS: Windows 10 Pro
Meet the GeForce GTX 1080 & GTX 1070 Founders Edition Cards Rise of the Tomb Raider
Comments Locked

200 Comments

View All Comments

  • patrickjp93 - Wednesday, July 20, 2016 - link

    That doesn't actually support your point...
  • Scali - Wednesday, July 20, 2016 - link

    Did I read a different article?
    Because the article that I read said that the 'holes' would be pretty similar on Maxwell v2 and Pascal, given that they have very similar architectures. However, Pascal is more efficient at filling the holes with its dynamic repartitioning.
  • mr.techguru - Wednesday, July 20, 2016 - link

    Just Ordered the MSI GeForce GTX 1070 Gaming X , way better than 1060 / 480. NVidia Nail it :)
  • tipoo - Wednesday, July 20, 2016 - link

    " NVIDIA tells us that it can be done in under 100us (0.1ms), or about 170,000 clock cycles."

    Is my understanding right that Polaris, and I think even earlier with late GCN parts, could seamlessly interleave per-clock? So 170,000 times faster than Pascal in clock cycles (less in total time, but still above 100,000 times faster)?
  • Scali - Wednesday, July 20, 2016 - link

    That seems highly unlikely. Switching to another task is going to take some time, because you also need to switch all the registers, buffers, caches need to be re-filled etc.
    The only way to avoid most of that is to duplicate the whole register file, like HyperThreading does. That's doable on an x86 CPU, but a GPU has way more registers.
    Besides, as we can see, nVidia's approach is fast enough in practice. Why throw tons of silicon on making context switching faster than it needs to be? You want to avoid context switches as much as possible anyway.

    Sadly AMD doesn't seem to go into any detail, but I'm pretty sure it's going to be in the same ballpark.
    My guess is that what AMD calls an 'ACE' is actually very similar to the SMs and their command queues on the Pascal side.
  • Ryan Smith - Wednesday, July 20, 2016 - link

    Task switching is separate from interleaving. Interleaving takes place on all GPUs as a basic form of latency hiding (GPUs are very high latency).

    The big difference is that interleaving uses different threads from the same task; task switching by its very nature loads up another task entirely.
  • Scali - Thursday, July 21, 2016 - link

    After re-reading AMD's asynchronous shader PDF, it seems that AMD also speaks of 'interleaving' when they switch a graphics CU to a compute task after the graphics task has completed. So 'interleaving' at task level, rather than at instruction level.
    Which would be pretty much the same as NVidia's Dynamic Load Balancing in Pascal.
  • eddman - Thursday, July 21, 2016 - link

    The more I read about async computing in Polaris and Pascal, the more I realize that the implementations are not much different.

    As Ryan pointed out, it seems that the reason that Polaris, and GCN as a whole, benefit more from async is the architecture of the GPU itself, being wider and having more ALUs.

    Nonetheless, I'm sure we're still going to see comments like "Polaris does async in hardware. Pascal is hopeless with its software async hack".
  • Matt Doyle - Wednesday, July 20, 2016 - link

    Typo in the lead sentence of HPC vs. Consumer: Divergence paragraph: "Pascal in an architecture that..."

    "is" instead of "in"
  • Matt Doyle - Wednesday, July 20, 2016 - link

    Feeding Pascal page, "GDDR5X uses a 16n prefetch, which is twice the size of GDDR5’s 8n prefect."

    Prefect = prefetch

Log in

Don't have an account? Sign up now