Exploring DirectX 12: 3DMark API Overhead Feature Test

Name: Exploring DirectX 12: 3DMark API Overhead Feature Test
Item: Exploring DirectX 12: 3DMark API Overhead Feature Test

by Ryan Smith & Ian Cutress on March 27, 2015 8:00 AM EST

113 Comments | Add A Comment

113 Comments

Discrete GPU Testing

We’ll kick things off with our discrete GPUs, which should present us with a best case scenario for DirectX 12 from a hardware standpoint. With the most powerful CPUs powering the most powerful GPUs, the ability to generate a massive number of draw calls and to have them consumed in equally large number, this is where DirectX 12 will be at its best.

We’ll start with a look a CPU scaling on our discrete GPUs. How much benefit do we see going from 2 to 4 and finally 6 CPU cores?

3DMark API Overhead D3D12 CPU Scaling

The answer on the CPU side is quite a lot. Whereas Star Swarm generally topped out at 4 cores – after which it was often GPU limited – we see gains all the way up to 6 cores on our most powerful cards. This is a simple but important reminder of the fact that the AOFT is a synthetic test designed specifically to push draw calls and avoid all other bottlenecks as much as possible, leading to increased CPU scalability.

With that said, it’s clear that we’re reaching the limits of our GPUs with 6 cores. While the gains from 2 to 4 cores are rather significant, increasing from 4 to 6 (and with a slight bump in clockspeed) is much more muted, even with our most powerful cards. Meanwhile anything slower than a Radeon R9 285X is showing no real scaling from 4 to 6 cores, indicating a rough cutoff right now of how powerful a card needs to be to take advantage of more than 4 cores.

Moving on, let’s take a look at the actual API performance scaling characteristics at 6, 4, and 2 cores.

3DMark API Overhead GPU Scaling - 6 Cores

6 cores of course is a best case scenario for DirectX 12 – it’s the least likely to be CPU-bound – and we see first-hand the incredible increase in draw call throughput by switching from DirectX 11 to DirectX 12 or Mantle.

Somewhat unexpectedly, the greatest gains and the highest absolute performnace are achieved by AMD’s Radeon R9 290X. As we saw in Star Swarm and continue to see here, AMD’s DirectX 11 throughput is relatively poor, topping out at 1.1 draw calls for both DX11ST and DX11MT. AMD simply isn’t able to push much more than that many calls through their drivers, and without real support for DX11 multi-threading (e.g. DX11 Dirver Command Lists), they gain nothing from the DX11MT test.

But on the opposite side of the coin, this means they have the most to gain from DirectX 12. The R9 290X sees a 16.8x increase in draw call throughput switching from DX11 to DX12. At 18.5 million draw calls per second this is the highest draw call rate out of any of our cards, and we have good reason to suspect that we’re GPU command processor limited at this point. Which is to say that our CPU could push yet more draw calls if only a GPU existed that could consumer that many calls. On a side note, 18.5M calls would break down to just over 300K calls per frame at 60fps, which is a similarly insane number compared to today’s standards where draw calls per frame in most games is rarely over 10K.

Meanwhile we see a reduction in gains going from the 290X to the 285 and finally to the 7970. As we mentioned earlier we appear to be command processor limited, and each one of these progressively weaker GPUs appears to contain a similarly weaker command processor. Still, even the “lowly” 7970 can push 11.6M draw calls per second, which is a 10.5x (order of magnitude) increase in draw call performance over DirectX 11.

Mantle on the other hand presents an interesting aside. As AMD’s in-house API (and forerunner to Vulkan), the AMD cards do even better on Mantle than they do DirectX 12. At this point the difference is somewhat academic – what are you going to do with 20.3M draw calls over 18.5M – but it goes to show that Mantle can still squeeze out a bit more at times. It will be interesting to see whether this holds as Windows 10 and the drivers are finalized, and even longer term whether these benefits are retained by Vulkan.

As for the NVIDIA cards, NVIDIA sees neither quite the awesome relative performance gains from DirectX 12 nor enough absolute performance to top the charts, but here too we see the benefits of DirectX 12 in full force. At 1.9M draw calls per second in DX11ST and 2.2M draw calls per second in DX11MT, NVIDIA starts out in a much better position than AMD does; in the latter they essentially can double AMD’s DX11MT throughput (or alternatively have half the API overhead).

Once DX12 comes into play though, NVIDIA’s throughput rockets through the roof as well. The GTX 980 sees an 8.2x increase over DX11ST, and a 7x increase over DX11MT. On an absolute basis the GTX 980 is consuming 15.5M draw calls per second (or about 250K per frame at 60fps), showing that even the best DX11 implementation can’t hold a candle to this early DirectX 12 implementation. The benefits of DirectX 12 really are that great for draw call performance.

Like AMD, NVIDIA seems to be command processor limited here. GPU-Z reports 100% GPU usage in the DX12 test, indicating that by NVIDIA’s internal metrics the card is working as hard as it can. Meanwhile though not charted, I also tested a GTX Titan X here, which achieved virtually the exact same results as the GTX 980. In lieu of more evidence to support being CPU bound, I have to assume that the GM200 GPU uses a similar command processor as the GM204 based GTX 980, leading to a similar bottleneck. Which would make some sense, as the GM200 is by all practical measurements a supersized version of GM204.

Moving down the NVIDIA lineup, we see performance decrease as we work towards the GTX 680 and GTX 750 Ti. The latter is a newer product, based on the GM107 GPU, but ultimately it is a smaller and lower performing GPU than the GTX 680. Regardless, we are hitting the lower command processor throughput limits of these cards, and seeing the maximum DX12 throughput decrease accordingly. This means that the relative gains are smaller – DX11 performance is virtually the same as GTX 980 since the CPU is the limit there – but even GTX 750 Ti sees a 3.8x increase in throughput over DX11ST.

Finally, it’s here where we’re seeing a distinct case of the DX11 test producing variable results. For the NVIDIA cards we have seen our results fluctuate between 1.4M and 1.9M. Of all of our runs 1.9M is more common – not to mention it’s close to the score we get on NVIDIA’s public WDDM 1.3 drivers – so it’s what we’re publishing here. However for whatever reason, 1.4M will become more common with fewer cores even though the bottleneck was (and remains) single-core performance.

3DMark API Overhead GPU Scaling - 4 Cores

As for performance scaling with 4 cores, it’s very similar to what we saw with 6 cores. As we noted in our CPU-centric look at our data, only the fastest cards benefit from 6 cores, so the performance we see with 4 cores is quite similar to what we saw before. AMD of course still sees the greatest gains, while overall the gap between AMD and NVIDIA is compressed some.

Interestingly Mantle’s performance advantage melts away here. DirectX 12 is now the fastest API for all AMD cards, indicating that DX12 scales out better to 4 cores than Mantle, but perhaps not as well to 6 cores.

3DMark API Overhead GPU Scaling - 2 Cores

Finally with 2 cores many of our configurations are CPU limited. The baseline changes a bit – DX11MT ceases to be effective since 1 core must be reserved for the display driver – and the fastest cards have lost quite a bit of performance here. None the less, the AMD cards can still hit 10M+ draw calls per second with just 2 cores, and the GTX 980/680 are close behind at 9.4M draw calls per second. Which is again a minimum 6.7x increase in draw call throughput versus DirectX 11, showing that even on relatively low performance CPUs the draw call gains from DirectX 12 are substantial.

Overall then, with 6 CPU cores in play AMD appears to have an edge in command processor performance, allowing them to sustain a higher draw call throughput than NVIDIA. That said, as we know the real world performance of the GTX 980 easily surpasses the R9 290X, which is why it’s important to remember that this is a synthetic benchmark. Meanwhile at 2 cores where we become distinctly CPU limited, AMD appears to still have an edge in DirectX 12 throughput, an interesting role reversal from their poorer DirectX 11 performance.

Other Notes & The Test Integrated GPU Testing

PRINT THIS ARTICLE

Post Your Comment
Please log in or sign up to comment.

Comments Locked

113 Comments

View All Comments

silverblue - Saturday, March 28, 2015 - link
Well, varying results aside, I've heard of scores in the region of eight million. That would theoretically (if other results are anything to go off) put it around the level of a mildly-overclocked i3 (stock about 7.5m). Definitely worth bearing in mind the more-than-six-cores scaling limitation showcased by this test - AMD's own tests show this happening to the 8350, meaning that the Mantle score - which can scale to more cores - should be higher. Incidentally, the DX11 scores seem to be in the low 600,000s with a slight regression in the MT test. I saw these 8350 figures in some comments somewhere but forgot where so I do apologise for not being able to back them up, however the Intel results can be found here:

http://www.pcworld.com/article/2900814/tested-dire...

I suppose it's all hearsay until a site actually does a CPU comparison involving both Intel and AMD processors. Draw calls are also just a synthetic; I can't see AMD's gaming performance leaping through the stratosphere overnight, and Intel stands to benefit a lot here as well.
silverblue - Saturday, March 28, 2015 - link
Sorry, stock i3 about 7.1m.
oneb1t - Saturday, March 28, 2015 - link
my fx-8320@4.7ghz + R9 290x does 14.4mil :) in mantle
Laststop311 - Friday, March 27, 2015 - link
I think AMD APU's are the biggest winner here. Since draw calls help lift cpu bottlenecks and the apu's have 4 weaker cores the lack of dx11 to be able to really utilize multi core for draw calls means the weak single threaded performance of the apus could really hold things back here. DX12 will be able to shift the bottleneck back to the igpu of the apu's for a lot of games and really help make more games playable at 1080p with higher settings or at least same settings and smoother.

If only AMD would release an updated version of the 20 cu design for the ps4 using GCN 1.3 cores + 16GB of 2nd generation 3d HBM memory directly on top that the cpu or gpu could use, not only would you have a rly fast 1080p capable gaming chip you could design radically new motherboards that omit ram slots entirely. Could have new mini itx boards that have room for more sata ports and usb headers and fan headers and more room available for vrm's and cool it with good water cooling like the thermaltake 3.0 360mm rad AIO and good TIM like the coollaboratory liquid metal ultra. Or you could even take it the super compact direction and even create a smaller board than mini-itx and turn it into an ultimate htpc. And as well as the reduced size your whole system would benefit from the massive bandwidth (1.2TB/sec) and reduced latency. The memory pool could respond in real time to add more space for the gpu as necessary and since apu's are really only for 1080p that will never be a problem. I know this will probably never happen but if it did i would 100% build my htpc with an apu like that
Laststop311 - Saturday, March 28, 2015 - link
As a side question, Is there some contractual agreement that will not allow AMD to sell these large 20 cu designed APU's on the regular pc market? Does sony have exclusive rights to the chip and the techniques used to make such a large igpu? Or is it die size and cost that scares AMD from making the chip for the PC market as their would be a much higher price compared to current apu's? I'm sure 4 excavator cores cant be much bigger than 8 jaguar so if its doable with 8 jaguar it should be doable with 4 excavator, especially if they put it on the 16/14nm finfet node?
silverblue - Saturday, March 28, 2015 - link
I'm sure Sony would only be bothered if AMD couldn't fulfill their orders. A PC built to offer exactly the same as the PS4 would generally cost more anyway.

They can't very well go from an eight FPU design to one with two/four depending on how you look at it, even if the clocks are much higher. I think you'd need to wait for the next generation of consoles.
FriendlyUser - Saturday, March 28, 2015 - link
I really hope the developers put this to good use. I am also particularly excited about multicore scaling, since single threaded performance has stagnated (yes, even in the Intel camp).
jabber - Saturday, March 28, 2015 - link
I think this shows that AMD has got a big boost from being the main partner with Microsoft on the Xbox. It's meant that AMD got a major seat at the top DX12 table from day one for a change. I hope to see some really interesting results now that it appears finally AMD hardware has been given some optimisation love other than Intel.
Tigran - Saturday, March 28, 2015 - link
>>> Finally with 2 cores many of our configurations are CPU limited. The baseline changes a bit – DX11MT ceases to be effective since 1 core must be reserved for the display driver – and the fastest cards have lost quite a bit of performance here. None the less, the AMD cards can still hit 10M+ draw calls per second with just 2 cores, and the GTX 980/680 are close behind at 9.4M draw calls per second. Which is again a minimum 6.7x increase in draw call throughput versus DirectX 11, showing that even on relatively low performance CPUs the draw call gains from DirectX 12 are substantial. <<<

Can you please explain how can it be? I thought the main advantage of new APIs is the workload of all CPU cores (instead of one in DX11). If so, should't the performance double in 2-core mode?Why there is 6.7x increase in draw call instead of 2x ?
Tigran - Saturday, March 28, 2015 - link
Just to make it clear: I know there such advantage of Mantle and DX12 as direct addressing GPU, w/o CPU. But this test is about draw calls, requested from CPU to GPU. How can we boost the number of draw calls apart from using additional CPU core?

Exploring DirectX 12: 3DMark API Overhead Feature Test

Discrete GPU Testing

Post Your Comment

113 Comments

View All Comments

silverblue - Saturday, March 28, 2015 - link

silverblue - Saturday, March 28, 2015 - link

oneb1t - Saturday, March 28, 2015 - link

Laststop311 - Friday, March 27, 2015 - link

Laststop311 - Saturday, March 28, 2015 - link

silverblue - Saturday, March 28, 2015 - link

FriendlyUser - Saturday, March 28, 2015 - link

jabber - Saturday, March 28, 2015 - link

Tigran - Saturday, March 28, 2015 - link

Tigran - Saturday, March 28, 2015 - link

Log in

Don't have an account? Sign up now