The DirectX 12 Performance Preview: AMD, NVIDIA, & Star Swarm

Name: The DirectX 12 Performance Preview: AMD, NVIDIA, & Star Swarm
Item: The DirectX 12 Performance Preview: AMD, NVIDIA, & Star Swarm
Author: Ryan Smith

by Ryan Smith on February 6, 2015 2:00 PM EST

245 Comments | Add A Comment

245 Comments

DirectX 12 vs. Mantle, Power Consumption

Although the bulk of our coverage today is going to be focused on DirectX 12 versus DirectX 11, we also wanted to take a moment to also stop and look at DirectX 12 and how it compares to AMD’s Mantle. Mantle offers an interesting point of contrast being that it has been in beta longer than DirectX 12, but also due to the fact that it’s an even lower level API than DirectX 12. Since Mantle only needs to work on AMD’s GPUs and can be tweaked for AMD’s architectures, it offers AMD the chance to exploit their GPUs in a few additional ways that a common, cross-vendor API like DirectX 12 cannot.

Star Swarm - Direct3D 12 vs. Mantle (4 Cores) - Extreme Quality

With 4 cores we find that AMD achieves better results with Mantle than DirectX 12 across the board. The gains are never very great – a few percent here and there – but they are consistent and just outside our window of variability for the Star Swarm benchmark. With such a small gain there are a number of factors that can possibly explain this outcome – better developed drivers, better developed application, further benefits of working with a known hardware platform – so we can’t credit any one factor. But it’s safe to say that at least in this one instance, at this time, Star Swarm’s Mantle rendering path produces even better results than its DirectX 12 path on AMD cards.

Star Swarm - Direct3D 12 vs. Mantle (2 Cores) - Extreme Quality

On the other hand, Mantle doesn’t seem to be able to accommodate a two-core situation as well, with the 290X seeing a small but distinct performance regression from switching to Mantle from DirectX 12. Though we didn’t have time to look at an AMD APU for this article, it would be interesting to see if this regression occurs on their 2M/4C parts as well as it does here; AMD is banking heavily on low-level APIs like Mantle to help level the CPU playing field with Intel, so if Mantle needs 4 CPU cores to fully spread its wings with faster cards, that might be a problem.

Star Swarm CPU Batch Submission Time (4 Cores) - D3D vs. Mantle - Extreme Quality

Diving deeper, we can see that part of the explanation for our Mantle performance regression may come from the batch submission process. DirectX 12 is unexpectedly well ahead of Mantle here, with batch submission taking on average a bit more than half as long as it does under Mantle. As batch submission times are highly correlated to CPU bottlenecking on Star Swarm, this would imply that DirectX 12 would bottleneck later than Mantle in this instance. That said, since we’re so strongly GPU-bound right now it’s not at all clear if either API would be CPU bottlenecked any time soon.

Update: Oxide Games has emailed us this evening with a bit more detail about what's going on under the hood, and why Mantle batch submission times are higher. When working with large numbers of very small batches, Star Swarm is capable of throwing enough work at the GPU such that the GPU's command processor becomes the bottleneck. For this reason the Mantle path includes an optimization routine for small batches (OptimizeSmallBatch=1), which trades GPU power for CPU power, doing a second pass on the batches in the CPU to combine some of them before submitting them to the GPU. This bypasses the command processor bottleneck, but it increases the amount of work the CPU needs to do (though note that in AMD's case, it's still several times faster than DX11).

This feature is enabled by default in our build, and by combining those small batches this is the likely reason that the Mantle path holds a slight performance edge over the DX12 path on our AMD cards. The tradeoff is that in a 2 core configuration, the extra CPU workload from the optimization pass is just enough to cause Star Swarm to start bottlenecking at the CPU again. For the time being this is a user-adjustable feature in Star Swarm, and Oxide notes that in any shipping game the small batch feature would likely be turned off by default on slower CPUs.

Star Swarm CPU Batch Submission Time (4 Cores) - Small Batch Optimization

Star Swarm - Direct3D 12 vs. Mantle (4 Cores) - Small Batch Optimization

If we turn off the small batch optimization feature, what we find is that Mantle' s batch submission time drops nearly in half, to an average of 4.4ms. With the second pass removed, Mantle and DirectX 12 take roughly the same amount of time to submit batches in a single pass. However as Oxide noted, there is a performance hit; the Mantle rendering path's performance goes from being ahead of DirectX 12 to trailing it. So given sufficient CPU power to pay the price for batch optimization, it can have a signifcant impact (16%) on improving performance under Mantle.

Star Swarm System Power Consumption (6 Cores)

Finally, we wanted to take a quick look at power consumption among cards and APIs. To once again repeat what we said earlier, Star Swarm is an imperfect, non-deterministic benchmark, and coupled with the in-development status of DirectX 12 everything here is subject to change. However we thought this was interesting enough to include in our evaluation.

As expected, the increased throughput from DirectX 12 and Mantle drive up system power consumption. With the CPU no longer the bottleneck, the GPU never gets a chance to idle and video card power consumption ramps up to full load.

GPU Scaling Mid Quality Performance

PRINT THIS ARTICLE

Post Your Comment
Please log in or sign up to comment.

Comments Locked

245 Comments

View All Comments

alaricljs - Friday, February 6, 2015 - link
It takes time to devise such tests and more time validate that the test is really doing what you want and yet more time to DO the testing... and meanwhile I'm pretty sure they're not going to just drop everything else that's in the pipeline.
monstercameron - Friday, February 6, 2015 - link
and amd knows that well but maybe nvidia should also...maybe?
JarredWalton - Friday, February 6, 2015 - link
As an owner -- an owner that actually BOUGHT my 970, even though I could have asked for one -- of a GTX 970, I can honestly say that the memory segmentation issue isn't much of a concern. The reality is that when you're running settings that are coming close to 3.5GB of VRAM use, you're also coming close to the point where the performance is too low to really matter in most games.

Case in point: Far Cry 4, or Assassin's Creed Unity, or Dying Light, or Dragon Age: Inquisition, or pretty much any other game I've played/tested, the GTX 980 is consistently coming in around 20-25% faster than the GTX 970. In cases where we actually come close to the 4GB VRAM on those cards (e.g. Assassin's Creed Unity at 4K High or QHD Ultra), both cards struggle to deliver acceptable performance. And there are dozens of other games that won't come near 4GB VRAM that are still providing unacceptable performance with these GPUs at QHD Ultra settings (Metro: Last Light, Crysis 3, Company of Heroes 2 -- though that uses SSAA so it really kills performance at higher quality settings).

Basically, with current games finding a situation where GTX 980 performs fine but the GTX 970 performance tanks is difficult at best, and in most cases it's a purely artificial scenario. Most games really don't need 4GB of textures to look good, and when you drop texture quality from Ultra to Very High (or even High), the loss in quality is frequently negligible while the performance gains are substantial.

Finally, I think it's worth noting again that NVIDIA has had memory segmentation on other GPUs, though perhaps not quite at this level. The GTX 660 Ti has a 192-bit memory interface with 2GB VRAM, which means there's 512MB of "slower" VRAM on one of the channels. That's one fourth of the total VRAM and yet no one really found cases where it mattered, and here we're talking about 1/8 of the total VRAM. Perhaps games in the future will make use of precisely 3.75GB of VRAM at some popular settings and show more of an impact, but the solution will still be the same: twiddle a few settings to get back to 80% of the GTX 980 performance rather than worrying about the difference between 10 FPS and 20 FPS, since neither one is playable.
shing3232 - Friday, February 6, 2015 - link
Those people who own two 970 will not agree with you.
JarredWalton - Friday, February 6, 2015 - link
I did get a second one, thanks to Zotac (I didn't pay for that one, though). So sorry to disappoint you. Of course, there are issues at times, but that's just the way of multiple GPUs, whether it be SLI or CrossFire. I knew that going into the second GPU acquisition.

At present, I can say that Far Cry 4 and Dying Light are not working entirely properly with SLI, and neither are Wasteland 2 or The Talos Principle. Assassin's Creed: Unity seems okay to me, though there is a bit of flicker perhaps on occasion. All the other games I've tried work fine, though by no means have I tried "all" the current games.

For CrossFire, the list is mostly the same with a few minor additions. Assassin's Creed: Unity, Company of Heroes 2, Dying Light, Far Cry 4, Lords of the Fallen, and Wasteland 2 all have problems, and scaling is pretty poor on at least a couple other games (Lichdom: Battlemage and Middle-Earth: Shadow of Mordor scale, but more like by 25-35% instead of 75% or more).

Overall, GTX 970 SLI and R9 290X CF are basically tied at both 4K and QHD testing in my results across quite a few games, with NVIDIA taking a slight lead at 1080p and lower. In fact for single GPUs, 290X wins on average by 10% at 4K (but neither card is typically playable except at lower quality settings), while the difference is 1% or less at QHD Ultra.
Cryio - Saturday, February 7, 2015 - link
"Overall, GTX 970 SLI and R9 290X CF are basically tied at both 4K and QHD testing in my results across quite a few games, with NVIDIA taking a slight lead at 1080p and lower."

By virtue of *every* benchmark I've seen on the internets, on literally every game, 4K, maxed out settings, CrossFire of 290Xs are faster than both SLI 970s *and* 980s.

In 1080p and 1440p, by all intents and purposes, 290Xs trade blows with 970s and the 980s reign supreme. But at 4K, the situation completely shifts and the 290Xs come on top.
JarredWalton - Saturday, February 7, 2015 - link
Note that my list of games is all relatively recent stuff, so the fact that CF fails completely in a few titles certainly hurts -- and that's reflected in my averages. If we toss out ACU, CoH2, DyLi, FC4, LotF... then yes, it would do better, but then I'm cherry picking results to show the potential rather than the reality of CrossFire.
Kjella - Saturday, February 7, 2015 - link
Owner of 2x970s here, reviews show that 2x780 Ti generally wins current games at 3840x2160 with only 3GB of memory so it doesn't seem to matter much today, I've seen no non-synthetic benchmarks at playable resolutions/frame rates to indicate otherwise. Nobody knows what future games will bring but I would have bought them as a "3.5 GB" card too, though of course I feel a little cheated that they're worse than the 980 GTX in a way I didn't expect.
JarredWalton - Saturday, February 7, 2015 - link
I don't have 780 Ti (or 780 SLI for that matter), but interestingly GTX 780 just barely ends up ahead of a single GTX 970 at QHD Ultra and 4K High/Ultra settings. There are times where 970 leads, but the times when 780 leads are by slightly higher margins. Effectively, GTX 970 is equal to GTX 780 but at a lower price point and with less power.
mapesdhs - Tuesday, February 10, 2015 - link

That's the best summary I've read on all this IMO, ie. situations which would demonstrate
the 970's RAM issue are where performance isn't good enough anyway, typically 4K
gaming, so who cares? Right now, if one wants better performance at that level, then
buy one or more 980, 290X, whatever, because two of any lesser card aren't going
to be quick enough by definition.

I bought two 980s, first all-new GPU purchase since I bought two of EVGA's infamous
GTX 460 FTW cards when they first came out. Very pleased with the 980s, they're
excellent cards. Bought a 3rd for benchmarking, etc., the three combined give 8731
for Fire Strike Ultra (result no. 4024577), I believe the highest S1155 result atm, but
the fps numbers still aren't really that high.

Truth is, by the time a significant number of people will be concerned about a typical
game using more than 3.5GB RAM, GPU performance needs to be a heck of a lot
quicker than a 970. It's a non-issue. None of the NVIDIA-hate I've seen changes the
fact that the 970 is a very nice card, and nothing changes how well it performs as
shown in initial reviews. I'll probably get one for my brother's bday PC I'm building,
to go with a 3930K setup.

Most of those complaining about all this are people who IMO have chosen to believe
that NVIDIA did all of this deliberately, because they want that to be the case, irrespective
of what actually happened, and no amount of evidence to the contrary will change their
minds. The 1st Rule gets broken again...

As I posted elsewhere, all those complainig about the specs discrepancy do however
seem perfectly happy for AMD (and indeed NVIDIA) to market dual-GPU cards as
having double RAM numbers which is completely wrong, not just misleading. Incredible
hypocrasy here.

Ian.

The DirectX 12 Performance Preview: AMD, NVIDIA, & Star Swarm

DirectX 12 vs. Mantle, Power Consumption

Post Your Comment

245 Comments

View All Comments

alaricljs - Friday, February 6, 2015 - link

monstercameron - Friday, February 6, 2015 - link

JarredWalton - Friday, February 6, 2015 - link

shing3232 - Friday, February 6, 2015 - link

JarredWalton - Friday, February 6, 2015 - link

Cryio - Saturday, February 7, 2015 - link

JarredWalton - Saturday, February 7, 2015 - link

Kjella - Saturday, February 7, 2015 - link

JarredWalton - Saturday, February 7, 2015 - link

mapesdhs - Tuesday, February 10, 2015 - link

Log in

Don't have an account? Sign up now