Choosing a Gaming CPU October 2013: i7-4960X, i5-4670K, Nehalem and Intel Updateby Ian Cutress on October 3, 2013 10:05 AM EST
The Importance of Data
In order to keep consistency, I want to this article to contain all the information we had in the previous article rather than just reference back – I personally find the measure of applying statistics to the data we obtain (and how we obtain it) very important. The new CPUs will be highlighted, and any adjustments to our conclusions will also be published. I also want to answer some of the questions raised from our previous Gaming CPU articles.
Where to Begin?
One question when building or upgrading a gaming system is of which CPU to choose - does it matter if I have a quad core from Intel, or a quad module from AMD? Perhaps something simpler will do the trick, and I can spend the difference on the GPU. What if you are running a multi-GPU setup, does the CPU have a bigger effect? This was the question I set out to help answer.
A few things before we start:
For the sake of expediency I could not select 10 different gaming titles across a variety of engines and then test them in seven or more different configurations per game and per CPU, nor could I test every different CPU made. As a result, on the gaming side, I limited myself to one resolution, one set of settings, and four very regular testing titles that offer time demos: Metro2033, Dirt 3, Civilization V and Sleeping Dogs. This is obviously not Skyrim, Battlefield 3, Crysis 3 or Far Cry 3, which may be more relevant in your set up.
The arguments for and against time demo testing as well as the arguments for taking FRAPs values of sequences are well documented (time demos might not be representative vs. consistency and realism of FRAPsing a repeated run across a field), however all of our tests can be run on home systems to get a feel for how a system performs. Below is a discussion regarding AI, one of the common usages for a CPU in a game, and how it affects the system. Out of our benchmarks, Dirt3 plays a game, including AI in the result, and the turn-based Civilization V has no concern for direct AI except for time between turns.
All this combines in with my unique position as the motherboard senior editor here at AnandTech – the position gives me access to a wide variety of motherboard chipsets, lane allocations and a fair number of CPUs. GPUs are not necessarily in a large supply in my side of the reviewing area, but both ASUS and ECS have provided my test beds with HD7970s and GTX580s respectively, such that they have been quintessential in being part of my test bed for almost two years. The task set before me in this review would be almost a career in itself if we were to expand to more GPUs and more multi-GPU setups. Thus testing up to 4x 7970 and up to 2x GTX 580 is a more than reasonable place to start.
Where It All Began
The most important point to note is how this set of results came to pass. In 2012 I came across a few sets of testing by other review websites that floored me – simple CPU comparison tests for gaming which were spreading like wildfire among the forums, and some results contradicted the general prevailing opinion on the topic. These results were pulling all sorts of lurking forum users out of the woodwork to have an opinion, and being the well-adjusted scientist I am, I set forth to confirm the results were, at least in part, valid. What came next was a shock – some of the previous explanations posted online had no real explanation of the hardware setups. While the basic overview of hardware was supplied, there was no run down of settings used, and no attempt to justify the findings which had obviously caused quite a stir. Needless to say, I felt stunned that the lack of verbose testing, as well as both the results and a lot of the conversation, particularly from avid fans of Team Blue and Team Red, that followed. I planned to right this wrong the best way I know how – with science!
The other reason for pulling together the results in this article is perhaps the one I originally started with – the need to update drivers every so often. From Ivy Bridge release to Haswell I have been using Catalyst 12.3 and GeForce 296.10 WHQL on my test beds, despite the latest drivers were 13.1 and 320.90. This causes problems – older drivers are not optimized, readers sometimes complain if older drivers are used, and new games cannot be added to the test bed because they might not scale correctly due to the older drivers. So while there are some reviews on the internet that update drivers between testing and keeping the old numbers (leading to skewed results), actually taking time out to retest a number of platforms for more data points solely on the new drivers is actually a large undertaking. For example, testing new drivers over six platforms (CPU/motherboard combinations) would mean: six platforms, four games, seven different GPU configurations, ~10 minutes per test plus 2+ hours to set up each platform and install a new OS/drivers/set up benchmarks. That makes 20-40+ hours of solid testing (if all goes well), or up to a full working week per CPU – more if I also test the CPU performance for a computational benchmark update, or exponentially more if I include multiple resolutions and setting options. If this is all that is worked on that week, it means no new content – so it happens rarely, perhaps once a year or before a big launch. This time was now, and when I started this testing, I was moving to Catalyst 13.1 and GeForce 310.90, which by the time the first part of the review went live was already been superseded! Now in the official Part 2, we present the testing results from 49 different CPU and motherboard combinations.
Some initial AMD Testing from Part 1
What Does the CPU do in a Game?
A lot of game developers use customized versions of game engines, such as the EGO engine for driving games or the Unreal engine. The engine provides the underpinnings for a lot of the code, and the optimizations therein. The engine also decides what in the game gets offloaded onto the GPU.
Imagine the code that makes up the game as a linear sequence of events in order. In order to go through the game quickly, we need the fastest single core processor available. Of course, games are not like this – lots of the game can be parallelized, such as vector calculations for graphics. These vector calculations were of course the first to be moved from CPU to the GPU. Over time, more parts of the code have made the move – physics and compute being the main features in recent months and years.
The GPU is good at independent, simple tasks – calculating which color is in which pixel is an example of this, along with addition processing and post-processing features (FXAA and so on). If a task is linear, it lives on the CPU, such as loading textures into memory or negotiating which data to transfer between the memory and the GPUs. The CPU also takes control of independent complex tasks, as the CPU is the one that can make complicated logic analysis.
Very few parts of a game come under this heading of ‘independent yet complex’. Anything suitable for the GPU but not ported over will be here, and the big one usually quoted is artificial intelligence. Deciding where an NPC is going to run, shoot or fly could be considered a very complex set of calculations, ideal for fast CPUs. The counter argument is that games have had complex AI for years – the number of times I personally was destroyed by a Dark Sim on Perfect Dark on the N64 is testament to either my uselessness or the fact that complex AI can be configured with not much CPU power. AI is unlikely to be a limiting factor in frame rates due to CPU usage.
What is most likely going to be the limiting factor is how the CPU can manage data. As engines evolve, they try and move data between the CPU, memory and GPUs less – if textures can be kept on the GPU, then they will stay there. But some engines are not as perfect as we would like them to be, resulting in the CPU as the limiting factor. As CPU performance increases, and those that write the engines in which games are made understand the ecosystem, CPU performance should be less of an issue over time. All roads point towards the PS4 of course, and its 8-core Jaguar processor. Is this all that is needed for a single GPU, albeit in a HSA environment?
Another angle I wanted to test beyond most other websites is multi-GPU. There is content online dealing mostly with single GPU setups, with a few for dual GPU. Even though the numbers of multi-GPU users is actually quite small globally, the enthusiast markets are clearly geared for it. We get motherboards with support for 4 GPU cards; we have cases that will support a dual processor board as well as four double-height GPUs. Then there are GPUs being released with two sets of silicon on a PCB, wrapped in a double or triple height cooler. More often than not on a forum, people will ask ‘what GPU for $xxx’ and some of the suggestions will be towards two GPUs at half the budget, as it commonly offers more performance than a single GPU if the game and the drivers all work smoothly (at the cost of power, heat, and bad driver scenarios). The ecosystem supports multi-GPU setups, so I felt it right to test at least one four-way setup. Although with great power comes great responsibility – there was no point testing 4-way 7970s on 1080p. Typically in this price bracket, users will go for multi-monitor setups, along the lines of 5760x1080, or big monitor setups like 1440p, 1600p, or the mega-rich might try 4K. Ultimately the high end enthusiast, with cash to burn, is going to gravitate towards 4K, and I cannot wait until that becomes a reality. So for a median point in all of this, we are testing at 1440p and maximum settings. This will put the strain on our Core2Duo and Celeron G465 samples, but should be easy pickings for our multi-processor, multi-GPU beast of a machine.
A Minor Problem In Interpreting Results
Throughout testing for this review, there were clearly going to be some issues to consider. Chiefly of which is one of consistency and in particular if something like Metro 2033 decides to have an ‘easy’ run which reports +3% higher than normal. For that specific example we get around this by double testing, as the easy run typically appears in the first batch – so we run two or three batches of four and disregard the first batch.
The other, perhaps bigger, issue is interpreting results. If I get 40.0 FPS on a Phenom II X4-960T, 40.1 FPS on an i5-2500K, and then 40.2 FPS on a Phenom II X2-555 BE, does that make the results invalid? The important points to recognize here are statistics and system state.
- System State: We have all had times when booting a PC and it feels sluggish, but this sluggish behavior disappears on reboot. The same thing can occur with testing, and usually happens as a result of bad initialization or a bad cache optimization routine at boot time. As a result, we try and spot these circumstances and re-run. With more time we would take 100 different measurements of each benchmark, with reboots, and cross out the outliers. Time constraints outside of academia unfortunately do not give us this opportunity.
- Statistics: System state aside, frame rate values will often fluctuate around an average. This will mean (depending on the benchmark) that the result could be +/- a few percentage points on each run. So what happens if you have a run of 4 time demos, and each of them are +2% above the ‘average’ FPS? From the outside, as you will not know the true average, you cannot say if it is valid as the data set is extremely small. If we take more runs, we can find the variance (the technical version of the term), the standard deviation, and perhaps represent the mean, median and mode of a set of results. As always, the main constraint in articles like these is time – the quicker to publish, the less testing, the larger the error bars and the higher likelihood that some results are going to be skewed because it just so happened to be a good/bad benchmark run. So the example given above of the X2-555 getting a better result is down to interpretation – each result might be +/- 0.5 FPS on average, and because they are all pretty similar we are actually more GPU limited. So it is more whether the GPU has a good/bad run in this circumstance.
For this example, I batched 100 runs of my common WinRAR test in motherboard testing, on an i5-2500K CPU with a Maximus V Formula. Results varied between 71 seconds and 74 seconds, with a large gravitation towards the lower end. To represent this statistically, we normally use a histogram, which separates the results up into ‘bins’ (e.g. 71.00 seconds to 71.25 seconds) of how accurate the final result has to be. Here is an initial representation of the data (time vs. run number), and a few histograms of that data, using a bin size of 1.00 s, 0.75s, 0.5s, 0.33s, 0.25s and 0.1s.
As we get down to the lower bin sizes, there is a pair of large groupings of results between ~71 seconds and ~ 72 seconds. The overall average/mean of the data is 71.88 due to the outliers around 74 seconds, with the median at 72.04 seconds and standard deviation of 0.660. What is the right value to report? Overall average? Peak? Average +/- standard deviation? With the results very skewed around two values, what happens if I do 1-3 runs and get ~71 seconds and none around ~72 seconds?
Statistics is clearly a large field, and without a large sample size, most numbers can be one-off results that are not truly reflective of the data. It is important to ask yourself every time you read a review with a result – how many data points went into that final value, and what analysis was performed?
For this review, we typically take 4 runs of our GPU tests each, except Civilization V which is extremely consistent +/- 0.1 FPS. The result reported is the average of those four values, minus any results we feel are inconsistent. At times runs have been repeated in order to confirm the value, but this will not be noted in the results.
Reporting the Minimum FPS
A lot of readers have noted in the past that they would like to see minimum FPS values. The minimum FPS is a good measure to the point to for the sake of ‘the worst gameplay experience’, but even with our testing, it would be an effort to go back and retest all scenarios and report it. I know a lot of websites do report minimum FPS, but it is important to realize that:
In a test that places AI in the center of the picture, it can be difficult to remain consistent. Take for example a run of Dirt 3 – this runs a standard race with several AI cars in which anything can happen. If in one of the runs there is a big six-car crash, lots of elements will be going on, resulting in a severe dip in FPS. In this run I get a minimum 6 FPS, whereas in others I get a minimum ~40 FPS. Which is the right number to report? Technically it would be 6 FPS, but then for any CPU that did not have a big crash pile-up, it would look better when theoretically it has not been put to the test.
If I had the time to run 100 tests of each benchmark, I would happily provide histograms of data representing how often the minimum FPS value fluctuated between runs. But that just is not possible when finding a balance between complete testing and releasing results for you all to see.
Many sites to offer a plot of FPS against time, to show what the average FPS looks like and where the dips and how bad the ‘minimum FPS’ value actually looks like. In reality, this data set has a large amount of adjacent point averaging, meaning that the FPS reported is actually the result of the average FPS over the last 50-200 frames. If we were going for exact FPS, the time taken to render the frame would cause some of the data to jump about, especially in high pressure scenarios. In this regard, it is always important to question (especially if it is not specifically stated) how the benchmark software obtains the FPS data.
While I admit that the time-demo benchmarks that are not AI dependent as such will have a more regular minimum FPS, the average FPS result allows the consistency of the run to be petered out. Ideally perhaps we should be reporting the standard deviation (which would help eliminate those stray ultra-low FPS values), but then that brings its own cavalcade of issues whether the run is mainly higher than average or lower than average, and will most likely not be a regular distribution but a skewed distribution.
Nevertheless, due to the requests, I will endeavor to report our minimum FPS data when this article gets a new driver and GPU update in 2014. Due to the level of testing already performed, the minimum FPS obtained from this point in would contain a lot of holes and I would not feel comfortable reporting patchy data. Stay tuned for our next driver update (also game update) for this data.
While FCAT is a great way to test frame rates, it needs to be set up accordingly and getting data is not a simple run and gun for benchmark results as one would like – even more complicated in terms of data retrieval and analysis than FRAPS, which personally I tend not to touch with a barge pole. While I understand the merits of such a system, it would be ideal if a benchmark mode used FCAT in its own overlay to report data.
Why Test at 1440p? Most Gamers play at 1080p!
Obviously one resolution is not a catch all situation. There will be users on the cheapest 1080p screen money can buy, and those using tri-monitor setups who want peak performance. Having a multi-GPU test at 1080p is a little strange, personally, and ideally for those high end setups you really need to be pushing the pixels. While 1440p is not the de-facto standard, it provides an ideal mid-point in analysis. Take for example the Steam survey:
What we see is 30.73% of gamers running at 1080p, but 4.16% of gamers are above 1080p (1.25% above 1200p). If that applies to all of the 4.6 million gamers currently on Steam, we are talking about ~200,000 individuals with setups bigger than 1080p playing games on Steam right now (~57,500 bigger than 1200p), who may or may not have to run at a lower resolution to get frame rates.
So 1080p is still the mainstay for gamers at large, but there is movement afoot to multi-monitor and higher resolution monitors. As a random point of data, personally my gaming rig does have a 1080p screen, but that is only because my two 1440p Korean panels are used for AnandTech review testing, such as this article.
I do have a desire to push this test fully into 4K when I can get my hands on a 4K panel, despite the potential lack of immediate relevance in modern gaming. The push towards higher resolutions in the monitor space is happening, slowly but surely.
The Bulldozer Challenge
Another purpose of this article was to tackle the problem surrounding Bulldozer and its derivatives, such as Piledriver and thus all Trinity and Richland APUs. The architecture is such that Windows 7, by default, does not accurately assign new threads to new modules – the ‘freshly installed’ stance is to double up on threads per module before moving to the next. By installing a pair of Windows Updates (which do not show in Windows Update automatically), we get an effect called ‘core parking’, which assigns the first series of threads each to its own module, giving it access to a pair of INT and an FP unit, rather than having pairs of threads competing for the prize. This affects variable threaded loading the most, particularly from 2 to 2N-2 threads where N is the number of modules in the CPU (thus 2 to 6 threads in an FX-8150). It should come as no surprise that games fall into this category, so we want to test with and without the entire core parking features in our benchmarks.
Hurdles with NVIDIA and 3-Way SLI on Ivy Bridge/Haswell
Users who have been keeping up to date with motherboard options on Z77/Z87 will understand that there are several ways in order to put three PCIe slots onto a motherboard. The majority of sub-$250 motherboards will use three PCIe slots in an PCIe 3.0 x8/x8 + PCIe 2.0 x4 arrangement (meaning x8/x8 from the CPU and x4 from the chipset), allowing either two-way SLI or three-way Crossfire. Some motherboards will use a different lane allocation option such that we have a PCIe 3.0 x8/x4/x4 layout, giving three-way Crossfire but only two-way SLI. In fact in this arrangement, fitting the final x4 with a sound/raid card disables two-way SLI entirely.
This is due to a not widely publicized requirement of SLI – it needs at least an x8 lane allocation in order to work (either PCIe 2.0 or 3.0). Anything less than this on any GPU and you will be denied in the software. So putting in that third card will cause the second lane to drop to x4, disabling two-way SLI. There are motherboards that have a switch to change to x8/x8 + x4 in this scenario, but we are still capped at two-way SLI.
The only way to go onto 3-way or 4-way SLI is via a PLX 8747 enabled motherboard, which greatly enhances the cost of a motherboard build. This should be kept in mind when dealing with the final results.
It has come to my attention that even if the results were to come out X > Y, some users may call out that the better processor draws more power, which at the end of the day costs more money if you add it up over a year. For the purposes of this review, we are of the opinion that if you are gaming on a budget, then high-end GPUs such as the ones used here are not going to be within your price range. Simple fun gaming can be had on a low resolution, limited detail system for not much money – for example at a recent LAN I went to I enjoyed 3-4 hours of TF2 fun on my AMD netbook with integrated HD3210 graphics, even though I had to install the ultra-low resolution texture pack and mods to get 30+ FPS. But I had a great time, and thus the beauty of high definition graphics of the bigger systems might not be of concern as long as the frame rates are good. But if you want the best, you will pay for the best, even if it comes at the electricity cost. Budget gaming is fine, but this review is designed to focus at high resolutions with maximum settings, which is not a budget gaming scenario.