Memory Scaling on Haswell CPU, IGP and dGPU: DDR3-1333 to DDR3-3000 Tested with G.Skill

Name: Memory Scaling on Haswell CPU, IGP and dGPU: DDR3-1333 to DDR3-3000 Tested with G.Skill
Item: Memory Scaling on Haswell CPU, IGP and dGPU: DDR3-1333 to DDR3-3000 Tested with G.Skill
Author: Dr. Ian Cutress

by Ian Cutress on September 26, 2013 4:00 PM EST

Posted in
Memory
G.Skill
Haswell
DDR3

89 Comments | Add A Comment

89 Comments

One side I like to exploit on CPUs is the ability to compute and whether a variety of mathematical loads can stress the system in a way that real-world usage might not. For these benchmarks we are ones developed for testing MP servers and workstation systems back in early 2013, such as grid solvers and Brownian motion code. Please head over to the first of such reviews where the mathematics and small snippets of code are available.

3D Movement Algorithm Test

The algorithms in 3DPM employ uniform random number generation or normal distribution random number generation, and vary in various amounts of trigonometric operations, conditional statements, generation and rejection, fused operations, etc. The benchmark runs through six algorithms for a specified number of particles and steps, and calculates the speed of each algorithm, then sums them all for a final score. This is an example of a real world situation that a computational scientist may find themselves in, rather than a pure synthetic benchmark. The benchmark is also parallel between particles simulated, and we test the single thread performance as well as the multi-threaded performance. Results are expressed in millions of particles moved per second, and a higher number is better.

Single threaded results:

For software that deals with a particle movement at once then discards it, there are very few memory accesses that go beyond the caches into the main DRAM. As a result, we see little differentiation between the memory kits, except perhaps a loose automatic setting with 3000 C12 causing a small decline.

Multi-Threaded:

With all the cores loaded, the caches should be more stressed with data to hold, although in the 3DPM-MT test we see less than a 2% difference in the results and no correlation that would suggest a direction of consistent increase.

N-Body Simulation

When a series of heavy mass elements are in space, they interact with each other through the force of gravity. Thus when a star cluster forms, the interaction of every large mass with every other large mass defines the speed at which these elements approach each other. When dealing with millions and billions of stars on such a large scale, the movement of each of these stars can be simulated through the physical theorems that describe the interactions. The benchmark detects whether the processor is SSE2 or SSE4 capable, and implements the relative code. We run a simulation of 10240 particles of equal mass - the output for this code is in terms of GFLOPs, and the result recorded was the peak GFLOPs value.

Despite co-interaction of many particles, the fact that a simulation of this scale can hold them all in caches between time steps means that memory has no effect on the simulation.

Grid Solvers - Explicit Finite Difference

For any grid of regular nodes, the simplest way to calculate the next time step is to use the values of those around it. This makes for easy mathematics and parallel simulation, as each node calculated is only dependent on the previous time step, not the nodes around it on the current calculated time step. By choosing a regular grid, we reduce the levels of memory access required for irregular grids. We test both 2D and 3D explicit finite difference simulations with 2ⁿ nodes in each dimension, using OpenMP as the threading operator in single precision. The grid is isotropic and the boundary conditions are sinks. We iterate through a series of grid sizes, and results are shown in terms of ‘million nodes per second’ where the peak value is given in the results – higher is better.

Two-Dimensional Grid:

In 2D we get a small bump over at 1600 C9 in terms of calculation speed, with all other results being fairly equal. This would statistically be an outlier, although the result seemed repeatable.

Three Dimensions:

In three dimensions, the memory jumps required to access new rows of the simulation are far greater, resulting in L3 cache misses and accesses into main memory when the simulation is large enough. At this boundary it seems that low CAS latencies work well, as do memory speeds > 2400 MHz. 2400 C12 seems a surprising result.

Grid Solvers - Implicit Finite Difference + Alternating Direction Implicit Method

The implicit method takes a different approach to the explicit method – instead of considering one unknown in the new time step to be calculated from known elements in the previous time step, we consider that an old point can influence several new points by way of simultaneous equations. This adds to the complexity of the simulation – the grid of nodes is solved as a series of rows and columns rather than points, reducing the parallel nature of the simulation by a dimension and drastically increasing the memory requirements of each thread. The upside, as noted above, is the less stringent stability rules related to time steps and grid spacing. For this we simulate a 2D grid of 2ⁿ nodes in each dimension, using OpenMP in single precision. Again our grid is isotropic with the boundaries acting as sinks. We iterate through a series of grid sizes, and results are shown in terms of ‘million nodes per second’ where the peak value is given in the results – higher is better.

2D Implicit:

Despite the nature if implicit calculations, it would seem that as long as 1333 MHz is avoided, results are fairly similar. 1866 C8 being a surprise outlier.

Memory Scaling on Haswell: CPU Real World Memory Scaling on Haswell: IGP Compute

PRINT THIS ARTICLE

Post Your Comment
Please log in or sign up to comment.

Comments Locked

89 Comments

View All Comments

HerrKaLeun - Saturday, September 28, 2013 - link
This was a good review. But I see one major problem for practiacl applications:
Whoever cares about performace, doesn't use 8 GB of memory in the year 2013.
Even for a cheap home-built (no gaming, no CAD etc.) I used 16 GB a year ago, which cost only ~$70. when I run multiple applications in parallel (who doesn't?) W7/8 easily uses all memory for cache. Even with an SSD this is a speed advantage.

So for real world applications (running virus scan in parallel to work, 18 browser windows, watching movies etc) 8 GB re easily used up.

I would imagine a 16 GB PC (let's say ~$100) runs circles around the $700 8 GB PC in the real world.

Right now I run MSE and Malwarebytes while just using IE for browsing and I have none of my 16 GB left. The computer is not sluggish at all. I'm not sure how 8 GB RAM would work out.

One could argue most applications don't require that much memory, but running virusscan frequently should be done by all users.

I think this test should be repeated with either 16 GB or 24 GB for triple-channel platform. People interested in a few % more, also need more RAM.
Wwhat - Sunday, September 29, 2013 - link
@HerrKaLeun you say who doesn't use more than 8GB? and say you got 16GB for about 70 dollars, but this article covers a lot of extremely highly speced RAM that as stated is quite expensive, and if you bought 8GB for several hundred dollars you aren't going to supplement it with cheap high-latency low speed off-the-shelf stuff obviously.
malphadour - Sunday, September 29, 2013 - link
HerrKaleun you are talking rubbish!! I have an X58 running 6gb ram and I never get anywhere near flooding it. 8GB is more tha ample for 99% of users out there. I recently built a 16gb ram rig for one of our engineers because he demanded it. To prove a point I benchmarked all our software (which includes a juicy construction CAD package) and recorded no more than a 3% performance increase going to 16gb and I put most of that down to going from single channel 8gb stick to dual channel for the 16gb. We tested render times, large drawing copies plus program open and close times with lots every peice of software on the machine running. His argument was the same as yours, and incorrect. Hardware is way ahead of the curve at the moment vs software and it will be a while before the everyday user "needs" more than 8gb.
Wwhat - Monday, September 30, 2013 - link
To be fair, I hear battlefield 4 has as suggested setup at least 8GB.
Like always the more RAM people on average have the more software starts to require.
ShieTar - Monday, September 30, 2013 - link
"So for real world applications (running virus scan in parallel to work, 18 browser windows, watching movies etc) 8 GB re easily used up."

Because Windows will fill up all the Memory it has before even starting any garbage collection algorithms. Even today, you should be able to do all those trivial applications on 2GB of memory.

And anybody doing serious work or gaming will probably not run two major software packages at the same time. A few background programs (depending on how paranoid your companies IT department is), and a few trivial programs like browser, word processor, excel, PDF may run on the side and use up 1GB to 2GB. But nobody in his right mind will start processing of huge images in Photoshop while keeping his CAD models open in CATIA. A few nutjobs out there may run 16 installations of WoW on 16 screens with the same PC, but thats not really relevant to a general review.

So if you go and have a look again at what is tested in this review, and once you understand that any reviewer worth his salary will not go and run a dozen pieces of software parallel to the one software he is benchmarking at that moment, it should be clear at the very least that repeating above benchmarks with 16GB will give you absolutely no difference in the benchmark results whatsoever.
Chrispy_ - Sunday, September 29, 2013 - link
So the the three common scenarios are:
:
--- 1. You want an IGP ---
Get the cheapest RAM, If you buy significantly better RAM the cost of APU + RAM becomes more than the cost of a normal CPU + dGPU + cheap RAM, which is obviously much higher.performance.
:
--- 2. You want a single graphics card ---
Spend the money you're *thinking* about spending on better RAM on a better graphics card. If you want a decent dGPU then you're most likely a gamer and even 1600MHz CL9 is fine, but you'll see a big improvement if you move from a $200 GTX660 to a $250 660Ti
:
--- 3. You want more than one graphics card ---
Divide RAM Frequency by CAS Latency to get the actual speed, I've been doing this for years and I'm glad Ian has finally mentioned this in an article.
ShieTar - Monday, September 30, 2013 - link
I don't think anybody would disagree with the general direction of your comment, but you seem to overestimate the exact differences in cost for 8GB of RAM these days. A quick check (for Germany) gives me the following price differences for RAM frequency (relative to 1333):

1600 : -0.50€ (No-Brainer)
1866 : +1€
2000 : +20€
2133 : +10€
2400 : + 8€
2666 : +50€
2933 : +170€

So, for 8€ you can pick 2400 instead of 1600, which would give you a significant increase in performance should you ever find a piece of software that heavily depends on memory transfer rates. You are very unlikely to step up your CPU or GPU model for that kind of price difference.

Latencies can be similar. For DDR3-1600, going from CL11 to CL9 will cost you about 2€ to 3€. Of course, at that point you still have a higher latency than DDR3-2400 with a CL11, so that seems to make the most sense right now for price to value ratio.
rootheday3 - Sunday, September 29, 2013 - link
Hd4600 is likely not memory bottlenecked with 20 eus at stock igp frequencies. There is a reason that intel didn't add the EDram to skus other than the 47w+ gt3e 40eu skus, 4 samplers and 2pixel hacienda. For a gt2 with half the assets, memory is not the issue- 1600mhz in dual channel is plenty. For people who were asking earlier in the thread, dual channel vs single channel is ~15-30% impact on gt2.

If you want to see more sensitivity/ scaling with memory, you would need to OC the igp first.

Or, as others said, test on skus that are more likely to stress memory - like gt3e (iris pro 5200) Note that hd5000 (15w package tdp) and iris 5100 (28w tdp) may be tdp bound on most workloads, so even there you may not see scaling with memory beyond ~1600-1866 dual channel.

Note that Trinity/Richland are more sensitive to memory (especially on 65-100w desktop skus) because they don't have the LLC to buffer some of the bandwidth demands.
malphadour - Sunday, September 29, 2013 - link
I have mushkin 6-8-6-21 1600mhz which seems to be almost unique (don't think I have seen nayone else make cl6 at this speed) - would be interested to see if CL6 at 1600mhz was a match for much higher mhz
malphadour - Sunday, September 29, 2013 - link
I think the comment 1600mhz is bad can be taken with a pinch of salt here. Depends on who the PC is for. If it is normal use then 1600mhz cl9 is going to be fine all day long. Ian's point is, I think, aimed at the enthusiast who is benchmark chasing, in which case bigger is always better. It would be nice if hte price of ram had not doubled. I was buying 8gb 1600mhz cl9 for £29.99 not too long ago, two recent builds it is as £54.99, nearly twice the price in the UK :(

Memory Scaling on Haswell CPU, IGP and dGPU: DDR3-1333 to DDR3-3000 Tested with G.Skill

Post Your Comment

89 Comments

View All Comments

HerrKaLeun - Saturday, September 28, 2013 - link

Wwhat - Sunday, September 29, 2013 - link

malphadour - Sunday, September 29, 2013 - link

Wwhat - Monday, September 30, 2013 - link

ShieTar - Monday, September 30, 2013 - link

Chrispy_ - Sunday, September 29, 2013 - link

ShieTar - Monday, September 30, 2013 - link

rootheday3 - Sunday, September 29, 2013 - link

malphadour - Sunday, September 29, 2013 - link

malphadour - Sunday, September 29, 2013 - link

Log in

Don't have an account? Sign up now