One side I like to exploit on CPUs is the ability to compute and whether a variety of mathematical loads can stress the system in a way that real-world usage might not.  For these benchmarks we are ones developed for testing MP servers and workstation systems back in early 2013, such as grid solvers and Brownian motion code.  Please head over to the first of such reviews where the mathematics and small snippets of code are available.

3D Movement Algorithm Test

The algorithms in 3DPM employ uniform random number generation or normal distribution random number generation, and vary in various amounts of trigonometric operations, conditional statements, generation and rejection, fused operations, etc.  The benchmark runs through six algorithms for a specified number of particles and steps, and calculates the speed of each algorithm, then sums them all for a final score.  This is an example of a real world situation that a computational scientist may find themselves in, rather than a pure synthetic benchmark.  The benchmark is also parallel between particles simulated, and we test the single thread performance as well as the multi-threaded performance.  Results are expressed in millions of particles moved per second, and a higher number is better.

Single threaded results:

For software that deals with a particle movement at once then discards it, there are very few memory accesses that go beyond the caches into the main DRAM.  As a result, we see little differentiation between the memory kits, except perhaps a loose automatic setting with 3000 C12 causing a small decline.

Multi-Threaded:

With all the cores loaded, the caches should be more stressed with data to hold, although in the 3DPM-MT test we see less than a 2% difference in the results and no correlation that would suggest a direction of consistent increase.

N-Body Simulation

When a series of heavy mass elements are in space, they interact with each other through the force of gravity.  Thus when a star cluster forms, the interaction of every large mass with every other large mass defines the speed at which these elements approach each other.  When dealing with millions and billions of stars on such a large scale, the movement of each of these stars can be simulated through the physical theorems that describe the interactions.  The benchmark detects whether the processor is SSE2 or SSE4 capable, and implements the relative code.  We run a simulation of 10240 particles of equal mass - the output for this code is in terms of GFLOPs, and the result recorded was the peak GFLOPs value.

Despite co-interaction of many particles, the fact that a simulation of this scale can hold them all in caches between time steps means that memory has no effect on the simulation.

Grid Solvers - Explicit Finite Difference

For any grid of regular nodes, the simplest way to calculate the next time step is to use the values of those around it.  This makes for easy mathematics and parallel simulation, as each node calculated is only dependent on the previous time step, not the nodes around it on the current calculated time step.  By choosing a regular grid, we reduce the levels of memory access required for irregular grids.  We test both 2D and 3D explicit finite difference simulations with 2n nodes in each dimension, using OpenMP as the threading operator in single precision.  The grid is isotropic and the boundary conditions are sinks.  We iterate through a series of grid sizes, and results are shown in terms of ‘million nodes per second’ where the peak value is given in the results – higher is better.

Two-Dimensional Grid:

In 2D we get a small bump over at 1600 C9 in terms of calculation speed, with all other results being fairly equal.  This would statistically be an outlier, although the result seemed repeatable.

Three Dimensions:

In three dimensions, the memory jumps required to access new rows of the simulation are far greater, resulting in L3 cache misses and accesses into main memory when the simulation is large enough.  At this boundary it seems that low CAS latencies work well, as do memory speeds > 2400 MHz.  2400 C12 seems a surprising result.

Grid Solvers - Implicit Finite Difference + Alternating Direction Implicit Method

The implicit method takes a different approach to the explicit method – instead of considering one unknown in the new time step to be calculated from known elements in the previous time step, we consider that an old point can influence several new points by way of simultaneous equations.  This adds to the complexity of the simulation – the grid of nodes is solved as a series of rows and columns rather than points, reducing the parallel nature of the simulation by a dimension and drastically increasing the memory requirements of each thread.  The upside, as noted above, is the less stringent stability rules related to time steps and grid spacing.  For this we simulate a 2D grid of 2n nodes in each dimension, using OpenMP in single precision.  Again our grid is isotropic with the boundaries acting as sinks.  We iterate through a series of grid sizes, and results are shown in terms of ‘million nodes per second’ where the peak value is given in the results – higher is better.

2D Implicit:

Despite the nature if implicit calculations, it would seem that as long as 1333 MHz is avoided, results are fairly similar.  1866 C8 being a surprise outlier.

Memory Scaling on Haswell: CPU Real World Memory Scaling on Haswell: IGP Compute
Comments Locked

89 Comments

View All Comments

  • Rob94hawk - Friday, September 27, 2013 - link

    Avoid DDR3 1600 and spend more for that 1 extra fps? No thanks. I'll stick with my DDR3 1600 @ 9-9-9-24 and I'll keep my Haswell overclocked at 4.7 Ghz which is giving me more fps.
  • Wwhat - Friday, September 27, 2013 - link

    I have RAM that has an XMP profile, but I did NOT enable it in the BIOS, reason being that it will run faster but it jumps to 2T, and ups to 1.65v from the default 1.5v, apart from the other latencies going up of course.
    Now 2T is known to not be a great plan if you can avoid it.
    So instead I simply tweak the settings to my own needs, because unlike this article's suggestion you can, and overclockers will, do it manually instead of only having the options SPD or XMP..
    The difference is that you need to do some testing to see what is stable, which can be quite different from the advised values in the settings chip.
    So it's silly to ridicule people for not being some uninformed type with no idea except allowing the SPD/XMP to tell them what to do.
  • Hrel - Friday, September 27, 2013 - link

    Not done yet, but so far it seems 1866 CL 9 is the sweet spot for bang/buck.

    I'd also like to add that I absolutely LOVE that you guys do this kind of in depth analyses. Remember when, one of you, did the PSU review? Actually going over how much the motherboard pulled at idle and load, same for memory on a per DIMM basis. CPU, everything, hdd, add in cards. I still have the specs saved for reference. That info is getting pretty old though, things have changed quite a bit since back then; when the northbridge was still on the motherboard :P

    Hint Hint ;)
  • repoman27 - Friday, September 27, 2013 - link

    Ian, any chance you could post the sub-timings you ended up using for each of the tested speeds?

    If you're looking at mostly sequential workloads, then CL is indicative of overall latency, but once the workloads become more random / less sequential, tRCD and tRP start to play a much larger role. If what you list as 2933 CL12 is using 12-14-14, then page-empty or page-miss accesses are going to look a lot more like CL13 or CL14 in terms of actual ns spent servicing the requests.

    Also, was CMD consistent throughout the tests, or are some timings using 1T and others 2T?

    There's a lot of good data in this article, but I constantly struggle with seeing the correlation between real world performance, memory bandwidth, and memory latency. I get the feeling that most scenarios are not bound by bandwidth alone, and that reducing the latency and improving the consistency of random accesses pays bigger dividends once you're above a certain bandwidth threshold. I also made the following chart, somewhat along the lines of those in the article, in order to better visualize what the various CAS latencies look like at different frequencies: http://i.imgur.com/lPveITx.png Of course real world tests don't follow the simple curves of my chart because the latency penalties of various types of accesses are not dictated solely by CL, and enthusiast memory kits are rarely set to timings such as n-n-n-3*n-1T where the latency would scale more consistently.
  • Wwhat - Sunday, September 29, 2013 - link

    Good comment I must say, and interesting chart.
  • Peroxyde - Friday, September 27, 2013 - link

    "#2 Number of sticks of memory"
    Can you please clarify? What should be that number? The highest possible? For example, to get 16GB, what is the best sticks combination to recommend? Thanks for any help.
  • erple2 - Sunday, September 29, 2013 - link

    I think that if you have a dual channel memory controller and have a single dimm, then you should fill up the controller with a second memory chip first.
  • malphadour - Sunday, September 29, 2013 - link

    Peroxyde, Haswell uses a dual channel controller, so in theory (and in some benchmarks I have seen) 2 sticks of 8gb ram would give the same performance as 4 sticks of 4gb ram. So go with the 2 sticks as this allows you to fit more ram in the future should you want to without having to throw away old sticks. You could also get 1 16gb stick of ram, and benchmarks I have seen suggest that there is only about a 5% decrease in performance, though for the tiny saving in cost you might as well go dual channel.
  • lemonadesoda - Saturday, September 28, 2013 - link

    I'm reading the benchmarks. And what I see is that in 99% of tests the gains are technical and only measurable to the third significant digit. That means they make no practical noticeable difference. The money is better spent on a difference part of the system.
  • faster - Saturday, September 28, 2013 - link

    This is a great article. This is valuable, useful, and practical information for the system builders on this site. Thank you!

Log in

Don't have an account? Sign up now