One of the touted benefits of Haswell is the compute capability afforded by the IGP.  For anyone using DirectCompute or C++ AMP, the compute units of the HD 4600 can be exploited as easily as any discrete GPU, although efficiency might come into question.  Shown in some of the benchmarks below, it is faster for some of our computational software to run on the IGP than the CPU (particularly the highly multithreaded scenarios). 

Grid Solvers - Explicit Finite Difference on IGP

As before, we test both 2D and 3D explicit finite difference simulations with 2n nodes in each dimension, using OpenMP as the threading operator in single precision.  The grid is isotropic and the boundary conditions are sinks.  We iterate through a series of grid sizes, and results are shown in terms of ‘million nodes per second’ where the peak value is given in the results – higher is better.

Two Dimensional:

The results on the IGP are 50% higher than those on the CPU, and it would seem that memory can make a difference as well.  As long as 1333 MHz is not chosen, there is at least a 2% gain to be had.  Otherwise, the next jump up is at 2666 MHz for another 2%, which might not be cost effective.

Three Dimensional:

The 3D results seem to be a little haphazard, with 1333 C7 and 2400 C9 both performing well.  1600 C11 definitely is out of the running, although anything 2400 MHz or above affords almost a 10%+ benefit.

N-Body Simulation on IGP

As with the CPU compute, we run a simulation of 10240 particles of equal mass - the output for this code is in terms of GFLOPs, and the result recorded was the peak GFLOPs value.

In terms of a workload that calculates FLOPs, the operational workload does not seem to be affected by memory.

3D Particle Movement on IGP

Similar to our CPU Compute algorithm, we calculate the random motion in 3D of free particles involving random number generation and trigonometric functions.  For this application we take the fastest true-3D motion algorithm and test a variety of particle densities to find the peak movement speed.  Results are given in ‘million particle movements calculated per second’, and a higher number is better.

Despite this result being over 35x the equivalent calculation on a fully multithreaded 4770K CPU (200 vs. 7000), again there seems little difference between memory speeds.  3000 C12 gets a small peak over the rest, similar to the n-Body test.

Matrix Multiplication on IGP

Matrix Multiplication occurs in a number of mathematical models, and is typically designed to avoid memory accesses where possible and optimize for a number of reads and writes depending on the registers available to each thread or batch of dispatched threads.  He we have a crude MatMul implementation, and iterate through a variety of matrix sizes to find the peak speed.  Results are given in terms of ‘million nodes per second’ and a higher number is better.

Matrix Multiplication on this scale seems to vary little between memory settings, although a shift towards the lower CL timings gives a marginally (though statistically minor) better result.

3D Particle Movement on IGP

Similar to our 3DPM Multithreaded test, except we run the fastest of our six movement algorithms with several million threads, each moving a particle in a random direction for a fixed number of steps.  Final results are given in million movements per second, and a higher number is better.

While there is a slight dip using 1333 C9, in general almost all of our memory timing settings perform roughly the same.  The peak shown using our memory kit at its XMP rated timings are presumably more due to the adjustments in BCLK which need to be made in order to hit this memory frequency.

Memory Scaling on Haswell: CPU Compute Memory Scaling on Haswell: IGP Gaming
Comments Locked

89 Comments

View All Comments

  • ShieTar - Friday, September 27, 2013 - link

    I think you would have to propose a software benchmark which benefits from actually running from a Ramdisk. Testing the RD itself with some kind of synthetic HD-Benchmark will not give you much different results than a synthetic memory benchmark, unless the software implementation is rubbish.

    So if you want to see this happen, I suggest you explain to everybody what kind of software you use in combination with your Ramdisk, and why it benefits from it. And hope that this software is sufficiently relevant to get a large number of people interested in this kind of benchmark.
  • ShieTar - Friday, September 27, 2013 - link

    Two comments on the "Performance Index" used in this article:

    1. It is calculated as the reverse of the actual access latency (in nanoseconds). Using the reverse of a physically meaningful number will always make the relationship exhibit much more of an "diminishing return" then when using the phyical attribute directly.

    2. As no algorithm should care directly about the latency, but rather about the combined time to get the full data set it requested, it would be interesting to understand which is the typical size of a data set affecting the benchmarks indicate. If your software is randomly picking single bytes from the memory, you expect performance to only depend on the latency. On the other hand, if the software is reading complete rows (512 bytes), the bandwidth becomes more relevant than the latency.

    Of course figuring out the best performance metric for any kind of review can take a lot of time and effort. But when you do a review generating this large amount of data anyways, would it be possible to make the raw data available to the readers, so they can try to get their own understanding on the matter?
  • Death666Angel - Friday, September 27, 2013 - link

    First of all, great article and really good chart layout, very easy to read! :D
    But one thing seems strange, the WinRAR 3.93 test, 2800MHz/C12 performs better than 2800MHz/C11, but you call out ...C11 in the text as performing well, even though anyone can increase their latencies without incurring stability issues (that's my experience at least). Switched numbers? :)
  • willis936 - Friday, September 27, 2013 - link

    I too thought this was strange. You could see higher latencies clock for clock performing better which doesn't seem intuitive. I couldn't work out why those results were the way they were.
  • ShieTar - Friday, September 27, 2013 - link

    In reality, there really should be no reason why a longer latency should increase performance (unless you are programming some real-time code which depends on algorithm synchronization). Therefore it seems safe to interpret the difference as the measurement noise of this specific benchmark.
  • Urbanos - Friday, September 27, 2013 - link

    excellent article! i was waiting for one of these! great work, masterful :)
  • jaydee - Friday, September 27, 2013 - link

    Great work, I'd like to see a future article look at single-channel vs dual channel RAM in laptops/mITX/NUC configurations. With only two SO-DIMM slots, people have to really evaluate whether or not you want to fill both DIMM slots knowing you'd have to replace both of them if you want to upgrade but able to utilize the dual channels, or going with a single SO-DIMM, losing the dual channel but having an easier memory upgrade path down the road.

    Thanks and great work!
  • Hrel - Friday, September 27, 2013 - link

    How do you get such nice screenshots of the BIOS? They look much nicer than when people just use a camera so what did you use to take those screenshots?
  • merikafyeah - Friday, September 27, 2013 - link

    Probably used a video capture card. These are also used to objectively evaluate GPU frame-pacing in a way that software like FRAPS cannot.
  • Rob94hawk - Saturday, September 28, 2013 - link

    Moder BIOS allow you to upload screenshots to USB. My MSI Z87 Gaming does it. No more picture taking. It's a great feature long overdue!

Log in

Don't have an account? Sign up now