As part of this review, we also ran our normal motherboard benchmarks.

WinRAR x64 3.93 - link

With 64-bit WinRAR, we compress the set of files used in the USB speed tests. WinRAR x64 3.93 attempts to use multithreading when possible, and provides as a good test for when a system has variable threaded load.  If a system has multiple speeds to invoke at different loading, the switching between those speeds will determine how well the system will do.

WinRar x64 3.93

WinRAR is another example where enabling HyperThreading is actually hurting the throughput of the system. But even with all 32 threads in the system, the lack of memory speed hurts the benchmark.

FastStone Image Viewer 4.2 - link

FastStone Image Viewer is a free piece of software I have been using for quite a few years now.  It allows quick viewing of flat images, as well as resizing, changing color depth, adding simple text or simple filters.  It also has a bulk image conversion tool, which we use here.  The software currently operates only in single-thread mode, which should change in later versions of the software.  For this test, we convert a series of 170 files, of various resolutions, dimensions and types (of a total size of 163MB), all to the .gif format of 640x480 dimensions.

FastStone Image Viewer 4.2

FastStone is relatively unaffected due to the single-threaded nature of the program.

Xilisoft Video Converter

With XVC, users can convert any type of normal video to any compatible format for smartphones, tablets and other devices.  By default, it uses all available threads on the system, and in the presence of appropriate graphics cards, can utilize CUDA for NVIDIA GPUs as well as AMD APP for AMD GPUs.  For this test, we use a set of 33 HD videos, each lasting 30 seconds, and convert them from 1080p to an iPod H.264 video format using just the CPU.  The time taken to convert these videos gives us our result.

Xilisoft Video Converter 7

With XVC having many threads is what wins the day, and having HT enabled made the process very fast indeed.  With HT on, we have 32 threads, meaning most of the videos were actually converted very quickly – the final 33rd video caused an extra delay at the end.  This is yet another example of an algorithm that can be ported to GPUs, as XVC offers both an AMD and NVIDIA option for conversion.

x264 HD Benchmark

The x264 HD Benchmark uses a common HD encoding tool to process an HD MPEG2 source at 1280x720 at 3963 Kbps.  This test represents a standardized result which can be compared across other reviews, and is dependant on both CPU power and memory speed.  The benchmark performs a 2-pass encode, and the results shown are the average of each pass performed four times.

x264 HD Pass 1

x264 HD Pass 2

In contrast to XVC, which splits its threads across many files, the x264 HD benchmark splits threads across one file.  As a result it seems that having HT off gives a subtle 13.7% boost in performance in the first pass and 11.0% boost in the second pass.  The results of the first pass makes the second pass a lot more efficient across all the threads due to fewer memory accesses.

n-Body Simulations System Benchmarks
Comments Locked

64 Comments

View All Comments

  • toyotabedzrock - Saturday, January 5, 2013 - link

    There is a large number of very smart people on Google+. You really should come join us.
  • JlHADJOE - Tuesday, January 8, 2013 - link

    Of course there are lots of smart people on G+! You're all google employees right? =P
  • Activate: AMD - Saturday, January 5, 2013 - link

    As a fellow chemist, I must say that you have to be some kind of nut to want to do computational/physical chemistry. If you need me, I'll be at the bench!

    Good article too!
  • engrpiman - Saturday, January 5, 2013 - link

    I didn't read the article in full but what I did read was top notch. I found your simulations and mathematics very interesting. I took a Physics class which was focused in writing code to run mathematical simulations . Using the given java lib. I wrote my own code to calculate PI. When I returned from the gym the program had calculated 3.1 . I then re-wrote the program from scratch and ditched the built in libs. and reran. I had 20 decimals in 30 sec it was an epic improvement.

    All in all I think your article could be very useful to me.

    Thanks for writing.
  • SodaAnt - Saturday, January 5, 2013 - link

    THIS is why I read Anandtech. I'll admit that I wasn't quite in the mood to read all the equations (I'll have to do that later), but really, these kind of reviews make my day.
  • Cardio - Saturday, January 5, 2013 - link

    Wonderful review...as always. Thanks
  • Hakon - Saturday, January 5, 2013 - link

    Hi Ian. Thanks for the nice article. I have one suggestion regarding the explicit finite difference code:

    You could try to reorder the loops such that the memory access is more cache friendly. Right now 'pos' is incremented by NX (or even NX*NY in 3D) which will generate a lot cache misses for large grids. If you switch the x and the y loop (in the 2D case) this can be avoided.
  • IanCutress - Saturday, January 5, 2013 - link

    Either way I order the loops, each point has to read one up, one down, one left and one right. My current code tries to keep three as consecutive reads and jump once, keeping the old jump in local memory. If I adjusted the loops, I could keep the one dimension in local memory, but I'd have to jump outside twice (both likely cache misses) to get the other data. I couldn't cache those two values as I never use them again in the loop iteration.

    When I did this code on the GPU, one method was to load an XY block into memory and iterate in the Z-dimension, meaning that each thread per loop iteration only loaded one element, with a few of them loading another for the halo, but all cache aligned.

    I hope that makes sense :)
    Ian
  • Hakon - Saturday, January 5, 2013 - link

    Yes, but when you access the array 'cA' at 'pos' the CPU will fetch the entire cache line (64 byte in case of your machine, i.e. 16 floats) of the corresponding memory address into the CPU cache. That means that subsequent accesses to say 'pos + 1', 'pos + 2' and so on will be served by the cache. Accessing an array in such a sequential manner is therefore fast.

    However, when you access an array in a nasty way, e.g. 'NX + x' -> '2*NX + x', -> '3*NX + x', then each such access implies a trip to main memory if NX happens to be large.

    That you need to move up / down and sideways in memory does not matter. When you write down the accesses of the code with the reordered loops you will notice that they just access three "lines" in memory in a cache friendly way.

    Not reusing the old values of the last iteration should not affect performance in a measurable way. Even if the compiler fails to see this optimization, the accesses will be served by the L1 cache.

    Btw, did you allocate the array having NUMA in mind, i.e. did you initialize your memory in an OpenMP loop with the same access pattern as used in the algorithms? I am a bit surprised by the bad performance of your dual Xeon system.
  • IanCutress - Saturday, January 5, 2013 - link

    Memory was allocated via the new command as it is 1D. When using a 2D array the program was much slower. I was unaware you could allocate memory in an OpenMP way, which thinking about it could make the 2D array quicker. I also tried writing the code using the PPL and lambdas, but that was also slower than a simple OpenMP loop.

    I'm coming at these algorithms from the point of view of a non-CompSci interested in hardware, and the others in the research group were chemists content to write single threaded code on multi-core machines. Transferring the OpenMP variations of that code from a 1P to a 2P, as the results show, give variable results depending on the algorithm.

    There are always ways to improve the efficiency of the code (and many ways to make it unreadable), but for a large part moving to the 2P system all depends on how your code performs. Please understand that my examples being within my limits of knowledge and representative of the research I did :) I know that SSE2/SSE4/AVX would probably help, but I have never looked into those. More often than not, these environments are all about research throughput, so rather than spend a few week to improve efficiency by 10% (or less), they'd rather spend that money getting a faster system which theoretically increases the same code throughput 100%.

    I'll have a look at switching the loops if I write an article similar to this in the future :)

    Ian

Log in

Don't have an account? Sign up now