Investigating the Opteron Performance Mystery

What really surprised us was the Opteron's abysmal performance in Stars Euler3D CFD. We did not believe the results and repeated the benchmark at least 10 times on the quad Opteron system. Let us delve a little deeper.

Notice that the Intel Xeons scale very well until the number of threads is higher than the physical core count. The performance of the 40 core E7-4870 only drops when we use 48 threads (with HT off). The Opteron however only scales reasonably well from 1 to 6 threads. Between 6 and 12 threads scaling is very mediocre, but at least performance increases. From there, the performance curve is essentially flat.

The Opteron Performance Remedy?

We contacted Charles of Caselab with our results. He gaves us a few clues:

1. The Euler3d CFD solver uses an unstructured grid (spider web appearance with fluid states stored at segment endpoints). Thus, adjacent physical locations do not (cannot!) map to adjacent memory locations.

2. The memory performance benchmark relevant to Euler3D appears to be the random memory recall rate and NOT the adjacent-memory-sweep bandwidth.

3. Typical memory tests (e.g. Stream) are sequential "block'' based. Euler3D effectively tests random access memory performance.

So sequential bandwidth is not the answer. In fact, in most "Streamish" benchmarks (including our own compiled binaries), the Quad Opteron was close to 100GB/s while the Quad Xeon E7 got only between 37 and 55GB/s. So far it seems that only the Intel compiled stream binaries are able to achieve more than 55GB/s. So we have a piece of FP intensive software that performs a lot of random memory accesses.

On the Opteron, performance starts to slow down when we use more than 12 threads. With 24 or even better 48 threads the application spawns more threads than the available cores within the local socket. This means that remote memory accesses cannot be avoided. Could it be that the performance is completely limited by the threads that have to go the furthest (2 hops)? In others words, some threads working on local memory finish much faster, but the whole test cannot complete until the slowest threads (working on remote memory) finish.

We decided to enable "Node Interleaving" in the BIOS of our Dell R815. This means that data is striped across all four memory controllers. Interleaved accesses are slower than local-only accesses because three out of four operations traverse the HT link. However, all threads should now experience a latency that is more or less the same. We prevent the the worst-case scenario where few threads are seeing 2-hop latency. Let us see if that helped.

 

STARS Euler3D CFD Testing the Opteron Remedy
Comments Locked

52 Comments

View All Comments

  • jaguarpp - Friday, September 30, 2011 - link

    what if instead of using a full program, create a small test program that is compiled for each platform something like
    declare variables int, floats, arrays to test diferent workloads
    put the variables on loops and do some operation sum, div, the integers then the floats and so on measure the time that take to exit from each block
    the hardest part will be how to make it threadable
    and get acces to diferent compilers, maybe a friend?
    anyway great article i really enjoy it even when i never get close to that class of hardware
    thanks very much for the reading
  • Michael REMY - Friday, September 30, 2011 - link

    very interesting analyze but...why use a score in cinebench instead a time render score ?

    Time result are more meaning for common and pro user than integer score !
  • MrSpadge - Friday, September 30, 2011 - link

    Because time is totally dependent on the complexity of your scene, output resolution etc. And the score can be directly translated into time if you know the time for any of the configurations tested.

    MrS
  • Casper42 - Friday, September 30, 2011 - link

    Go back to Quanta and see if they have a newer BIOS with the Core Disable feature properly implemented. I Know the big boys are now implementing the feature and it allows you to disable as many cores as you want as long as its done in pairs. So your 10c proc can be turned into 2/4/6/8 core versions as well.

    So for your first test where you had to turn HT off because 80 threads was too much, you could instead turn off 2 cores per proc and synthetically create a 4p32c server and then leave HT on for the full 64 threads.
  • alpha754293 - Sunday, October 2, 2011 - link

    "Hyper-Threading offers better resource utilization but that does not negate the negative performance effect of the overhead of running 80 threads. Once we pass 40 threads on the E7-4870, performance starts to level off and even drop."

    It isn't thread-locking that limits the performance. It isn't because it has to sync/coordinate 80-threads. It's because there's only 40 FPUs available to do the actual calculations on/with.

    Unlike virtualization, where thread locking is a real possiblity because there really isn't much in the way of underlying computations (I would guess that if you profiled the FPU workload, it wouldn't show up much), whereas for CFD, solving the Navier-Stokes equations requires a HUGE computational effort.

    it also depends on the means that the parallelization is done, whether it's multi-threading, OpenMP, or MPI. And even then, within different flavors of MPI, they can also yield different results; and to make things even MORE complicated, how the domain is decomposed also can make a HUGE impact on performance as well. (See the studies performed by LSTC with LS-DYNA).
  • alpha754293 - Sunday, October 2, 2011 - link

    Try running Fluent (another CFD) code and LS-DYNA.

    CAUTION: both are typically usually VERY time-intensive benchmarks, so you have to be very patient with them.

    If you need help in setting up standardized test cases, let me know.
  • alpha754293 - Sunday, October 2, 2011 - link

    I'm working on trying to convert an older CFX model to Fluent for a full tractor-trailer aerodynamics run. The last time that I ran that, it had about 13.5 million elements.
  • deva - Monday, October 3, 2011 - link

    If you want something that currently scales well, Terra Vista would be a good bet (although it is expensive).

    Have a look at the Multi Machine Build version.

    http://www.presagis.com/products_services/products...

    "...capability to generate databases of
    100+ GeoCells distributed to 256 individual
    compute processes with a single execution."

    That's the bit that caught my eye and made me think it might be useful to use as a benchmarking tool.

    Daniel.
  • mapesdhs - Tuesday, October 4, 2011 - link


    Have you guys considered trying C-ray? It scales very well with no. of cores, benefits from as
    many threads as one can throw at it, and the more complex version of the example render
    scene stresses RAM a bit aswell (the small model doesn't stress RAM at all, deliberately so).
    I started a page for C-ray (Google for, "c-ray benchmark", 1st link) but discovered recently
    it's been taken up by the HPC community and is now part of the Phoronix Test Suite (Google
    for, "c-ray benchmark pts", 1st link again). I didn't create C-ray btw (creds to John Tsiombikas),
    just took over John's results page.

    Hmm, don't suppose you guys have the clout to borrow or otherwise have access to an SGI
    Altix UV? Would be fascinating to see how your tests scale with dozens of sockets instead of
    just four, eg. the 960-core UV 100. Even a result from a 40-core UV 10 would be interesting.
    Shared-memory system so latency isn't an issue.

    Ian.
  • shodanshok - Wednesday, October 5, 2011 - link

    Hi Johan,
    thank you for the very interesting article.

    The Hyperthreading ON vs OFF results somewhat surprise me, as Windows Server 2008 should be able to prioritize hardware core vs logical ones. Was this the case, or you saw that logical processors were used before full hardware core utilization? If so, you probably encounter one corner case were extensive hardware sharing (and contention) between two threads produce lower aggregate performance.

    Regards.

Log in

Don't have an account? Sign up now