Testing the Opteron HPC Remedy

The results of memory node interleaving are pretty spectacular, at least in terms of improving Opteron performance.

Once we disable NUMA, our Opteron server scales properly. Performance is multiplied by 3 when we run the benchmark with 48 threads. So memory interleaving does the trick, but since memory interleaving increases the traffic between the CPU nodes, we decided to test with HT assist (a 1MB snoop filter) on and off.

Stars Euler 3D CFD: maximum score revisited

Notice how this benchmark relies on the CPU interconnects: when we disable HT assist but leave interleaving on, we lose more than 25% performance. HT assist avoids many unnessary broadcasts on the HT interconnects. What is more, we did test the Xeon E7 with memory node interleaving (4-way) but this did not improve or decrease performance in any substantial way.

There's even more good news for the Opteron: the score on Cinebench R11.5 rendering improved from 25 (NUMA) to 26.3. (memory node interleaving). It's hardly spectacular, but that's still a nice and free of charge 5% performance boost, assuming you're running workloads that will benefit.

Investigating the Opteron Performance Mystery Final Analysis
Comments Locked

52 Comments

View All Comments

  • derrickg - Friday, September 30, 2011 - link

    Would love to see them benchmarked using such a powerful machine.
  • JohanAnandtech - Friday, September 30, 2011 - link

    Suggestions how to get this done?
  • derrickg - Friday, September 30, 2011 - link

    simple benchmarking: http://www.linuxhaxor.net/?p=1346

    I am sure there are much more advanced ways of taking benchmarks on chess engines, but I have long since dropped out of those circles. Chess engines usually scale very well from 1P and up.
  • JPQY - Saturday, October 1, 2011 - link

    Hi Johan,

    Here you have my link how people can test with Chess calculatings in a very simple way!

    http://www.xtremesystems.org/forums/showthread.php...

    If you are interested you can always contact me.

    Kind regards,
    Jean-Paul.
  • JohanAnandtech - Monday, October 3, 2011 - link

    Thanks Jean-Paul, Derrick, I will check your suggestions. Great to see the community at work :-).
  • fredisdead - Monday, April 23, 2012 - link

    http://www.theinquirer.net/inquirer/review/2141735...

    dear god, at last the truth. Interlagos is 30% faster

    hey anand, whats up with YOUR testing.
  • fredisdead - Monday, April 23, 2012 - link

    everybody, the opteron is 30% faster

    http://www.theinquirer.net/inquirer/review/2141735...

    follow thew intel ad bucks ... lol
  • anglesmith - Friday, September 30, 2011 - link

    i was in a similar situation on a 48 core opteron machine.

    without numa my app was twice slower than a 4 core i7 920. then did a test with same number of threads but with 2 sockets (24 cores), the app became faster than with 48 cores :~
    then found the issue is all with numa which is not a big issue if you are using a 2 socket machine.

    once i coded the app to be numa aware the app is 6 times faster.

    i know there are few apps that are both numa aware and scale to 50 or so cores but ...
  • tynopik - Friday, September 30, 2011 - link

    benhcmark

    like it Phenom
  • JoeKan - Friday, September 30, 2011 - link

    I'd llove to see single core workstations used as baseline comparisons. In using a server to render, I'd be wondering which would be more cost effective to render animations. Maybe use an animation sequence as a render performance test.

Log in

Don't have an account? Sign up now