The Big Question: Why?

The big question is why the Opteron performs so much better with memory node interleaving while this has no effect whatsoever on the Xeons. Only a very detailed profiling could gives us the absolute and accurate answer, but that is a bit beyond the scope of this article (and our time constraints). However, we already have a few interesting clues:

  1. Enabling HT assist improves performance by 32% (8.5 vs 6.4), which indicates that snoop traffic is a serious bottleneck. That is also a result of using memory node interleaving, which increases the data traffic between the sockets as data is striped over the memory nodes.
  2. The application is very sensitive to latency.

The Xeon E7 has a Global Coherence Engine with Directory Assisted Snoop (DAS). As described by David Kanter here, the Coherence Engine makes use of an innovative 2-hop protocol that achieves much lower snoop latency. Intel's Coherence Engine is quite a bit more advanced than the 3-hop protocol combined with the snoop filter that AMD uses on the current Opterons. This might be one explanation why the Xeon E7 does not need memory node interleaving to get good performance in an application that spawns more threads than the core count of one CPU socket.

Conclusion

It is interesting to note that Cinebench also benefits from node interleaving, although it is a lot less extreme than what we witnessed in STARS Euler3D CFD. That could indicate there are quite a few (HPC) applications which could benefit from memory node interleaving despite the fact that most operating systems are now well optimized for NUMA. We suspect that almost any application that spawns threads accross four sockets and works on a common dataset will see some benefit from node interleaving on AMD's quad Opteron platform.

That said, virtualization is not such an application, as most VMs are limited to 4-8 vCPUs. In such setups, the dataset can be kept locally with a bit of tuning, and since the release of vSphere 4.0, ESX is pretty good at this.

Looking at the performance results, the Xeons dominated the CFD benchmark, even with the interleaving enabled on Opterons. However, this doesn't mean that the current 12-core opteron is a terrible choice for HPC use. We know that the AMD Opteron performs very well in some important HPC benches, as you can read here. That benchmark was compiled with an Intel Fortran compiler (ifort 10.0), and you might wonder why it was compiled that way. We asked Charles, the software designer, to answer that question:

"I spent some time with the gfortran compiler but the results were fairly bad. [...] That's why we pay big money for Intel's Fortran compiler!"

What that benchmark and this article show is how careful we must be when looking at performance results for many-threaded workloads and servers. If you just run the CFD benchmark on a typical server configurations, you might conclude that a 12-core Xeon is more than three times faster than a 48-core Opteron setup. However, after some tinkering we begin to understand what is actually going on, and while the final result still isn't particularly impressive (the 12-core/24-thread Xeon still bested the 48-core Opteron by 15%, and the quad Xeon E7-4870 is nearly twice as fast as the best Opteron result so far), there's still potential for improvement.

To Be Continued...

Remember, this is only our first attempt at HPC benchmarking. We'll be looking into more ambitious testing later, and we're hoping to incorporate your feedback. What Let us know your suggestions for benchmarks and other tests you'd like to see us run on these servers (and upcoming hardware as well), and we'll work to make it happen.

Testing the Opteron Remedy
Comments Locked

52 Comments

View All Comments

  • derrickg - Friday, September 30, 2011 - link

    Would love to see them benchmarked using such a powerful machine.
  • JohanAnandtech - Friday, September 30, 2011 - link

    Suggestions how to get this done?
  • derrickg - Friday, September 30, 2011 - link

    simple benchmarking: http://www.linuxhaxor.net/?p=1346

    I am sure there are much more advanced ways of taking benchmarks on chess engines, but I have long since dropped out of those circles. Chess engines usually scale very well from 1P and up.
  • JPQY - Saturday, October 1, 2011 - link

    Hi Johan,

    Here you have my link how people can test with Chess calculatings in a very simple way!

    http://www.xtremesystems.org/forums/showthread.php...

    If you are interested you can always contact me.

    Kind regards,
    Jean-Paul.
  • JohanAnandtech - Monday, October 3, 2011 - link

    Thanks Jean-Paul, Derrick, I will check your suggestions. Great to see the community at work :-).
  • fredisdead - Monday, April 23, 2012 - link

    http://www.theinquirer.net/inquirer/review/2141735...

    dear god, at last the truth. Interlagos is 30% faster

    hey anand, whats up with YOUR testing.
  • fredisdead - Monday, April 23, 2012 - link

    everybody, the opteron is 30% faster

    http://www.theinquirer.net/inquirer/review/2141735...

    follow thew intel ad bucks ... lol
  • anglesmith - Friday, September 30, 2011 - link

    i was in a similar situation on a 48 core opteron machine.

    without numa my app was twice slower than a 4 core i7 920. then did a test with same number of threads but with 2 sockets (24 cores), the app became faster than with 48 cores :~
    then found the issue is all with numa which is not a big issue if you are using a 2 socket machine.

    once i coded the app to be numa aware the app is 6 times faster.

    i know there are few apps that are both numa aware and scale to 50 or so cores but ...
  • tynopik - Friday, September 30, 2011 - link

    benhcmark

    like it Phenom
  • JoeKan - Friday, September 30, 2011 - link

    I'd llove to see single core workstations used as baseline comparisons. In using a server to render, I'd be wondering which would be more cost effective to render animations. Maybe use an animation sequence as a render performance test.

Log in

Don't have an account? Sign up now