The Big Question: Why?

The big question is why the Opteron performs so much better with memory node interleaving while this has no effect whatsoever on the Xeons. Only a very detailed profiling could gives us the absolute and accurate answer, but that is a bit beyond the scope of this article (and our time constraints). However, we already have a few interesting clues:

  1. Enabling HT assist improves performance by 32% (8.5 vs 6.4), which indicates that snoop traffic is a serious bottleneck. That is also a result of using memory node interleaving, which increases the data traffic between the sockets as data is striped over the memory nodes.
  2. The application is very sensitive to latency.

The Xeon E7 has a Global Coherence Engine with Directory Assisted Snoop (DAS). As described by David Kanter here, the Coherence Engine makes use of an innovative 2-hop protocol that achieves much lower snoop latency. Intel's Coherence Engine is quite a bit more advanced than the 3-hop protocol combined with the snoop filter that AMD uses on the current Opterons. This might be one explanation why the Xeon E7 does not need memory node interleaving to get good performance in an application that spawns more threads than the core count of one CPU socket.

Conclusion

It is interesting to note that Cinebench also benefits from node interleaving, although it is a lot less extreme than what we witnessed in STARS Euler3D CFD. That could indicate there are quite a few (HPC) applications which could benefit from memory node interleaving despite the fact that most operating systems are now well optimized for NUMA. We suspect that almost any application that spawns threads accross four sockets and works on a common dataset will see some benefit from node interleaving on AMD's quad Opteron platform.

That said, virtualization is not such an application, as most VMs are limited to 4-8 vCPUs. In such setups, the dataset can be kept locally with a bit of tuning, and since the release of vSphere 4.0, ESX is pretty good at this.

Looking at the performance results, the Xeons dominated the CFD benchmark, even with the interleaving enabled on Opterons. However, this doesn't mean that the current 12-core opteron is a terrible choice for HPC use. We know that the AMD Opteron performs very well in some important HPC benches, as you can read here. That benchmark was compiled with an Intel Fortran compiler (ifort 10.0), and you might wonder why it was compiled that way. We asked Charles, the software designer, to answer that question:

"I spent some time with the gfortran compiler but the results were fairly bad. [...] That's why we pay big money for Intel's Fortran compiler!"

What that benchmark and this article show is how careful we must be when looking at performance results for many-threaded workloads and servers. If you just run the CFD benchmark on a typical server configurations, you might conclude that a 12-core Xeon is more than three times faster than a 48-core Opteron setup. However, after some tinkering we begin to understand what is actually going on, and while the final result still isn't particularly impressive (the 12-core/24-thread Xeon still bested the 48-core Opteron by 15%, and the quad Xeon E7-4870 is nearly twice as fast as the best Opteron result so far), there's still potential for improvement.

To Be Continued...

Remember, this is only our first attempt at HPC benchmarking. We'll be looking into more ambitious testing later, and we're hoping to incorporate your feedback. What Let us know your suggestions for benchmarks and other tests you'd like to see us run on these servers (and upcoming hardware as well), and we'll work to make it happen.

Testing the Opteron Remedy
Comments Locked

52 Comments

View All Comments

  • MrSpadge - Friday, September 30, 2011 - link

    Agreed - performance of a single i7 2600 can be hard to beat, depending on the application. My Matlab code uses all physical cores through the Intel Math Kernel Library, yet is ~30% slower on 2 x X5570 (wich is about the difference in clock speed, incidently).

    MrS
  • JohanAnandtech - Friday, September 30, 2011 - link

    http://www.anandtech.com/show/4486/server-renderin...

    the core i970 3.2 GHz is included. But indeed, it has been some time since we have used backburner.

    Is this the kind of bench you are looking for?
    http://www.anandtech.com/show/2240/7

    Backburner scales extremely well, so I suspect that especially the Quad MC Dell is a very good choice compared to a workstation.
  • JoeKan - Friday, September 30, 2011 - link

    Yes - the backburner test is it. Although I use different rendering software, that test would be appropriate as the visualization rendering can properly represent real life usage and can stress the hardware at the same time.

    The test linked uses frames 20-29. I'd like to see a longer frame sequence.

    The reason I asked that a workstation be used as a base reference is because that gives us, the readers, a point of reference to compare against. I define a workstation as a single CPU box anyone can build with off the shelf components, like a i7-2600K, or a i7-970 - a performance CPU in the $300+ to $600 range. That allows one to compare performance on a per $ basis.

    Not a true 'workstation' as it does not use a Xeon, but it gives the ability to compare 'performance' to 'performance per buck' basis.

    By using a $1000+ class CPU for comparison the 'bang for the buck' comparison is distorted.
  • xxtypersxx - Friday, September 30, 2011 - link

    I love reading about the high end server hardware, its like F1 compared to road cars.

    As for benchmarks, may I suggest the linux x64 Folding at Home client? We know it scales past at least 128 cores without issue and as many of us that fold are running server hardware anyway, it will attract a new audience to the reviews.
  • rehm - Friday, September 30, 2011 - link

    Hello,
    for CFD benchmarking you could also consider the code OpenFOAM. It scales very well and is gaining a lot of interest in industry and academia. Memory behaviour should be comparable to Fluent and it can be compiled with gcc and icc.

    Regards
  • JohanAnandtech - Friday, September 30, 2011 - link

    Very nice suggestion... but is there a sample solution/ benchmark we can measure? It is a bit hard for a hardware reviewer to come up with very specialized realworld tests :-).
  • ozztheforester - Friday, September 30, 2011 - link

    I am currently using a bunch of 2600k's for rendering in the past I used some dual xeon setups but only found those being extremely inefficient on cost/performance ratio. Can you please let us know the cost and power consumption of this system?

    currently getting around 8.72 points on cinebench 11.5 on a 2600k pc @4.5ghz which is consuming less than 200 watts at full load and costing a bit less than 800usd

    also I would suggest using vray for multi thread benchmarks
  • sicofante - Friday, September 30, 2011 - link

    Why didn't you set up a scene in Maya or Softimage and then render it with Mental Ray? THAT would be a professional test, Cinebench is not.

    BTW, no matter how powerful, these Xeon E7 systems are a no-go for studios. They are plainly anti-economical. You can have a much sensibler setup by putting ordinary Xeons or overclocked Core i7s in many racks, i.e., a rendering farm.

    (Note: I build rendering farms for studios. Since 3D rendering grows almost linearly with frequency, what matters in the end is Euros/GHz, that is normalized GHz)
  • Phynaz - Friday, September 30, 2011 - link

    What studio renders on overclocked desktop cpu's?
  • confusis - Friday, September 30, 2011 - link

    My studio does. We can't yet step up to a higher end multi-socket rendering server (finances, start-up company) so we make do with Phenom II x4's. A desktop box is good value for money at our end of the company scale. Once we grow we'll be looking at Interlagos however

Log in

Don't have an account? Sign up now