Rendering and HPC Benchmark Session Using Our Best Servers

Name: Rendering and HPC Benchmark Session Using Our Best Servers
Item: Rendering and HPC Benchmark Session Using Our Best Servers
Author: Johan De Gelas

by Johan De Gelas on September 30, 2011 12:00 AM EST

52 Comments | Add A Comment

52 Comments

The Big Question: Why?

The big question is why the Opteron performs so much better with memory node interleaving while this has no effect whatsoever on the Xeons. Only a very detailed profiling could gives us the absolute and accurate answer, but that is a bit beyond the scope of this article (and our time constraints). However, we already have a few interesting clues:

Enabling HT assist improves performance by 32% (8.5 vs 6.4), which indicates that snoop traffic is a serious bottleneck. That is also a result of using memory node interleaving, which increases the data traffic between the sockets as data is striped over the memory nodes.
The application is very sensitive to latency.

The Xeon E7 has a Global Coherence Engine with Directory Assisted Snoop (DAS). As described by David Kanter here, the Coherence Engine makes use of an innovative 2-hop protocol that achieves much lower snoop latency. Intel's Coherence Engine is quite a bit more advanced than the 3-hop protocol combined with the snoop filter that AMD uses on the current Opterons. This might be one explanation why the Xeon E7 does not need memory node interleaving to get good performance in an application that spawns more threads than the core count of one CPU socket.

Conclusion

It is interesting to note that Cinebench also benefits from node interleaving, although it is a lot less extreme than what we witnessed in STARS Euler3D CFD. That could indicate there are quite a few (HPC) applications which could benefit from memory node interleaving despite the fact that most operating systems are now well optimized for NUMA. We suspect that almost any application that spawns threads accross four sockets and works on a common dataset will see some benefit from node interleaving on AMD's quad Opteron platform.

That said, virtualization is not such an application, as most VMs are limited to 4-8 vCPUs. In such setups, the dataset can be kept locally with a bit of tuning, and since the release of vSphere 4.0, ESX is pretty good at this.

Looking at the performance results, the Xeons dominated the CFD benchmark, even with the interleaving enabled on Opterons. However, this doesn't mean that the current 12-core opteron is a terrible choice for HPC use. We know that the AMD Opteron performs very well in some important HPC benches, as you can read here. That benchmark was compiled with an Intel Fortran compiler (ifort 10.0), and you might wonder why it was compiled that way. We asked Charles, the software designer, to answer that question:

"I spent some time with the gfortran compiler but the results were fairly bad. [...] That's why we pay big money for Intel's Fortran compiler!"

What that benchmark and this article show is how careful we must be when looking at performance results for many-threaded workloads and servers. If you just run the CFD benchmark on a typical server configurations, you might conclude that a 12-core Xeon is more than three times faster than a 48-core Opteron setup. However, after some tinkering we begin to understand what is actually going on, and while the final result still isn't particularly impressive (the 12-core/24-thread Xeon still bested the 48-core Opteron by 15%, and the quad Xeon E7-4870 is nearly twice as fast as the best Opteron result so far), there's still potential for improvement.

To Be Continued...

Remember, this is only our first attempt at HPC benchmarking. We'll be looking into more ambitious testing later, and we're hoping to incorporate your feedback. What Let us know your suggestions for benchmarks and other tests you'd like to see us run on these servers (and upcoming hardware as well), and we'll work to make it happen.

Testing the Opteron Remedy

PRINT THIS ARTICLE

Post Your Comment
Please log in or sign up to comment.

Comments Locked

52 Comments

View All Comments

Phynaz - Friday, September 30, 2011 - link
If you are overclocking in a business environment, what other moronic decisions has your company made?

When does the going out of business sale start?
MrSpadge - Friday, September 30, 2011 - link
Don't condemn him blindly. By overclocking they can get substantially more performance from a similar budget. that's more efficiency - if done right.

The question is "what happens in case of failure". If it's just a crashing machine, the rendering can be repeated by another one and this machine can be tuned down a bit. If it's a visual artefact during rendering, the redering can be repeated by any machine and this machine tuned down a bit. What else could go wrong in rendering? Obviously you wouldn't want to OC your web server or database..

BTW: there was an article here some time ago, showing Cyrix doing their testing on OC'ed i7s.

MrS
Kvarta - Tuesday, October 4, 2011 - link
Don't be so sure. Recently You can see standard desktop CPU beating expensive Xeons in professional applications. Example:
http://www.solidworks.com/sw/support/shareyourscor...
So You don't need to buy very expensive DELL or other workstation, instead go to PC boutique near the corner :)
JohanAnandtech - Saturday, October 1, 2011 - link
This was not meant to be a professional rendering test. It was more an experiment to give the enthusiasts an idea what these servers are capable off. If you have a suggestion on which animation we should use in our benchmarking scenario's let me know. I have solid background in the "web- database - virtualization" field (I have been active in the field for more than 10 years now, teaching and consulting) , but rendering and HPC is something I only know from a benchmarking background :-).
WeaselITB - Friday, September 30, 2011 - link
I'll echo the other sentiments here. If a Xeon system renders something twice as fast as the Opteron system, but takes five times the power draw to do it, it's a net-win for the Opteron system. Performance / watt would be a useful metric in these comparisons, especially as systems like these will be going in data rooms where excess wattage = excess heat = excess money.

I would also be interested to know what a comparision would be between a "big iron" system like this versus a "traditional" render farm composed of some Core i7 machines.

Awesome review, though! I'm especially happy with the fact that you didn't just say "Oh, the Operteron kinda sucks in this test. Oh well." but actually took a look deeper into what's going on with the benchmark and the workload. THAT's the type of analysis that makes me keep coming back to AnandTech. :-)

Thanks,
-Weasel
JarredWalton - Friday, September 30, 2011 - link
Pretty sure perf/Watt isn't going to be in Opteron's favor, but there's a lot of stuff you need to account for. Johan did some measurements of power use on these servers previously (http://www.anandtech.com/show/4285/6), but as pointed out the Intel setup has a lot more RAS features and such that could be adding to the power use. Even so, the "load" power measured (using vAPUS, which may use less than something like 3D rendering) is around 875W (HT off) to 920W for the Intel E7-4870 server compared to around 590W for the Opteron 6174 server.

In terms of perf/Watt, if those figures are relatively close for the benchmarks Johan has done here, then the best scores in Cinebench give 0.0355 CB/Watt for E7-4870 vs. 0.0425 CB/Watt for the Opteron 6174 -- and again, note that the 64-thread limit (tested with 40) means CB11.5 isn't able to make maximum use of the Intel platform. For the second test, best-case we measure 0.0194 CFD/W for Intel compared to 0.0145 CFD/W for AMD. So AMD wins in 3D rendering by 20% and Intel wins in the Euler3D CFD test by 34%, at least given the current estimates.

My gut feeling is that if all other elements and features were identical, other than the necessary chipset and CPU differences (e.g. the PSU, amount of RAM, HDDs, fans, RAS features, etc.), the difference in power draw for the two platforms should be within 100W, not the up to 340W spread measured in the earlier article. (There's also a 310W difference at idle, which gives some indication of all the other things that appear to be running on the Intel setup, as normal idle power looking at just the CPUs should be very nearly equal.) So these figures I list here are specific to the Intel Quanta QSCC-4R and Dell PowerEdge R815 and may not hold for other AMD/Intel servers. In other words, take with a grain of salt.
RandomUsername3245 - Friday, September 30, 2011 - link
The intel compiler is a very good compiler for Intel CPUs, but in the past it was well known for producing poor quality binaries for non-Intel CPUs. I would still be wary of benchmarking Intel vs. AMD when running code compiled with Intel's compiler.

FWIW, I heard a while ago that Intel was "officially" going to stop artificially penalizing AMD CPUs that run Intel-compiled code.
James5mith - Friday, September 30, 2011 - link
Just a note:

We are doing some in house testing for high end database testing using solid state storage connected via infiniband to multisocket servers.

An example:

Dell R910
4x 8C/16T X7560 2.26GHz Xeon CPU (32C/64T total)
512GB RAM
2x 146GB 15K SAS hdd's in RAID1 (OS)
2x Mellanox QDR Infiniband 40Gbps adapters

Hooked up to some seriously fast external flash storage, we got around 6GB/s+. This allowed us to do massively multi-threaded workloads, like building an index on a 2TB database.

During these tests, we can max out all 64 Threads and put the entire box under 100% load. It was during these tests we found out that Dell has a flawed implementation of the Intel SpeedStep technology which keeps the fans from ramping up under load.

Without the fast storage, we could never have fully stress tested the box.
mczak - Friday, September 30, 2011 - link
I think part why the opteron has bad scaling without interleaving and xeon does not is not just due to the coherence engine.
Don't forget that while both have 4 sockets the Opteron is a 8 node system. The article states that there are "4 memory controllers" and "3 out of 4 operations traverse the HT link" which isn't really true as there are 8 memory controllers (and 7 out of 8 operations traverse HT, though some of them are internal HT links).
You can see that this makes a difference with the bad scaling from 6 to 12 threads (though not as bad as with even more threads...).
extide - Friday, September 30, 2011 - link
Dont forget the Xeon E7 is 4 sockets with 4 memory channels each.

Rendering and HPC Benchmark Session Using Our Best Servers

Post Your Comment

52 Comments

View All Comments

Phynaz - Friday, September 30, 2011 - link

MrSpadge - Friday, September 30, 2011 - link

Kvarta - Tuesday, October 4, 2011 - link

JohanAnandtech - Saturday, October 1, 2011 - link

WeaselITB - Friday, September 30, 2011 - link

JarredWalton - Friday, September 30, 2011 - link

RandomUsername3245 - Friday, September 30, 2011 - link

James5mith - Friday, September 30, 2011 - link

mczak - Friday, September 30, 2011 - link

extide - Friday, September 30, 2011 - link

Log in

Don't have an account? Sign up now