Rendering and HPC Benchmark Session Using Our Best Servers

Name: Rendering and HPC Benchmark Session Using Our Best Servers
Item: Rendering and HPC Benchmark Session Using Our Best Servers
Author: Johan De Gelas

by Johan De Gelas on September 30, 2011 12:00 AM EST

52 Comments | Add A Comment

52 Comments

The Big Question: Why?

The big question is why the Opteron performs so much better with memory node interleaving while this has no effect whatsoever on the Xeons. Only a very detailed profiling could gives us the absolute and accurate answer, but that is a bit beyond the scope of this article (and our time constraints). However, we already have a few interesting clues:

Enabling HT assist improves performance by 32% (8.5 vs 6.4), which indicates that snoop traffic is a serious bottleneck. That is also a result of using memory node interleaving, which increases the data traffic between the sockets as data is striped over the memory nodes.
The application is very sensitive to latency.

The Xeon E7 has a Global Coherence Engine with Directory Assisted Snoop (DAS). As described by David Kanter here, the Coherence Engine makes use of an innovative 2-hop protocol that achieves much lower snoop latency. Intel's Coherence Engine is quite a bit more advanced than the 3-hop protocol combined with the snoop filter that AMD uses on the current Opterons. This might be one explanation why the Xeon E7 does not need memory node interleaving to get good performance in an application that spawns more threads than the core count of one CPU socket.

Conclusion

It is interesting to note that Cinebench also benefits from node interleaving, although it is a lot less extreme than what we witnessed in STARS Euler3D CFD. That could indicate there are quite a few (HPC) applications which could benefit from memory node interleaving despite the fact that most operating systems are now well optimized for NUMA. We suspect that almost any application that spawns threads accross four sockets and works on a common dataset will see some benefit from node interleaving on AMD's quad Opteron platform.

That said, virtualization is not such an application, as most VMs are limited to 4-8 vCPUs. In such setups, the dataset can be kept locally with a bit of tuning, and since the release of vSphere 4.0, ESX is pretty good at this.

Looking at the performance results, the Xeons dominated the CFD benchmark, even with the interleaving enabled on Opterons. However, this doesn't mean that the current 12-core opteron is a terrible choice for HPC use. We know that the AMD Opteron performs very well in some important HPC benches, as you can read here. That benchmark was compiled with an Intel Fortran compiler (ifort 10.0), and you might wonder why it was compiled that way. We asked Charles, the software designer, to answer that question:

"I spent some time with the gfortran compiler but the results were fairly bad. [...] That's why we pay big money for Intel's Fortran compiler!"

What that benchmark and this article show is how careful we must be when looking at performance results for many-threaded workloads and servers. If you just run the CFD benchmark on a typical server configurations, you might conclude that a 12-core Xeon is more than three times faster than a 48-core Opteron setup. However, after some tinkering we begin to understand what is actually going on, and while the final result still isn't particularly impressive (the 12-core/24-thread Xeon still bested the 48-core Opteron by 15%, and the quad Xeon E7-4870 is nearly twice as fast as the best Opteron result so far), there's still potential for improvement.

To Be Continued...

Remember, this is only our first attempt at HPC benchmarking. We'll be looking into more ambitious testing later, and we're hoping to incorporate your feedback. What Let us know your suggestions for benchmarks and other tests you'd like to see us run on these servers (and upcoming hardware as well), and we'll work to make it happen.

Testing the Opteron Remedy

PRINT THIS ARTICLE

Post Your Comment
Please log in or sign up to comment.

Comments Locked

52 Comments

View All Comments

jaguarpp - Friday, September 30, 2011 - link
what if instead of using a full program, create a small test program that is compiled for each platform something like
declare variables int, floats, arrays to test diferent workloads
put the variables on loops and do some operation sum, div, the integers then the floats and so on measure the time that take to exit from each block
the hardest part will be how to make it threadable
and get acces to diferent compilers, maybe a friend?
anyway great article i really enjoy it even when i never get close to that class of hardware
thanks very much for the reading
Michael REMY - Friday, September 30, 2011 - link
very interesting analyze but...why use a score in cinebench instead a time render score ?

Time result are more meaning for common and pro user than integer score !
MrSpadge - Friday, September 30, 2011 - link
Because time is totally dependent on the complexity of your scene, output resolution etc. And the score can be directly translated into time if you know the time for any of the configurations tested.

MrS
Casper42 - Friday, September 30, 2011 - link
Go back to Quanta and see if they have a newer BIOS with the Core Disable feature properly implemented. I Know the big boys are now implementing the feature and it allows you to disable as many cores as you want as long as its done in pairs. So your 10c proc can be turned into 2/4/6/8 core versions as well.

So for your first test where you had to turn HT off because 80 threads was too much, you could instead turn off 2 cores per proc and synthetically create a 4p32c server and then leave HT on for the full 64 threads.
alpha754293 - Sunday, October 2, 2011 - link
"Hyper-Threading offers better resource utilization but that does not negate the negative performance effect of the overhead of running 80 threads. Once we pass 40 threads on the E7-4870, performance starts to level off and even drop."

It isn't thread-locking that limits the performance. It isn't because it has to sync/coordinate 80-threads. It's because there's only 40 FPUs available to do the actual calculations on/with.

Unlike virtualization, where thread locking is a real possiblity because there really isn't much in the way of underlying computations (I would guess that if you profiled the FPU workload, it wouldn't show up much), whereas for CFD, solving the Navier-Stokes equations requires a HUGE computational effort.

it also depends on the means that the parallelization is done, whether it's multi-threading, OpenMP, or MPI. And even then, within different flavors of MPI, they can also yield different results; and to make things even MORE complicated, how the domain is decomposed also can make a HUGE impact on performance as well. (See the studies performed by LSTC with LS-DYNA).
alpha754293 - Sunday, October 2, 2011 - link
Try running Fluent (another CFD) code and LS-DYNA.

CAUTION: both are typically usually VERY time-intensive benchmarks, so you have to be very patient with them.

If you need help in setting up standardized test cases, let me know.
alpha754293 - Sunday, October 2, 2011 - link
I'm working on trying to convert an older CFX model to Fluent for a full tractor-trailer aerodynamics run. The last time that I ran that, it had about 13.5 million elements.
deva - Monday, October 3, 2011 - link
If you want something that currently scales well, Terra Vista would be a good bet (although it is expensive).

Have a look at the Multi Machine Build version.

http://www.presagis.com/products_services/products...

"...capability to generate databases of
100+ GeoCells distributed to 256 individual
compute processes with a single execution."

That's the bit that caught my eye and made me think it might be useful to use as a benchmarking tool.

Daniel.
mapesdhs - Tuesday, October 4, 2011 - link

Have you guys considered trying C-ray? It scales very well with no. of cores, benefits from as
many threads as one can throw at it, and the more complex version of the example render
scene stresses RAM a bit aswell (the small model doesn't stress RAM at all, deliberately so).
I started a page for C-ray (Google for, "c-ray benchmark", 1st link) but discovered recently
it's been taken up by the HPC community and is now part of the Phoronix Test Suite (Google
for, "c-ray benchmark pts", 1st link again). I didn't create C-ray btw (creds to John Tsiombikas),
just took over John's results page.

Hmm, don't suppose you guys have the clout to borrow or otherwise have access to an SGI
Altix UV? Would be fascinating to see how your tests scale with dozens of sockets instead of
just four, eg. the 960-core UV 100. Even a result from a 40-core UV 10 would be interesting.
Shared-memory system so latency isn't an issue.

Ian.
shodanshok - Wednesday, October 5, 2011 - link
Hi Johan,
thank you for the very interesting article.

The Hyperthreading ON vs OFF results somewhat surprise me, as Windows Server 2008 should be able to prioritize hardware core vs logical ones. Was this the case, or you saw that logical processors were used before full hardware core utilization? If so, you probably encounter one corner case were extensive hardware sharing (and contention) between two threads produce lower aggregate performance.

Regards.

Rendering and HPC Benchmark Session Using Our Best Servers

Post Your Comment

52 Comments

View All Comments

jaguarpp - Friday, September 30, 2011 - link

Michael REMY - Friday, September 30, 2011 - link

MrSpadge - Friday, September 30, 2011 - link

Casper42 - Friday, September 30, 2011 - link

alpha754293 - Sunday, October 2, 2011 - link

alpha754293 - Sunday, October 2, 2011 - link

alpha754293 - Sunday, October 2, 2011 - link

deva - Monday, October 3, 2011 - link

mapesdhs - Tuesday, October 4, 2011 - link

shodanshok - Wednesday, October 5, 2011 - link

Log in

Don't have an account? Sign up now