Rendering and HPC Benchmark Session Using Our Best Servers

Name: Rendering and HPC Benchmark Session Using Our Best Servers
Item: Rendering and HPC Benchmark Session Using Our Best Servers
Author: Johan De Gelas

by Johan De Gelas on September 30, 2011 12:00 AM EST

52 Comments | Add A Comment

52 Comments

Investigating the Opteron Performance Mystery

What really surprised us was the Opteron's abysmal performance in Stars Euler3D CFD. We did not believe the results and repeated the benchmark at least 10 times on the quad Opteron system. Let us delve a little deeper.

Notice that the Intel Xeons scale very well until the number of threads is higher than the physical core count. The performance of the 40 core E7-4870 only drops when we use 48 threads (with HT off). The Opteron however only scales reasonably well from 1 to 6 threads. Between 6 and 12 threads scaling is very mediocre, but at least performance increases. From there, the performance curve is essentially flat.

The Opteron Performance Remedy?

We contacted Charles of Caselab with our results. He gaves us a few clues:

1. The Euler3d CFD solver uses an unstructured grid (spider web appearance with fluid states stored at segment endpoints). Thus, adjacent physical locations do not (cannot!) map to adjacent memory locations.

2. The memory performance benchmark relevant to Euler3D appears to be the random memory recall rate and NOT the adjacent-memory-sweep bandwidth.

3. Typical memory tests (e.g. Stream) are sequential "block'' based. Euler3D effectively tests random access memory performance.

So sequential bandwidth is not the answer. In fact, in most "Streamish" benchmarks (including our own compiled binaries), the Quad Opteron was close to 100GB/s while the Quad Xeon E7 got only between 37 and 55GB/s. So far it seems that only the Intel compiled stream binaries are able to achieve more than 55GB/s. So we have a piece of FP intensive software that performs a lot of random memory accesses.

On the Opteron, performance starts to slow down when we use more than 12 threads. With 24 or even better 48 threads the application spawns more threads than the available cores within the local socket. This means that remote memory accesses cannot be avoided. Could it be that the performance is completely limited by the threads that have to go the furthest (2 hops)? In others words, some threads working on local memory finish much faster, but the whole test cannot complete until the slowest threads (working on remote memory) finish.

We decided to enable "Node Interleaving" in the BIOS of our Dell R815. This means that data is striped across all four memory controllers. Interleaved accesses are slower than local-only accesses because three out of four operations traverse the HT link. However, all threads should now experience a latency that is more or less the same. We prevent the the worst-case scenario where few threads are seeing 2-hop latency. Let us see if that helped.

STARS Euler3D CFD Testing the Opteron Remedy

PRINT THIS ARTICLE

Post Your Comment
Please log in or sign up to comment.

Comments Locked

52 Comments

View All Comments

proteus7 - Tuesday, October 11, 2011 - link
STREAM triad on a 4S Xeon E7 should hit about 65GB/s, unless your memory, or UEFI/bios options are misconfigured. Firmware settings can have a HUGE difference on these systems.

Did you:
Enable Hemisphere mode?
Disable HT?
If running Windows, assume it was Server 2008 R2 SP1?
If running Windows, realize that only certain applications, compiled with specific flags will work on core counts over 64 (kgroup0). Not an issue if HT was off.
Enable prefetch modes in firmware?
ensure system firmware was set to max perf, and not powersaving modes?
if running windows, set power options to max performance profile? (default power profile on server drops perf substantially for short burst benchmarks)
TPC-E is also a great benchmark to run (need some SSD storage/Fusion I/O) HPCC/Linpack are good for HPC testing.
pventi - Monday, October 31, 2011 - link
As you can read from the icc manual when running on non INTEL processors the Non-Temporal pre-fetches are not implemented in the final machine code. This alone means it could be up to 27% faster.

Another reason why it's slower is because the "standard" HW configuration of the Opteron throttles the DRAM pre-fetchers when under load.
Under Linux this behaviour can be changed from shell and should add another 5~10% increase in performance.

So this benchmark should show ~ 30% higher number for the Opteron.

www.metarstation.com

Best Regards
Pierdamiano

Rendering and HPC Benchmark Session Using Our Best Servers

Post Your Comment

52 Comments

View All Comments

proteus7 - Tuesday, October 11, 2011 - link

pventi - Monday, October 31, 2011 - link

Log in

Don't have an account? Sign up now