Rendering and HPC Benchmark Session Using Our Best Servers

Name: Rendering and HPC Benchmark Session Using Our Best Servers
Item: Rendering and HPC Benchmark Session Using Our Best Servers
Author: Johan De Gelas

by Johan De Gelas on September 30, 2011 12:00 AM EST

52 Comments | Add A Comment

52 Comments

Testing the Opteron HPC Remedy

The results of memory node interleaving are pretty spectacular, at least in terms of improving Opteron performance.

Once we disable NUMA, our Opteron server scales properly. Performance is multiplied by 3 when we run the benchmark with 48 threads. So memory interleaving does the trick, but since memory interleaving increases the traffic between the CPU nodes, we decided to test with HT assist (a 1MB snoop filter) on and off.

Stars Euler 3D CFD: maximum score revisited

Notice how this benchmark relies on the CPU interconnects: when we disable HT assist but leave interleaving on, we lose more than 25% performance. HT assist avoids many unnessary broadcasts on the HT interconnects. What is more, we did test the Xeon E7 with memory node interleaving (4-way) but this did not improve or decrease performance in any substantial way.

There's even more good news for the Opteron: the score on Cinebench R11.5 rendering improved from 25 (NUMA) to 26.3. (memory node interleaving). It's hardly spectacular, but that's still a nice and free of charge 5% performance boost, assuming you're running workloads that will benefit.

Investigating the Opteron Performance Mystery Final Analysis

PRINT THIS ARTICLE

Post Your Comment
Please log in or sign up to comment.

Comments Locked

52 Comments

View All Comments

proteus7 - Tuesday, October 11, 2011 - link
STREAM triad on a 4S Xeon E7 should hit about 65GB/s, unless your memory, or UEFI/bios options are misconfigured. Firmware settings can have a HUGE difference on these systems.

Did you:
Enable Hemisphere mode?
Disable HT?
If running Windows, assume it was Server 2008 R2 SP1?
If running Windows, realize that only certain applications, compiled with specific flags will work on core counts over 64 (kgroup0). Not an issue if HT was off.
Enable prefetch modes in firmware?
ensure system firmware was set to max perf, and not powersaving modes?
if running windows, set power options to max performance profile? (default power profile on server drops perf substantially for short burst benchmarks)
TPC-E is also a great benchmark to run (need some SSD storage/Fusion I/O) HPCC/Linpack are good for HPC testing.
pventi - Monday, October 31, 2011 - link
As you can read from the icc manual when running on non INTEL processors the Non-Temporal pre-fetches are not implemented in the final machine code. This alone means it could be up to 27% faster.

Another reason why it's slower is because the "standard" HW configuration of the Opteron throttles the DRAM pre-fetchers when under load.
Under Linux this behaviour can be changed from shell and should add another 5~10% increase in performance.

So this benchmark should show ~ 30% higher number for the Opteron.

www.metarstation.com

Best Regards
Pierdamiano

Rendering and HPC Benchmark Session Using Our Best Servers

Post Your Comment

52 Comments

View All Comments

proteus7 - Tuesday, October 11, 2011 - link

pventi - Monday, October 31, 2011 - link

Log in

Don't have an account? Sign up now