Testing the Opteron HPC Remedy

The results of memory node interleaving are pretty spectacular, at least in terms of improving Opteron performance.

Once we disable NUMA, our Opteron server scales properly. Performance is multiplied by 3 when we run the benchmark with 48 threads. So memory interleaving does the trick, but since memory interleaving increases the traffic between the CPU nodes, we decided to test with HT assist (a 1MB snoop filter) on and off.

Stars Euler 3D CFD: maximum score revisited

Notice how this benchmark relies on the CPU interconnects: when we disable HT assist but leave interleaving on, we lose more than 25% performance. HT assist avoids many unnessary broadcasts on the HT interconnects. What is more, we did test the Xeon E7 with memory node interleaving (4-way) but this did not improve or decrease performance in any substantial way.

There's even more good news for the Opteron: the score on Cinebench R11.5 rendering improved from 25 (NUMA) to 26.3. (memory node interleaving). It's hardly spectacular, but that's still a nice and free of charge 5% performance boost, assuming you're running workloads that will benefit.

Investigating the Opteron Performance Mystery Final Analysis
Comments Locked

52 Comments

View All Comments

  • proteus7 - Tuesday, October 11, 2011 - link

    STREAM triad on a 4S Xeon E7 should hit about 65GB/s, unless your memory, or UEFI/bios options are misconfigured. Firmware settings can have a HUGE difference on these systems.

    Did you:
    Enable Hemisphere mode?
    Disable HT?
    If running Windows, assume it was Server 2008 R2 SP1?
    If running Windows, realize that only certain applications, compiled with specific flags will work on core counts over 64 (kgroup0). Not an issue if HT was off.
    Enable prefetch modes in firmware?
    ensure system firmware was set to max perf, and not powersaving modes?
    if running windows, set power options to max performance profile? (default power profile on server drops perf substantially for short burst benchmarks)
    TPC-E is also a great benchmark to run (need some SSD storage/Fusion I/O) HPCC/Linpack are good for HPC testing.
  • pventi - Monday, October 31, 2011 - link

    As you can read from the icc manual when running on non INTEL processors the Non-Temporal pre-fetches are not implemented in the final machine code. This alone means it could be up to 27% faster.

    Another reason why it's slower is because the "standard" HW configuration of the Opteron throttles the DRAM pre-fetchers when under load.
    Under Linux this behaviour can be changed from shell and should add another 5~10% increase in performance.

    So this benchmark should show ~ 30% higher number for the Opteron.

    www.metarstation.com

    Best Regards
    Pierdamiano

Log in

Don't have an account? Sign up now