Is Sandy Bridge-EP an Upgrade Path?

At the beginning of this review, I referred back to Johan’s article on the behind the scenes benefits that Sandy Bridge-EP offers over Westmere-EP, and condensed them into a list for what a non-CS student in a scientific field might have to consider:

- The improved core and µop cache on Sandy Bridge-EP should boost IPC through the roof with calculations that can take advantage, especially advanced trigonometric functions.
 - The increase in L3 cache would reduce stress on jumps out to main memory for values, although the improved memory bandwidth would also help in this regard.
 - More cores are always welcome – Turbo 2.0 would help with pre-release code testing, which often occurs in debug / single thread mode.
 - An increase of memory limits would help various simulation scenarios, as well as aid having VMs of different environments.
 - The move up to PCIe 3.0 helps any GPGPU simulation that requires lots of memory transfers back and forth across the bus (matrix solving), as long as the GPU supports PCIe 3.0 (K10, K20X, FirePro, not Xeon Phi which uses PCIe 2.0).

Every scenario that an individual faces, either in the office, the laboratory, or generic work place #147 is going to be different – perhaps only slightly, but different nonetheless.  We have to weigh up the pros and cons of the specific workload and make relative suggestions. 

For the most part, any simulation which has large parts that can be computed in parallel should be looking at GPUs, unless the thread are ‘dense’ (require lots of memory registers for the serial calculation) or are already optimized for SSE4/AVX.  Double precision can also be a hurdle to GPU computing, but the NVIDIA GTX Titan makes the cost a lot more palatable on research grants.  Lots of researchers will be dealing with Fortran code tens of thousands of lines long and 20 years old, meaning that porting to GPUs is not a reasonable situation (unless you encourage the research supervisor to apply for a 3 year grant to convert the code).  In these cases, make a note of how much memory the simulation needs – if it is sub 2.5 MB, then load up on as many cores as you can get as you will still be in L3 cache on the 20MB L3 processors.  For more than that, you will be dealing with memory accesses out to main memory, and unless you are comfortable dealing with NUMA based code and tools (which your Fortran probably is not geared for), then a single fast processor is probably the best bet.  MPI based Fortran is where dual processors systems would be best, or for simulations that require more memory than what a single processor can have equipped.

In terms of Westmere-EP vs. Sandy Bridge-EP for our benchmark suite, the relative numbers are:

Dual E5-2690 vs. Dual X5690
Price +25% (before tax and additional seller markup)
  HT On HT Off Recommended Setup
2D Explicit FD +12.7% +7.3% GPU or
Single Multicore CPU
w/High Speed Memory
3D Explicit FD +7.7% -10.3% GPU or
Single Multicore CPU
w/High Speed Memory
2D Implicit +25.6% +9.9% Single CPU
High Mem Bandwidth
Brownian Motion
Single Thread
+2.4% +2.8% High Single CPU Speed
Brownian Motion
Multi Thread
+31.8% +23.4% GPU
n-Body +29.0% +47.7% GPU
WinRar +27.4% +3.4% High Mem Bandwidth
FastStone +6.5% +3.2% High Single CPU Speed
Xilisoft Video +14.3% +24.4% GPU or
x264 Pass 1 -9.0% +3.4% Single CPU
x264 Pass 2 +27% +24.3% Multi-CPU

While we do not get a price equivalent speed up across the board, certain scenarios (Xilisoft, x264 Pass 2) benefit greatly from a dual processor Sandy Bridge-EP system over either Westmere-EP or GPU.  Sometimes a GPU is not available, putting the Brownian Motion benchmark through the roof when it comes to more cores.  A limiting factor in many of these benchmarks is memory speed – if you do not need a Xeon, then the latest Intel/AMD processors can handle 2133+ MHz memory which provides an absolute tangible boost in finite difference simulation and WinRar.

If we come back to the original question ‘Is moving from Westmere-EP to Sandy Bridge-EP a reasonable upgrade’, in the majority of our scenarios it probably is not – either other alternatives exist that perform better (single CPU, GPU, memory bandwidth) or the price difference is not worth the jump.  Remember that most scenarios will have to absorb the whole cost, rather than the cost of an upgrade, and calculating that into the cost/benefit analysis is a major part of the equation.  But none of our scenarios need more than 96 GB of memory, PCIe 3.0, VMs for different environments, or use advanced processor instruction sets, which could be vital to your work. 

Ivy Bridge-EP is slated for the end of the year, meaning that those on Westmere-EP would probably consider waiting to see what comes out from Intel.  If you need a DP system now, then Sandy Bridge-EP is an obvious choice if you want to go down the Intel route, though NUMA related code may benefit from a quad AMD system better.  If we get one in for another comparison point, we will let you know.

A final note to give thanks to the Gigabyte server team for loaning us the CPUs and motherboard to make this testing possible.

Compression and Video Benchmarks


View All Comments

  • SatishJ - Monday, March 04, 2013 - link

    It would be only fair to compare X5690 with E5-2667. I suspect in this case the performance difference would not be earth-shattering. No doubt E5-2690 excels but then it has advantage of more cores / threads. Reply
  • wiyosaya - Monday, March 04, 2013 - link

    There is a possible path forward for those dealing with "old" FORTRAN code. CUDA FORTRAN -

    I would expect that there would be some conversion issues, however, I would also expect that they would be lesser than converting to C++ or some other CUDA/openCL compliant language.

    As much as some of us might like it to be, FORTRAN is not dead, yet!
  • mayankleoboy1 - Monday, March 04, 2013 - link

    1. Why not use 7Zip for the compression benchmark ? Most HPC people would like to use a FREE, Highly threaded software for their work.

    2.Using 3770K @ 5.4 Ghz as a comparison point is foolish. Any Ivy bride processor above ~4.6 on air is unrealistic. And for HPC, no body will use a overclocked system.
  • Senti - Monday, March 04, 2013 - link

    WinRar is interesting because it's very sensitive to memory subsystem (7zip is less so), but 3.93 is absolutely useless as it utilizes about half of my cpu time and the end result it turns into powersaving impact benchmark before anything else. AT promised to upgrade sometime this year, but before it we'll continue to have one less useful benchmark.

    Not overclocking your cpu when you have good cooling is plain waste of resources. Of course I mean not extreme overclocks, but permanent maximum "turbo" frequency should be your minimum goal.
  • SetiroN - Monday, March 04, 2013 - link

    Sorry but...

    -So what?
    Being "sensitive to memory" doesn't make a worse benchmark better, or free;

    -So what?
    Nobody will ever have good enough cooling to be able to compute daily at 5.4, which is FAR above max turbo anyway. Overclocked results are welcome, provided that I don't need an additional $500 phase change cooler and $100+ in monthly bills.
  • tynopik - Monday, March 04, 2013 - link

    > Being "sensitive to memory" doesn't make a worse benchmark better, or free;

    It makes it better if your software is also sensitive to memory speed

    different benchmarks that measure different aspects of performance are a GOOD thing
  • Death666Angel - Monday, March 04, 2013 - link

    The OC CPU I see as a data point for his statement that some workloads don't require multi socket CPU systems but rather a single high IPC CPU. It may or may not be unrealistic for the target demographic, but it does add a data point for or against such a thing. Reply
  • IanCutress - Tuesday, March 05, 2013 - link

    1. WinZip 3.93 has been part of my benchmark suite for the past 18 months (so lots of comparison numbers can be scrutinized), and is the one I personally use :) We should be updating to 4.2 for Haswell, though going back and testing the last few years of chipsets and various processors takes some time.

    2. My inhouse retail CPU does 4.9 GHz on air, easy :) But most of the OC numbers are courtesy of several HWBot overclockers at who volunteered to help as part of the testing. For them the bigger score the better, hence the overclocks on something other than ambient.

  • mayankleoboy1 - Monday, March 04, 2013 - link

    How many real world workloads are using hand-coded AVX software ?
    How many use compiler optimized AVX software ?
    What is the perf difference between them?

    Not directly related to this article, but how many softwares have the AMD Bulldozer/piledriver optimised FMA and BMI extensions ?
  • Kevin G - Monday, March 04, 2013 - link

    What is going on with the Explicit Finite Difference tests? The thing that stood out to me are the two results for the i7 3770K at 4.5 Ghz with memory speed being the differentiating factor. Going from 2000 Mhz to 2600 Mhz effective speed on the memory increased performance by ~13% in the 2D tests and ~6% in the 3D tests. Another thing worth pointing out is that the divider in Ivy Bridge has higher throughput than Sandy Bridge. This would account for some of the exceedingly high performance of the desktop Ivy Bridge systems if the algorithms make heavy use of division. The dual socket systems likely need some tuning with regards to their memory systems. The results of the dual socket systems are embarrassing in comparison to their 'lesser' single socket brethen.

    The implicit 2D test is similarly odd. The odd ball result is the Core i7 3820@4.2 Ghz against the Ivy bridge based Core i7 3770k@stock (3.5 Ghz). Despite the higher clock speed and extra memory channel, the consumer Sandy Bridge-E system loses! This is with the same number of cores and threads running. Just how much division are these algorithms using? That is the only thing that I can fathom to explain these differences. Multi-socket configurations are similarly nerfed with the implicit 2D test as they are with the explicit 2D test.

    Did the Browian Motion simulations take advantage of Ivy Bridge's hardware random number generator? Looking at the results, signs are pointing toward 'no'.

    I'm a bit nitpicky about the usage of the word 'element' describing the n-Body simulation with regards to gravity. The usage of element and particle are not technically incorrect but lead the reader to think that these simulations are done with data regarding the microscopic scales, not stellar.

    The Xilisoft Video Converter test results seem to be erroneous. More than doubling the speed by enabling Hyperthreading? How is that even possible? Best case for Hypthereading is that half of the CPU execution resources are free so that another thread can utilize them and get double the throughput. HT rarely gets near twice as fast but these results imply five times faster which is outside the realm of possibility with everything else being equal. Scaling between the Core i7-3960k and the dual E5-2690 HT Off result looks off given how the results between other platforms look too.

Log in

Don't have an account? Sign up now