Is Sandy Bridge-EP an Upgrade Path?

At the beginning of this review, I referred back to Johan’s article on the behind the scenes benefits that Sandy Bridge-EP offers over Westmere-EP, and condensed them into a list for what a non-CS student in a scientific field might have to consider:

- The improved core and µop cache on Sandy Bridge-EP should boost IPC through the roof with calculations that can take advantage, especially advanced trigonometric functions.
 - The increase in L3 cache would reduce stress on jumps out to main memory for values, although the improved memory bandwidth would also help in this regard.
 - More cores are always welcome – Turbo 2.0 would help with pre-release code testing, which often occurs in debug / single thread mode.
 - An increase of memory limits would help various simulation scenarios, as well as aid having VMs of different environments.
 - The move up to PCIe 3.0 helps any GPGPU simulation that requires lots of memory transfers back and forth across the bus (matrix solving), as long as the GPU supports PCIe 3.0 (K10, K20X, FirePro, not Xeon Phi which uses PCIe 2.0).

Every scenario that an individual faces, either in the office, the laboratory, or generic work place #147 is going to be different – perhaps only slightly, but different nonetheless.  We have to weigh up the pros and cons of the specific workload and make relative suggestions. 

For the most part, any simulation which has large parts that can be computed in parallel should be looking at GPUs, unless the thread are ‘dense’ (require lots of memory registers for the serial calculation) or are already optimized for SSE4/AVX.  Double precision can also be a hurdle to GPU computing, but the NVIDIA GTX Titan makes the cost a lot more palatable on research grants.  Lots of researchers will be dealing with Fortran code tens of thousands of lines long and 20 years old, meaning that porting to GPUs is not a reasonable situation (unless you encourage the research supervisor to apply for a 3 year grant to convert the code).  In these cases, make a note of how much memory the simulation needs – if it is sub 2.5 MB, then load up on as many cores as you can get as you will still be in L3 cache on the 20MB L3 processors.  For more than that, you will be dealing with memory accesses out to main memory, and unless you are comfortable dealing with NUMA based code and tools (which your Fortran probably is not geared for), then a single fast processor is probably the best bet.  MPI based Fortran is where dual processors systems would be best, or for simulations that require more memory than what a single processor can have equipped.

In terms of Westmere-EP vs. Sandy Bridge-EP for our benchmark suite, the relative numbers are:

Dual E5-2690 vs. Dual X5690
Price +25% (before tax and additional seller markup)
  HT On HT Off Recommended Setup
2D Explicit FD +12.7% +7.3% GPU or
Single Multicore CPU
w/High Speed Memory
3D Explicit FD +7.7% -10.3% GPU or
Single Multicore CPU
w/High Speed Memory
2D Implicit +25.6% +9.9% Single CPU
High Mem Bandwidth
Brownian Motion
Single Thread
+2.4% +2.8% High Single CPU Speed
Brownian Motion
Multi Thread
+31.8% +23.4% GPU
n-Body +29.0% +47.7% GPU
WinRar +27.4% +3.4% High Mem Bandwidth
FastStone +6.5% +3.2% High Single CPU Speed
Xilisoft Video +14.3% +24.4% GPU or
Multi-CPU
x264 Pass 1 -9.0% +3.4% Single CPU
x264 Pass 2 +27% +24.3% Multi-CPU

While we do not get a price equivalent speed up across the board, certain scenarios (Xilisoft, x264 Pass 2) benefit greatly from a dual processor Sandy Bridge-EP system over either Westmere-EP or GPU.  Sometimes a GPU is not available, putting the Brownian Motion benchmark through the roof when it comes to more cores.  A limiting factor in many of these benchmarks is memory speed – if you do not need a Xeon, then the latest Intel/AMD processors can handle 2133+ MHz memory which provides an absolute tangible boost in finite difference simulation and WinRar.

If we come back to the original question ‘Is moving from Westmere-EP to Sandy Bridge-EP a reasonable upgrade’, in the majority of our scenarios it probably is not – either other alternatives exist that perform better (single CPU, GPU, memory bandwidth) or the price difference is not worth the jump.  Remember that most scenarios will have to absorb the whole cost, rather than the cost of an upgrade, and calculating that into the cost/benefit analysis is a major part of the equation.  But none of our scenarios need more than 96 GB of memory, PCIe 3.0, VMs for different environments, or use advanced processor instruction sets, which could be vital to your work. 

Ivy Bridge-EP is slated for the end of the year, meaning that those on Westmere-EP would probably consider waiting to see what comes out from Intel.  If you need a DP system now, then Sandy Bridge-EP is an obvious choice if you want to go down the Intel route, though NUMA related code may benefit from a quad AMD system better.  If we get one in for another comparison point, we will let you know.

A final note to give thanks to the Gigabyte server team for loaning us the CPUs and motherboard to make this testing possible.

Compression and Video Benchmarks
Comments Locked

44 Comments

View All Comments

  • alpha754293 - Tuesday, March 5, 2013 - link

    Sorry, I'm back. Where was I? oh yes...

    Unless that you were purely running single-threaded, single process jobs (or maybe even lightly multithreaded, single process jobs) - I would think that to say that it is favouring a single-CPU system might be a little bit misleading.

    Even with single-socket systems, if it's got multiple cores, then you can parallelize amongst those as well.

    Some commercial programs too favour 2^n cores as well, which would make quite a difference between having 8- or 16-cores vs. 6 or 12 (because some programs won't even run properly if it isn't 2^n).

    Also it was interesting to see that you didn't run the implicit 3D grid solver benchmark.

    Actually, MPI might not has as much to do with memory than you might think. Considering that the world's top supercomputers haven't maxed out the memory capacity per socket, I doubt that. It IS, however, much better at the actual parallelization of the task than OpenMP.

    "‘Is moving from Westmere-EP to Sandy Bridge-EP a reasonable upgrade’, in the majority of our scenarios it probably is not"

    It really depends. If you're writing your own code (which is what you're doing), and you have a lot of control over it, then that MIGHT be a true statement. (And it also depends on the state of your code too. If you're almost always in a permanent alpha phase (because you keep adding new capability and modules to it), then chances are, you might not even get around to parallelizing it (because you want to make sure that the base solver is robust first before you add the additional complexity of parallelization on top of that).

    But if, say, suppose that you're doing research on crash and crash safety; and you're using a commercial code - some of those would just favour more cores period (see Johan's latest benchmark on the Opteron for details).

    And as to whether or not you can run it on a GPU; the problem with that is that you have to make sure that every system in your working/research group has the same capable GPU hardware; otherwise, those that don't can't even run it, and those that have lesser-capable GPUs - might not get the benefits of using a GPU as much as you think, if at all.

    (My GTX 660 OC's double precision performance is actually slightly SLOWER than the double precision performance of my 3930K OC'd to 4.5 GHz.)
  • alpha754293 - Tuesday, March 5, 2013 - link

    Also, as far as I know/can remember - not everything can make use of AVX - both in terms of programs and also in terms of fundamental math operations.

    And I would suspect that you might also have slight performance variations if you were to recompile on the Sandy Bridge vs. on the Westmere-EP platforms (rather than sharing the binaries between the two - unless you purposely don't make it target specific).
  • wingless - Monday, March 4, 2013 - link

    Somebody on my folding team is building this setup with dual Titans as a folding/gaming rig. The ultimate in computation, gaming, and space heating!
  • yougotkicked - Monday, March 4, 2013 - link

    I just wanted to say I found this analysis rather interesting. I'm a undergrad CS major, but I work in the IT office for the chemical engineering and material science dept. at a research university, so this breakdown of the relationship between simulations and hardware was really fascinating for me.

    Just to give some perspective on the pricing of an OEM-built system using E5-26XX parts, one of the research groups I work with recently bought a dual E5-2687W system from HP with 128 gigs of ram, liquid cooling, and a mid-range workstation GPU; The whole system came in at over $10,000. admittedly this includes a ~$1000 monitor and 4 hard drives, but this is probably at least a 30% margin over the cost of hardware, so the 10% margin used in the article may be on the conservative side.

    P.S. we didn't suggest that system to the researchers, they bought it on their own.
  • colonelpepper - Monday, March 4, 2013 - link

    HP & Dell systems are much more expensive than the same system built by a "system integrator" <-- I believe that's what they're called

    I've read in other forums that system integrators building you a custom system add about 10% to the price tag.
  • yougotkicked - Monday, March 4, 2013 - link

    That sounds reasonable, my only gripe would be that many researchers would not seek out a system integrator and just turn to a big name like HP. Had they sought the advice of the IT office I would have suggested we build it in-house for no markup at all.
  • Kevin G - Monday, March 4, 2013 - link

    Dell and HP's workstation lines carry a much higher premium than what you'd get DIY. The difference is in their warranties. Since dual socket workstations are effectively using low end servers in a tower chassis and they'll offer warranties very similar to what you can get for a 24/7/365 running server. While I'm not a big fan of Dell's hardware, I will say that they do follow through on their warranties. I've seen them get a replacement hard drive to my facility in under 4 hours as that is the level of support I was paying for. It wasn't cheap but it was worth it considering the business need.

    With the scientific slant, such warranties may turn out to be overkill as is going with OEM's. You'd still want to have the necessary data protections in place like ECC memory, redundant storage and a good backup policy while the simulation is running. However, what is the worst case that could happen when something does go wrong with the basic protections in place? Generally it is simply running the simulation again. Time is money and there are often some deadlines to meet. So if the simulation can't realistically be run again or it'd cost to much to run again, then going with an OEM that'll help achieve greater uptime is worth it.

    As for the price of some of these components individually, I'm about to drop ~$1000 USD on a 128 GB memory upgrade. OEM's like Dell and HP get such parts far cheaper than DIY users due to bulk purchasing power. It is far higher than 10% margin for them in terms of raw hardware costs.
  • IanCutress - Tuesday, March 5, 2013 - link

    The dual E5520 systems from Dell (with 4GB RAM because research department limited us to XP 32-bit) I used for research, with basic storage and a monitor each, ran up to £2k per order back in 2009. After a month of waiting to be delivered (after tons of initial hassle with the department IT guy), it turns out the systems arrived shortly after ordering and our IT guy had decided to hide them in a different building on campus and 'forgot to tell us'. The monitors were in a building the other side of campus. Fun fun fun! Needless to say, we were all rather annoyed. But looking back, I should have just asked for a single powerful Xeon workstation.
  • yougotkicked - Tuesday, March 5, 2013 - link

    Yikes, sounds like your IT guy needs to get his act together. We'd never get away with that kind of stuff here.

    My university has a few super-computing clusters available for researchers running truly large simulations, but because of that many groups choose not to buy systems powerful enough to run their mid-sized simulations and the clusters are usually booked a while in advance. The HP system was purchased primarily because the group was sick of waiting for their simulations to get a turn on the supercomputers.

    If only there were a folding@home style client that researchers could easily program for, we could turn our computer labs into compute clusters at night.
  • colonelpepper - Monday, March 4, 2013 - link

    This would be a very poor time to spend thousands of dollars on a high-end 2600 CPUs!

    The Xeon 2600 series is getting a refresh soon, better to wait and get more CPU for your buck... unless you're dying to spend big $$ now that is:

    http://2.bp.blogspot.com/-zhrS1C8wbk0/UHj_HxsrMTI/...

    that link is to the largest image I could find of Intel's Xeon Roadmap that was leaked late last year

Log in

Don't have an account? Sign up now