Westmere-EP to Sandy Bridge-EP: The Scientist Potential Upgrade

Name: Westmere-EP to Sandy Bridge-EP: The Scientist Potential Upgrade
Item: Westmere-EP to Sandy Bridge-EP: The Scientist Potential Upgrade
Author: Dr. Ian Cutress

by Ian Cutress on March 4, 2013 9:30 AM EST

44 Comments | Add A Comment

44 Comments

Test Setup

Alongside the X5690 CPUs we are using for this review, the Gigabyte server team was at hand to offer one of their dual processor 1366 server motherboards – the GA-7TESM. The 7TESM was released back in September 2011, featuring support for 55xx/56xx Xeons and up to 18 DIMMs of registered or unbuffered DDR3 memory – for up to 288GB at 1333 MHz with Netlist Hypercloud modules. Alongside four Intel GbE network ports (82576EB + 2x 82574L) and a management port, we get six SATA 3 Gbps ports from the chipset and 8 SAS 6 Gbps ports from an LSI SAS2008 chip (via SFF-8087), both supporting RAID 0/1/5/10. Onboard video comes from a Matrox 200e, and the system provides a PCIe 2.0 x16, an x8, an x4, and a PCI slot. Many thanks to Gigabyte for making the review possible!

Many thanks also to...

We must thank the following companies for kindly providing hardware for our test bed:

Thank you to OCZ for providing us with the 1250W Gold Power Supply and SATA SSD.
Thank you to Kingston for providing us with the ECC Memory.

Test Setup
Processor	2x Intel Xeon X5690 6 Cores, 12 Threads 3.47 GHz (3.73 GHz Turbo) each
Motherboards	Gigabyte GA-7TESM
Cooling	Intel Thermal Solution STS100C
Power Supply	OCZ 1250W Gold ZX Series
Memory	Kingston 1600 C11 ECC 8x4GB Kit
Memory Settings	1333 C9
Hard Drive	Kingston 120GB HyperX
Optical Drive	LG GH22NS50
Case	Open Test Bed
Operating System	Windows 7 64-bit

As per the last test with E5 2600 CPUs, we are using Windows 7 64 bit. The reason behind this is simple – in the research environment I was in, we never updated operating systems beyond security updates. IT staff wanted everyone in the building to use an approved OS image, of which there was only Windows XP, if anyone wanted network access. For this review I got in contact with a colleague to see if this is still the case, and it is – Windows XP 32-bit across the whole department at the university.

Power Consumption

Power consumption was tested on the system as a whole with a wall meter connected to the OCZ 1250W power supply, while in a single 7970 GPU configuration. This power supply is Gold rated, and as I am in the UK on a 230-240 V supply, leads to ~75% efficiency > 50W, and 90%+ efficiency at 250W, which is suitable for both idle and multi-GPU loading. This method of power reading allows us to compare the power management of the UEFI and the board to supply components with power under load, and includes typical PSU losses due to efficiency. These are the real world values that consumers may expect from a typical system (minus the monitor) using this motherboard.

While this method for power measurement may not be ideal, and you feel these numbers are not representative due to the high wattage power supply being used (we use the same PSU to remain consistent over a series of reviews, and the fact that some boards on our test bed get tested with three or four high powered GPUs), the important point to take away is the relationship between the numbers. These boards are all under the same conditions, and thus the differences between them should be easy to spot.

Power Consumption - One 7970 @ 1250W Gold

For the workstation theorist in a research group, power consumption is often the last thing on their minds – as long as the system computes in a decent time, everything is golden. In a commercial situation where the code works and throughput is everything, then power does matter. The Sandy Bridge-EP system used 26.3% more power during CPU load than our Westmere-EP system did, in line with the pricing of the CPU itself.

DPC Latency

Deferred Procedure Call latency is a way in which Windows handles interrupt servicing. In order to wait for a processor to acknowledge the request, the system will queue all interrupt requests by priority. Critical interrupts will be handled as soon as possible, whereas lesser priority requests, such as audio, will be further down the line. So if the audio device requires data, it will have to wait until the request is processed before the buffer is filled. If the device drivers of higher priority components in a system are poorly implemented, this can cause delays in request scheduling and process time, resulting in an empty audio buffer – this leads to characteristic audible pauses, pops and clicks. Having a bigger buffer and correctly implemented system drivers obviously helps in this regard. The DPC latency checker measures how much time is processing DPCs from driver invocation – the lower the value will result in better audio transfer at smaller buffer sizes. Results are measured in microseconds and taken as the peak latency while cycling through a series of short HD videos - under 500 microseconds usually gets the green light, but the lower the better.

DPC Latency Maximum

For whatever reason the DPC Latency on the X5690 system is bad. This is more indicative of the motherboard than the CPU performance, which should easily handle DPC requests. It is highly doubtful that time sensitive work would be carried out on a system like this, but any non-Xeon product would be able to outperform our setup.

Comparing Westmere-EP to Sandy Bridge-EP Grid Solver Benchmarks

PRINT THIS ARTICLE

Post Your Comment
Please log in or sign up to comment.

Comments Locked

44 Comments

View All Comments

jamyryals - Monday, March 4, 2013 - link
Element is an acceptable term in this case. Anyone confusing a finite element with a chemical element would do well to read up on these types of mathematical models anyways.

Your other points are well made, and highlight the difficulty in creating meaningful benchmarks.
Kevin G - Monday, March 4, 2013 - link
I agree that the usage of the word element is technically correct. The thing that threw me off more was its usage in conjunction with particle. When I read that paragraph I had to do a double take to get the proper context. My issue here is just a small editorial quibble than a technical issue. :)
IanCutress - Tuesday, March 5, 2013 - link
A majority of the results in the graphs (essentially all the overclocked ones) were on systems out of my control - several users from the Overclock.net HWBot team helped on that one and offered me insight into their setups. Unfortunately I do not have access to a vast array of sockets and systems for comparison.

The implicit calculations have a fair few division elements per loop, as noted in the previous article where I posted the code (http://www.anandtech.com/show/6533/8) - for each timestep there are >2 divisions per node calculation. Technically the non-CS scientist might not know what is inside the silicon regarding Ivy's better divisor .

Don't forget the whole point of a review of something like this was to look at the scenario I was in. We went and ordered dual Nehalem systems (E5520s) just because of all the threads. Looking back on it now, I wish we had stuck to single processor systems based on the code we were writing.

Regarding the built-in Ivy PRNG, as noted in the previous review, the code wasn't hand written for each processor. It was written once and applied over. We didn't get extra time or money to find the best way to simulate something, we just had to simulate.

Regarding element and particle, I almost use them synonymously in the text. I like to use 'element' to describe the motion of one point in the simulation, but my Chemistry supervisor thought I was being an idiot when we were dealing with chemicals, despite my pleas that element was a CS term. He preferred the term particle as a mid-way point between the two (and also not to confuse the chemistry people reading our papers) and mentally I have equated the two, which is not always the best thing.

For XVC, I'm not sure why there is such a difference. With HT On, we have 24 threads to do 33 videos, which is one batch of 24 then another of 9 (put your turbos in where appropriate). Without HT, we're slightly faster per core (if we're lucky, or 0 if not), but we have batches of 12, 12 and then 9. Again, apply turbos where appropriate. That's just the program runs - it decides if it wants to commit one thread per video, or multiple threads per video. If it is coding more videos than half the available threads, it does one thread per video - if there is enough threads that each video can get two, it applies two. So the set of 9 videos when HT is on probably gets two threads per video, rather than one thread per video for the 9 videos when HT is off.

Ian
Kevin G - Tuesday, March 5, 2013 - link
The thing with Ivy Bridge's improved division unit is that it can explain some of the speed up. Glancing at the code, those operations don't seem to be that common that it'd make such a noticeable impact. (The real test would be to compile, disassemble and then count the number of division instructions.) The other thing about Ivy Bridge's divisor is that its performance gains are 'free' in the sense that it doesn't require rewriting or recompiling code to take advantage of. It is an architectural tweak that benefits existing code.

Upon release, Nehalem was a very good platform and still respectable today. I think the issue is that consumer systems have been catching up. Looking at the charts the only consumer system that's a roughly the same age as the E5520's was an overclocked Phenom II X4's and the dual socket Xeon showed an advantage there. The problem I'm seeing is that the code isn't scaling across multiple sockets and memory controllers very well. Solving that would put performance closer to expectations. If possible, I would suggest enabling memory mirroring across sockets to see if that solves some of the scaling issues. The code wouldn't have to be written to be NUMA aware but usable memory in the system is halved.

If the NUMA problem is not practical to solve, then going single socket makes sense. Howevever, I would expand the discussion into include RAS. I would not recommend a single highly overclocked system to run scientific simulations as the reliability simply isn't there. One way around that is to get two similarly configured systems and run the simulation twice and compare the results for redunancy. With some of these heavily overclocked systems costing less than half the dual Xeon's price tag and running the code twice as fast, it is worth considering such a mirrored configuration. Other options to consider would be a single 8 core Xeon on socket 2011 or some of the quad core Xeon on socket 1155 and gain ECC memory support to forgo the second system.

The XVC results can see some improvements in queuing but those benefits should be able to carry over to the non-HT results with a software tweak. (Most software like that can accept such tuning parameters but I'm personally unfamiliar with XVC.) The results are falling outside the realm of reason. It is like say cooling a gas until you realize you're at -20 kelvin. At that point you have to realize something is erroneous. At best HT can double performance and the results are roughly five times faster. Turbo is a factor but that would benefit the non-HT results more as utilization is lower (ie. fewer transistors switching, less heat, more turbo boost).
toyotabedzrock - Monday, March 4, 2013 - link
I looks like Intel forgot about HT on sandy bridge.
IanCutress - Tuesday, March 5, 2013 - link
i5-2500K is a 4C/4T processor.

Ian
TeXWiller - Monday, March 4, 2013 - link
Ian, have you tried playing with the numa options of the boards?
IanCutress - Tuesday, March 5, 2013 - link
NUMA was enabled in the BIOS, I made sure before I tested :) I also looked at various ways to keep the top turbo in force through all loading, but the limited BIOS options relating to clock speed on server boards are not up to scratch compared to consumer products (as you would expect).

Ian
TeXWiller - Tuesday, March 5, 2013 - link
I was thinking about the improved bandwith between the processors in E5 family. Some aplications might prefer node interleaved memory instead.
alpha754293 - Monday, March 4, 2013 - link
re: OpenMP vs. MPI
Multithreaded codes using OpenMP is known to be quite a lot slower than a proper, MPI code. In the testing that I've done, the difference can be as much as 40% because the OpenMP code just simply cannot keep the CPU/FPU units occupied long enough. I've never really dug in deep as to WHY that is (I'm sooo NOT a programmer), but as an end user; that a HUGE difference.

Secondly, also depending on how you write your MPI code - some of them can be VERY efficient at using multicore/multiprocessors. It depends on the code, the nature and physics of the problem, and a whole bunch of other things. (LS-DYNA for example scales VERY well to the number of processors and/or cores. And my research is showing about an 11-17% benefit with HTT enabled on a 3930K (I don't have 8-core Xeons to play with). :(

Conversely, I've also seen some MPI codes that don't really quite parallelize nearly quite as well. It SAYS that it's MPI, but it looks more like an OpenMP implementation for the parallelization.

Part of it also depends on how much data dependency there is - does the information of one depend on the results or the information/data of another (either on spatial or temporal terms)?

Third - I've had many arguments about this. A single socket, multi-core processor is still a parallel multicore system. Yes, you don't have to deal with NUMA, but unless you have a LOT of traffic going through between your two sockets (something which NO ONE has been able to tell me how to measure so far) - chances are, both either OpenMP OR MPI can scale to single multi-core processor, or multiple multi-core processors. It shouldn't really care (unless you've hard-coded the domain decomposition and the number of "partitions" or "divisions" it makes for the parallelization.)

I think that the statement/comment that you wrote about how some of the benchmarks or some types of simulations/processes favour a single-CPU setup isn't QUITE exactly accurate only because your single-socket, multi-core CPUs were quite highly overclocked. (I've got my 3930K up to 4.5 GHz, and I just re-enabled C1E/EIST in order to cut my idle power consumption).

[brb...to be continued]

Westmere-EP to Sandy Bridge-EP: The Scientist Potential Upgrade

Post Your Comment

44 Comments

View All Comments

jamyryals - Monday, March 4, 2013 - link

Kevin G - Monday, March 4, 2013 - link

IanCutress - Tuesday, March 5, 2013 - link

Kevin G - Tuesday, March 5, 2013 - link

toyotabedzrock - Monday, March 4, 2013 - link

IanCutress - Tuesday, March 5, 2013 - link

TeXWiller - Monday, March 4, 2013 - link

IanCutress - Tuesday, March 5, 2013 - link

TeXWiller - Tuesday, March 5, 2013 - link

alpha754293 - Monday, March 4, 2013 - link

Log in

Don't have an account? Sign up now