Intel Xeon E5-2697 v2 and Xeon E5-2687W v2 Review: 12 and 8 Cores

Name: Intel Xeon E5-2697 v2 and Xeon E5-2687W v2 Review: 12 and 8 Cores
Item: Intel Xeon E5-2697 v2 and Xeon E5-2687W v2 Review: 12 and 8 Cores
Author: Dr. Ian Cutress

by Ian Cutress on March 17, 2014 11:59 AM EST

Posted in
CPUs
Intel
Xeon
Enterprise

71 Comments | Add A Comment

71 Comments

Scientific and Synthetic Benchmarks

2D to 3D Rendering –Agisoft PhotoScan v1.0: link

Agisoft Photoscan creates 3D models from 2D images, a process which is very computationally expensive. The algorithm is split into four distinct phases, and different phases of the model reconstruction require either fast memory, fast IPC, more cores, or even OpenCL compute devices to hand. Agisoft supplied us with a special version of the software to script the process, where we take 50 images of a stately home and convert it into a medium quality model. This benchmark typically takes around 15-20 minutes on a high end PC on the CPU alone, with GPUs reducing the time.

Agisoft PhotoScan Benchmark - Total Time

For PhotoScan, the extra cores and MHz from the Xeons means most in the first stage of the computation. The second stage shows an increas in CPU Mapping Speed, however this is the stage where the GPU can accelerate when in use. Stage 3 benefits more from the MHz of the 8-core model, and the final stage is about even.

Console Emulation –Dolphin Benchmark: link

At the start of 2014 I was emailed with a link to a new emulation benchmark based on the Dolphin Emulator. The issue with emulators tends to be two-fold: game licensing and raw CPU power required for the emulation. As a result, many emulators are often bound by single thread CPU performance, and general reports tended to suggest that Haswell provided a significant boost to emulator performance. This benchmark runs a Wii program that raytraces a complex 3D scene inside the Dolphin Wii emulator. Performance on this benchmark is a good proxy of the speed of Dolphin CPU emulation, which is an intensive single core task using most aspects of a CPU. Results are given in minutes, where the Wii itself scores 17.53; meaning that anything above this is faster than an actual Wii for processing Wii code, albeit emulated.

Dolphin Benchmark

Emulation is a pure single threaded affair, and the IPC improvements of Haswell stand out a lot against the Ivy Bridge-E based Xeons.

Point Calculations – 3D Movement Algorithm Test: link

3DPM is a self-penned benchmark, taking basic 3D movement algorithms used in Brownian Motion simulations and testing them for speed. High floating point performance, MHz and IPC wins in the single thread version, whereas the multithread version has to handle the threads and loves more cores.

3D Particle Movement: Single Threaded

The low core frequency of the 12-core Xeon puts it behind in our FP single threaded benchmark.

3D Particle Movement: MultiThreaded

In out multithreaded scenario, we see the situation similar to PovRay, where cores and frequency take top spots.

Encryption –TrueCrypt v0.7.1a: link

TrueCrypt is an off the shelf open source encryption tool for files and folders. For our test we run the benchmark mode using a 1GB buffer and take the mean result from AES encryption.

TrueCrypt 7.1a AES

Synthetic – 7-Zip 9.2: link

As an open source compression tool, 7-Zip is a popular tool for making sets of files easier to handle and transfer. The software offers up its own benchmark, to which we report the result.

7-Zip MIPS

Real World CPU Benchmarks: Rendering, Compression, Video Conversion Gaming Benchmarks: F1 2013, Bioshock Infinite, Tomb Raider

PRINT THIS ARTICLE

Post Your Comment
Please log in or sign up to comment.

Comments Locked

71 Comments

View All Comments

vLsL2VnDmWjoTByaVLxb - Monday, March 17, 2014 - link
> TrueCrypt is an off the shelf open source encoding tool for files and folders.

Encoding?
Brutalizer - Monday, March 17, 2014 - link
I would not say these cpus are for high end market. High end market are huge servers, with as many as 32 sockets, some monster servers even have 64 sockets! These expensive Unix RISC servers or IBM Mainframes, have extremely good RAS. For instance, some Mainframes do every calculation in three cpus, and if one fails it will automatically shut down. Some SPARC cpus can replay instructions if something went wrong. Hotswap cpus, and hotswap RAM. etc etc. These low end Xeon cpus have nothing of that.

PS. Remember that I distinguish between a SMP server (which is a single huge server) which might have 32/64 sockets and are hugely expensive. For instance, the IBM P595 32-socket POWER6 server used for the old TPC-C record, costed $35 million. No typo. One single huge 32 socket server, costed $35 million. Other examples are IBM P795, Oracle M5-32 - both have 32 sockets. Oracle have 32TB RAM which is the largest RAM server on the market. IBM Mainframes also belong to this category.

In contrast to this, every server larger than 32/64 sockets, is a cluster. For instance the SGI Altix or UV2000 servers, which sports up to 262.000 cores and 100s of TB. These are the characteristica of supercomputer clusters. These huge clusters are dirt cheap, and you pay essentially the hardware cost. Buy 100 nodes, and you pay 100 x $one node. Many small computers in a cluster.

Clusters are only used for HPC number crunching. SMP servers (one single huge server, extremely expensive because it is very difficult to scale beyond 16 sockets) are used for ERP business systems. An HPC cluster can not run business systems, as SGI explains in this link:
http://www.realworldtech.com/sgi-interview/6/
"The success of Altix systems in the high performance computing market are a very positive sign for both Linux and Itanium. Clearly, the popularity of large processor count Altix systems dispels any notions of whether Linux is a scalable OS for scientific applications. Linux is quite popular for HPC and will continue to remain so in the future,...However, scientific applications (HPC) have very different operating characteristics from commercial applications (SMP). Typically, much of the work in scientific code is done inside loops, whereas commercial applications, such as database or ERP software are far more branch intensive. This makes the memory hierarchy more important, particularly the latency to main memory. Whether Linux can scale well with a SMP workload is an open question. However, there is no doubt that with each passing month, the scalability in such environments will improve. Unfortunately, SGI has no plans to move into this SMP market, at this point in time."
Kevin G - Tuesday, March 18, 2014 - link
@Brutalizer
And here we go again. ( http://anandtech.com/comments/7757/quad-ivy-brigde... )

“These low end Xeon cpus have nothing of that.”

This is actually accurate as the E5 is Intel’s midrange Xeon series. Intel has the E7 line for those who want more RAS or scalability to 8 sockets. Features like memory hot swap can or lock step mirroring can be found in select high end Xeon systems. If you want ultra high end RAS, you can find it if you need it as well as pay the premium price premium for it.

“In contrast to this, every server larger than 32/64 sockets, is a cluster. For instance the SGI Altix or UV2000 servers, which sports up to 262.000 cores and 100s of TB. These are the characteristica of supercomputer clusters. These huge clusters are dirt cheap, and you pay essentially the hardware cost. Buy 100 nodes, and you pay 100 x $one node.”

Incorrect on several points but they’ve already been pointed out to you. The UV2000 is fully cache coherent (with up to 64 TB of memory) with a global address space that operates as one uniform, logical system that only a single OS/Hypervisor is necessary to boot and run.

Secondly, the price of the UV2000 does not scale linearly. There are NUMALink switches that bridge the coherency domains that have to be purchased to scale to higher node counts. This is expected of how the architecture scales and is similar to other large scale systems from IBM and Oracle.

“Clusters are only used for HPC number crunching.”

Incorrect. Clustering is standard in what you define as SMP applications (big business ERP). It is utilized to increase RAS and prevent downtime. This is standard procedure in this market.

“SMP servers (one single huge server, extremely expensive because it is very difficult to scale beyond 16 sockets) are used for ERP business systems. An HPC cluster can not run business systems,”

Why? As long as underlaying architecture is the same, they can run. You may not get the same RAS or scale as high in a single logical system but they’ll work. Performance is where you’d expected it on these boxes: a dual socket HPC system will perform roughly one quarter the speed of as the same chips occupying an 8 socket system.

“as SGI explains in this link:
http://www.realworldtech.com/sgi-interview/6/“

As pointed out numerous times before, that link is you cite is a decade old. SGI has moved into the SMP space with the Altix UV series. Continuing to use this link as relevant is plain disingenuous and deceptive.

As for an example of a big ERP application running on such an architecture, the US Post Office run’s Oracle Data Warehousing software on a UV1000. ( https://www.fbo.gov/index?s=opportunity&mode=f... )
Brutalizer - Tuesday, March 18, 2014 - link
Do you really think that UV (which is the successor to Altix) is that different? Windows is Windows, and it will not magically challenge Unix or OpenVMS, in some iterations later. Windows will not be superior to Unix after some development. You think that HPC- Altix will after some development, be superior to Oracle and IBM's huge investments of decades research in billions of USD? Do you think Oracle and IBM has stopped developing their largest servers?

Altix is only for HPC number crunching, says SGI in my link. Today the UV line of servers, has up to 262.000 cores and 100s of TB of RAM. Whereas the largest Unix and IBM Mainframes have 64 sockets and couple of TB RAM, after decades of research.

In a SMP server, all cpus will have to be connected to each other, for this SGI UV2000 with 32.768 cpus, you would need (n²) 540 million (half a billion) threads connecting each cpu. Do you really think that is feasible? Does it sound reasonable to you? IBM and Oracle and HP has had great problems connecting 32 sockets to each other, just look at the connections on the last picture at the bottom, do you see all connections? Now imagine half a billion of them in a server!
http://www.theregister.co.uk/2013/08/28/oracle_spa...

But on the other hand, if you keep the number of connection downs to islands, and then connect the islands to each other, you dont need half a billion. This solution would be feasible. And then you are not in SMP territory anymore: SGI say like this on page 4 about the UV2000 cluster:
www.sgi.com/pdfs/4395.pdf‎
"...SMP is based on intra-node communication using memory shared by all cores. A cluster is made up of SMP compute nodes but each node cannot communicate with each other so scaling is limited to a single compute node...."

Dont you think that a 262.000 core server and 100s of TB of RAM sounds more like a cluster, than a single fat SMP server? And why do the UV line of servers focus on OpenMPI accerators? OpenMPI is never used in SMP workloads, only in HPC.

Do you have any benchmarks where one 32.768 cpu SGI UV2000 demolishes 50-100 of the largest Oracle SPARC M6-32 in business systems? And why is the UV2000 much much cheaper than a 16/32 socket Unix server? Why does a single 32 socket Unix server cost $35 million, whereas a very large SGI cluster with 1000 of sockets is very very cheap?
Kevin G - Tuesday, March 18, 2014 - link
Wow, I think the script you're copy/pasting from needs better revision.

"Do you really think that UV (which is the successor to Altix) is that different?"

Yes. SGI changed the core achitecture to add cache coherent links between the entire system. Clusters tend to have an API on top of a networking software stack to abstract the independent systems so they may act as one. The UV line does not need to do this. For one processor to use memory and performance calculations on data residing off of CPU on the other end, a memory read operation is all that is needed on the UV. It is really that simple.

"Windows is Windows, and it will not magically challenge Unix or OpenVMS, in some iterations later."

The UV can run any OS that runs on modern x86 hardware today. Windows, Linux, Solaris (Unix) and perhaps at some point NonStop (HP's mainframe OS http://h17007.www1.hp.com/us/en/enterprise/servers... ). The x86 platform has plenty of choices to choose from.

"You think that HPC- Altix will after some development, be superior to Oracle and IBM's huge investments of decades research in billions of USD? Do you think Oracle and IBM has stopped developing their largest servers?"

What I see SGI offering is another tool alongside IBM and Oracle systems. Also you mention decades of research, then it is also fair to put SGI into that category as that link you love to spam IS A DECADE OLD. Clearly SGI didn't have this technology back in 2004 when that interview was written.

"Today the UV line of servers, has up to 262.000 cores and 100s of TB of RAM. Whereas the largest Unix and IBM Mainframes have 64 sockets and couple of TB RAM, after decades of research."

Actually this is a bit incorrect. IBM can scale to 131,072 cores on POWER7 if the coherency requirement is forgiven. Oh, and this system can run either AIX or Linux when maxed out. Source: http://www.theregister.co.uk/Print/2009/11/27/ibm_...

"In a SMP server, all cpus will have to be connected to each other, for this SGI UV2000 with 32.768 cpus, you would need (n²) 540 million (half a billion) threads connecting each cpu.
http://www.theregister.co.uk/2013/08/28/oracle_spa...

Wow, do you not read your own sources? Not only is your math horribly horribly wrong but the correct methodology is found for calculating the number of links as things scale is in the link you provided. To quote that link: "The Bixby interconnect does not establish everything-to-everything links at a socket level, so as you build progressively larger machines, it can take multiple hops to get from one node to another in the system. (This is no different than the NUMAlink 6 interconnect from Silicon Graphics, which implements a shared memory space using Xeon E5 chips...)"

The full implication here is that if the UV 2000 is not a socket machine, then neither is Oracle's soon-to-be-released 96 socket device. The topology to scale is the same in both cases per your very own source.

"SGI say like this on page 4 about the UV2000 cluster:
www.sgi.com/pdfs/4395.pdf‎"

Fundamentally false. If you were to actually *read* the source material for that quote, it is not describing the UV2000. Rather is speaking generically abou the differences between a cluster and large SMP box on page 4. If you got to page 19, it further describes the UV 2000 as a single system image unlike that of a cluster as defined on page 4.

"Dont you think that a 262.000 core server and 100s of TB of RAM sounds more like a cluster, than a single fat SMP server? And why do the UV line of servers focus on OpenMPI accerators? OpenMPI is never used in SMP workloads, only in HPC."

All I'd say about a 262,000 core server is that it wouldn't fit into a single box. Then again IBM, Oracle and HP are spreading their large servers across multiple chassis so this doesn't bother me at all. The important part is how all these boxes are connected. SGI uses NUMAlink6 which provides cache coherency and a global address space for a single system image. OpenMPI can be used inside of a cache coherent NUMA system as it provides a means to gurantee memory locality when data is used for execution. It is a means of increasing efficiency for applications that use it. However, OpenMPI libraries do not need to be installed for software to scale across all 256 sockets on the UV200. It is purely an option for programmers to take advantage of.

"And why is the UV2000 much much cheaper than a 16/32 socket Unix server? Why does a single 32 socket Unix server cost $35 million, whereas a very large SGI cluster with 1000 of sockets is very very cheap?"

First, to maintain coherency, the UV2000 only scales to 256 sockets/64 TB of memory. Second, the cost of a decked out P795 from IBM in terms of processors (8 sockets, 256 cores) and memory (2 TB) but only basic storage to boot the system is only $6.7 million whole sale. Still expensive but far less than what you're quoting. It'll require some math and reading comprehension to get to that figure but here is the source: http://www-01.ibm.com/common/ssi/ShowDoc.wss?docUR...

I couldn't find pricing for the UV2000 as a complete system but purchasing the Intel processors and memory seperately to get to a 256 socket/64 TB system would be just under $2 million. Note that that figure is just processor + memory, no blade chassis, racks or interconnect to glue everything together. That would also be several million. So yes, the UV2000 does come out to be cheaper but not drastically. That IBM pricing document does highlight why their high end systems costs so much, mainly capacity on demand. The p795 is getting a mainframe like pricing structure where you purchase the hardware and then you have to activate it as an additional cost. Not so on the UV2000.
psyq321 - Tuesday, March 18, 2014 - link
Xeon 2697 v2 is not a "low end" Xeon.

It is part of "expandable server" platform (EP), being able to scale up to 24 cores.

That is far from "low end", at least in 2014.
alpha754293 - Wednesday, March 19, 2014 - link
"High end market are huge servers, with as many as 32 sockets, some monster servers even have 64 sockets!"

Partially true. The entire cabinet might have that many sockets/processors, but on a per-system, per-"box" level, most max out between two and four. You get a few odd balls here and there that would have a daughter board for a true 8-socket system, but those are EXTREMELY rare in actuality. (Tyan, I think had one for the AMD Opterons, and they said that less than 5% of the orders were for the full fledge 8-socket systems).

"PS. Remember that I distinguish between a SMP server (which is a single huge server) which might have 32/64 sockets and are hugely expensive. For instance, the IBM P595 32-socket POWER6 server used for the old TPC-C record, costed $35 million. No typo. One single huge 32 socket server, costed $35 million. Other examples are IBM P795, Oracle M5-32 - both have 32 sockets. Oracle have 32TB RAM which is the largest RAM server on the market. IBM Mainframes also belong to this category."
Again, only partially true. The costs and stuff is correct, but the assumptions that you're writing about is incorrect. SMP is symmetric multiprocessing. BY DEFINITION, that means that "involves a multiprocessor computer hardware and software architecture where two or more identical processors connect to a single, shared main memory, have full access to all I/O devices, and are controlled by a single OS instance that treats all processors equally, reserving none for special purposes." (source: wiki) That means that it is a monolithic system, again, of which, few are TRULY such systems. If you've ever ACTUALLY witnessed the startup/bootup sequence of an ACTUAL IBM mainframe, the rest of the "nodes" are actually booted up typically by PXE or something very similiar to that, and then the "node" is ennumerated into the resource pool. But, for all other intents and purposes, they are semi-independent, standalone systems, because SMP systems do NOT have the capability to pass messages and/or memory calls (reads/writes/requests) without some kind of a transport layer (for example MPI).

Furthermore, the old TPC-C that you mention, they do NOT process as one monolithic sequential series of events in parallel (so think of like how PATA works...), but rather more like a JBOD SATA (i.e. the processing of the next transaction does NOT depend on ALL of the current block of transactions to be completed, UNLESS there is an inherent dependency issue, which I don't think would be very common in TPC-C). Like bank accounts, they're all treated as discrete and separate, independent entities, which means you can send all 150,000 accounts against the 32-socket or 64-socket system and it'll just pick up the next account when the current one is done, regardless.

The other failure in your statement or assumption is that's why there's something called HA - high avialability. Which means that they can dynamically hotswap an entire node if there's a CPU failure, so that the node can be downed and yanked out for service/repair while another one is hotswapped in. So it will failover to a spare hotswap node, work on it, and then either fall over back to the newly replaced node or it would rotate the new one into the hotswap failover pool. (There are MANY different ways of doing that and MANY different topologies).

The statement you made about having 32TB of RAM is again, partially true. But NONE of the single OS instances EVER have full control of all 32TB at once, which again, by DEFINITION, means that it is NOT truly SMP. (Course, if you ever get a screenshot which shows that, I'd LOVE to see it. I'd LOVE to get corrected on that.)

"In contrast to this, every server larger than 32/64 sockets, is a cluster."
Again, not entirely true. You can actually get 4 socket systems that are comprised of two dual-socket nodes and THAT is enough to meet the requirements of a cluster. Heck, if you pair up two single-socket consumer-grade systems, that TOO is a cluster. That's kinda how Beowulf clusters got started - cuz it was an inexpensive way (compare to the aforementioned RISC UNIX based systems) to gain computing power without having to spend a lot of money.

'These huge clusters are dirt cheap"
Sure...if you consider IBM's $100 million contract award "cheap".

"Clusters are only used for HPC number crunching. SMP servers (one single huge server, extremely expensive because it is very difficult to scale beyond 16 sockets) are used for ERP business systems. An HPC cluster can not run business systems, as SGI explains in this link:"
So there's the two problems with this - 1) it's SGI - so of course they're going to promote what they ARE capable of vs. what they don't WANT to be capable of. 2) Given the SGI-biased statements, this, again, isn't EXACTLY ENTIRELY true either.

HPCs CAN run ERP systems.

"HPC vendors are increasingly targeting commercial markets, whereas commercial vendors, such as Oracle, SAP and SAS, are seeing HPC requirements." (Source: http://www.information-age.com/it-management/strat...

But that also depends on the specific implementation of the ERP system given that SAP is NOT the ONLY ERP system that's available out there, but it's probably one of the most popular one, if not THE most popular one. (There's a whole thing about distributed relational databases so that the database can reside in smaller chunks across multiple nodes, in-memory, which are then accessed via a high speed interconnect like Myrinet or Infiniband or something along those lines.)

Furthermore, the fact that ERP runs across large mainframes (it grows as the needs grows), is an indications of HPC's place in ERP. Alternatively, perhaps rather than using it for the backend, HPC can be used on the front end by supporting many, many, many virtualized front-end clients.

Like I said, most of the numbers that you wrote are true, but the assumptions behind them isn't exactly all entirely true.

See also: http://csserver.evansville.edu/~mr56/Publications/...
Kevin G - Wednesday, March 19, 2014 - link
"That means that it is a monolithic system, again, of which, few are TRULY such systems. If you've ever ACTUALLY witnessed the startup/bootup sequence of an ACTUAL IBM mainframe, the rest of the "nodes" are actually booted up typically by PXE or something very similiar to that, and then the "node" is ennumerated into the resource pool. But, for all other intents and purposes, they are semi-independent, standalone systems, because SMP systems do NOT have the capability to pass messages and/or memory calls (reads/writes/requests) without some kind of a transport layer (for example MPI)."

Not exactly. IBM's recent boxes don't boot themselves. Each box has a service processor that initializes the main CPU's and determines if there are any additional boxes connected via external GX links. If it finds external boxes, some negotiation is done to join them into one large coherent system before an attempt to load an OS is made. This is all done in hardware/firmware. Adding/removing these boxes can be done but there are rules to follow to prevent data loss.

It'll be interesting to see what IBM does with their next generation of hardware as the GX bux is disappearing.

"The statement you made about having 32TB of RAM is again, partially true. But NONE of the single OS instances EVER have full control of all 32TB at once, which again, by DEFINITION, means that it is NOT truly SMP. (Course, if you ever get a screenshot which shows that, I'd LOVE to see it. I'd LOVE to get corrected on that.)"

Actually on some of these larger systems, a single OS can see the entire memory pool and span across all sockets. The SGI UV2000 and SPARC M6 are fully cache coherent across a global memory address space.

As for a screenshot, I didn't find one. I did find a video going over some of the UV 2000 features displaying all of this though. It is only a 64 socket, 512 core, 1024 thread, 2 TB of RAM configuration running a single instance of Linux. :)
https://www.youtube.com/watch?v=YUmBu6A2ykY

IBM's topology is weird in that while a global memory address space is shared across nodes, it is not cache coherent. IBM's POWER7 and their recent BlueGene systems can be configured like this. I wouldn't call these setups clusters as there is no software overhead to read/write to remote memory addresses but it isn't fully SMP either due to multiple coherency domains.
silverblue - Monday, March 17, 2014 - link
The A10-7850K is a 2M/4T CPU.
Ian Cutress - Monday, March 17, 2014 - link
Thanks for the correction, small brain fart on my part when generating the graphs.

Intel Xeon E5-2697 v2 and Xeon E5-2687W v2 Review: 12 and 8 Cores

Scientific and Synthetic Benchmarks

Post Your Comment

71 Comments

View All Comments

vLsL2VnDmWjoTByaVLxb - Monday, March 17, 2014 - link

Brutalizer - Monday, March 17, 2014 - link

Kevin G - Tuesday, March 18, 2014 - link

Brutalizer - Tuesday, March 18, 2014 - link

Kevin G - Tuesday, March 18, 2014 - link

psyq321 - Tuesday, March 18, 2014 - link

alpha754293 - Wednesday, March 19, 2014 - link

Kevin G - Wednesday, March 19, 2014 - link

silverblue - Monday, March 17, 2014 - link

Ian Cutress - Monday, March 17, 2014 - link

Log in

Don't have an account? Sign up now