Original Link: http://www.anandtech.com/show/7757/quad-ivy-brigde-ex-60-cores-120-threads
The Intel Xeon E7 v2 Review: Quad Socket, Up to 60 Cores/120 Threadsby Johan De Gelas on February 21, 2014 6:00 AM EST
It is generally accepted as common knowledge that the high-end RISC server vendors—IBM and Oracle—have been bleeding market share in favor of high-end Intel Xeon based servers. Indeed, the RISC market accounts for about 150k units while the x86 market has almost 10 million servers. About 5% of those 10 million units are high-end x86 servers, so the Xeon E7 server volume is probably only 2-4 times the size of the whole RISC market. Still, that tiny amount of RISC servers represents about 50% of the server market revenues.
But the RISC vendors have finally woken up. IBM has several Power7+ based servers that are more or less price competitive with the Xeon E7. Sun/Oracle's server CPUs have been lagging severely in performance. The UltraSPARC T1 and T2 for example were pretty innovative but only performed well in a very small niche of the market, while offering almost ridicously low performance in any application (HPC, BI, ERP ) that needed decent per-thread performance.
Quite surprisingly, Oracle has been extremely aggressive the past few years. The "S3" core of the octal-core SPARC T4 launched at the end of 2011 was finally a competitive server core. Compared to the quad-issue Westmere core inside the contemporary Xeon E7 , it was still a simple core, but gone were the single-issue in-order designs of the T1 and T2 at laughably low clock speeds. No, instead, the SUN server chip received a boost to an out-of-order dual-issue chip at pretty decent 3GHz clocks. Each core could support eight threads but also execute two threads simultaneously. Last year, the Sparc-T5, an improved T4, had twice as many cores at 20% higher clocks.
As usual, the published benchmarks are very vague and are only available for the top models, the TDP is unknown, and the best performing systems come with astronomic price tags ($950,000 for two servers, some networking, and storage... really?). In a nutshell, every effort is made to ensure you cannot compare these with the servers of "Big Blue" or the x86 competition. Even Oracle's "technical deep dive" seems to be written mostly to please the marketing people out there. A question like "Does the SPARC T5 also support both single-threaded and multi-threaded applications?" must sound particularly hilarious to our technically astute readers.
Oracle's nebulous marketing to justify some of the outrageous prices has not changed, but make no mistake: something is brewing among the RISC vendors. SUN/Oracle is no longer the performance weakling in the server market, some IBM Power systems are priced quite reasonably, and the Intel Xeon E7—still based on the outdated Westmere Core—is starting to show its age. Not surprisingly, it's time for a "tick-tock" update of the Xeon E7. The new Xeon E7 48xx v2 is baked in a better process (22nm vs 32nm) and comes with 2012's "Ivy Bridge" core, enhanced for server/IT markets to become "Ivy Bridge EX".
As Ian already discussed, the new Xeon E7 v2 is a 6, 8, 10, 12 or 15-core Ivy Bridge Xeon, similar to the Xeon E5-2600 v2. The big difference of course is that this new Xeon E7 v2 can be plugged into a quad- or native octal-socket server. These processors have three QuickPath Interconnects to be able to communicate over one hop. More sockets are possible with third party "glue logic".
Compared to the old Xeon E7 based on the "Westmere" core, the new Xeon E7 v2 "Ivy Bridge EX" features a vast amount of improvements. We will not list all of them, but just to give you an idea of how much progress has been made since the Westmere core:
- µop cache (less decoding)
- Improved branch prediction
- Deeper and larger OoO buffers
- Turbo Boost 2.0
- AVX instructions
- Divider is twice as fast
- MOVs take no execution slots
- Improved prefetchers
- Improved shift/rotate and split/load
- Better balance between Hyper-Threading and single-threaded performance; buffers are dynamically allocated to threads
- Faster memory controller
Most of the improvement were fine tuning but the combined effect of them should result in a tangible performance boost in integer performance. For software that uses AVX, the performance boost could be very substantial. Even in software that uses older SSE(2) code, we found that the Sandy Bridge/Ivy Bridge generations were 20% faster, clock for clock, and we should see similar results here.
Just like the Xeon E5-2600 v2, the Ivy Bridge EX cores and 2.5MB L3 cache slices are stacked in columns connected with three fast rings, which connect all cores and all other the units (called agents) on the SoC. These rings also make sure that the L3 slices can act as one unified 37.5MB L3 cache with 450GB/s of bandwidth. The latency to the L3 cache is very low: 15.5ns (at 2.8GHz) versus 20ns for Westmere-EX (Xeon E7-4780 at 2.4GHz). PCIe I/O now happens on the die as well, and each CPU can support 32 PCIe lanes.
Finally, some coherency improvements are also implemented. Modified cache lines are send straight to the requester, without any write back to the memory agent. Overall, the collective sum of the improvement should prove quite capable.
Previous versions of Intel's flagship Xeon always came with very conservative memory configurations as RAM capacity and reliability was the priority. Typically, these systems came with memory extension buffers for increased capacity, but those memory buffers also increase memory latency. As a result, these quad- and octal-socket monsters had a hard time competing with the best dual-Xeon setups in memory intensive applications.
The new Xeon E7 v2 still has plenty of memory buffers (code named "Jordan Creek"), and it now supports three instead of two DIMMs per channel. The memory riser cards with two buffers now support 12 instead of eight DIMMs (Xeon Westmere-EX). Using relatively affordable 32GB DIMMs, this allows you to load a system machine up to 3TB RAM. If you break the bank and use 64GB LRDIMMs, 6TB RAM is possible.
With the previous platform, having eight memory channels only increased capacity and not bandwidth as they ran in lockstep. Each channel delivers half a cache line, then the Jordan Creek buffer combines those halves and sends off the result to the requesting memory controller. The high speed serial interface or scalable memory interconnect (SMI) channels must run at the same speed as the DDR3 channels. With Westmere-EX, this resulted in an SMI running at a maximum of 1066MHz. With the Xeon E7 v2, we get four SMI interconnects running at speeds up to 1600MHz. In lockstep, the system can survive a dual-device error. As result, the RAS (Reliability, Accessibility, Serviceability) is best in Lockstep.
With the Ivy Bridge EX version of the Xeon E7, the channels can also run independently. This mode is called performance mode and each channel can deliver one cache line. To cope with twice the amount of bandwidth, the SMI interconnect must run twice as fast as the memory channels. In this case, the SMI channel can run at 2667 MT/s while the two channels work at 1333 MT/s. That means in theory, the E7 v2 chip could deliver as much as 85GB/s (1333 * 8 channels * 8 bytes per channel) of bandwidth, which is 2.5x more than what the previous platform delivered. The disadvantage is that only a single device error can be corrected—more speed, less RAS.
According to Intel, both latency and bandwidth are improved tremendously compared to the Westmere-EX platform. As a result, the new quad Xeon E7 v2 platform should perform a lot better in memory intensive HPC applications.
By the virtue of the impressive 22nm Hi-K metal-gate tri-gate 22nm CMOS with 9 metal layers, Intel has been able to increase the maximum core count by 50% (15 vs 10) and the clockspeed by 17% (2.8GHz vs 2.4GHz) while the TDP has only increased by 19% (155W vs 130W). Intel claims that the actual power usage of the new flagship E7, the 155W 4890 v2, is actually lower than the previous 130W TDP Xeon E7-4870 at low and medium loads.
At maximum load, Intel claims you get about 50% higher power consumption for twice as much performance. At idle and low loads, it seems that the 155W Xeon 4890 v2 is a lot more efficient. That makes sense considering the improvements in idle/low load power use we saw with Sandy Bridge and then Ivy Bridge over the earlier Nehalem/Clarksfield offerings on desktops and laptops; it's taken some time, but the big servers are finally seeing the same improvements with Ivy Bridge EX.
SKUs and Prices
Anno 2014, the only competition for the Xeon E7 v2 are the—ranging from expensive to "Exa" expensive—Oracle Superclusters, the relatively "cheap" but lowly specced IBM Power 710/720 Server Express line, or the powerful but rather expensive IBM Power 760-780 server line. As a result, the prices for Ivy Bridge EX are a lot more "RISCy". Intel feels that you should get 20% to 50% more performance for the same amount of money...
... but feels that a premium price is warranted for the two top models (4890 and 4880 v2) that offer higher performance increases. That leads to some hefty price tages:
One of the most interesting SKUs seems to be the Xeon E7-8857 v2, a native 12-core at a pretty high 3GHz clock, and which only costs 60% of the other top models.
A View of Our Lab
We installed the the new Intel "Brickland" server in our newest rack...
...and it was placed on top of its predecessor, the "Boxboro" server.
A look at the back: two 1200W PSUs and a dual-10Gb Ethernet interface. PCIe cards must be mounted horizontally via the riser cards.
Like Boxboro, memory is placed on daughter/riser cards with two memory buffers ("Jordan Creek"). Once we remove the memory daughter cards...
...you can finally see the massive heatsinks on top of our Xeon E7-4890 v2 processors.
Our Benchmark Choices
To make the comparison more interesting, we decided to include both the Quad Xeon "Westmere-EX" as well as the "Nehalem-EX". Remember these heavy duty, high RAS servers continue to be used for much longer in the data center than their dual socket counterparts. Many people considering the newest Xeon E7-4800 v2 probably still own a Xeon X7500.
Of course, the comparison would not be complete without the latest dual Xeon 2600 v2 server and at least one Opteron based server. Due to the large number of platforms and the fact that we developed a brand new HPC test (see further), we quickly ran out of time. These time constrains and the fact that we have neglected our Linux testing in recent reviews in favor of Windows 2012 and ESXi led to the decision to limit ourself to testing on top of Ubuntu Linux 13.10 (kernel 3.11). You'll see our typical ESXi and Windows benchmarks in a later review.
There are some differences in the RAM and SSD configurations. The use of different SSDs was due to time constraints as we wanted to test the servers as much as possible in parallel. The RAM configuration differences are a result of the platforms: for example, the quad Intel CPUs only perform at their best when each CPU gets eight DIMMs. The Opteron and Dual Xeon E5-2680 v2 server perform best with one DIMM per channel (1 DPC).
None of these differences have a tangible influence on the results of our benchmarks, as none of them were bottlenecked by the storage system or the amount of RAM that was used. The minimum amount of 64GB of RAM was more than enough for all benchmarks in this review.
We also did not attempt to do power measurements. We will try to do an apples-to-apples power comparison at a later time.
Intel S4TR1SY3Q "Brickland" IVT-EX 4U-server
The latest and greatest from Intel consists of the following components:
4x Xeon E7-4890 v2 (D1 stepping) 2.8GHz
15 cores, 37.5MB L3, 155W TDP
256GB, 32x8GB Samsung 8GB DDR3
M393B1K70DH0-YK0 at 1333MHz
|Motherboard||Intel CRB Baseboard "Thunder Ridge"|
Total amount of DIMM slots is 96. When using 64GB LRDIMMs, this server can offer up to 6TB of RAM! In some cases, we have tested the E7-4890 v2 at a lower maximum clock in order to do clock-for-clock comparisons with the previous generation, and in a few cases we have also disabled three of the cores in order to simulate performance of some of the 12-core Ivy Bridge EX parts. For example, a E7-4890 v2 at 2.8 GHz with 3 cores disabled (12 cores total) gives you a good idea how the much less expensive E7- 8857 v2 at 3 GHz would perform: it would perform about 7% higher than the 12-core E7-4890 v2.
Intel Quanta QSCC-4R Benchmark Configuration
The previous quad Xeon E7 server, as reviewed here.
4x Xeon X7560 at 2.26GHz or
4x Xeon E7-4870 at 2.4GHz
16x8GB Samsung 8GB DDR3
M393B1K70DH0-YK0 at 1066MHz
|Motherboard||QCI QSSC-S4R 31S4RMB00B0|
|PSU||4x850W Delta DPS-850FB A S3F E62433-004 850W|
The server can accept up to 64 32GB Load Reduced DIMMs (LR-DIMMs) or 2TB.
Intel's Xeon E5 server R2208GZ4GSSPP (2U Chassis)
This is the server we used in our Xeon "Ivy bridge EP" review.
|CPU||2x Xeon processor E5-2680 (2.8GHz, 10c, 25MB L3, 115W)|
128GB (8 x 16GB) Micron MT36JSF2G72PZ – BDDR3-1866
|Internal Disks||2 x Intel MLC SSD710 200GB|
|Motherboard||Intel Server Board S2600GZ "Grizzly Pass"|
|BIOS version||SE5C600.86B (August the 6th, 2013)|
|PSU||Intel 750W DPS-750XB A (80+ Platinum)|
The Xeon E5 CPUs have four memory channels per CPU and support up to DDR3-1866, and thus our dual CPU configuration gets eight DIMMs for maximum bandwidth.
Supermicro A+ Opteron server 1022G-URG (1U Chassis)
This Opteron server is not comparable in any way with the featured Intel systems as it is not targeted at the same market and costs a fraction of the other machines. Nevertheless, here's our test configuration.
|CPU||2x Opteron "Abu Dhabi" 6376 at 2.3GHz|
|RAM||64GB (8x8GB) DDR3-1600 Samsung M393B1K70DH0-CK0|
|Internal Disks||2 x Intel MLC SSD710 200GB|
|Chipset||AMD Chipset SR5670 + SP5100|
|PSU||SuperMicro PWS-704P-1R 750Watt|
The Opteron server in this review is only here to satisfy curiosity. We want to see how well the Opteron fares in our new Linux benchmarks.
I admit, the following two benchmarks are almost irrelevant for anyone buying a Xeon E7 based machine. But still, we have to quench our curiosity: how much have the new cores been improved? There is a lot that can be said about all the sophisticated "uncore" improvements (cache coherency policies, low latency rings, and so on) that allow this multi-core monster to scale, but at the end of the day, good performance starts with a good core. And since we have listed the many subtle core improvements, we could not resist the opportunity to check how each core compares.
The results aren't totally meaningless either, as the profile of a compression algorithm is somewhat similar to many server workloads: hard to extract instruction level parallelism (ILP) and sensitive to memory parallelism and latency. The instruction mix is a bit different, but it's still somewhat similar to many server workloads. And as one more reason to test performance in this manner, the 7-zip source code is available under the GNU LGPL license. That allows us to recompile the source code on every machine with the -O2 optimization with gcc 4.8.1.
We've run an additional data point for this particular set of tests. The new Ivy Bridge EX was tested at 2.8GHz and downclocked to 2.4GHz, so that we can do a clock-for-clock comparison with Westmere EX. Since we're only testing single-threaded performance here, other than perhaps slight differences due to having more total L3 cache, it doesn't matter which particular E7 v2 chip we use.
The latest Xeon E7 v2 "Ivy Bridge EX" is capable of extracting 33% more ILP out of the complex compression code than the older Xeon E7 "Westmere-EX" at the same clock speed. That is pretty amazing and shows how all the small micro-architecture improvements have accumulated into a large performance increase. The Opteron core is also better than most people think: at 2.4GHz it would deliver about 2481 MIPs. That is about 80% of Intel's best server core at the moment—not enough, but nothing to be ashamed about.
Also interesting to note is that the Westmere core was indeed a "tick": any performance increase over the Xeon X7560 (Codename "Beckton", 45nm Nehalem core) is simply the result of the higher clockspeed of the 32nm chip.
Let us see how the chips compare in decompression. Decompression is an even lower IPC (Instructions Per Clock) workload, as it is pretty branch intensive and depends on the latencies of the multiply and shift instructions.
Again, we note a 30% improvement in integer performance going from the Xeon E7 "Westmere" (Xeon E7-4870 at 2.4GHz) to the Xeon E7 v2 "Ivy Bridge EX" (Xeon E7-4890 v2 clocked down to 2.4GHz).
To summarize: the new 15-core Xeon E7 v2 is built upon a strong core architecture that has improved significantly compared to the predecessor.
Multi-Threaded Integer Performance
How do the new Xeon E7 v2 chips compare to the existing Xeon servers when it comes to some multi-threaded workloads?
When it comes to raw integer processing, the new Xeon delivers up 70% better performance than the previous generation and up to 2.3x better performance than the Xeon X7560. With only 12 cores active to simulate performance of the 12-core models, e.g. E7-8857 v2 and E7-4860 v2, we can get a rough idea how the interesting 3GHz 12-core E7-8857 v2 performs, which has the same TDP (130W) as the previous generation. In those circumstances the new Xeon E7 v2 is 50% faster than the previous generation and twice as fast as the Xeon 7560 (and the actual E7-8857 v2 will be clocked slightly higher).
Decompression gives a similar performance landscape, though the E5-2680 now drops below the X7560.
Linux Kernel Compile
A more real-world benchmark to test the integer processing power of our quad Xeon server is a Linux kernel compile. Although very few people compile their own kernel, it gives us a good idea how the CPUs handle a complex build.
To do this we have downloaded the 3.11 kernel from kernel.org. We then compiled the kernel with the "time make -jx" command, where x stand for the maximum amount of threads that the platform is capable of using. To make the graph more readeable, the number of seconds in wall time was converted into the number of builds per hour.
The flagship SKU is almost twice as fast as the previous E7 platform and 2.5 times faster than the Xeon X7560.
It is interesting to note that Xeon 8857-V2 (simulated by the E7-4890 v2 with three cores disabled) will be about twice as fast as the Xeon E5-2680 v2 and delivers almost 90% of the performance of the flagship at 60% of the price. This may be a good option to help hard working developers be more productive and happy?
Of course, we wil be the first to admit that this is a niche market. Let's take a look at some software this platform is built to handle: SAP ERP.
SAP S&D Benchmark
The SAP SD (Sales and Distribution, 2-Tier Internet Configuration) benchmark is an interesting benchmark as it is a real-world client-server application. It is one of those rare industry benchmarks that actually means something to the real IT professionals. Even better, the SAP ERP software is a prime example of where these Xeon E7 v2 chips will be used. We looked at SAP's benchmark database for these results.
Most of the results below all run on Windows 2008/2012 and MS SQL Server (both 64-bit). Every 2-Tier Sales & Distribution benchmark was performed with SAP's latest ERP 6 Enhancement Package 4. These results are not comparable with any benchmark performed before 2009. We analyzed the SAP Benchmark in-depth in one of our earlier articles. The profile of the benchmark has remained the same:
- Very parallel resulting in excellent scaling
- Low to medium IPC, mostly due to "branchy" code
- Somewhat limited by memory bandwidth
- Likes large caches (memory latency)
- Very sensitive to sync ("cache coherency") latency
Let's see how the quad Xeon compares to the previous Intel generation, the cheaper dual socket systems, and the RISC competition.
The new Xeon E7 v2 is no less than 80% faster than its predecessor. The nearest RISC competitor (IBM Power 7 3.55) is a lot more expensive and delivers only 70% of the performance. We have little doubt that the performance/watt ratio of the Xeon E7 v2 is a lot better too.
Intel delivers a serious blow to the RISC competition. For about 11 months, the Oracle SPARC T5-8 delivered the highest SAPS of all octal-socket machines. This insanely expensive machine, which keeps 1024 threads in flight (but executes 256 of them) is now beaten by the Fujitsu PRIMEQUEST 2800E. The 240 thread octal Xeon E7-8890 v2 outperforms the former champion of Oracle by about 18%. The SPARC comeback is still remarkable, although we are pretty sure that the Fujitsu server will be less expensive. Even better is you do not have to pay the Oracle support costs.
Several of our readers have already suggested that we look into OpenFoam. That's easier said than done, as good benchmarking means you have to master the sofware somewhat. Luckily, my lab was able to work with the professionals of Actiflow. Actiflow specialises in combining aerodynamics and product design. Calculating aerodynamics involves the use of CFD software, and Actiflow uses OpenFoam to accomplish this. To give you an idea what these skilled engineers can do, they worked with Ferrari to improve the underbody airflow of the Ferrari 599 and increase its downforce.
The Ferrari 599: an improved product thanks to Openfoam.
We were allowed to use one of their test cases as a benchmark, but we are not allowed to discuss the specific solver. All tests were done on OpenFoam 2.2.1 and openmpi-1.6.3.
Many CFD calculations do not scale well on clusters, unless you use InfiniBand. InfiniBand switches are quite expensive and even then there are limits to scaling. We do not have an InfiniBand switch in the lab, unfortunately. Although it's not as low latency as InfiniBand, we do have a good 10G Ethernet infrastructure, which performs rather well.
So we added a fifth configuration to our testing: the quad-node Intel Server System H2200JF. The only CPU that we have eight of right now is the Xeon E5-2650L 1.8GHz. Yes, it is not perfect, but this is the start of our first clustered HPC benchmark. This way we can get an of idea whether or not the Xeon E7 v2 platform can replace a complete quad-node cluster system and at the same time offer much higher RAM capacity.
The results are pretty amazing: the quad Xeon E7-4980 v2 runs circles around our quad-node HPC cluster. Even if we were to outfit it with 50% higher clocked Xeons, the quad Xeon E7 v2 would still be the winner. Of course, there is no denying that our quad-node cluster is a lot cheaper to buy. Even with an InfiniBand switch, an HPC cluster with dual socket servers is a lot cheaper than a quad socket Intel Xeon E7 v2.
However, this bodes well for the soon to be released Xeon E5-46xx v2 parts. QPI links are even lower latency than InfiniBand. But since we do not have a lot of HPC testing experience, we'll leave it up to our readers to discuss this in more detail.
Another interesting detail is that the Xeon 2650L at 1.8GHz is about twice as fast as a Xeon L5650. We found AVX code inside OpenFoam 2.2.1, so we assume that this is one of the cases where AVX improves FP performance tremendously. Seasoned OpenFoam users, let us know whether is the accurate assessment.
It has been more than three years since the previous generation Xeon E7 hit the market. IBM and Oracle have overtaken the old Xeon E7 since then and an update was long overdue. Since then, Intel has launched two new architectures in the dual socket server CPU market: the Intel E5-2600, based on the "Sandy Bridge" architecture, and the Intel E5-2600 v2 ("Ivy Bridge").
The new Xeon core has already shown its worth in the dual socket Xeon E5-2600 v2 based servers. It is interesting to note that both architecture updates, Sandy Bridge and Ivy Bridge, although relatively subtle on their own, have increased the integer performance of each individual core by 30%. The many subtle changes also increase the performance/watt, and the excellent 22nm process technology enables a 50% higher core count. The end result is that the general computing performance has doubled in scalable integer applications (SAP) and tripled in floating point applications. There is more.
We are entering the big data era, and the result is a strong and renewed interest in (almost) real-time data mining. One of the prime examples is SAP with the in memory and compressed database platform SAP HANA. Both Microsoft with SQL Server 2014 and IBM with DB2 10.5—with the so called BLU acceleration—are following suit. Therefore, it is likely that there will be a strong demand for a server platform with massive RAM capacity. The new quad socket Xeon servers can offer up to 3TB of RAM with relatively affordable 32GB DIMM technology and no less than 6TB with the ultra-expensive 64GB LR-DIMMs. That is another reason why the Intel Xeon E7 v2 platform will be more attractive than much more expensive RISC servers that are typically limited to 1-2TB.
Overall, Intel's launch of the tried and proven Ivy Bridge cores looks ready to set a new level of performance expectations. Ivy Bridge EX may seem awfully late compared to the IVB and IVB EP releases, but that's typical of this server segment. The Xeon E7 v2 chips are slated to remain in data centers for the next several years as the most robust—and most expensive—offerings from Intel. If you can use more smaller servers instead of a few large servers, that will certainly be more cost effective, but the types of applications typically run on these servers and the demands of the software can frequently make the hardware costs a secondary consideration.