Single-Threaded Performance

I admit, the following two benchmarks are almost irrelevant for anyone buying a Xeon E7 based machine. But still, we have to quench our curiosity: how much have the new cores been improved? There is a lot that can be said about all the sophisticated "uncore" improvements (cache coherency policies, low latency rings, and so on) that allow this multi-core monster to scale, but at the end of the day, good performance starts with a good core. And since we have listed the many subtle core improvements, we could not resist the opportunity to check how each core compares.

The results aren't totally meaningless either, as the profile of a compression algorithm is somewhat similar to many server workloads: hard to extract instruction level parallelism (ILP) and sensitive to memory parallelism and latency. The instruction mix is a bit different, but it's still somewhat similar to many server workloads. And as one more reason to test performance in this manner, the 7-zip source code is available under the GNU LGPL license. That allows us to recompile the source code on every machine with the -O2 optimization with gcc 4.8.1.

We've run an additional data point for this particular set of tests. The new Ivy Bridge EX was tested at 2.8GHz and downclocked to 2.4GHz, so that we can do a clock-for-clock comparison with Westmere EX. Since we're only testing single-threaded performance here, other than perhaps slight differences due to having more total L3 cache, it doesn't matter which particular E7 v2 chip we use.

LZMA single threaded performance: compression

The latest Xeon E7 v2 "Ivy Bridge EX" is capable of extracting 33% more ILP out of the complex compression code than the older Xeon E7 "Westmere-EX" at the same clock speed. That is pretty amazing and shows how all the small micro-architecture improvements have accumulated into a large performance increase. The Opteron core is also better than most people think: at 2.4GHz it would deliver about 2481 MIPs. That is about 80% of Intel's best server core at the moment—not enough, but nothing to be ashamed about.

Also interesting to note is that the Westmere core was indeed a "tick": any performance increase over the Xeon X7560 (Codename "Beckton", 45nm Nehalem core) is simply the result of the higher clockspeed of the 32nm chip.

Let us see how the chips compare in decompression. Decompression is an even lower IPC (Instructions Per Clock) workload, as it is pretty branch intensive and depends on the latencies of the multiply and shift instructions.

LZMA single threaded performance: decompression

Again, we note a 30% improvement in integer performance going from the Xeon E7 "Westmere" (Xeon E7-4870 at 2.4GHz) to the Xeon E7 v2 "Ivy Bridge EX" (Xeon E7-4890 v2 clocked down to 2.4GHz).

To summarize: the new 15-core Xeon E7 v2 is built upon a strong core architecture that has improved significantly compared to the predecessor.

Our Benchmarks and Configuration Multi-Threaded Integer Performance
Comments Locked

125 Comments

View All Comments

  • Kevin G - Friday, February 21, 2014 - link

    And a quick addition:

    There will indeed be a quick adoption to Haswell-EX not because of AVX2 or DDR4 but rather transactional memory support (TSX). For the large databases and applications these systems are targeted at, TSX should prove to be helpful.
  • TiGr1982 - Friday, February 21, 2014 - link

    I agree, TSX should make a lot of sense for these E7's - they have a huge core count and huge shared memory at the same time.
  • Schmide - Friday, February 21, 2014 - link

    I think your L3 latency numbers are off. I think typical Intel L3 latencies are 30-40 clocks ~3-4ns.
  • Schmide - Friday, February 21, 2014 - link

    Oops my bad i miss used the calculator. Ignore.
  • dylan522p - Friday, February 21, 2014 - link

    No power consumption numbers?
  • JohanAnandtech - Saturday, February 22, 2014 - link

    Coming...we had to run lots of test in parallel, so it was not possible to make sure all systems were similar. Also we should test with workloads that require a lot more memory to get an idea.
  • mslasm - Friday, February 21, 2014 - link

    Note that E7-8857 v2 has 12 cores but no HT, so only has 12 threads as well (see http://ark.intel.com/products/75254/Intel-Xeon-Pro... Thus it is not equivalent to a 3Ghz E7-4860V2, as 4860 has HT for a total of 24 threads

    Also, there must be a typo either in the graph or in the text on the "single thread" integer performance test: "Opteron ... at 2.4GHz would deliver about 2481 MIPs", while - according to the graph - it already delivers 2636 @ 2.3Ghz.
  • JohanAnandtech - Saturday, February 22, 2014 - link

    Good point. There is little gain from HT in OpenFoam, but it will influence the LZMA benchmarks. So the Openfoam findings are still valid, but not the LZMA. The kernel compile is somewhat in between.
  • JohanAnandtech - Saturday, February 22, 2014 - link

    I will rerun the benchmarks without HT to check.
  • mslasm - Saturday, February 22, 2014 - link

    Thanks! I did not mean to imply HT matters "a lot", but it may influence some (and I admit I don't know much about how your benchmarks behave, other than parallel LZMA which I worked a lot with) - so it just does not sound right to outright call it equivalent, and I wish AT only has statements anyone can just trust :)

Log in

Don't have an account? Sign up now