Intel's Benchmarks

Since time constraints meant that we were not able to run a ton of benchmarks ourselves, it's useful to check out Intel's own benchmarks as well. In our experience Intel's own benchmarking has a good track record for producing accurate numbers and documenting configuration details. Of course, you have to read all the benchmarking information carefully to make sure you understand just what is being tested.

The OLTP and virtualization benchmarks show that the new Xeon E7 v3 is about 25 to 39% faster than the previous Xeon E7 (v2). In some of those benchmarks, the new Xeon had twice as much memory, but it is safe to say that this will make only a small difference. We think it's reasonable to conclude that the Xeon E7 is 25 to 30% faster, which is also what we found in our integer benchmarks.

The increase in legacy FP application is much lower. For example Cinebench was 14% faster, SPECFP 9% and our own OpenFOAM was about 4% faster. Meanwhile linpack benchmarks are pretty useless to most of the HPC world, so we have more faith in our own benchmarking. Intel's own realistic HPC benchmarking showed at best a 19% increase, which is nothing to write home about.

The exciting part about this new Xeon E7 is that data analytics/mining happens a lot faster on the new Xeon E7 v3. The 72% faster SAS analytics number is not really accurate as part of the speedup was due to using P3700 SSDs instead of the S3700 SSD. Still, Intel claims that the replacing the E7 v2 with the v3 is good for a 55-58% speedup.

The most spectacular benchmark is of course SAP HANA. It is not 6x faster as Intel claims, but rather 3.3x (see our comments about TSX). That is still spectacular and the result of excellent software and hardware engineering.

Final Words: Comparing Xeon E7 v3 vs V2

For those of us running scale-up, reasonably priced HPC or database applications, it is hard to get excited about the Xeon E7 v3. The performance increases are small-but-tangible, however at the same time the new Xeon E7 costs a bit more. Meanwhile as far as our (HPC) energy measurements go, there is no tangible increase in performance per watt.

The Xeon E7 in its natural habitat: heavy heatsinks, hotpluggable memory

However organizations running SAP HANA will welcome the new Xeon E7 with open arms, they get massive speedups for a 0.1% or less budget increase. The rest of the data mining community with expensive software will benefit too, as the new Xeon E7 is at least 50% faster in those applications thanks to TSX.

Ultimately we wonder how the rest of us will fare. Will SAP/SAS speedups also be visible in open source Big Data software such as Hadoop and Elastic Search? Currently we are still struggling to get the full potential out of the 144 threads. Some of these tests run for a few days only to end with a very vague error message: big data benchmarking is hard.

Comparing Xeon E7 v3 and POWER8

Although the POWER8 is still a power gobbling monster, just like its older brother the POWER7, there is no denying that IBM has made enormous progress. Few people will be surprised that IBM's much more expensive enterprise systems beat Intel based offerings in the some high-end benchmarks like SAP's. But the fact that 24 POWER8 cores in a relatively reasonably priced IBM POWER8 server can beat 36 Intel Haswell cores by a considerable margin is new.

It is also interesting that our own integer benchmarking shows that the POWER8 core is capable of keeping up with Intel's best core at the same clockspeed (3.3-3.4 GHz). Well, at least as long as you feed it enough threads in IPC unfriendly code. But that last sentence is the exact description of many server workloads. It also means that the SAP benchmark is not an exception: the IBM POWER8 is definitely not the best CPU to run Crysis (not enough threads) but it is without a doubt a dangerous competitor for Xeon E7 when given enough threads to fill up the CPU.

Right now the threat to Intel is not dire, IBM still asks way too much for its best POWER8 systems and the Xeons have a much better performance-per-watt ratio. But once the OpenPOWER fondation partners start offering server solutions, there is a good chance that Intel will receive some very significant performance-per-dollar competition in the server market.

HPC Watts per Job
Comments Locked

146 Comments

View All Comments

  • Brutalizer - Tuesday, May 12, 2015 - link

    Again, Hana is a clustered RAM database. And as I have shown above with the Oracle TenTimes RAM database, these are totally different from a normal database. In Memory DataBases can never replace a normal database, as IMDB are optimized for reading data (analysis), not modifying data.

    Regarding SGI UV300H, it is a 16 socket server, i.e. scale-up server. It is not a huge scale-out cluster. And therefore UV300H might be good for business software, but I dont know the performance of SGI's first(?) scale-up server. Anyway, 16 socket servers are different from SGI UV2000 scale out clusters. And UV2000 can not be used for business software. As evidenced by non existing SAP benchmarks.
  • ats - Wednesday, May 13, 2015 - link

    No, you haven't shown anything. You quote some random whitepaper on the internet like it is gospel and ignore the fact that in memory dbs are used daily as the primary in OLTP, OLAP, BI, etc workloads.

    And you don't understand that a significant number of the IMDBs are actually designed directly for the OLTP market which is precisely the DB workload that is modifying the most data and is the most complex and demanding with regard to locks and updates.

    There is no architecural difference between the UV300 and the UV2k except slightly faster interconnect. And just an fyi, UV300 is like SGI's 30th scale up server. After all, they've been making scale up server for longer than Sun/Oracle.
  • questionlp - Monday, May 11, 2015 - link

    HP Superdome X is a 16-socket x86 server that will probably end up replacing the Itanium-based Superdome if HP can scale the S/X to 32 sockets.
  • Brutalizer - Monday, May 11, 2015 - link

    HP will face great difficulties if they try to mod and go beyond 8 sockets on the old Superdome. Heck, even 8 sockets have scaling difficulties on x86.
  • Kevin G - Monday, May 11, 2015 - link

    Except that you can you buy a 16 socket Superdome X *today*.

    http://h20195.www2.hp.com/V2/getpdf.aspx/4AA5-6149...

    The interconnect they're using for the Superdome X is from the old Poulson Itaniums that use QPI which can scale to 64 sockets.
  • rbanffy - Wednesday, May 13, 2015 - link

    You talk "serious business workloads". Of course, there are organizations that use technology that does not scale horizontally, where adding more machines to share the workload does not work because the workload was not designed to be shared. For those, there are solutions that offer progressively less performance per dollar for levels of single-box performance that are unattainable on high-end x86 machines, but that is just because those organizations are limited by the technology they chose.

    There is nothing in SAP (except its design) or (non-rel) databases that preclude horizontal scaling. It's just that the software was designed in an age when horizontal scaling was not in fashion (even though VAXes have been doing clustering since I was a young boy) and now it's too late to rebuild it from scratch.
  • mapesdhs - Friday, May 8, 2015 - link

    Good point, I wonder why they've left it at only 2/core for so long...
  • name99 - Friday, May 8, 2015 - link

    It's not easy to ramp up the number of threads. In particular POWER8 uses something I've never seen any other CPU do --- they have a second tier register file (basically an L2 for registers) and the system dynamically moves data between the two register files as appropriate.

    It's also much easier for POWER8 to decode 8 instructions per cycle (and to do the multiple branch prediction per cycle to make that happen). Intel could maybe do that if they reverted to a trace cache, but the target codes for this type of CPU are characterized by very large I-footprints and not much tight looping, so trace caches, loop caches, micro-op caches are not that much help. Intel might have to do something like a dual-ported I-cache, and running two fetch streams into two independent sets of 4-wide decoders.
  • xdrol - Saturday, May 9, 2015 - link

    Another register file is just a drop in the ocean. The real problem is the increasing L1/2/.. cache pressure; what can only be mitigated by increasing cache size; what in turn will make your cache access slower, even when you use only one of the SMT threads.

    Also, you need to have enough unused execution capacity (pipeline ports) for another hardware thread to be useful; the 2 threads in Haswell can already saturate the 7 execution ports with quite high probability, so the extra thread can only run in expense of the other, and due to the cache effects, it's probably faster to just get the 2 tasks executed sequentially (within the same thread). This question could be revisited if the processor has 14 execution port, 2x issue, 2x cache, 2x everything, so it can have 4T/1C, but then it's not really different from 2 normal size cores with 4T..
  • iAPX - Friday, May 8, 2015 - link

    It's because this is the same architecture (mainly) that is used on desktop, laptops, and now even mobility!

    With this market share, I won't be surprised that Intel decided to create a new architecture (x86-64 based) for future server chips, much more specialized, dropping AVX for cloud servers, having 4+ threads per core with simpler decoder and a lot of integer and load/store units!

    That might be complemented by a Xeon Phi socketable for floating-point compute intensive tasks and workstations, but it's unclear even if Intel announced it far far ago! ;)

Log in

Don't have an account? Sign up now