The Competitor: IBM's POWER8

As we briefly mentioned in the introduction, among all of the potential competitors for the Xeon E7 line, IBM's OpenPower might be the most potent competitor at this time.  So how do IBM's offerings compare to Intel's? IBM POWER 8 is a Brainiac (high IPC) design that also wants to be speed demon (high clock speeds).

The POWER8 core can decode, issue and execute and retire 8 instructions per cycle. That degree of of instruction level parallelism (ILP) can not be extracted out of (most) software. To battle the lack of ILP in software, no less than 8 threads (SMT) are active per core.  According to IBM, 

  • 2-threads delivers about 45% performance more than one
  • 4-threads deliver yet another 30% boost
  • the last 4-threads deliver about 7%

So in total, the 8-way SMT doubles the performance of this massive core. Let us compare the two chips. 

Xeon E7v3/POWER8 Comparison
Feature Intel Haswell-EX
​Xeon E7
IBM POWER8
Process tech.  22nm FinFET 22nm SOI
Max clock 2.5-3.6 GHz 3.5-4.35 GHz
Max. core count
Max. thread count
18@2.5 GHz
36 SMT
12@4.2 GHz
96 SMT
Max. sustained IPC 6 (4) 8
L1-I​ / L1-D Cache 32 KB/32 KB 32 KB/64 KB
L2 Cache 256 KB SRAM per core 512 KB SRAM ​per core
L3 Cache 2.5 MB SRAM per core 8 MB eDRAM ​per core
L4 Cache None 16 MB eDRAM ​per MBC
(64/128 MB total)
Memory 1.5 TB per socket
(64 GB per DIMM)
1-2 TB per socket
(64 GB per DIMM)
Theoretical Memory Bandwidth 102 GB/s
(independent mode)
204 GB/s
PCIe 3.0 Lanes 40 Lanes 32 Lanes

The POWER8 looks better than Haswell-EX in almost every spec, but the devil is of course in the details. First of all, Intel's L2-cache works at the same clock as the core, IBM's L2-cache runs at a lower clock (2.2 GHz or less, depending on the model). Secondly, the POWER8's L3 eDRAM cache might be much larger, but it is so also a bit slower.  

But the main disadvantage of the POWER8 is that all this superscalar wideness and high clockspeed goodness comes with a power price. This slide from Tyan at the latest OpenPOWER conference tell us more. 

A 12 core POWER8 is "limited" to 3.1 GHz if you want to stay below the 190W TDP mark. Clockspeeds higher than 4 GHz are only possible with 8-cores and a 250W TDP. This makes us really curious what kind of power dissipation we may expect from the 4.2 GHz 10-core POWER8 inside the expensive E870 Enterprise systems (300W?).  

That is not all. Each "Jordan Creek2" memory buffer on the Intel system is limited to about 9W. IBM uses a similar but more complex "Centaur" memory buffer (including a 16 MB cache) which needs more than twice as much energy (16-20W). There are at least four of them per chip, and a high-end chip can have eight. So in total the Intel CPU plus memory buffers have a 201W TDP (165W CPU + 4x9W Jordan Creek 2), while the IBM platform has at best a 270W TDP (190W CPU+ 4x20W MBC).

Xeon E7 v3 SKUs and prices POWER8 Versus Xeon E7 v3
Comments Locked

146 Comments

View All Comments

  • Brutalizer - Tuesday, May 12, 2015 - link

    Again, Hana is a clustered RAM database. And as I have shown above with the Oracle TenTimes RAM database, these are totally different from a normal database. In Memory DataBases can never replace a normal database, as IMDB are optimized for reading data (analysis), not modifying data.

    Regarding SGI UV300H, it is a 16 socket server, i.e. scale-up server. It is not a huge scale-out cluster. And therefore UV300H might be good for business software, but I dont know the performance of SGI's first(?) scale-up server. Anyway, 16 socket servers are different from SGI UV2000 scale out clusters. And UV2000 can not be used for business software. As evidenced by non existing SAP benchmarks.
  • ats - Wednesday, May 13, 2015 - link

    No, you haven't shown anything. You quote some random whitepaper on the internet like it is gospel and ignore the fact that in memory dbs are used daily as the primary in OLTP, OLAP, BI, etc workloads.

    And you don't understand that a significant number of the IMDBs are actually designed directly for the OLTP market which is precisely the DB workload that is modifying the most data and is the most complex and demanding with regard to locks and updates.

    There is no architecural difference between the UV300 and the UV2k except slightly faster interconnect. And just an fyi, UV300 is like SGI's 30th scale up server. After all, they've been making scale up server for longer than Sun/Oracle.
  • questionlp - Monday, May 11, 2015 - link

    HP Superdome X is a 16-socket x86 server that will probably end up replacing the Itanium-based Superdome if HP can scale the S/X to 32 sockets.
  • Brutalizer - Monday, May 11, 2015 - link

    HP will face great difficulties if they try to mod and go beyond 8 sockets on the old Superdome. Heck, even 8 sockets have scaling difficulties on x86.
  • Kevin G - Monday, May 11, 2015 - link

    Except that you can you buy a 16 socket Superdome X *today*.

    http://h20195.www2.hp.com/V2/getpdf.aspx/4AA5-6149...

    The interconnect they're using for the Superdome X is from the old Poulson Itaniums that use QPI which can scale to 64 sockets.
  • rbanffy - Wednesday, May 13, 2015 - link

    You talk "serious business workloads". Of course, there are organizations that use technology that does not scale horizontally, where adding more machines to share the workload does not work because the workload was not designed to be shared. For those, there are solutions that offer progressively less performance per dollar for levels of single-box performance that are unattainable on high-end x86 machines, but that is just because those organizations are limited by the technology they chose.

    There is nothing in SAP (except its design) or (non-rel) databases that preclude horizontal scaling. It's just that the software was designed in an age when horizontal scaling was not in fashion (even though VAXes have been doing clustering since I was a young boy) and now it's too late to rebuild it from scratch.
  • mapesdhs - Friday, May 8, 2015 - link

    Good point, I wonder why they've left it at only 2/core for so long...
  • name99 - Friday, May 8, 2015 - link

    It's not easy to ramp up the number of threads. In particular POWER8 uses something I've never seen any other CPU do --- they have a second tier register file (basically an L2 for registers) and the system dynamically moves data between the two register files as appropriate.

    It's also much easier for POWER8 to decode 8 instructions per cycle (and to do the multiple branch prediction per cycle to make that happen). Intel could maybe do that if they reverted to a trace cache, but the target codes for this type of CPU are characterized by very large I-footprints and not much tight looping, so trace caches, loop caches, micro-op caches are not that much help. Intel might have to do something like a dual-ported I-cache, and running two fetch streams into two independent sets of 4-wide decoders.
  • xdrol - Saturday, May 9, 2015 - link

    Another register file is just a drop in the ocean. The real problem is the increasing L1/2/.. cache pressure; what can only be mitigated by increasing cache size; what in turn will make your cache access slower, even when you use only one of the SMT threads.

    Also, you need to have enough unused execution capacity (pipeline ports) for another hardware thread to be useful; the 2 threads in Haswell can already saturate the 7 execution ports with quite high probability, so the extra thread can only run in expense of the other, and due to the cache effects, it's probably faster to just get the 2 tasks executed sequentially (within the same thread). This question could be revisited if the processor has 14 execution port, 2x issue, 2x cache, 2x everything, so it can have 4T/1C, but then it's not really different from 2 normal size cores with 4T..
  • iAPX - Friday, May 8, 2015 - link

    It's because this is the same architecture (mainly) that is used on desktop, laptops, and now even mobility!

    With this market share, I won't be surprised that Intel decided to create a new architecture (x86-64 based) for future server chips, much more specialized, dropping AVX for cloud servers, having 4+ threads per core with simpler decoder and a lot of integer and load/store units!

    That might be complemented by a Xeon Phi socketable for floating-point compute intensive tasks and workstations, but it's unclear even if Intel announced it far far ago! ;)

Log in

Don't have an account? Sign up now