The Intel Xeon E7-8800 v3 Review: The POWER8 Killer?

Name: The Intel Xeon E7-8800 v3 Review: The POWER8 Killer?
Item: The Intel Xeon E7-8800 v3 Review: The POWER8 Killer?
Author: Johan De Gelas

by Johan De Gelas on May 8, 2015 8:00 AM EST

146 Comments | Add A Comment

146 Comments

Xeon E7 v3 System and Memory Architecture

So, the Xeon E5 "Haswell EP" and Xeon E7 "Haswell EX" are the same chip, but the latter has more features enabled and as result it finds a home in a different system architecture.

Debuting alongside the Xeon E7 v3 is the new "Jordan Creek 2" buffer chip, which offers support for DDR4 LR-DIMMs or buffered RDIMMs. However if necessary it is still possible to use the original "Jordan Creek" buffer chips with DDR3, giving the Xeon E7 v3 the ability to be used with either DDR3 or DDR4. Meanwhile just like its predecessor, the Jordan Creek 2 buffers can either running in lockstep (1:1) or in performance mode (2:1). If you want more details, read our review of the Xeon E7 v2 or Intel's own comparison.

To sum it up, in lockstep mode (1:1):

The Scalable Memory Buffer (SMB) is working at the same speed as the RAM, max. 1866 MT/s.
Offers higher availability as the memory subsystem can recover from two sequential RAM failures
Has lower bandwidth as the SMB is running at max. 1866 MT/s
...but also lower energy for the same reason (about 7W instead of 9W).

In performance mode (2:1):

You get higher bandwidth as the SMB is running at 3200 MT/s (Xeon E7 v2: 2667 MT/s), twice the speed of the memory channels. The SMB combines two memory channels of DDR-4 1600.
Higher energy consumption as the SMB is running at full speed (9W TDP, 2.5 W idle)
The memory subsystem can recover from one device/chip failure as the data can be reconstructed in the spare chip thanks to the CRC chip.

This is a firmware option, so you chose once whether being able to lose 2 DRAM chips is worth the bandwidth hit.

Xeon E7 vs E5

The different platform/system architecture is the way that the Xeon E7 differentiates itself from the Xeon E5, all the while both chips have what is essentially the same die. Besides being able to use 4 and 8 socket configurations, the E7 supports much more memory. Each socket connects via Scalable Memory Interconnect 2 (SMI2) to four "Jordan Creek2" memory controllers.

Jordan Creek 2 memory buffers under the black heatsinks with 6 DIMM slots

Each of these memory buffers supports 6 DIMM slots. Multiply four sockets with four memory buffers and six dimm slots and you get a total of 96 DIMM slots. With 64 GB LR-DIMMs (see our tests of Samsung/IDT based LRDIMMs here) in those 96 DIMM slots, you get an ultra expensive server with no less than 6 TB RAM. That is why these system are natural hosts for in-memory databases such as SAP HANA and Microsoft's Hekaton.

There is more of course. Chances are uncomfortably high that with 48 Trillion memory cells that one of those will go bad, so you want some excellent reliability features to counter that. Memory mirroring is nothing new, but the Xeon E7 v3 allows you to mirror only the critical part of your memory instead of simply dividing capacity by 2. Also new is "multiple rank sparing", which provides dynamic failover of up to four ranks of memory per memory channel. In other words, not can the system shrug off a single chip failure, but even a complete DIMM failure won't be enough to take the system down either.

The New Xeon E7v3 Haswell Architecture Improvements: TSX & More

PRINT THIS ARTICLE

Post Your Comment
Please log in or sign up to comment.

Comments Locked

146 Comments

View All Comments

Brutalizer - Tuesday, May 12, 2015 - link
Again, Hana is a clustered RAM database. And as I have shown above with the Oracle TenTimes RAM database, these are totally different from a normal database. In Memory DataBases can never replace a normal database, as IMDB are optimized for reading data (analysis), not modifying data.

Regarding SGI UV300H, it is a 16 socket server, i.e. scale-up server. It is not a huge scale-out cluster. And therefore UV300H might be good for business software, but I dont know the performance of SGI's first(?) scale-up server. Anyway, 16 socket servers are different from SGI UV2000 scale out clusters. And UV2000 can not be used for business software. As evidenced by non existing SAP benchmarks.
ats - Wednesday, May 13, 2015 - link
No, you haven't shown anything. You quote some random whitepaper on the internet like it is gospel and ignore the fact that in memory dbs are used daily as the primary in OLTP, OLAP, BI, etc workloads.

And you don't understand that a significant number of the IMDBs are actually designed directly for the OLTP market which is precisely the DB workload that is modifying the most data and is the most complex and demanding with regard to locks and updates.

There is no architecural difference between the UV300 and the UV2k except slightly faster interconnect. And just an fyi, UV300 is like SGI's 30th scale up server. After all, they've been making scale up server for longer than Sun/Oracle.
questionlp - Monday, May 11, 2015 - link
HP Superdome X is a 16-socket x86 server that will probably end up replacing the Itanium-based Superdome if HP can scale the S/X to 32 sockets.
Brutalizer - Monday, May 11, 2015 - link
HP will face great difficulties if they try to mod and go beyond 8 sockets on the old Superdome. Heck, even 8 sockets have scaling difficulties on x86.
Kevin G - Monday, May 11, 2015 - link
Except that you can you buy a 16 socket Superdome X *today*.

http://h20195.www2.hp.com/V2/getpdf.aspx/4AA5-6149...

The interconnect they're using for the Superdome X is from the old Poulson Itaniums that use QPI which can scale to 64 sockets.
rbanffy - Wednesday, May 13, 2015 - link
You talk "serious business workloads". Of course, there are organizations that use technology that does not scale horizontally, where adding more machines to share the workload does not work because the workload was not designed to be shared. For those, there are solutions that offer progressively less performance per dollar for levels of single-box performance that are unattainable on high-end x86 machines, but that is just because those organizations are limited by the technology they chose.

There is nothing in SAP (except its design) or (non-rel) databases that preclude horizontal scaling. It's just that the software was designed in an age when horizontal scaling was not in fashion (even though VAXes have been doing clustering since I was a young boy) and now it's too late to rebuild it from scratch.
mapesdhs - Friday, May 8, 2015 - link
Good point, I wonder why they've left it at only 2/core for so long...
name99 - Friday, May 8, 2015 - link
It's not easy to ramp up the number of threads. In particular POWER8 uses something I've never seen any other CPU do --- they have a second tier register file (basically an L2 for registers) and the system dynamically moves data between the two register files as appropriate.

It's also much easier for POWER8 to decode 8 instructions per cycle (and to do the multiple branch prediction per cycle to make that happen). Intel could maybe do that if they reverted to a trace cache, but the target codes for this type of CPU are characterized by very large I-footprints and not much tight looping, so trace caches, loop caches, micro-op caches are not that much help. Intel might have to do something like a dual-ported I-cache, and running two fetch streams into two independent sets of 4-wide decoders.
xdrol - Saturday, May 9, 2015 - link
Another register file is just a drop in the ocean. The real problem is the increasing L1/2/.. cache pressure; what can only be mitigated by increasing cache size; what in turn will make your cache access slower, even when you use only one of the SMT threads.

Also, you need to have enough unused execution capacity (pipeline ports) for another hardware thread to be useful; the 2 threads in Haswell can already saturate the 7 execution ports with quite high probability, so the extra thread can only run in expense of the other, and due to the cache effects, it's probably faster to just get the 2 tasks executed sequentially (within the same thread). This question could be revisited if the processor has 14 execution port, 2x issue, 2x cache, 2x everything, so it can have 4T/1C, but then it's not really different from 2 normal size cores with 4T..
iAPX - Friday, May 8, 2015 - link
It's because this is the same architecture (mainly) that is used on desktop, laptops, and now even mobility!

With this market share, I won't be surprised that Intel decided to create a new architecture (x86-64 based) for future server chips, much more specialized, dropping AVX for cloud servers, having 4+ threads per core with simpler decoder and a lot of integer and load/store units!

That might be complemented by a Xeon Phi socketable for floating-point compute intensive tasks and workstations, but it's unclear even if Intel announced it far far ago! ;)

The Intel Xeon E7-8800 v3 Review: The POWER8 Killer?

Xeon E7 v3 System and Memory Architecture

Xeon E7 vs E5

Post Your Comment

146 Comments

View All Comments

Brutalizer - Tuesday, May 12, 2015 - link

ats - Wednesday, May 13, 2015 - link

questionlp - Monday, May 11, 2015 - link

Brutalizer - Monday, May 11, 2015 - link

Kevin G - Monday, May 11, 2015 - link

rbanffy - Wednesday, May 13, 2015 - link

mapesdhs - Friday, May 8, 2015 - link

name99 - Friday, May 8, 2015 - link

xdrol - Saturday, May 9, 2015 - link

iAPX - Friday, May 8, 2015 - link

Log in

Don't have an account? Sign up now