Assessing IBM's POWER8, Part 1: A Low Level Look at Little Endian

Name: Assessing IBM's POWER8, Part 1: A Low Level Look at Little Endian
Item: Assessing IBM's POWER8, Part 1: A Low Level Look at Little Endian
Author: Johan De Gelas

by Johan De Gelas on July 21, 2016 8:45 AM EST

Posted in
CPUs
IBM
POWER
POWER8

124 Comments | Add A Comment

124 Comments

Comparing with Intel's Best

Comparing CPUs in tables is always a very risky game: those simple numbers hide a lot of nuances and trade-offs. But if we approach with caution, we can still extract quite a bit of information out of it.

Feature	IBM POWER8	Intel Broadwell (Xeon E5 v4)	Intel Skylake
L1-I cache Associativity	32 KB 8-way	32 KB 8-way	32 KB 8-way
L1-D cache Associativity	64 KB 8-way	32 KB 8-way	32 KB 8-way
Outstanding L1-cache misses	16	10	10
Fetch Width	8 instructions	16 bytes (+/- 4-5 x86)	16 bytes (+/- 4-5 x86)
Decode Width	8	4 µops	5-6* µops (*µop cache hit)
Issue Queue	64+15 branch+8 CR = 87	60 unified	97 unified
Issue Width/Cycle	10	8	8
Instructions in Flight	224 (GCT SMT-8 modus)	192 (ROB)	224 (ROB)
Archi regs Rename regs	32 (ST), 2x32 (SMT-2) 92 (ST), 2x92 (SMT-2)	16 168	16 180
Load Bandwidth (per unit) Load Queue Size	4 per cycle 16B/cycle 44 entries	2 per cycle 32B/cycle 72 entries	2 per cycle 32B/cycle 72 entries
Store Bandwidth Store Queue Size	2 per cycle 16B/cycle 40 entries	1 per cycle 32B/cycle 42 entries	1 per cycle 32B/cycle 56 entries
Int. Pipeline Length	18 stages	19 stages 14 stage from µop cache	19 stages 14 stage from µop cache
TLB	2048 4-way	128I + 64D L1 1024 8-way	128I + 64D L1 1536 8-way
Page Support	4 KB, 64 KB, 16 MB, 16 GB	4 KB, 2/4 MB, 1 GB	4 KB, 2/4 MB, 1 GB

Both CPUs are very wide brawny Out of Order (OoO) designs, especially compared to the ARM server SoCs.

Despite the lower decode and issue width, Intel has gone a little bit further to optimize single threaded performance than IBM. Notice that the IBM has no loop stream detector nor µop cache to reduce branch misprediction. Furthermore the load buffers of the Intel microarchitecture are deeper and the total number of instructions in flight for one thread is higher. The TLB architecture of the IBM POWER8 has more entries while Intel favors speedy address translations by offering a small level one TLB and a L2 TLB. Such a small TLB is less effective if many threads are working on huge amounts of data, but it favors a single thread that needs fast virtual to physical address translation.

On the flip side of the coin, IBM has done its homework to make sure that 2-4 threads can really boost the performance of the chip, while Intel's choices may still lead to relatively small SMT related performance gains in quite a few applications. For example, the instruction TLB, µop cache (Decode Stream Buffer) and instruction issue queues are divided in 2 when 2 threads are active. This will reduced the hit rate in the micro-op cache, and the 16 byte fetch looks a little bit on the small side. Let us see what IBM did to make sure a second thread can result in a more significant performance boost.

Inside the Beast(s) Heavy SMT: Multi Threading Prowess

PRINT THIS ARTICLE

Post Your Comment
Please log in or sign up to comment.

Comments Locked

124 Comments

View All Comments

zodiacfml - Thursday, July 21, 2016 - link
Like a good TV series, I can't wait for the next episode.
aryonoco - Friday, July 22, 2016 - link
OK, this is literally why Anandtech is the best in the tech journalism industry.

There is nowhere else on the net that you can find a head to head comparison between POWER and Xeon, and unless you work in the tech department of a Fortune 500 company, this information has just not been available, until now.

Johan, thank you for your work on this article. I did give you beef in your previous article about using LE Ubuntu but I concede your point. Very happy to you are writing more for Anandtech these days.

Xeons really need some competition. Whether that competition comes from POWER or ARM or Zen, I am happy to see some competition. IBM has big plans for POWER9. Hopefully this is just the start of things to come.
JohanAnandtech - Friday, July 22, 2016 - link
Thanks! it is very exciting to perform benchmarks that nobody has published yet :-).

In hindsight, I have to admit that the first article contained too few benchmarks that really mattered for POWER8. Most of our usual testing and scripting did not work, and so after lot of tinkering, swearing and sweat I got some benchmarks working on this "exotic to me" platform. The contrast between what one would expect to see on POWER8 and me being proud of being able to somewhat "tame the beast" could not have been greater :-). In other words, there was a learning curve.
tipoo - Friday, July 22, 2016 - link
I found it very interesting as well and would certainly not mind seeing more from this space, like maybe Xeon Phi and SPARC M7
jospoortvliet - Tuesday, July 26, 2016 - link
Amen. But, to not ask to much, just the prospect of part 2 of the Power benchmark is already super exciting. Yes, the Internetz need more of this!
Daniel Egger - Friday, July 22, 2016 - link
Not quite sure what the Endianess of a systems adds to the competitive factor. Maybe someone could elaborate why it is so important to run a system in LE?
ZeDestructor - Friday, July 22, 2016 - link
Not much, really, with the compilers being good and all that.

Really, it's quite clearly there just for some excellent alliteration.
JohanAnandtech - Friday, July 22, 2016 - link
Basically LE reduces the barrier for an IBM server being integrated in x86 dominated datacenter.

see https://www.ibm.com/developerworks/community/blogs...

Just a few reasons:

"Numerous clients, software partners, and IBM’s own software developers have told us that porting their software to Power becomes simpler if the Linux environment on Power supports little endian mode, more closely matching the environment provided by Linux on x86. This new level of support will *** lower the barrier to entry for porting Linux on x86 software to Linux on Power **."

"A system accelerator programmer (GPU or FPGA) who needs to share memory with applications running in the system processor must share data in an pre-determined endianness for correct application functionality."
Daniel Egger - Friday, July 22, 2016 - link
While correct in theory, this hasn't been a problem for the last 20 years. People are used to using BE on PPC/POWER, the software, the drivers and the infrastructure are very mature (as a matter of fact it was my job 15 years ago to make sure they are). PPC/POWER actually have configurable endianess so if someone wanted to go LE earlier it would have easily been possible but only few ever attempted that stunt; so why have the big disruption now?
KAlmquist - Friday, July 22, 2016 - link
I assume that this is about selling POWER boxes to companies that currently run all x86 servers, and have a bunch of custom software that they might be willing to recompile. If the customer has to spend a bunch of time fixing endian dependencies in his software in order to get it to work on POWER, it will probably be less expensive for them to simply stick with x86.

Assessing IBM's POWER8, Part 1: A Low Level Look at Little Endian

Comparing with Intel's Best

Post Your Comment

124 Comments

View All Comments

zodiacfml - Thursday, July 21, 2016 - link

aryonoco - Friday, July 22, 2016 - link

JohanAnandtech - Friday, July 22, 2016 - link

tipoo - Friday, July 22, 2016 - link

jospoortvliet - Tuesday, July 26, 2016 - link

Daniel Egger - Friday, July 22, 2016 - link

ZeDestructor - Friday, July 22, 2016 - link

JohanAnandtech - Friday, July 22, 2016 - link

Daniel Egger - Friday, July 22, 2016 - link

KAlmquist - Friday, July 22, 2016 - link

Log in

Don't have an account? Sign up now