Comparing with Intel's Best

Comparing CPUs in tables is always a very risky game: those simple numbers hide a lot of nuances and trade-offs. But if we approach with caution, we can still extract quite a bit of information out of it.

Feature IBM POWER8
 
Intel
Broadwell (Xeon E5 v4)
Intel
Skylake
L1-I cache
Associativity
32 KB
8-way
32 KB
8-way
32 KB
8-way
L1-D cache
Associativity
64 KB
8-way
32 KB
8-way
32 KB
8-way
Outstanding L1-cache misses 16 10 10
Fetch Width 8 instructions 16 bytes (+/- 4-5 x86) 16 bytes (+/- 4-5 x86)
Decode Width 8 4 µops 5-6* µops
(*µop cache hit)
Issue Queue 64+15 branch+8 CR
= 87 
60 unified 97 unified
Issue Width/Cycle 10   8 8
Instructions in Flight 224 (GCT SMT-8 modus) 192 (ROB) 224 (ROB)
Archi regs
Rename regs
32 (ST), 2x32 (SMT-2)
92 (ST), 2x92 (SMT-2)
16
168
16
180
Load
Bandwidth (per unit)
Load Queue Size
4 per cycle
16B/cycle

44 entries
2 per cycle
32B/cycle

72 entries
2 per cycle
32B/cycle

72 entries
Store
Bandwidth
Store Queue Size
2 per cycle
16B/cycle
40 entries
1 per cycle
32B/cycle
42 entries
1 per cycle
32B/cycle
56 entries
Int. Pipeline Length

18 stages

19 stages
14 stage from µop cache


19 stages
14 stage from µop cache
TLB 2048
4-way
128I + 64D L1
1024
8-way
128I + 64D L1
1536
8-way
Page Support 4 KB, 64 KB, 16 MB, 16 GB 4 KB, 2/4 MB, 1 GB 4 KB, 2/4 MB, 1 GB

Both CPUs are very wide brawny Out of Order (OoO) designs, especially compared to the ARM server SoCs.

Despite the lower decode and issue width, Intel has gone a little bit further to optimize single threaded performance than IBM. Notice that the IBM has no loop stream detector nor µop cache to reduce branch misprediction. Furthermore the load buffers of the Intel microarchitecture are deeper and the total number of instructions in flight for one thread is higher. The TLB architecture of the IBM POWER8 has more entries while Intel favors speedy address translations by offering a small level one TLB and a L2 TLB. Such a small TLB is less effective if many threads are working on huge amounts of data, but it favors a single thread that needs fast virtual to physical address translation.

On the flip side of the coin, IBM has done its homework to make sure that 2-4 threads can really boost the performance of the chip, while Intel's choices may still lead to relatively small SMT related performance gains in quite a few applications. For example, the instruction TLB, µop cache (Decode Stream Buffer) and instruction issue queues are divided in 2 when 2 threads are active. This will reduced the hit rate in the micro-op cache, and the 16 byte fetch looks a little bit on the small side. Let us see what IBM did to make sure a second thread can result in a more significant performance boost.

Inside the Beast(s) Heavy SMT: Multi Threading Prowess
Comments Locked

124 Comments

View All Comments

  • JohanAnandtech - Thursday, July 28, 2016 - link

    Send me a mail at johan@anandtech.com
  • abufrejoval - Thursday, August 4, 2016 - link

    Hmm, a bit fuzzy after the first paragraph or so and evidently because I dislike malwaretizement: Such links should be banned!
  • mystic-pokemon - Friday, July 22, 2016 - link

    Hi floobit
    For virtualization: powerVM and out of the box KVM (tested on Fedora 23, Ubuntu 15.04 / 15.10 / 16.04) work quite well. Xen doesn't work well or hasn't been officially tested / released.
  • tipoo - Thursday, July 21, 2016 - link

    Fun! I was always curious about this processor.
  • tipoo - Thursday, July 21, 2016 - link

    Interesting that the L3 eDRAM not only allows them to pack in much more L3 (what was it, 3 SRAM transistors per eDRAM or something?), but it's also low latency which was a cited concern with eDARM by some people. Appears to be an unfounded fear.

    And then on top of that they put another large L4 eDRAM cache on.

    Maybe Intel needs to play with eDRAM more...
  • tipoo - Thursday, July 21, 2016 - link

    Lol, eDRAM, not eDARM
  • Kevin G - Thursday, July 21, 2016 - link

    There was a change in how the L4 cache works from Broadwell to SkyLake on the mobile parts. The implication is that Intel was exploring the idea of a large L4 eDRAM for SkyLake-EP/EX parts. We'll see how that turns out as Intel also has explored using HMC as a cache for high bandwidth applications in Knights Landing. So either way, Intel has thus idea on there radar and we'll see how it pans out next year.
  • tsk2k - Thursday, July 21, 2016 - link

    Is it possible to run Windows on one of these?
  • ZeDestructor - Thursday, July 21, 2016 - link

    At the moment, a very solid no.

    That said, if enough partners ask for it and/or if the numbers make sense for Azure, MS will at the very least have a damn good look at porting Windows over.
  • DanNeely - Thursday, July 21, 2016 - link

    It's probably just a case of doing QA and releasing it. They've sold a PPC build in the past; and maintain internal builds for a number of other CPU architectures to avoid accidentally baking x86isms into the core code.

Log in

Don't have an account? Sign up now