Assessing Cavium's ThunderX2: The Arm Server Dream Realized At Last

Name: Assessing Cavium's ThunderX2: The Arm Server Dream Realized At Last
Item: Assessing Cavium's ThunderX2: The Arm Server Dream Realized At Last
Author: Johan De Gelas

by Johan De Gelas on May 23, 2018 9:00 AM EST

97 Comments | Add A Comment

97 Comments

Single-Threaded Integer Performance: SPEC CPU2006

Getting down to measuring actual compute performance, we'll start with the SPEC CPU2006 suite. Astute readers will point out that SPEC CPU2006 is now outdated as SPEC CPU2017 has arrived. But due to the limited testing time and the fact that we could not retest the ThunderX, we decided to stick with CPU2006.

Given that SPEC is almost as much of a compiler benchmark as it is a hardware benchmark, we believe it's important to lay out our testing philosophy here. In this case, that using specific flags and other compiler settings just to inflate a benchmark's score does not lead to meaningful comparisons. So we want to keep the settings as "real world" as possible with the following settings (and we welcome constructive criticism on the matter):

64 bit gcc: most used compiler on Linux, good all round compiler that does not try to "break" benchmarks (libquantum...)
-Ofast: compiler optimization that many developers may use
-fno-strict-aliasing: necessary to compile some of the subtests
base run: every subtest is compiled in the same way.

The first objective is to measure performance in applications where for some reason – as is frequently the case – a "multi-threading unfriendly" task keeps us waiting. Our second objective is to understand how well the ThunderX OOO architecture deals with a single thread compared to Intel's Skylake architecture. Keep in mind that this specific model Skylake chip can boost to 3.8 GHz. The chip will run at 2.8 GHz in almost all situations (28 threads active), and will sustain 3.4 GHz with 14 active threads.

Overall, Cavium positions the ThunderX2 CN9980 ($1795) as being "better than the 6148" ($3072), a CPU that runs at 2.6 GHz (20 threads) and reaches 3.3 GHz without much trouble (up to 16 threads active). As a result, the Intel SKUs will have a sizable 30% clock advantage in many situations (3.3GHz vs 2.5GHz).

Cavium makes up for this clockspeed deficit by offering up to 60% more cores (32 cores) than the Xeon 6148 (20 cores). But we must note that higher core counts will result in diminishing returns in many applications (e.g. Amdahl). So if Cavium wants to threaten Intel's dominant position with the ThunderX2, each core needs to at least offer competitive performance on a clock-for-clock. Or in this case, the ThunderX2 should deliver at least 66% (2.5 vs 3.8) of the single threaded performance of the Skylake. If that is not the case, Cavium must hope that the 4-way SMT bridges the gap.

SPEC CPU2006: Single-Threaded
Subtest SPEC CPU2006 Integer	Application Type	Cavium ThunderX 2 GHz gcc 5.2	Cavium ThunderX2 @2.5 GHz gcc 7.2	Xeon 8176 @3.8 GHz gcc 7.2	ThunderX2 vs Xeon 8176
400.perlbench	Spam filter	8.3	20.1	46.4	43%
401.bzip2	Compression	6.5	14	25	56%
403.gcc	Compiling	10.8	26.7	31	86%
429.mcf	Vehicle scheduling	10.2	44.5	40.6	110%
445.gobmk	Game AI	9.2	15.7	27.6	57%
456.hmmer	Protein seq. analyses	4.8	22.2	35.6	62%
458.sjeng	Chess	8.8	15.8	30.8	51%
462.libquantum	Quantum sim	5.8	76.4	86.2	89%
464.h264ref	Video encoding	11.9	26.7	64.5	49%
471.omnetpp	Network sim	7.3	26.4	37.9	70%
473.astar	Pathfinding	7.9	15.6	24.7	63%
483.xalancbmk	XML processing	8.4	27.7	63.7	43%

Without having the opportunity to do any profiling on the ThunderX2, we must humbly admit that we have to speculate a bit based on what we have read so far about these benchmarks. Furthermore, since the ThunderX2 is running ARMv8 (AArch64) code and the Xeon runs x86-64 code, the picture gets even blurrier.

The pointer chasing benchmarks – XML processing (also large OoO buffers necessary) and Path finding – which typically depend on a large L3-cache to lower the impact of access latency, are the worst performing on the ThunderX2. We can assume that the higher latency of DRAM system is hurting performance.

The workloads where the impact of branch prediction is higher (at least on x86-64: a higher percentage of branch misses) – gobmk, sjeng, hmmer – are not top performers either on the ThunderX2.

It's also worth noting that perlbench, gobmk, hmmer, and the instruction part of h264ref are all known to benefit from the larger L2-cache (512 KB) of Skylake. We are only giving you a few puzzle pieces, but together they might help to make some educated guesses.

On the positive side, the ThunderX2 performs well on gcc, which runs mostly inside the L1 and L2-cache (thus relying on a low latency L2) and where the performance impact of the branch predictor is minimal. Overall the best subtest for the TunderX2 is mcf (vehicle scheduling in public mass transportation), which is known to miss the L1 data cache almost completely, relying a lot on the L2-cache, which is pretty fast on the ThunderX2. Mcf also demands quite a bit of memory bandwidth. Libquantum is the one with the highest memory bandwidth demand. The fact that Skylake offers rather mediocre single threaded bandwidth is probably also a reason why the ThunderX2 is so competitive on libquantum and mcf.

Memory Subsystem Measurements SPEC CPU2006 Cont: Per-Core Performance w/SMT

PRINT THIS ARTICLE

Post Your Comment
Please log in or sign up to comment.

Comments Locked

97 Comments

View All Comments

Wilco1 - Wednesday, May 23, 2018 - link
You might want to study RISC and CISC first before making any claims. RISC doesn't use more instructions than CISC. Vector instructions are actually quite similar on most ISAs. In fact I would say the Neon ones are more powerful and more general due to being well designed rather than added ad-hoc.
HStewart - Wednesday, May 23, 2018 - link
The following site explain the difference using a simple multiply action, where a CISC architecture can do in single instruction, RISC would need to use multiple instructions

http://www.firmcodes.com/difference-risc-sics-arch...

of course as time move on RISC chips added more complex operations and CISC also found ways to breaking more complex CISC instruction in smaller RISC like microcode increasing the chip ability to multitask the pipeline.
Wilco1 - Thursday, May 24, 2018 - link
The example was about load/store architecture, not multiply. In reality almost all instructions use registers (even on CISCs) since memory is too slow, so it's not a good example of what happens in actual code. The number of executed instructions on large applications is actually very close. The key reason is that compilers avoid all the complex instructions on x86 and mostly use register operations, not memory.
Kevin G - Tuesday, May 29, 2018 - link
Raw instruction counts isn't a good metric to determine the difference between RISC and CISC, especially as both have evolved to include various SIMD and transactional extensions.

The big thing for RISC is that it only supports a handful of instruction formats, generally all of the same length (traditionally 4 bytes)* and have alignment rules in place. x86 on the other hand leverages a series of prefixes to enhance instructions which permits length up to 15 bytes. On the flip side, there are also x86 instructions that consume a single byte. This also means x86 doesn't have the alignment rules that RISC chips generally adhere to.
*ARM does offer some compressed instruction formats in Thumb/Thumb2 but they those are also of a fixed length. 16 bit Thumb instructions are half size as 32 bit ARM instructions and have alignment rules as well.

Modern x86 is radically different internally than its philosophical lineage. x86 instructions are broken down into micro-ops which are RISC-like in nature. These decoded instructions are now being cached to bypass the complex and power hungry decode stages. Compare this to some ARM cores where some instructions do not have to be decoded. While having a simpler decode doesn't directly help with performance, it does impact power consumption.

However, I would differ and say that ARM's FPU and vector history has been rather troubled. Initially ARM didn't specify a FPU but rather a method to add coprocessors. This lead to 3rd parties producing ARM cores with incompatible FPUs. It wasn't until recently that ARM themselves put their foot down and mandated NEON as the one to rule them all, especially in 64 bit mode.
peevee - Wednesday, May 23, 2018 - link
The whole RISC vs CISC distinction is outdated for at least 20 years. Both now include a shi(p)load of instruction far outnumbering original CISC processors like 68000 and 8088 (from the epoch of the whole CISC vs RISC discussion), and both have a lot of architectural registers (which on speculative OoO CPUs are not even the same as real register files). ARMv8 for example includes NEON instructions, which is like... "AVX-128" (or SSE3 or smth).

A lot of instructions means that both have to have huge decoders, which limits how small the CPU can be (because any reduction in other hardware which decrease performance faster than cost). For 64-bit ARMv8.2 it is very unlikely than an implementation can be made smaller than A55, and it is a huge core (in transistors) compared to even Pentium, let alone 8088.
HStewart - Wednesday, May 23, 2018 - link
I think the big difference between SIMD technologies - even though ARM has included they are not as wide as instructions as Intel or AMD. The following link appears to have a good comparison of chip SIMD comparison in size, To me in looks like AMD is on AVX level 8/16 instead of 16/32 in current chips while ARM including Neon is 4 Wide which is actually less than Core 2 SSE instructions from 10 years ago.

https://stackoverflow.com/questions/15655835/flops...

It also interesting to note Ryzen stats - which I heard that AMD implement AVX 256 by combine two 128 together

One thing is that both Intel and AMD CPUs have grown a long ways since 20 years ago. In fact even todays Atom's can out rune most core-2 CPU's from 10 years - not my Xeon 5160 however.
ZolaIII - Thursday, May 24, 2018 - link
It's 2x128 NEON SIMD per ARM A75 core which goes into your smartphone.
Even with smaller SIMD utilising TBL QC Centriq is able to beat up an Xerox Gold.
https://blog.cloudflare.com/neon-is-the-new-black/
Wilco1 - Thursday, May 24, 2018 - link
Modern Arm cores have 2-3 128-bit SIMD units, so 16-24 SP FLOPS/cycle. About half of Skylake theoretical flops, and yet they can match or beat Skylake on many HPC codes. Size is not everything...
peevee - Thursday, May 24, 2018 - link
"ARM including Neon is 4 Wide which is actually less than Core 2 SSE instructions from 10 years ago"

How is it less? It is the same 128 bits, 2x64 or 4x32 or 2x16...

And AMD combines 2 AVX-256 operations (not 2 128-bit SSEs) to get AVX-512.
patrickjp93 - Friday, May 25, 2018 - link
AMD does NOT have AVX-512. They combine 2 128s into a 256 on Ryzen, ThreadRipper, and Epyc.

Assessing Cavium's ThunderX2: The Arm Server Dream Realized At Last

Single-Threaded Integer Performance: SPEC CPU2006

Post Your Comment

97 Comments

View All Comments

Wilco1 - Wednesday, May 23, 2018 - link

HStewart - Wednesday, May 23, 2018 - link

Wilco1 - Thursday, May 24, 2018 - link

Kevin G - Tuesday, May 29, 2018 - link

peevee - Wednesday, May 23, 2018 - link

HStewart - Wednesday, May 23, 2018 - link

ZolaIII - Thursday, May 24, 2018 - link

Wilco1 - Thursday, May 24, 2018 - link

peevee - Thursday, May 24, 2018 - link

patrickjp93 - Friday, May 25, 2018 - link

Log in

Don't have an account? Sign up now