Assessing Cavium's ThunderX2: The Arm Server Dream Realized At Last

Name: Assessing Cavium's ThunderX2: The Arm Server Dream Realized At Last
Item: Assessing Cavium's ThunderX2: The Arm Server Dream Realized At Last
Author: Johan De Gelas

by Johan De Gelas on May 23, 2018 9:00 AM EST

97 Comments | Add A Comment

97 Comments

Single-Threaded Integer Performance: SPEC CPU2006

Getting down to measuring actual compute performance, we'll start with the SPEC CPU2006 suite. Astute readers will point out that SPEC CPU2006 is now outdated as SPEC CPU2017 has arrived. But due to the limited testing time and the fact that we could not retest the ThunderX, we decided to stick with CPU2006.

Given that SPEC is almost as much of a compiler benchmark as it is a hardware benchmark, we believe it's important to lay out our testing philosophy here. In this case, that using specific flags and other compiler settings just to inflate a benchmark's score does not lead to meaningful comparisons. So we want to keep the settings as "real world" as possible with the following settings (and we welcome constructive criticism on the matter):

64 bit gcc: most used compiler on Linux, good all round compiler that does not try to "break" benchmarks (libquantum...)
-Ofast: compiler optimization that many developers may use
-fno-strict-aliasing: necessary to compile some of the subtests
base run: every subtest is compiled in the same way.

The first objective is to measure performance in applications where for some reason – as is frequently the case – a "multi-threading unfriendly" task keeps us waiting. Our second objective is to understand how well the ThunderX OOO architecture deals with a single thread compared to Intel's Skylake architecture. Keep in mind that this specific model Skylake chip can boost to 3.8 GHz. The chip will run at 2.8 GHz in almost all situations (28 threads active), and will sustain 3.4 GHz with 14 active threads.

Overall, Cavium positions the ThunderX2 CN9980 ($1795) as being "better than the 6148" ($3072), a CPU that runs at 2.6 GHz (20 threads) and reaches 3.3 GHz without much trouble (up to 16 threads active). As a result, the Intel SKUs will have a sizable 30% clock advantage in many situations (3.3GHz vs 2.5GHz).

Cavium makes up for this clockspeed deficit by offering up to 60% more cores (32 cores) than the Xeon 6148 (20 cores). But we must note that higher core counts will result in diminishing returns in many applications (e.g. Amdahl). So if Cavium wants to threaten Intel's dominant position with the ThunderX2, each core needs to at least offer competitive performance on a clock-for-clock. Or in this case, the ThunderX2 should deliver at least 66% (2.5 vs 3.8) of the single threaded performance of the Skylake. If that is not the case, Cavium must hope that the 4-way SMT bridges the gap.

SPEC CPU2006: Single-Threaded
Subtest SPEC CPU2006 Integer	Application Type	Cavium ThunderX 2 GHz gcc 5.2	Cavium ThunderX2 @2.5 GHz gcc 7.2	Xeon 8176 @3.8 GHz gcc 7.2	ThunderX2 vs Xeon 8176
400.perlbench	Spam filter	8.3	20.1	46.4	43%
401.bzip2	Compression	6.5	14	25	56%
403.gcc	Compiling	10.8	26.7	31	86%
429.mcf	Vehicle scheduling	10.2	44.5	40.6	110%
445.gobmk	Game AI	9.2	15.7	27.6	57%
456.hmmer	Protein seq. analyses	4.8	22.2	35.6	62%
458.sjeng	Chess	8.8	15.8	30.8	51%
462.libquantum	Quantum sim	5.8	76.4	86.2	89%
464.h264ref	Video encoding	11.9	26.7	64.5	49%
471.omnetpp	Network sim	7.3	26.4	37.9	70%
473.astar	Pathfinding	7.9	15.6	24.7	63%
483.xalancbmk	XML processing	8.4	27.7	63.7	43%

Without having the opportunity to do any profiling on the ThunderX2, we must humbly admit that we have to speculate a bit based on what we have read so far about these benchmarks. Furthermore, since the ThunderX2 is running ARMv8 (AArch64) code and the Xeon runs x86-64 code, the picture gets even blurrier.

The pointer chasing benchmarks – XML processing (also large OoO buffers necessary) and Path finding – which typically depend on a large L3-cache to lower the impact of access latency, are the worst performing on the ThunderX2. We can assume that the higher latency of DRAM system is hurting performance.

The workloads where the impact of branch prediction is higher (at least on x86-64: a higher percentage of branch misses) – gobmk, sjeng, hmmer – are not top performers either on the ThunderX2.

It's also worth noting that perlbench, gobmk, hmmer, and the instruction part of h264ref are all known to benefit from the larger L2-cache (512 KB) of Skylake. We are only giving you a few puzzle pieces, but together they might help to make some educated guesses.

On the positive side, the ThunderX2 performs well on gcc, which runs mostly inside the L1 and L2-cache (thus relying on a low latency L2) and where the performance impact of the branch predictor is minimal. Overall the best subtest for the TunderX2 is mcf (vehicle scheduling in public mass transportation), which is known to miss the L1 data cache almost completely, relying a lot on the L2-cache, which is pretty fast on the ThunderX2. Mcf also demands quite a bit of memory bandwidth. Libquantum is the one with the highest memory bandwidth demand. The fact that Skylake offers rather mediocre single threaded bandwidth is probably also a reason why the ThunderX2 is so competitive on libquantum and mcf.

Memory Subsystem Measurements SPEC CPU2006 Cont: Per-Core Performance w/SMT

PRINT THIS ARTICLE

Post Your Comment
Please log in or sign up to comment.

Comments Locked

97 Comments

View All Comments

Davenreturns - Wednesday, May 23, 2018 - link
In the spec table for the AMD EPYC 7601 you have max sockets 4 and PCIe 3.0 lanes as 64. I thought the max sockets was 2 and that the total number of PCIe 3.0 lanes was 128 (64 in a dual socket machine).
davegraham - Wednesday, May 23, 2018 - link
max sockets is 2 and PCIe lanes is 128 (64 from each 7601 for a combined total of 128; remember, each 7601 has 128 PCIe lanes by themselves. 64 from each are ganged together for IF in a 2P system).
davegraham - Wednesday, May 23, 2018 - link
*are not *is
Davenreturns - Wednesday, May 23, 2018 - link
But in a single socket motherboard system, the total PCIe lanes available from one EPYC processor is 128 which I think we are both saying is correct.
Davenreturns - Wednesday, May 23, 2018 - link
The reason I think these two corrections are important and should be addressed by the author is the way the players in the market are competing. The table should read 128 PCIe lanes and 2 sockets max for EPYC. One only needs to look at AMD's EPYC One socket page to understand why it is important.

https://www.amd.com/en/products/epyc-7000-series-1...

The page is filled with marketing trying to convince customers that you are actually getting a two socket server in just one socket. And yes 128 PCIe lanes are available to the customer in these one socket products as part of the reasoning.

The max number of sockets is also important. AMD and probably Cavium are both arguing that 90% of the market only needs 1 or 2 sockets. Intel doesn't agree and provides 4 or more socket configurations.

The one socket argument centers around the I/O and memory channels available in the AMD processor. Even though the table just might have typos, reviewers around the web had a hard time believing that a single chip offered 128 lanes of PCIe connectivity and I found a lot of misinformation. It continues today.
DanNeely - Wednesday, May 23, 2018 - link
AFAIK even for intel 1/2 socket machines are around 90% of their sales. They're just selling enough total server chips in total that catering to the sliver of the market that does want 4/8way configurations is still worth their time.
Arnulf - Sunday, May 27, 2018 - link
Profit margins in that market segment are likely to be way higher so it's worth it for Intel as long as there is no competition, forcing prices downwards.
Ryan Smith - Wednesday, May 23, 2018 - link
You are correct. Thanks for pointing that out.
Davenreturns - Wednesday, May 23, 2018 - link
Thanks so much, Ryan.
vanilla_gorilla - Wednesday, May 23, 2018 - link
"This is because the customers who have invested in expensive enterprise software (Oracle, SAP) are less sensitive to cost on the hardware side, so they are much less likely to change to a new hardware platform."

I don't really follow the logic here. Just because you spend a lot more money on software doesn't mean you wouldn't try to save money on hardware. You don't only focus on one related expense because it's larger.

Assessing Cavium's ThunderX2: The Arm Server Dream Realized At Last

Single-Threaded Integer Performance: SPEC CPU2006

Post Your Comment

97 Comments

View All Comments

Davenreturns - Wednesday, May 23, 2018 - link

davegraham - Wednesday, May 23, 2018 - link

davegraham - Wednesday, May 23, 2018 - link

Davenreturns - Wednesday, May 23, 2018 - link

Davenreturns - Wednesday, May 23, 2018 - link

DanNeely - Wednesday, May 23, 2018 - link

Arnulf - Sunday, May 27, 2018 - link

Ryan Smith - Wednesday, May 23, 2018 - link

Davenreturns - Wednesday, May 23, 2018 - link

vanilla_gorilla - Wednesday, May 23, 2018 - link

Log in

Don't have an account? Sign up now