Comparing With the Other ARMs

We did not have access to any recent Cortex-A57 or X-Gene platform to run the full SPEC CPU2006 suite. But we can still combine our previous findings with those that have been published on the 7-cpu.com. The first X-Gene 1 result is our own measurement, the second one is the best we could find.

SKU Clock Baseline Xeon D Compress Baseline Xeon D Decompress
Atom C2720 2.4 1687 2114
X-Gene 1 (AT bench) 2.4 1580 1864
X-Gene 1 (best) 2.4 1770 1980
Cortex-A57 1.9 1500 2330
ThunderX 2.0 1547 2042
Xeon D1557 1.5-2.1 3079 2320
Xeon E5-2640 v4 2.4-2.6 3755 2943
Xeon E5-2690 v3 2.6-3.5 4599 3811

Let's translate this to percentages, where we compare the Thunder-X performance to the Xeon D and the Cortex-A57, two architectures it must try to beat. The first one is to open a broader market, the second one to justify the development of a homegrown ARMv8 microarchitecture.

SKU Clock Baseline Xeon D Compress Baseline Xeon D Decompress Baseline A57 Compress Baseline A57 Decompress
Atom C2720 2.4 55% 91% 112% 91%
X-Gene (AT bench) 2.4 51% 80% 105% 80%
X-Gene (best) 2.4 57% 85% 118% 85%
Cortex-A57 1.9 49% 100% 100% 100%
ThunderX 2.0 50% 88% 103% 88%
Xeon D1557 2.1 100% 100% 205% 100%
Xeon E5-2640 v4 2.4 122% 127% 250% 126%
Xeon E5-2690 v3 3.5 149% 164% 307% 164%

First of all, these benchmarks should be placed in perspective: they tend to have a different profile than most server applications. For example compression relies a lot on memory latency and TLB efficiency. Decompression relies on integer instructions (shift, multiply). Since this test has unpredictable branches, the ThunderX has an advantage.

The ThunderX at 2 GHz performs more or less like an A57 core at the same speed. Considering that AMD only got eight A57 cores inside a power envelope of 32W using similar process technology, you could imagine that a A57 chip would be able to fit 32 cores at the most in a 120W TDP envelope. So Cavium did quite well fitting about 50% more cores inside the same power envelope using an old 28 nm high-k metal gate process.

Nevertheless, a 120W Xeon E5 offers about 2.5-3 times higher compression performance. The gap is indeed much smaller in decompression, where the wide Broadwell core is only 13% (!) faster than the narrow ThunderX core (compare the Xeon D-1557 with the ThunderX).

Multi-Threaded Integer Performance: SPEC CPU2006 Compression & Decompression
Comments Locked

82 Comments

View All Comments

  • vivs26 - Wednesday, June 15, 2016 - link

    Not necessarily - (read Amdahl's law of diminishing returns). The performance actually depends on the workload. Having a million cores guarantees nothing in terms of performance unless the workload is parallelizable which in the real world is not as much as we think it could be. I'm curious to see how xeon merged with altera programmable fabric performs than ARM on a server.
  • maxxbot - Wednesday, June 22, 2016 - link

    Technically true but every generation that millstone gets a little smaller, the die area and power needed to translate x86 into uops isn't huge and reduces every generation.
  • jardows2 - Wednesday, June 15, 2016 - link

    Interesting. Faster in a few workloads where heavy use of multi-thread is important, but significantly slower in more single thread workloads. For server use, you don't always want parallelized tasks. The results are pretty much across the board for all the processors tested: If the ThunderX was slower, it was slower than all the Intel chips. If it were faster, it was faster than all but the highest end Intel Chips. With the price only being slightly lower than the cheapest Intel chip being sold, I don't think this is going to be a Xeon competitor at all, but will take a few niche applications where it can do better.

    With no significant energy savings, we should be looking forward to the ThunderX2 to see if it will bring this into a better alternative.
  • ddriver - Wednesday, June 15, 2016 - link

    There is hardly a server workload where you don't get better throughput by throwing more cores and servers at it. Servers are NOT about parallelized task, but about concurrent tasks. That's why while desktops are still stuck at 8 cores, server chips come with 20 and more... Server workloads are usually very simple, it is just that there is a lot of them. They are so simple and take so little time it literally makes no sense parallelizing them.
  • jardows2 - Wednesday, June 15, 2016 - link

    In the scenario you described, the single-thread performance takes on even more importance, thus highlighting the advantage the Xeon's currently have in most server configurations.
  • niva - Wednesday, June 15, 2016 - link

    Not if the Xeon doesn't have enough cores to actually process 40+ singlethreaded tasks con-currently.
  • hechacker1 - Wednesday, June 15, 2016 - link

    But kernels and VMWare know how to schedule multiple threads on 1 core if it's not being fully utilized. Single threaded IPC can make up for not having as many cores. See the iPhone SoCs for another example.
  • ddriver - Wednesday, June 15, 2016 - link

    Not if you have thousands of concurrent workloads and only like 8 cores. As fast as each core might be, the overhead from workload context switching will eat it up.
  • willis936 - Thursday, June 16, 2016 - link

    Yeah if each task is not significantly longer than a context switch. Context switches are very fast, especially with processors with many sets of SMT registers per core.
  • ddriver - Thursday, June 16, 2016 - link

    If what you suggest is correct, then intel would not be investing chip TDP in more cores but higher clocks and better single threaded performance. Clearly this is not the case, as they are pushing 20 cores at the fairly modest 2.4 Ghz.

Log in

Don't have an account? Sign up now