The Ampere Altra Review: 2x 80 Cores Arm Server Performance Monster

Name: The Ampere Altra Review: 2x 80 Cores Arm Server Performance Monster
Item: The Ampere Altra Review: 2x 80 Cores Arm Server Performance Monster
Author: Andrei Frumusanu

by Andrei Frumusanu on December 18, 2020 6:00 AM EST

148 Comments | Add A Comment

148 Comments

Compiling LLVM, NAMD Performance

As we’re trying to rebuild our server test suite piece by piece – and there’s still a lot of work go ahead to get a good representative “real world” set of workloads, one more highly desired benchmark amongst readers was a more realistic compilation suite. Chrome and LLVM codebases being the most requested, I landed on LLVM as it’s fairly easy to set up and straightforward.

git clone https://github.com/llvm/llvm-project.git

cd llvm-project

git checkout release/11.x

mkdir ./build

cd ..

mkdir llvm-project-tmpfs

sudo mount -t tmpfs -o size=10G,mode=1777 tmpfs ./llvm-project-tmpfs

cp -r llvm-project/* llvm-project-tmpfs

cd ./llvm-project-tmpfs/build

cmake -G Ninja \

-DLLVM_ENABLE_PROJECTS="clang;libcxx;libcxxabi;lldb;compiler-rt;lld" \

-DCMAKE_BUILD_TYPE=Release ../llvm

time cmake --build .

We’re using the LLVM 11.0.0 release as the build target version, and we’re compiling Clang, libc++abi, LLDB, Compiler-RT and LLD using GCC 10.2 (self-compiled). To avoid any concerns about I/O we’re building things on a ramdisk – on a 4KB page system 5GB should be sufficient but on the Altra’s 64KB system it used up to 9.5GB, including the source directory. We’re measuring the actual build time and don’t include the configuration phase as usually in the real world that doesn’t happen repeatedly.

LLVM Suite Compile Time

The Altra Q80-33 here performs admirably and pretty much matches the AMD EPYC 7742 both in 1S and 2S configurations. There isn’t exact perfect scaling between sockets because this being a actual build process, it also includes linking phases which are mostly single-threaded performance bound.

Generally, it’s interesting to see that the Altra here fares better than in the SPEC 502.gcc_r MT test – pointing out that real codebases might not be quite as demanding as the 502 reference source files, including a more diverse number of smaller files and objects that are being compiled concurrently.

NAMD

Another rather popular benchmark tool that we’ve actually seen being used by vendors such as AMD in their marketing materials when showcasing HPC performance for their server chips is NAMD. This actually quite an interesting adventure in terms of compiling the tool for AArch64 as essentially there little to no proper support for it. I’ve used the latest source drop, essentially the 2.15alpha / 3.0alpha tree, and compiled it from scratch on GCC 10.2 using the platform’s respective -march and -mtune targets.

For the Xeon 8280 – I did not use the AVX512 back-end for practical reasons: The code which introduces an AVX512 algorithm and was contributed by Intel engineers to NAMD has no portability to compilers other than ICC. Beyond this being a code-path that has no relation with the “normal” CPU algorithm – the reliance on ICC is something that definitely made me raise my eyebrows. It’s a whole other discussion topic on having a benchmark with real-world performance and the balance of having an actual fair and balanced apple to apples comparison. It’s something to revisit in the future as I invest more time into looking the code and see if I can port it to GCC or LLVM.

NAMD (Git-2020-12-09) - Apolipoprotein A1

For the single-socket numbers – we’re using the multicore variant of the tool which has predictable scaling across a single NUMA node. Here, the Ampere Altra Q80-33 performed amazingly well and managed to outperform the AMD EPYC 7742 by 30% - signifying this is mostly a compute-bound workload that scales well with actual cores.

For the 2S figures, using the multicore binaries results in undeterministic performance – the Altra here regressed to 2ns/day and the EPYC system also crashed down to 4ns/day – oddly enough the Xeon system had absolutely no issue in running this properly as it had excellent performance scaling and actually outperforms the MPI version. The 2S EPYC scales well with the MPI version of the benchmark, as expected.

Unfortunately, I wasn’t able to compile an MPI version of NAMD for AArch64 as the codebase kept running into issues and it had no properly maintained build target for this. In general, I felt like I was amongst the first people to ever attempt this, even though there are some resources to attempt to help out on this.

I also tried running Blender on the Altra system but that ended up with so many headaches I had to abandon the idea – on CentOS there were only some really old build packages available in the repository. Building Blender from source on AArch64 with all of its dependencies ends up in a plethora of software packages which simply assume you’re running on x86 and rely on basic SSE intrinsics – easy enough to fix that in the makefiles, but then I hit some other compilation errors after which I lost my patience. Fedora Linux seemed to be the only distribution offering an up-to-date build package for Blender – but I stopped short of reinstalling the OS just to benchmark Blender.

So, while AArch64 has made great strides in the past few years – and the software situation might be quite good for server workloads, it’s not all rosy and we’re still have ways to go before it can be considered a first-class citizen in the software ecosystem. Hopefully Apple’s introduction of Apple Silicon Macs will accelerate the Arm software ecosystem.

SPECjbb MultiJVM - Java Performance Conclusion & End Remarks

PRINT THIS ARTICLE

Post Your Comment
Please log in or sign up to comment.

Comments Locked

148 Comments

View All Comments

Josh128 - Friday, December 18, 2020 - link
Did you see the chip package? Its the size of an EPYC package. Im extremely doubtful its only 350mm^2.
mode_13h - Sunday, December 20, 2020 - link
Look at where they show the bottom of the heatsink and it's small contact area. That shows the actual die is much smaller.
Spunjji - Monday, December 21, 2020 - link
Doubt all you want - they have to put the pins for the interfaces somewhere, and that doesn't change much regardless of die size.
Gondalf - Friday, December 18, 2020 - link
Obviousy it is a cpu of niche, not high volume like Intel or AMD. With a so large die we will not see many of these around. As usual only volume matter in Server world
So no worries for X86.
eastcoast_pete - Friday, December 18, 2020 - link
Actually, those are a bigger threat to x86 than ARM chips like the M1 in Personal Computers. Server x86/x64 CPUs ist where AMD and Intel make a lot of their money. The key question for this and similar Neoverse chips is software support. If you can run your database or whatever natively on an ARM-native OS like Linux, these are tempting. Now, if MS would release Exchange Server in native for ARM, the threat would be even bigger.
Gondalf - Friday, December 18, 2020 - link
Agreed about software, but i don't see problems for x86 dominance.
Major sin of this design is die size, around 800mm2 looking photos in the article. On 7nm it means a very low cpu output; this issue will become even worse on 5nm.
So it is not a matter how good is a SKU but who have the real volume in server world. In past decades we have seen a lot of better cpus than x86 puppies, but in spite of this they all have lost their way.
The winner scheme is "volume". This is the only parameter that gives the dominance of a solution over another ones, expecially today with several and several millions/year of server SKUs absorbed by the market.
Altra is not born to beat x86, at least not in this crazy, old style, incarnation. They need to follow AMD (and shortly Intel) path instead of they will never be relevant.
Actual and upcoming advanced processes are not done for these massive things.
scineram - Saturday, December 19, 2020 - link
It's less than half that, you absolute retard moron.
Wilco1 - Friday, December 18, 2020 - link
Apple's move to Arm does hit Intel's bottom line by many billions. A large percentage of AWS is already Graviton as more big customers are moving to it (latest is Twitter). Oracle is going to use Ampere Altra, and Microsoft is claimed to develop their own Arm servers.

As Goldalf said, volume matters in the server world, and they are moving to Arm.
Spunjji - Monday, December 21, 2020 - link
I love Gondalf posts. Minimum-effort confirmation bias ramblings.
eastcoast_pete - Friday, December 18, 2020 - link
That was my question also! Who fabs it, and what is their yield. This thing is quite big. Does anyone know if they overprovision cores so they can use those with small, very partial defects? At that size and those numbers of transistors, even a tiny probability of a defect can mean that the great majority of chips ends up in the circular bin (garbage).

The Ampere Altra Review: 2x 80 Cores Arm Server Performance Monster

Compiling LLVM, NAMD Performance

NAMD

Post Your Comment

148 Comments

View All Comments

Josh128 - Friday, December 18, 2020 - link

mode_13h - Sunday, December 20, 2020 - link

Spunjji - Monday, December 21, 2020 - link

Gondalf - Friday, December 18, 2020 - link

eastcoast_pete - Friday, December 18, 2020 - link

Gondalf - Friday, December 18, 2020 - link

scineram - Saturday, December 19, 2020 - link

Wilco1 - Friday, December 18, 2020 - link

Spunjji - Monday, December 21, 2020 - link

eastcoast_pete - Friday, December 18, 2020 - link

Log in

Don't have an account? Sign up now