Compiling LLVM, NAMD Performance

As we’re trying to rebuild our server test suite piece by piece – and there’s still a lot of work go ahead to get a good representative “real world” set of workloads, one more highly desired benchmark amongst readers was a more realistic compilation suite. Chrome and LLVM codebases being the most requested, I landed on LLVM as it’s fairly easy to set up and straightforward.

git clone https://github.com/llvm/llvm-project.git
cd llvm-project
git checkout release/11.x
mkdir ./build
cd ..
mkdir llvm-project-tmpfs
sudo mount -t tmpfs -o size=10G,mode=1777 tmpfs ./llvm-project-tmpfs
cp -r llvm-project/* llvm-project-tmpfs
cd ./llvm-project-tmpfs/build
cmake -G Ninja \
  -DLLVM_ENABLE_PROJECTS="clang;libcxx;libcxxabi;lldb;compiler-rt;lld" \
  -DCMAKE_BUILD_TYPE=Release ../llvm
time cmake --build .

We’re using the LLVM 11.0.0 release as the build target version, and we’re compiling Clang, libc++abi, LLDB, Compiler-RT and LLD using GCC 10.2 (self-compiled). To avoid any concerns about I/O we’re building things on a ramdisk – on a 4KB page system 5GB should be sufficient but on the Altra’s 64KB system it used up to 9.5GB, including the source directory. We’re measuring the actual build time and don’t include the configuration phase as usually in the real world that doesn’t happen repeatedly.

LLVM Suite Compile Time

The Altra Q80-33 here performs admirably and pretty much matches the AMD EPYC 7742 both in 1S and 2S configurations. There isn’t exact perfect scaling between sockets because this being a actual build process, it also includes linking phases which are mostly single-threaded performance bound.

Generally, it’s interesting to see that the Altra here fares better than in the SPEC 502.gcc_r MT test – pointing out that real codebases might not be quite as demanding as the 502 reference source files, including a more diverse number of smaller files and objects that are being compiled concurrently.

NAMD

Another rather popular benchmark tool that we’ve actually seen being used by vendors such as AMD in their marketing materials when showcasing HPC performance for their server chips is NAMD. This actually quite an interesting adventure in terms of compiling the tool for AArch64 as essentially there little to no proper support for it. I’ve used the latest source drop, essentially the 2.15alpha / 3.0alpha tree, and compiled it from scratch on GCC 10.2 using the platform’s respective -march and -mtune targets.

For the Xeon 8280 – I did not use the AVX512 back-end for practical reasons: The code which introduces an AVX512 algorithm and was contributed by Intel engineers to NAMD has no portability to compilers other than ICC. Beyond this being a code-path that has no relation with the “normal” CPU algorithm – the reliance on ICC is something that definitely made me raise my eyebrows. It’s a whole other discussion topic on having a benchmark with real-world performance and the balance of having an actual fair and balanced apple to apples comparison. It’s something to revisit in the future as I invest more time into looking the code and see if I can port it to GCC or LLVM.

NAMD (Git-2020-12-09) - Apolipoprotein A1

For the single-socket numbers – we’re using the multicore variant of the tool which has predictable scaling across a single NUMA node. Here, the Ampere Altra Q80-33 performed amazingly well and managed to outperform the AMD EPYC 7742 by 30% - signifying this is mostly a compute-bound workload that scales well with actual cores.

For the 2S figures, using the multicore binaries results in undeterministic performance – the Altra here regressed to 2ns/day and the EPYC system also crashed down to 4ns/day – oddly enough the Xeon system had absolutely no issue in running this properly as it had excellent performance scaling and actually outperforms the MPI version. The 2S EPYC scales well with the MPI version of the benchmark, as expected.

Unfortunately, I wasn’t able to compile an MPI version of NAMD for AArch64 as the codebase kept running into issues and it had no properly maintained build target for this. In general, I felt like I was amongst the first people to ever attempt this, even though there are some resources to attempt to help out on this.

I also tried running Blender on the Altra system but that ended up with so many headaches I had to abandon the idea – on CentOS there were only some really old build packages available in the repository. Building Blender from source on AArch64 with all of its dependencies ends up in a plethora of software packages which simply assume you’re running on x86 and rely on basic SSE intrinsics – easy enough to fix that in the makefiles, but then I hit some other compilation errors after which I lost my patience. Fedora Linux seemed to be the only distribution offering an up-to-date build package for Blender – but I stopped short of reinstalling the OS just to benchmark Blender.

So, while AArch64 has made great strides in the past few years – and the software situation might be quite good for server workloads, it’s not all rosy and we’re still have ways to go before it can be considered a first-class citizen in the software ecosystem. Hopefully Apple’s introduction of Apple Silicon Macs will accelerate the Arm software ecosystem.

SPECjbb MultiJVM - Java Performance Conclusion & End Remarks
Comments Locked

148 Comments

View All Comments

  • Silver5urfer - Friday, December 18, 2020 - link

    25% more cores for Zen2 7742 class. If paired with multi socket and then Milan drop in this is not going to be any major breakthrough.

    "The Arm server dream is no longer a dream, it’s here today, and it’s real." lol so until today all the articles on the ARM are not real I guess.

    Anyways I will wait for market penetration of this with server share and then see how great ARM is and how bad x86 is going to be as from AT's narrative recently.
  • Spunjji - Monday, December 21, 2020 - link

    Are you this mopey every time there's a paradigm-shift in the tech industry? Feel free to keep looking for metrics that "prove" you right, but eventually it's going to be a very hard search.
  • eastcoast_pete - Friday, December 18, 2020 - link

    Thanks Andrei! Maybe I am barking up the wrong tree here, but I find the "baby" server chip in that lineup particularly interesting. Nowhere near as fast as this, of course, but for $ 800, it might make for a nice CPU for a basic server setup; nothing fancy, but low TdP, and would probably get the job done. The question here is how expensive the MB for those would be.
    Lastly, if Ampere sends you one of those $ 800 ones, could/would you test it?
  • Wilco1 - Friday, December 18, 2020 - link

    They will likely sell desktops using these just like the previous generation, but they are not cheap as it is high-end server gear using expensive ECC memory (and lots of it since there are 8 channels). If you don't need the fastest then there is eg. NVIDIA Xavier or LX2160A (16x A72) boards for around $500.
  • Spunjji - Monday, December 21, 2020 - link

    I think those are probably most useful for workloads that are pathologically memory and/or I/O limited - 4TB per socket, save ~$3000 over the faster CPU, benefit from power savings over the life of the server.
  • twtech - Friday, December 18, 2020 - link

    Ironically, AMD's opportunity to win might turn into an ultimate loss - Intel's manufacturing advantage kept x86 relevant, and with access to the x86 instruction set limited by ownership of the IP, AMD lived alongside Intel in that walled garden.

    With the manufacturing advantage gone however, Apple has left the garden, and maybe other personal computers won't be far behind - software compatibility I think is actually less of an issue in the era of SaaS and continuous updates. Ie. you were going to have to download new versions of the software you use as time went on anyway.
  • FunBunny2 - Friday, December 18, 2020 - link

    "you were going to have to download new versions of the software you use as time went on anyway."

    Solar Wind? :)
  • lorribot - Friday, December 18, 2020 - link

    This is all great but when all licencing is per core it limits the usage scenarios or benefits of these developments as they can really only be used with open source type licences.
    For the rest of us on Windows, Oracle, Java, Apple, IBM, etc licencing it doesn't bring anything to the table.
  • The_Assimilator - Friday, December 18, 2020 - link

    Just in time to be obsoleted by Milan.
  • Spunjji - Monday, December 21, 2020 - link

    For a given definition of "obsoleted", where it means "still more than competitive in performance per dollar at a lower price of entry".

Log in

Don't have an account? Sign up now