Compiling LLVM, NAMD Performance

As we’re trying to rebuild our server test suite piece by piece – and there’s still a lot of work go ahead to get a good representative “real world” set of workloads, one more highly desired benchmark amongst readers was a more realistic compilation suite. Chrome and LLVM codebases being the most requested, I landed on LLVM as it’s fairly easy to set up and straightforward.

git clone https://github.com/llvm/llvm-project.git
cd llvm-project
git checkout release/11.x
mkdir ./build
cd ..
mkdir llvm-project-tmpfs
sudo mount -t tmpfs -o size=10G,mode=1777 tmpfs ./llvm-project-tmpfs
cp -r llvm-project/* llvm-project-tmpfs
cd ./llvm-project-tmpfs/build
cmake -G Ninja \
  -DLLVM_ENABLE_PROJECTS="clang;libcxx;libcxxabi;lldb;compiler-rt;lld" \
  -DCMAKE_BUILD_TYPE=Release ../llvm
time cmake --build .

We’re using the LLVM 11.0.0 release as the build target version, and we’re compiling Clang, libc++abi, LLDB, Compiler-RT and LLD using GCC 10.2 (self-compiled). To avoid any concerns about I/O we’re building things on a ramdisk – on a 4KB page system 5GB should be sufficient but on the Altra’s 64KB system it used up to 9.5GB, including the source directory. We’re measuring the actual build time and don’t include the configuration phase as usually in the real world that doesn’t happen repeatedly.

LLVM Suite Compile Time

The Altra Q80-33 here performs admirably and pretty much matches the AMD EPYC 7742 both in 1S and 2S configurations. There isn’t exact perfect scaling between sockets because this being a actual build process, it also includes linking phases which are mostly single-threaded performance bound.

Generally, it’s interesting to see that the Altra here fares better than in the SPEC 502.gcc_r MT test – pointing out that real codebases might not be quite as demanding as the 502 reference source files, including a more diverse number of smaller files and objects that are being compiled concurrently.

NAMD

Another rather popular benchmark tool that we’ve actually seen being used by vendors such as AMD in their marketing materials when showcasing HPC performance for their server chips is NAMD. This actually quite an interesting adventure in terms of compiling the tool for AArch64 as essentially there little to no proper support for it. I’ve used the latest source drop, essentially the 2.15alpha / 3.0alpha tree, and compiled it from scratch on GCC 10.2 using the platform’s respective -march and -mtune targets.

For the Xeon 8280 – I did not use the AVX512 back-end for practical reasons: The code which introduces an AVX512 algorithm and was contributed by Intel engineers to NAMD has no portability to compilers other than ICC. Beyond this being a code-path that has no relation with the “normal” CPU algorithm – the reliance on ICC is something that definitely made me raise my eyebrows. It’s a whole other discussion topic on having a benchmark with real-world performance and the balance of having an actual fair and balanced apple to apples comparison. It’s something to revisit in the future as I invest more time into looking the code and see if I can port it to GCC or LLVM.

NAMD (Git-2020-12-09) - Apolipoprotein A1

For the single-socket numbers – we’re using the multicore variant of the tool which has predictable scaling across a single NUMA node. Here, the Ampere Altra Q80-33 performed amazingly well and managed to outperform the AMD EPYC 7742 by 30% - signifying this is mostly a compute-bound workload that scales well with actual cores.

For the 2S figures, using the multicore binaries results in undeterministic performance – the Altra here regressed to 2ns/day and the EPYC system also crashed down to 4ns/day – oddly enough the Xeon system had absolutely no issue in running this properly as it had excellent performance scaling and actually outperforms the MPI version. The 2S EPYC scales well with the MPI version of the benchmark, as expected.

Unfortunately, I wasn’t able to compile an MPI version of NAMD for AArch64 as the codebase kept running into issues and it had no properly maintained build target for this. In general, I felt like I was amongst the first people to ever attempt this, even though there are some resources to attempt to help out on this.

I also tried running Blender on the Altra system but that ended up with so many headaches I had to abandon the idea – on CentOS there were only some really old build packages available in the repository. Building Blender from source on AArch64 with all of its dependencies ends up in a plethora of software packages which simply assume you’re running on x86 and rely on basic SSE intrinsics – easy enough to fix that in the makefiles, but then I hit some other compilation errors after which I lost my patience. Fedora Linux seemed to be the only distribution offering an up-to-date build package for Blender – but I stopped short of reinstalling the OS just to benchmark Blender.

So, while AArch64 has made great strides in the past few years – and the software situation might be quite good for server workloads, it’s not all rosy and we’re still have ways to go before it can be considered a first-class citizen in the software ecosystem. Hopefully Apple’s introduction of Apple Silicon Macs will accelerate the Arm software ecosystem.

SPECjbb MultiJVM - Java Performance Conclusion & End Remarks
Comments Locked

148 Comments

View All Comments

  • Wilco1 - Monday, December 21, 2020 - link

    Why would they introduce Graviton if it would run at a loss??? A significant percentage of AWS is already Graviton (probably 20% by now). If anything Graviton increases profitability due to vertical integration and other cost reduction.
  • mode_13h - Monday, December 21, 2020 - link

    First, there's a fundamental disparity between an in-house CPU and a 3rd Party one, where Amazon can cut out some overheads by building their own. So, that already skews the price-comparison.

    The other question is whether Amazon is partially-subsidizing the price of their Graviton2 instances as an incentive to get more people to switch. For a business, the least risky thing is to stay on x86, so Amazon needs to present an immediate and significant cost savings to get people to switch. After they've switched and ARM server cores have had more time to mature, Amazon can charge more and make back a good return on investment.

    I obviously don't know if that's what they're doing, but we don't know that it's not. So, you really can't read much into their current pricing. That's all I'm saying.
  • mode_13h - Sunday, December 20, 2020 - link

    Finally, I guess you missed this part, in the discussion of SPECjbb:

    > One thing that did come to mind immediately when I saw the results was SMT.
    > Due to this being a transactional data-plane resident type of workload,
    > SMT will undoubtedly help a lot in terms of performance,
    > so I tested out the EPYC chip figures with SMT disabled,
    > and indeed max-jOPS went down to 209.5k for the 2S THP enabled results,
    > meaning that SMT accounts for a 29.7% performance benefit in this benchmark.

    ...

    > It’s generally these kinds of workloads that SMT works best on,
    > and that’s why IBM can deploy SMT4 or SMT8 processors,
    > and the type of workloads Marvell’s ThunderX was trying to carve a niche or itself with SMT4.
  • mode_13h - Sunday, December 20, 2020 - link

    As the article mentions, Marvell’s ThunderX did support SMT on ARMv8-A.

    Were SMT's reputation not bruised by all the recent side-channel exploits, perhaps it would be showing up in some of ARM's own cores. Maybe their V-series will get it, since that's a much larger core.
  • Wilco1 - Monday, December 21, 2020 - link

    ThunderX2/X3 and Neoverse E1 have SMT, but neither has been hugely successful. SMT doesn't provide a significant benefit across a wide range of workloads, so adding another core remains simpler and cheaper. And yes, security is another nail in the coffin.
  • EthiaW - Saturday, December 19, 2020 - link

    The performance of Graviton2 meets our expectation for Neoverse N1 (or Cortex A76) better. How can Q80 manage to deliver so much higher IPC with the same architecture? Incredible.
  • Brutalizer - Saturday, December 19, 2020 - link

    One old Oracle SPARC T8 cpu does 153.500 Java max-JOPS SPECjbb2015. And the crit-JOPS value is 90.000. Easily smashing all cpus here.
    https://blogs.oracle.com/bestperf/specjbb2015:-spa...
  • satai - Saturday, December 19, 2020 - link

    Benchmarked by Oracle... Definitely trustworthy.
  • zepi - Saturday, December 19, 2020 - link

    SPECJBB graphs kill me.

    For the love of god, please keep the axis scaling identical!

    Same applies to every single metric always. If you provide separate graphs for different products, please make sure that axis-scaling is the same in all images!
  • Andrei Frumusanu - Sunday, December 20, 2020 - link

    The graphs are generated by the benchmark itself.

Log in

Don't have an account? Sign up now