Compiling LLVM, NAMD Performance

As we’re trying to rebuild our server test suite piece by piece – and there’s still a lot of work go ahead to get a good representative “real world” set of workloads, one more highly desired benchmark amongst readers was a more realistic compilation suite. Chrome and LLVM codebases being the most requested, I landed on LLVM as it’s fairly easy to set up and straightforward.

git clone https://github.com/llvm/llvm-project.git
cd llvm-project
git checkout release/11.x
mkdir ./build
cd ..
mkdir llvm-project-tmpfs
sudo mount -t tmpfs -o size=10G,mode=1777 tmpfs ./llvm-project-tmpfs
cp -r llvm-project/* llvm-project-tmpfs
cd ./llvm-project-tmpfs/build
cmake -G Ninja \
  -DLLVM_ENABLE_PROJECTS="clang;libcxx;libcxxabi;lldb;compiler-rt;lld" \
  -DCMAKE_BUILD_TYPE=Release ../llvm
time cmake --build .

We’re using the LLVM 11.0.0 release as the build target version, and we’re compiling Clang, libc++abi, LLDB, Compiler-RT and LLD using GCC 10.2 (self-compiled). To avoid any concerns about I/O we’re building things on a ramdisk. We’re measuring the actual build time and don’t include the configuration phase as usually in the real world that doesn’t happen repeatedly.

LLVM Suite Compile Time

Starting off with the Xeon 8380, we’re looking at large generational improvements for the new Ice Lake SP chip. A 33-35% improvement in compile time depending on whether we’re looking at 2S or 1S figures is enough to reposition Intel’s flagship CPU in the rankings by notable amounts, finally no longer lagging behind as drastically as some of the competition.

It’s definitely not sufficient to compete with AMD and Ampere, both showcasing figures that are still 25 and 15% ahead of the Xeon 8380.

The Xeon 6330 is falling in line with where we benchmarked it in previous tests, just slightly edging out the Xeon 8280 (6258R equivalent), meaning we’re seeing minor ISO-core ISO-power generational improvements (again I have to mention that the 6330 is half the price of a 6258R).

NAMD (Git-2020-12-09) - Apolipoprotein A1

NAMD is a problem-child benchmark due to its recent addition of AVX512: the code had been contributed by Intel engineers – which isn’t exactly an issue in my view. The problem is that this is a new algorithm which has no relation to the normal code-path, which remains not as hand-optimised for AVX2, and further eyebrow raising is that it’s solely compatible with Intel’s ICC and no other compiler. That’s one level too much in terms of questionable status as a benchmark: are we benchmarking it as a general HPC-representative workload, or are we benchmarking it solely for the sake of NAMD and only NAMD performance?

We understand Intel is putting a lot of focus on these kinds of workloads that are hyper-optimised to run well extremely on Intel-only hardware, and it’s a valid optimisation path for many use-cases. I’m just questioning how representative it is of the wider market and workloads.

In any case, the GCC binaries of the test on the ApoA1 protein showcase significant performance uplifts for the Xeon 8380, showcasing a +35.6% gain. Using this apples-to-apples code path, it’s still quite behind the competition which scales the performance much higher thanks to more cores.

SPECjbb MultiJVM - Java Performance Conclusion & End Remarks
Comments Locked

169 Comments

View All Comments

  • Oxford Guy - Wednesday, April 7, 2021 - link

    You're arguing apples (latency) and oranges (capability).

    An Apple II has better latency than an Apple Lisa, even though the latter is vastly more powerful in most respects. The sluggishness of the UI was one of the big problems with that system from a consumer point of view. Many self-described power users equated a snappy interface with capability, so they believed their CLI machines (like the IBM PC) were a lot better.
  • GeoffreyA - Wednesday, April 7, 2021 - link

    "today's software and OSes are absurdly slow, and in many cases desktop applications are slower in user-time than their late 1980s counterparts"

    Oh yes. One builds a computer nowadays and it's fast for a year. But then applications, being updated, grow sluggish over time. And it starts to feel like one's old computer again. So what exactly did we gain, I sometimes wonder. Take a simple suite like LibreOffice, which was never fast to begin with. I feel version 7 opens even slower than 6. Firefox was quite all right, but as of 85 or 86, when they introduced some new security feature, it seems to open a lot slower, at least on my computer. At any rate, I do appreciate all the free software.
  • ricebunny - Wednesday, April 7, 2021 - link

    Well said.
  • Frank_M - Thursday, April 8, 2021 - link

    Intel Fortran is vastly faster then GCC.

    How did ricebunny get a free compiler?
  • mode_13h - Thursday, April 8, 2021 - link

    > It's strange to tell people who use the Intel compiler that it's not used much in the real world, as though that carries some substantive point.

    To use the automotive analogy, it's as if a car is being reviewed using 100-octane fuel, even though most people can only get 93 or 91 octane (and many will just use the cheap 87 octane, anyhow).

    The point of these reviews isn't to milk the most performance from the product that's theoretically possible, but rather to inform readers about how they're likely to experience it. THAT is why it's relevant that almost nobody uses ICC in practice.

    And, in fact, BECAUSE so few people are using ICC, Intel puts a lot of work into GCC and LLVM.
  • GeoffreyA - Thursday, April 8, 2021 - link

    I think that a common compiler like GCC should be used (like Andrei is doing), along with a generic x86-64 -march (in the case of Intel/AMD) and generic -mtune. The idea would be to get the CPUs on as equal a footing as possible, even with code that might not be optimal, and reveal relative rather than absolute performance.
  • Wilco1 - Thursday, April 8, 2021 - link

    Using generic (-march=x86-64) means you are building for ancient SSE2... If you want a common baseline then use something like -march=x86-64-v3. You'll then get people claiming that excluding AVX-512 is unfair eventhough there is little difference on most benchmarks except for higher power consumption ( https://www.phoronix.com/scan.php?page=article&... ).
  • GeoffreyA - Saturday, April 10, 2021 - link

    I think leaving AVX512 out is a good policy.
  • GeoffreyA - Thursday, April 8, 2021 - link

    If I may offer an analogy, I would say: the benchmark is like an exam in school but here we test time to finish the paper (and with the constraint of complete accuracy). Each pupil should be given the identical paper, and that's it.

    Using optimised binaries for different CPUs is a bit like knowing each child's brain beforehand (one has thicker circuitry in Bodman region 10, etc.) and giving each a paper with peculiar layout and formatting but same questions (in essence). Which system is better, who can say, but I'd go with the first.
  • Oxford Guy - Wednesday, April 7, 2021 - link

    Well, whatever tricks were used made Blender faster with the ICC builds I tested — both on AMD's Piledriver and on several Intel releases (Lynnfield and Haswell).

Log in

Don't have an account? Sign up now