Investigating Performance of Multi-Threading on Zen 3 and AMD Ryzen 5000
by Dr. Ian Cutress on December 3, 2020 10:00 AM EST- Posted in
- CPUs
- AMD
- Zen 3
- X570
- Ryzen 5000
- Ryzen 9 5950X
- SMT
- Multi-Threading
CPU Performance
For simplicity, we are listing the percentage performance differentials in all of our CPU testing – the number shown is the % performance of having SMT2 enabled compared to having the setting disabled. Our benchmark suite consists of over 120 tests, full details of which can be found in our #CPUOverload article.
Here are the single threaded results.
Single Threaded Tests AMD Ryzen 9 5950X |
||
AnandTech | SMT Off Baseline |
SMT On |
y-Cruncher | 100% | 99.5% |
Dwarf Fortress | 100% | 99.9% |
Dolphin 5.0 | 100% | 99.1% |
CineBench R20 | 100% | 99.7% |
Web Tests | 100% | 99.1% |
GeekBench (4+5) | 100% | 100.8% |
SPEC2006 | 100% | 101.2% |
SPEC2017 | 100% | 99.2% |
Interestingly enough our single threaded performance was within a single percentage point across the stack (SPEC being +1.2%). Given that ST mode should arguably give more resources to each thread for consistency, the fact that we see no difference means that AMD’s implementation of giving a single thread access to all the resources even in SMT mode is quite good.
The multithreaded tests are a bit more diverse:
Multi-Threaded Tests AMD Ryzen 9 5950X |
||
AnandTech | SMT Off Baseline |
SMT On |
Agisoft Photoscan | 100% | 98.2% |
3D Particle Movement | 100% | 165.7% |
3DPM with AVX2 | 100% | 177.5% |
y-Cruncher | 100% | 94.5% |
NAMD AVX2 | 100% | 106.6% |
AIBench | 100% | 88.2% |
Blender | 100% | 125.1% |
Corona | 100% | 145.5% |
POV-Ray | 100% | 115.4% |
V-Ray | 100% | 126.0% |
CineBench R20 | 100% | 118.6% |
HandBrake 4K HEVC | 100% | 107.9% |
7-Zip Combined | 100% | 133.9% |
AES Crypto | 100% | 104.9% |
WinRAR | 100% | 111.9% |
GeekBench (4+5) | 100% | 109.3% |
Here we have a number of different factors affecting the results.
Starting with the two tests that scored statistically worse with SMT2 enabled: yCruncher and AIBench. Both tests are memory-bound and compute-bound in parts, where the memory bandwidth per thread can become a limiting factor in overall run-time. yCruncher is arguably a math synthetic benchmark, and AIBench is still early-beta AI workloads for Windows, so quite far away from real world use cases.
Most of the rest of the benchmarks are between a +5% to +35% gain, which includes a number of our rendering tests, molecular dynamics, video encoding, compression, and cryptography. This is where we can see both threads on each core interleaving inside the buffers and execution units, which is the goal of an SMT design. There are still some bottlenecks in the system affecting both threads getting absolute full access, which could be buffer size, retire rate, op-queue limitations, memory limitations, etc – each benchmark is likely different.
The two outliers are 3DPM/3DPMavx, and Corona. These three are 45%+, with 3DPM going 66%+. Both of these tests are very light on the cache and memory requirements, and use the increased Zen3 execution port distribution to good use. These benchmarks are compute heavy as well, so splitting some of that memory access and compute in the core helps SMT2 designs mix those operations to a greater effect. The fact that 3DPM in AVX2 mode gets a higher benefit might be down to coalescing operations for an AVX2 load/store implementation – there is less waiting to pull data from the caches, and less contention, which adds to some extra performance.
Overall
In an ideal world, both threads on a core will have full access to all resources, and not block each other. However, that just means that the second thread looks like it has its own core completely. The reverse SMT method, of using one global core and splitting it into virtual cores with no contention, is known as VISC, and the company behind that was purchased by Intel a few years ago, but nothing has come of it yet. For now, we have SMT, and by design it will accelerate some key workloads when enabled.
In our CPU results, the single threaded benchmarks showed no uplift with SMT enabled/disabled in our real-world or synthetic workloads. This means that even in SMT enabled mode, if one thread is running, it gets everything the core has on offer.
For multi-threaded tests, there is clearly a spectrum of workloads that benefit from SMT.
Those that don’t are either hyper-optimized on a one-thread-per-core basis, or memory latency sensitive.
Most real-world workloads see a small uplift, an average of 22%. Rendering and ray tracing can vary depending on the engine, and how much bandwidth/cache/core resources each thread requires, potentially moving the execution bottleneck somewhere else in the chain. For execution limited tests that don’t probe memory or the cache at all, which to be honest are most likely to be hyper-optimized compute workloads, scored up to +77% in our testing.
126 Comments
View All Comments
MrSpadge - Friday, December 4, 2020 - link
That doesn't mean SMT is mainly responsible for that. The x86 decoders are a lot more complex. And at the top end you get diminishing performance returns for additional die area.Wilco1 - Friday, December 4, 2020 - link
I didn't say all the difference comes from SMT, but it can't be the x86 decoders either. A Zen 2 without L2 is ~2.9 times the size of a Neoverse N1 core in 7nm. That's a huge factor. So 2 N1 cores are smaller and significantly faster than 1 SMT2 Zen 2 core. Not exactly an advertisement for SMT, is it?Dolda2000 - Friday, December 4, 2020 - link
>Graviton 2 gives 75-80% of the performance of the fastest Rome at less than a third of the areaTo be honest, it wouldn't surprise me one bit if 90% of the area gives 10% of the performance. Wringing out that extra 1% single-threaded performance here or there is the name of the game nowadays.
Also there are many other differences that probably cost a fair bit of silicon, like wider vector units (NEON is still 128-bit, and exceedingly few ARM cores implement SVE yet).
Wilco1 - Saturday, December 5, 2020 - link
It's nowhere near as bad with Arm designs showing large gains every year. Next generation Neoverse has 40+% higher IPC.Yes Arm designs opt for smaller SIMD units. It's all about getting the most efficient use out of transistors. Having huge 512-bit SIMD units add a lot of area, power and complexity with little performance gain in typical code. That's why 512-bit SVE is used in HPC and nowhere else.
So with Arm you get many different designs that target specific markets. That's more efficient than one big complex design that needs to address every market and isn't optimal in any.
whatthe123 - Saturday, December 5, 2020 - link
The extra 20% of performance is difficult to achieve. You can already see it on zen CPUs, where 16 core designs are dramatically more efficient per core in multithread running at around 3.4ghz, vs 8 core designs running at 4.8ghz. I've always hated these comparisons with ARM for this reason... you need a part with 1:1 watt parity to make a fair comparison, otherwise 80% performance at half the power can also be accomplished even on x86 by just reducing frequency and upping core count.Wilco1 - Saturday, December 5, 2020 - link
Graviton clocks low to conserve power, and still gets close to Rome. You can easily clock it higher - Ampere Altra does clock the same N1 core 32% higher. So that 20-25% gap is already gone. We also know about the next generation (Neoverse N2 and V1) which have 40+% higher IPC.Yes adding more cores and clocking a bit lower is more efficient. But that's only feasible when your core is small! Altra Max has 128 cores on a single die, and I don't think we'll see AMD getting anywhere near that in the next few years even with chiplets.
peevee - Monday, December 7, 2020 - link
It is obviously a lot LESS than 5%. Nothing that matters in terms of transistors (caches and vector units) increases. Even doubling of registers would add a few hundreds/thousands of transistors on a chip with tens of billions of transistors, less than 0.000001%.They can double all scalar units and it still would be below 1% increase.
Kangal - Friday, December 4, 2020 - link
I agree.Adding SMT/HT requires something like a +10% increase in the Silicon Budget, and a +5% increase in power draw but increases performance by +30%, speaking in general. So it's worth the trade-off for daily tasks, and those on a budget.
What I was curious to see, is if you disabled SMT on the 5950X, which has lots of cores. Leaving each thread with slightly more resources. And use the extra thermals to overclock the processor. How would that affect games?
My hunch?
Thread-happy games like Ashes of Singularity would perform worse, since it is optimised and can take advantage of the SMT. Unoptimized games like Fallout 76 should see an increase in performance. Whereas actually optimised games like Metro Exodus they should be roughly equal between OC versus SMT.
Dolda2000 - Friday, December 4, 2020 - link
>What I was curious to see, is if you disabled SMT on the 5950X, which has lots of cores.That is exactly what he did in this article, though.
Kangal - Saturday, December 5, 2020 - link
I guess you didn't understand my point.Think of a modern game which is well optimised, is both GPU intensive and CPU intensive. Such as Far Cry V or Metro Exodus. These games scale well anywhere from 4-physical-core to 8-physical-cores.
So using the 5950X with its 16-physical-cores, you really don't need extra threads. In fact, it's possible to see a performance uplift without SMT, dropping it from 32-shared-threads down to 16-full-threads, as each core gets better utilisation. Now add to that some overclocking (+0.2GHz ?) due to the extra thermal headroom, and you may legitimately get more performance from these titles. Though I suspect they wouldn't see any substantial increases or decreases in frame rates.
In horribly optimised games, like Fallout 76, Mafia 3, or even AC Odyssey, anything could happen (though probably they would see some increases). Whereas we already know that in games that aren't GPU intensive, but CPU intensive (eg practically all RTS games), these were designed to scale up much much better. So even with the full-cores and overclock, we know these games will actually show a decrease in performance from losing those extra threads/SMT.