Investigating Performance of Multi-Threading on Zen 3 and AMD Ryzen 5000

Name: Investigating Performance of Multi-Threading on Zen 3 and AMD Ryzen 5000
Item: Investigating Performance of Multi-Threading on Zen 3 and AMD Ryzen 5000
Author: Dr. Ian Cutress

by Dr. Ian Cutress on December 3, 2020 10:00 AM EST

126 Comments | Add A Comment

126 Comments

CPU Performance

For simplicity, we are listing the percentage performance differentials in all of our CPU testing – the number shown is the % performance of having SMT2 enabled compared to having the setting disabled. Our benchmark suite consists of over 120 tests, full details of which can be found in our #CPUOverload article.

Here are the single threaded results.

Single Threaded Tests AMD Ryzen 9 5950X
AnandTech	SMT Off Baseline	SMT On
y-Cruncher	100%	99.5%
Dwarf Fortress	100%	99.9%
Dolphin 5.0	100%	99.1%
CineBench R20	100%	99.7%
Web Tests	100%	99.1%
GeekBench (4+5)	100%	100.8%
SPEC2006	100%	101.2%
SPEC2017	100%	99.2%

Interestingly enough our single threaded performance was within a single percentage point across the stack (SPEC being +1.2%). Given that ST mode should arguably give more resources to each thread for consistency, the fact that we see no difference means that AMD’s implementation of giving a single thread access to all the resources even in SMT mode is quite good.

The multithreaded tests are a bit more diverse:

Multi-Threaded Tests AMD Ryzen 9 5950X
AnandTech	SMT Off Baseline	SMT On
Agisoft Photoscan	100%	98.2%
3D Particle Movement	100%	165.7%
3DPM with AVX2	100%	177.5%
y-Cruncher	100%	94.5%
NAMD AVX2	100%	106.6%
AIBench	100%	88.2%
Blender	100%	125.1%
Corona	100%	145.5%
POV-Ray	100%	115.4%
V-Ray	100%	126.0%
CineBench R20	100%	118.6%
HandBrake 4K HEVC	100%	107.9%
7-Zip Combined	100%	133.9%
AES Crypto	100%	104.9%
WinRAR	100%	111.9%
GeekBench (4+5)	100%	109.3%

Here we have a number of different factors affecting the results.

Starting with the two tests that scored statistically worse with SMT2 enabled: yCruncher and AIBench. Both tests are memory-bound and compute-bound in parts, where the memory bandwidth per thread can become a limiting factor in overall run-time. yCruncher is arguably a math synthetic benchmark, and AIBench is still early-beta AI workloads for Windows, so quite far away from real world use cases.

Most of the rest of the benchmarks are between a +5% to +35% gain, which includes a number of our rendering tests, molecular dynamics, video encoding, compression, and cryptography. This is where we can see both threads on each core interleaving inside the buffers and execution units, which is the goal of an SMT design. There are still some bottlenecks in the system affecting both threads getting absolute full access, which could be buffer size, retire rate, op-queue limitations, memory limitations, etc – each benchmark is likely different.

The two outliers are 3DPM/3DPMavx, and Corona. These three are 45%+, with 3DPM going 66%+. Both of these tests are very light on the cache and memory requirements, and use the increased Zen3 execution port distribution to good use. These benchmarks are compute heavy as well, so splitting some of that memory access and compute in the core helps SMT2 designs mix those operations to a greater effect. The fact that 3DPM in AVX2 mode gets a higher benefit might be down to coalescing operations for an AVX2 load/store implementation – there is less waiting to pull data from the caches, and less contention, which adds to some extra performance.

Overall

In an ideal world, both threads on a core will have full access to all resources, and not block each other. However, that just means that the second thread looks like it has its own core completely. The reverse SMT method, of using one global core and splitting it into virtual cores with no contention, is known as VISC, and the company behind that was purchased by Intel a few years ago, but nothing has come of it yet. For now, we have SMT, and by design it will accelerate some key workloads when enabled.

In our CPU results, the single threaded benchmarks showed no uplift with SMT enabled/disabled in our real-world or synthetic workloads. This means that even in SMT enabled mode, if one thread is running, it gets everything the core has on offer.

For multi-threaded tests, there is clearly a spectrum of workloads that benefit from SMT.

Those that don’t are either hyper-optimized on a one-thread-per-core basis, or memory latency sensitive.

Most real-world workloads see a small uplift, an average of 22%. Rendering and ray tracing can vary depending on the engine, and how much bandwidth/cache/core resources each thread requires, potentially moving the execution bottleneck somewhere else in the chain. For execution limited tests that don’t probe memory or the cache at all, which to be honest are most likely to be hyper-optimized compute workloads, scored up to +77% in our testing.

Investigating SMT on Zen 3 Gaming Performance (Discrete GPU)

PRINT THIS ARTICLE

Post Your Comment
Please log in or sign up to comment.

Comments Locked

126 Comments

View All Comments

warpuck - Friday, December 25, 2020 - link
With a R 5 1600 it makes about 5-6% difference in usable clock speed. (200-250 Mhz) and also with temperature. With a R 7 3800X it is not as noticeable.
If you reduce the background operations while gaming with either CPU.
I don't know about recent game releases but older ones only use 2-4 cores (threads) so clocking the R 5 1600 @ 3750 (SMT on) Mhz vs 3975 Mhz (SMT off) does make a difference on frame rates
whatthe123 - Saturday, December 5, 2020 - link
it doesn't make much of a difference unless you go way past the TDP and have exotic cooling.

these CPUs are already boosting close to their limits at stock settings to maintain high gaming performance.
29a - Saturday, December 5, 2020 - link
There is a lot of different scenarios that would be interesting to see. I would like to see some testing with a dual core chip 2c/4t.
Netmsm - Thursday, December 3, 2020 - link
good point
Wilco1 - Friday, December 4, 2020 - link
I think that 5% area cost for SMT is marketing. If you only count the logic that is essential for SMT, then it might be 5%. However many resources need to be increased or doubled. Even if that helps single-threaded performance, it still adds a lot of area that you wouldn't need without SMT.

Graviton 2 proves that 2 small non-SMT cores will beat one big SMT core on multithreaded workloads using a fraction of the silicon and power.
peevee - Monday, December 7, 2020 - link
Except they are not faster, but whatever.
RickITA - Thursday, December 3, 2020 - link
Several compute applications do not need hyper-threading. A couple of official references:
1. Wolfram Mathematica: "Mathematica’s Parallel Computing suite does not necessarily benefit from hyper-threading, although certain kernel functionality will take advantage of it when it provides a speedup." [source: https://support.wolfram.com/39353?src=mathematica]. Indeed Mathematica automatically set-up a number of threads equal to the number of physical cores of the CPU.
2. Intel MKV library. "Hyper-Threading Technology (HT Technology) is especially effective when each thread is performing different types of operations and when there are under-utilized resources on the processor. Intel MKL fits neither of these criteria as the threaded portions of the library execute at high efficiencies using most of the available resources and perform identical operations on each thread. You may obtain higher performance when using Intel MKL without HT Technology enabled." [source: https://software.intel.com/content/www/us/en/devel...].

BTW Ian: Wolfram Mathematica has a benchmark mode [source: https://reference.wolfram.com/language/Benchmarkin...], please consider to add it to your test suite. Or something with Matlab.
realbabilu - Thursday, December 3, 2020 - link
Apparently intel mkl and Matlab that uses intel mkl only allowing AMD uses non AVX2 library only. Only Linux with fake cpu preloaded library could go around this.
https://www.google.com/amp/s/simon-martin.net/2020...
RickITA - Thursday, December 3, 2020 - link
Not a matlab user, but this is no longer true since version 2020a. Source: https://www.extremetech.com/computing/308501-cripp...
leexgx - Saturday, December 5, 2020 - link
The "if not Intel genuine cpu" disabled all optimisations (this rubbish has been going on for years only 2020 or 2019 where they are actually fixing there code to detect if AVX is available, even BTRFS had this problem it wouldn't use hardware acceleration if it wasn't on an intel CPU, again lazy coding )

Investigating Performance of Multi-Threading on Zen 3 and AMD Ryzen 5000

CPU Performance

Overall

Post Your Comment

126 Comments

View All Comments

warpuck - Friday, December 25, 2020 - link

whatthe123 - Saturday, December 5, 2020 - link

29a - Saturday, December 5, 2020 - link

Netmsm - Thursday, December 3, 2020 - link

Wilco1 - Friday, December 4, 2020 - link

peevee - Monday, December 7, 2020 - link

RickITA - Thursday, December 3, 2020 - link

realbabilu - Thursday, December 3, 2020 - link

RickITA - Thursday, December 3, 2020 - link

leexgx - Saturday, December 5, 2020 - link

Log in

Don't have an account? Sign up now