Investigating Performance of Multi-Threading on Zen 3 and AMD Ryzen 5000

Name: Investigating Performance of Multi-Threading on Zen 3 and AMD Ryzen 5000
Item: Investigating Performance of Multi-Threading on Zen 3 and AMD Ryzen 5000
Author: Dr. Ian Cutress

by Dr. Ian Cutress on December 3, 2020 10:00 AM EST

126 Comments | Add A Comment

126 Comments

CPU Performance

For simplicity, we are listing the percentage performance differentials in all of our CPU testing – the number shown is the % performance of having SMT2 enabled compared to having the setting disabled. Our benchmark suite consists of over 120 tests, full details of which can be found in our #CPUOverload article.

Here are the single threaded results.

Single Threaded Tests AMD Ryzen 9 5950X
AnandTech	SMT Off Baseline	SMT On
y-Cruncher	100%	99.5%
Dwarf Fortress	100%	99.9%
Dolphin 5.0	100%	99.1%
CineBench R20	100%	99.7%
Web Tests	100%	99.1%
GeekBench (4+5)	100%	100.8%
SPEC2006	100%	101.2%
SPEC2017	100%	99.2%

Interestingly enough our single threaded performance was within a single percentage point across the stack (SPEC being +1.2%). Given that ST mode should arguably give more resources to each thread for consistency, the fact that we see no difference means that AMD’s implementation of giving a single thread access to all the resources even in SMT mode is quite good.

The multithreaded tests are a bit more diverse:

Multi-Threaded Tests AMD Ryzen 9 5950X
AnandTech	SMT Off Baseline	SMT On
Agisoft Photoscan	100%	98.2%
3D Particle Movement	100%	165.7%
3DPM with AVX2	100%	177.5%
y-Cruncher	100%	94.5%
NAMD AVX2	100%	106.6%
AIBench	100%	88.2%
Blender	100%	125.1%
Corona	100%	145.5%
POV-Ray	100%	115.4%
V-Ray	100%	126.0%
CineBench R20	100%	118.6%
HandBrake 4K HEVC	100%	107.9%
7-Zip Combined	100%	133.9%
AES Crypto	100%	104.9%
WinRAR	100%	111.9%
GeekBench (4+5)	100%	109.3%

Here we have a number of different factors affecting the results.

Starting with the two tests that scored statistically worse with SMT2 enabled: yCruncher and AIBench. Both tests are memory-bound and compute-bound in parts, where the memory bandwidth per thread can become a limiting factor in overall run-time. yCruncher is arguably a math synthetic benchmark, and AIBench is still early-beta AI workloads for Windows, so quite far away from real world use cases.

Most of the rest of the benchmarks are between a +5% to +35% gain, which includes a number of our rendering tests, molecular dynamics, video encoding, compression, and cryptography. This is where we can see both threads on each core interleaving inside the buffers and execution units, which is the goal of an SMT design. There are still some bottlenecks in the system affecting both threads getting absolute full access, which could be buffer size, retire rate, op-queue limitations, memory limitations, etc – each benchmark is likely different.

The two outliers are 3DPM/3DPMavx, and Corona. These three are 45%+, with 3DPM going 66%+. Both of these tests are very light on the cache and memory requirements, and use the increased Zen3 execution port distribution to good use. These benchmarks are compute heavy as well, so splitting some of that memory access and compute in the core helps SMT2 designs mix those operations to a greater effect. The fact that 3DPM in AVX2 mode gets a higher benefit might be down to coalescing operations for an AVX2 load/store implementation – there is less waiting to pull data from the caches, and less contention, which adds to some extra performance.

Overall

In an ideal world, both threads on a core will have full access to all resources, and not block each other. However, that just means that the second thread looks like it has its own core completely. The reverse SMT method, of using one global core and splitting it into virtual cores with no contention, is known as VISC, and the company behind that was purchased by Intel a few years ago, but nothing has come of it yet. For now, we have SMT, and by design it will accelerate some key workloads when enabled.

In our CPU results, the single threaded benchmarks showed no uplift with SMT enabled/disabled in our real-world or synthetic workloads. This means that even in SMT enabled mode, if one thread is running, it gets everything the core has on offer.

For multi-threaded tests, there is clearly a spectrum of workloads that benefit from SMT.

Those that don’t are either hyper-optimized on a one-thread-per-core basis, or memory latency sensitive.

Most real-world workloads see a small uplift, an average of 22%. Rendering and ray tracing can vary depending on the engine, and how much bandwidth/cache/core resources each thread requires, potentially moving the execution bottleneck somewhere else in the chain. For execution limited tests that don’t probe memory or the cache at all, which to be honest are most likely to be hyper-optimized compute workloads, scored up to +77% in our testing.

Investigating SMT on Zen 3 Gaming Performance (Discrete GPU)

PRINT THIS ARTICLE

Post Your Comment
Please log in or sign up to comment.

Comments Locked

126 Comments

View All Comments

Oxford Guy - Friday, December 4, 2020 - link
Suggestions:

Compare with Zen 2 and Zen 1, particularly in games.

Explain SMT vs. CMT. Also, is SMT + CMT possible?
AntonErtl - Sunday, December 6, 2020 - link
CMT has at least two meanings.

Sun's UltraSparc T1 has in-order cores that run several threads alternatingly on the functional units. This is probably the closest thing to SMT that makes sense on an in-order core. Combining this with SMT proper makes no sense; if you can execute instructions from different threads in the same cycle, there is no need for an additional mechanism for processing them in alternate cycles. Instruction fetch on some SMT cores processes instructions in alternate cycles, though.

The AMD Bulldozer and family have pairs of cores that share more than cores in other designs share (but less than with SMT): They share the I-cache, front end and FPU. As a result, running code on both cores of a pair is often not as fast as when running it on two cores of different pairs. You can combine this scheme with SMT, but given that it was not such a shining success, I doubt anybody is going to do it.

Looking at roughly contemporary CPUs (Athlon X4 845 3.5GHz Excavator and Core i7 6700K 4.2Ghz Skylake), when running the same application twice one after the other on the same core/thread vs. running it on two cores of the same pair or two threads of the same core, using two cores was faster by a factor 1.65 on the Excavator (so IMO calling them cores is justified), and using two threads was faster by a factor 1.11 on the Skylake. But Skylake was faster by a factor 1.28 with two threads than Excavator with two cores, and by a factor 1.9 when running only a single core/thread, so even on multi-threaded workloads a 4c/8t Skylake can beat an 8c Excavator (but AFAIK Excavators were not built in 8c configurations). The benchmark was running LaTeX.
Oxford Guy - Sunday, December 6, 2020 - link
AMD's design was very inefficient in large part because the company didn't invest much into improving it. The decision was made, for instance, to stall high-performance with Piledriver in favor of a very very long wait for Zen. Excavator was made on a low-quality process and was designed to be cheap to make.

Comparing a 2011/2012 design that was bad when it came out with Skylake is a bit of a stretch, in terms of what the basic architectural philosophy is capable of.

I couldn't remember that fourth type (the first being standard multi-die CPU multiprocessing) so thanks for mentioning it (Sun's).
USGroup1 - Saturday, December 5, 2020 - link
So yCruncher is far away from real world use cases and 3DPMavx isn't.
pc8086 - Sunday, December 6, 2020 - link
Many congratulations to Dr. Ian Cutress for the excellent analysis carried out.

If possible, it would be extremely interesting to repeat a similar rigorous analysis (at least on multi-threaded subsection of choosen benchmarks) on the following platforms:
- 5900X (Zen 3, but fewer cores for each chiplet, maybe with more thermal headroom)
- 5800X (Zen 3, only a single computational chiplet, so no inter CCX latency throubles)
- 3950X (same cores and configuration, but with Zen 2, to check if the new, beefier core improved SMT support)
- 2950X (Threadripper 2, same number of cores but Zen+, with 4 mamory channels; useful expecially for tests such as AIBench, which have gotten worse with SMT)
- 3960X (Threadripper3, more cores, but Zen2 and with 4 memory ch.)

Obviously, it would be interesting to check Intel HyperThreading impact on recent Comet Lake, Tiger Lake and Cascade Lake-X.

For the time being, Apple has decided not to use any form of SMT on its own CPUs, so it is useful to fully understand the usefulness of SMT technologies for notebooks, high-end PCs and prosumer platforms.

Than you very much.
eastcoast_pete - Sunday, December 6, 2020 - link
Thanks Ian! With some of your comments about memory access limiting performance in some cases, how does (or would) a quad channel memory setup give in additional performance compared to the dual channel consumer setups (like these or mine) have? Now, I know that servers and actual workstations usually have 4 or more memory channels, and for good reason. So, in the time of 12 and 16 core CPUs, is it time for quad channel memory access for the rest of us, or would that break the bank?
mapesdhs - Thursday, December 10, 2020 - link
That's a good question. As time moves on and we keep getting more cores, with people doing more things that make use of them (such as gaming and streaming at the same time, with browser/tabs open, livechat, perhaps an ecode too), perhaps indeed the plethora of cores does need better mem bw and parallelism, but maybe the end user would not yet tolerate the cost.

Something I noticed about certain dual-socket S2011 mbds on Aliexpress is that they don't have as many memory channels as they claim, which with two CPUs does hurt performance of even consumer grade tasks such as video encoding:

http://www.sgidepot.co.uk/misc/kllisre_analysis.tx...
bez5dva - Monday, December 7, 2020 - link
Hi Dr. Cutress!

Thanks for these interesting tests!
Perhaps, SMT thing is a something that could drastically improve more budget CPUs performance? Your CPU has more than enough shiny cores for these games, but what if you take Ryzen 3100? I believe %age would be different, as it was in my real world case :)
Back then i had 6600k@4500 and in some FPS games with a huge maps and a lot of players (Heroes and Generals; Planetside 2) i started to receive stutters in a tight fights, but when i switched to 6700@4500 it wasn't my case anymore. So i do believe that Hyperthreading worked in my case, cuz my CPUs were identical aside of virtual threads in the last one.

Would super interesting to have this post updated with a cheaper sample results 😇
peevee - Monday, December 7, 2020 - link
It is clear that 16-core Ryzen is power, memory and thermally limited. I bet SMT results on 8-core Ryzen 7 5800x would be much better for more loads.
naive dev - Tuesday, December 8, 2020 - link
The slide states that Zen 3 decodes 4 instructions/cycle. Are there two independent decoders which each decode those 4 instruction for a thread? Or is there a single decoder that switches between the program counters of both threads but only decodes instructions of one thread per cycle?

Investigating Performance of Multi-Threading on Zen 3 and AMD Ryzen 5000

CPU Performance

Overall

Post Your Comment

126 Comments

View All Comments

Oxford Guy - Friday, December 4, 2020 - link

AntonErtl - Sunday, December 6, 2020 - link

Oxford Guy - Sunday, December 6, 2020 - link

USGroup1 - Saturday, December 5, 2020 - link

pc8086 - Sunday, December 6, 2020 - link

eastcoast_pete - Sunday, December 6, 2020 - link

mapesdhs - Thursday, December 10, 2020 - link

bez5dva - Monday, December 7, 2020 - link

peevee - Monday, December 7, 2020 - link

naive dev - Tuesday, December 8, 2020 - link

Log in

Don't have an account? Sign up now