Investigating Performance of Multi-Threading on Zen 3 and AMD Ryzen 5000by Dr. Ian Cutress on December 3, 2020 10:00 AM EST
- Posted in
- Zen 3
- Ryzen 5000
- Ryzen 9 5950X
For simplicity, we are listing the percentage performance differentials in all of our CPU testing – the number shown is the % performance of having SMT2 enabled compared to having the setting disabled. Our benchmark suite consists of over 120 tests, full details of which can be found in our #CPUOverload article.
Here are the single threaded results.
|Single Threaded Tests
AMD Ryzen 9 5950X
Interestingly enough our single threaded performance was within a single percentage point across the stack (SPEC being +1.2%). Given that ST mode should arguably give more resources to each thread for consistency, the fact that we see no difference means that AMD’s implementation of giving a single thread access to all the resources even in SMT mode is quite good.
The multithreaded tests are a bit more diverse:
AMD Ryzen 9 5950X
|3D Particle Movement||100%||165.7%|
|3DPM with AVX2||100%||177.5%|
|HandBrake 4K HEVC||100%||107.9%|
Here we have a number of different factors affecting the results.
Starting with the two tests that scored statistically worse with SMT2 enabled: yCruncher and AIBench. Both tests are memory-bound and compute-bound in parts, where the memory bandwidth per thread can become a limiting factor in overall run-time. yCruncher is arguably a math synthetic benchmark, and AIBench is still early-beta AI workloads for Windows, so quite far away from real world use cases.
Most of the rest of the benchmarks are between a +5% to +35% gain, which includes a number of our rendering tests, molecular dynamics, video encoding, compression, and cryptography. This is where we can see both threads on each core interleaving inside the buffers and execution units, which is the goal of an SMT design. There are still some bottlenecks in the system affecting both threads getting absolute full access, which could be buffer size, retire rate, op-queue limitations, memory limitations, etc – each benchmark is likely different.
The two outliers are 3DPM/3DPMavx, and Corona. These three are 45%+, with 3DPM going 66%+. Both of these tests are very light on the cache and memory requirements, and use the increased Zen3 execution port distribution to good use. These benchmarks are compute heavy as well, so splitting some of that memory access and compute in the core helps SMT2 designs mix those operations to a greater effect. The fact that 3DPM in AVX2 mode gets a higher benefit might be down to coalescing operations for an AVX2 load/store implementation – there is less waiting to pull data from the caches, and less contention, which adds to some extra performance.
In an ideal world, both threads on a core will have full access to all resources, and not block each other. However, that just means that the second thread looks like it has its own core completely. The reverse SMT method, of using one global core and splitting it into virtual cores with no contention, is known as VISC, and the company behind that was purchased by Intel a few years ago, but nothing has come of it yet. For now, we have SMT, and by design it will accelerate some key workloads when enabled.
In our CPU results, the single threaded benchmarks showed no uplift with SMT enabled/disabled in our real-world or synthetic workloads. This means that even in SMT enabled mode, if one thread is running, it gets everything the core has on offer.
For multi-threaded tests, there is clearly a spectrum of workloads that benefit from SMT.
Those that don’t are either hyper-optimized on a one-thread-per-core basis, or memory latency sensitive.
Most real-world workloads see a small uplift, an average of 22%. Rendering and ray tracing can vary depending on the engine, and how much bandwidth/cache/core resources each thread requires, potentially moving the execution bottleneck somewhere else in the chain. For execution limited tests that don’t probe memory or the cache at all, which to be honest are most likely to be hyper-optimized compute workloads, scored up to +77% in our testing.
Post Your CommentPlease log in or sign up to comment.
View All Comments
MrSpadge - Thursday, December 3, 2020 - link> We’ve known for many years that having two threads per core is not the same as having two cores
True, and I still read this as an argument against SMT in forums. IMO it should be pointed out clearly that the cost of implementing either also differs drastically: +100% core size for another core and ~5% for SMT.
WaltC - Thursday, December 3, 2020 - linkIntel began its HT journey in order to pull more efficiency from each core--basically, as performance was being left on the table. Interestingly enough, after Athlon and A64, AMD roundly criticized Intel because the SMT thread was not done by a "real core"...and then proceeded to drop cores with two integer units--which AMD then labeled as "cores"...;) Intel's HT approach proved superior, obviously. IIRC. It's been awhile so the memories are vague...;) The only problem with this article is that it tries to make calls about SMT hardware design without really looking hard at the software, and the case for SMT is a case for SMT software. Games will not use more than 4-8 threads simultaneously so of course there is little difference between SMT on and off when running most games on a 5950. You would likely see near the same results on a 5600 in terms of gaming. SMT on or off when running these games leaves most of the CPU's resources untouched. Programs designed and written to utilize a lot of threads, however, show a robust, healthy scaling with SMT on versus no SMT. So--without a doubt--SMT CPU design is superior to no SMT from the standpoint of the hardware's performance. The outlier is the software--not the hardware. And of course the hardware should never, ever be judged strictly by the software one arbitrarily decides to run on it. We learn a lot more about the limits of the software tested here than we learn about SMT--which is a solid performance design in CPU hardware.
WarlockOfOz - Friday, December 4, 2020 - linkVery valid point about how games won't see a difference between 16 and 32 threads when they only use 6. Do you know if this type of analysis has been done at the lower end of the market?
WaltC - Friday, December 4, 2020 - linkIt's been common knowledge established a few years ago when AMD started pushing 8 core (and greater) CPUs that games don't require that many cores and that 6 cores is optimal for gaming right now. And if you do more than game, occasionally, and need more than 6 threads then SMT is there for you. As the new consoles are 8-core CPU designs, over time the number of cores required for optimal game performance will increase.
Flying Aardvark - Friday, December 4, 2020 - linkConsoles are 8-core now, with 2 reserved for the OS. Count on 6-cores being optimal for gaming for quite some time.
Kangal - Friday, December 4, 2020 - linkThoset Jaguar cores was more like a 4c/8t processor to be fair. And they weren't that much better than Intel's Atom cores, a far cry from Intel's Core-i SkyLake architecture. And current gen consoles were very light on the OS, so maybe using 1-full core (or 2-threads-shared) leaving only 3-cores for games, but much better than the 2-core optimised games from the PS3/360 era.
The new gen consoles will be somewhat similar, using only 1-full core (2-threads) reserved for the OS. But this time we have an architecture that's on-par with Intel's Core-i SkyLake, with a modern full 8-core processor (SMT/HT optional). This time leaving a healthy 7-cores that's dedicated to games. Optimisations should come sooner than later, and we'll see the effects on PC ports by 2022. So we should see a widening gap between 4vs6-core, and to a lesser extent 6vs8-core in the future. I wouldn't future-proof my rig by going for a 5700x instead of a 5600x, I would do that for the next round (ie 2022 Zen4).
AntonErtl - Sunday, December 6, 2020 - linkThe 8 Jaguar cores are in no way like 4c/8t CPUs; if you use only half of them, you get half the performance (unless your application is memory/L2-bandwidth-limited). Their predecessor Bobcat is about twice as fast as an Bonnell core (Atom proper), and a little slower than Silvermont (the core that replaced Bonnell), about half as fast as Goldmont+ (all at the clock rates at which they were available in fanless mini-ITX boards), one third as fast as a 3.5GHz Excavator core, and one sixth as fast as a 4.2GHz Skylake.
Oxford Guy - Sunday, December 6, 2020 - linkWorse IPC than Bulldozer as far as I know. Certainly worse than Piledriver.
Really sad. The "consoles" should have used something better than Jaguar. It's bad enough that the "consoles" are a parasitic drain on PC gaming in the first place. It's worse when they not only drain life with their superfluous walled gardens but also by foisting such a low-grade CPU onto the art.
Kangal - Thursday, December 24, 2020 - linkThe Jaguar cores share alot of DNA with Bulldozer, but they aren't the same. It's like Intel's Atom chips compared to Intel Core-i chips. With that said, 2015 Puma+ was a slight improvement over 2013 Jaguar, which was a modest improvement over the initial 2011 Bobcat lineup. All this started in 2006 with AMD choosing to evolve their earlier Phenom2 cores which are derivatives of the AMD Athlon-64.
So just by their history, we can see they're inline with Intel's Atom architecture evolution, and basically a direct competitor. Where Intel had slightly less performance, but had much lower power-draw... making them the obvious winner. Leaving AMD to fill in the budget segments of the market.
As for the core arrangement, they don't have full proper cores as people expect them. Like the Bulldozer architecture, each core had to share resources like the decoder and floating-point unit. So in many instances, one core would have to wait for the other core. This boosts multithreaded performance with simple calculations in orderly patterns. However, with more complex calculations and erratic/dynamic patterns (ie Regular PC use), it causes a hit to the single-thread performance and notable hiccups. So my statement was true. This is more like a 4c/8t chipset, and it is less like a Core-i and much more like an Atom. But don't take my word for it, take Dr Ian
Cutress. He said the same thing during the deep dive into the Jaguar microarchitecture, and recently in the Chuwi Aerobox (Xbox One S) article.
Now, there have been huge benefits to the Gaming PC industry, and game ports, due to the PS4/XB1. The first being the x86-64bit direct compatibility. Second was the cross-compatability thanks to Vulkan and DirectX (moreso with PS4 Pro and XB1X). The third being that it forced game developers to innovate their game engines, so that they're less narrow and more multi-threaded. With PS5/XseX we now see a second huge push with this philosophy, and the improvements of fast single-thread performance and fast-flash storage access. So I think while we have legitimate reasons to groan about the architecture (especially in the PS4) upon release, we do have to recognize the conveniences that they also brought (especially in the XB1X). This is just to show that my stance wasn't about console bashing.
at_clucks - Monday, December 7, 2020 - link@Kangal, Jaguar APUs in consoles are definitely not "like a 4c/8t processor" because they don't use CMT. They are full 8 cores. Their IPC may be comparable with some newer Atoms although it's hard to benchmark how the later "Evolved Jaguar" cores in the mid generation console refresh compares against the regular Jaguar or Atom.