Conclusions: SMT On

I wasn’t too sure what we were going to see when I started this testing. I know the theory behind implementing SMT, and what it means for the instruction streams having access to core resources, and how cores that have SMT in mind from the start are built differently to cores that are just one thread per core. But theory only gets you so far. Aside from all the forum messages over the years talking about performance gains/losses when a product has SMT enabled, and the few demonstrations of server processors running focused workloads with SMT disabled, it is actually worth testing on real workloads to find if there is a difference at all.

Results Overview

In our testing, we covered three areas: Single Thread, Multi-Thread, and Gaming Performance.

In single threaded workloads, where each thread has access to all of the resources in a single core, we saw no change in performance when SMT is enabled – all of our workloads were within 1% either side.

In multi-threaded workloads, we saw an average uplift in performance of +22% when SMT was enabled. Most of our tests scored a +5% to a +35% gain in performance. A couple of workloads scored worse, mostly due to resource contention having so many threads in play – the limit here is memory bandwidth per thread. One workload scored +60%, a computational workload with little-to-no memory requirements; this workload scored even better in AVX2 mode, showing that there is still some bottleneck that gets alleviated with fewer instructions.

On gaming, overall there was no difference between SMT On and SMT Off, however some games may show differences in CPU limited scenarios. Deus Ex was down almost 10% when CPU limited, however Borderlands 3 was up almost 10%. As we moved to a more GPU limited scenario, those discrepancies were neutralized, with a few games still gaining single-digit percentage points improvement with SMT enabled.

For power and performance, we tested two examples where performance at two threads per core was either saw no improvement (Agisoft), or significant improvement (3DPMavx). In both cases, SMT Off mode (1 thread/core) ran at higher temperatures and higher frequencies. For the benchmark per performance was about equal, the power consumed was a couple of percentage points lower when running one thread per core. For the benchmark were running two threads per core has a big performance increase, the power in that mode was also lower, and there was a significant +91% performance per watt improvement by enabling SMT.

What Does This Mean?

I mentioned at the beginning of the article that SMT performance gains can be seen from two different viewpoints.

The first is that if SMT enables more performance, then it’s an easy switch to use, and some users consider that if you can get perfect scaling, then if SMT is an effective design.

The second is that if SMT enables too much performance, then it’s indicative of a bad core design. If you can get perfect scaling with SMT2, then perhaps something is wrong about the design of the core and the bottleneck is quite bad.

Having poor SMT scaling doesn’t always mean that the SMT is badly implemented – it can also imply that the core design is very good. If an effective SMT design can be interpreted as a poor core design, then it’s quite easy to see that vendors can’t have it both ways. Every core design has deficiencies (that much is true), and both Intel and AMD will tell its users that SMT enables the system to pick up extra bits of performance where workloads can take advantage of it, and for real-world use cases, there are very few downsides.

We’ve known for many years that having two threads per core is not the same as having two cores – in a worst case scenario, there is some performance regression as more threads try and fight for cache space, but those use cases seem to be highly specialized for HPC and Supercomputer-like tasks. SMT in the real world fills in the gaps where gaps are available, and this occurs mostly in heavily multi-threaded applications with no cache contention. In the best case, SMT offers a sizeable performance per watt increase. But on average, there are small (+22% on MT) gains to be had, and gaming performance isn’t disturbed, so it is worth keeping enabled on Zen 3.

 
Power Consumption, Temperature
Comments Locked

126 Comments

View All Comments

  • Bomiman - Saturday, December 5, 2020 - link

    That common knowledge is a few years old now. It was once common knowledge that games only used one thread.

    Consoles now have 3 times as many threads as before, and that's in a situation where 4t Cpus are barely usable and 4c 8t Cpus are obsolete.
  • MrPotatoeHead - Tuesday, December 15, 2020 - link

    Xbox360 came out in 2005. 3C/6T. Even the PS3 had a 1C/2T PowerPC PPE and 6 SPEs, so a total of 8T. PS4/XO is 8C/8T. Though I guess we could blame lack of CPU utilization still on this last generation using pretty weak cores from the get go. IIRC 8 core Jaguar would be on par with an Intel i3 at the time of these console releases.

    Though, the only other option AMD had was Piledriver. Piledriver still poor performer, a power hog, and it would likely only been worth it over 8 Jaguar cores if they went with a 3 or 4 module chip.

    It is nice that this generation MS and Sony both went all out on the CPU. Just too bad they aren't Zen 3 based. :(
  • Dolda2000 - Friday, December 4, 2020 - link

    It should be kept in mind that, at the time when AMD criticized Intel for that, that was when AMD had actual dual-cores (A64x2) and Intel still had single-cores with HT, which makes the criticism rather fair.
  • Xajel - Sunday, December 6, 2020 - link

    "Intel's HT approach proved superior".

    Intel's approach wasn't that much superior. In fact, in the early days of Intel's HTT processors, many Applications, even ones which supposed to be optimised for MC code path was getting lower scores with HTT enabled than when HTT was disabled.

    The main culprit was that Applications were designed to handle each thread in a real core, not two threads in a single core, the threads were fighting for resources that weren't there.

    Intel knew this and worked hard with developers to make them know the difference and apply this change to the code path. This actually took sometime till Multi-Core applications were SMT aware and had a code path for this.

    For AMD's case, AMD's couldn't work hard enough like Intel with developers to make them have a new code path just for AMD CPU's. Not to mention that intel was playing it dirty starting with their famous compiler which was -and still- used by most developers to compile applications, the compilers will optimise the code for intel's CPU's and have an optimised code path for every CPU and CPU feature intel have, but when the application detect a non-Intel CPU, including AMD's it will select the slowest code path, and will not try to test the feature and choose a code path.

    This applied also to AMD's CPU's, while sure the CPU's lacked FPU performance, and was not competitive enough (even when the software was optimised), but the whole optimisation thing made AMD's CPU inefficient, the idea should work better than Intel, because there's an actual real hardware there (at least for Integer), but developers didn't work harder, and the intel compiler played a major role for smaller developers also.

    TL'DR, the main issue was the intel compiler and lack of developers interest, then the actual cores were also not that much stronger than intel's (IPC side), AMD's idea should have worked, but things weren't in their side.

    And by the time AMD came with their design, they were already late, applications were already optimised for Intel HTT which became very good as almost all applications became SMT aware. AMD acknowledged this and knew that they must take what developers already have and work on it, they also worked hard on their SMT implementation that it is touted now that their SMT is better intel's own SMT implementation (HTT).
  • Keljian - Sunday, January 10, 2021 - link

    Urm no, intel’s compiler isn’t used often these days unless you’re doing really heavy maths. Microsoft’s compiler is used much more often, though clang is taking off
  • pogsnet - Tuesday, December 29, 2020 - link

    During P4, HT gives no difference in performance compared to AMD64 but on Core2Duo there it shows better performance. Probably because we have only 2-4 cores and not enough for our multi tasking needs, Now we have 4-32 cores plus much powerful and efficient cores, hence, SMT maybe not that significant already that is why on most test it shows no big performance lift.
  • willis936 - Thursday, December 3, 2020 - link

    5%? I think more than 5% is needed for a whole second set of registers plus the logic needed to properly handle context switching. Everything in between the cache and pipeline needs to be doubled.
  • tygrus - Thursday, December 3, 2020 - link

    Register rename means they already have more registers that don't need to be copied. The register renaming means they have more physical registers than logical registers exposed to programmer. Say you have: 16 logical registers exposed to coder per thread; 128 rename registers in HW; SMT 2tgreads/core = same 16 logical but each thread has 64 rename registers instead of 128.
    Compare mixing the workloads eg. 8 int/branch heavy with 8 FP heavy on 8 core; or OS background tasks like indexing/search/AntiVirus.
  • MrSpadge - Thursday, December 3, 2020 - link

    The 5% is from Intel for the original Pentium 4. At some point in the last 10 years I think I read a comparable number, probably here on AT, regarding a more modern chip.
  • Wilco1 - Friday, December 4, 2020 - link

    There is little accurate info about it, but the fact is that x86 cores are many times larger than Arm cores with similar performance, so it must be a lot more than 5%. Graviton 2 gives 75-80% of the performance of the fastest Rome at less than a third of the area (and half the power).

Log in

Don't have an account? Sign up now