Conclusions: SMT On

I wasn’t too sure what we were going to see when I started this testing. I know the theory behind implementing SMT, and what it means for the instruction streams having access to core resources, and how cores that have SMT in mind from the start are built differently to cores that are just one thread per core. But theory only gets you so far. Aside from all the forum messages over the years talking about performance gains/losses when a product has SMT enabled, and the few demonstrations of server processors running focused workloads with SMT disabled, it is actually worth testing on real workloads to find if there is a difference at all.

Results Overview

In our testing, we covered three areas: Single Thread, Multi-Thread, and Gaming Performance.

In single threaded workloads, where each thread has access to all of the resources in a single core, we saw no change in performance when SMT is enabled – all of our workloads were within 1% either side.

In multi-threaded workloads, we saw an average uplift in performance of +22% when SMT was enabled. Most of our tests scored a +5% to a +35% gain in performance. A couple of workloads scored worse, mostly due to resource contention having so many threads in play – the limit here is memory bandwidth per thread. One workload scored +60%, a computational workload with little-to-no memory requirements; this workload scored even better in AVX2 mode, showing that there is still some bottleneck that gets alleviated with fewer instructions.

On gaming, overall there was no difference between SMT On and SMT Off, however some games may show differences in CPU limited scenarios. Deus Ex was down almost 10% when CPU limited, however Borderlands 3 was up almost 10%. As we moved to a more GPU limited scenario, those discrepancies were neutralized, with a few games still gaining single-digit percentage points improvement with SMT enabled.

For power and performance, we tested two examples where performance at two threads per core was either saw no improvement (Agisoft), or significant improvement (3DPMavx). In both cases, SMT Off mode (1 thread/core) ran at higher temperatures and higher frequencies. For the benchmark per performance was about equal, the power consumed was a couple of percentage points lower when running one thread per core. For the benchmark were running two threads per core has a big performance increase, the power in that mode was also lower, and there was a significant +91% performance per watt improvement by enabling SMT.

What Does This Mean?

I mentioned at the beginning of the article that SMT performance gains can be seen from two different viewpoints.

The first is that if SMT enables more performance, then it’s an easy switch to use, and some users consider that if you can get perfect scaling, then if SMT is an effective design.

The second is that if SMT enables too much performance, then it’s indicative of a bad core design. If you can get perfect scaling with SMT2, then perhaps something is wrong about the design of the core and the bottleneck is quite bad.

Having poor SMT scaling doesn’t always mean that the SMT is badly implemented – it can also imply that the core design is very good. If an effective SMT design can be interpreted as a poor core design, then it’s quite easy to see that vendors can’t have it both ways. Every core design has deficiencies (that much is true), and both Intel and AMD will tell its users that SMT enables the system to pick up extra bits of performance where workloads can take advantage of it, and for real-world use cases, there are very few downsides.

We’ve known for many years that having two threads per core is not the same as having two cores – in a worst case scenario, there is some performance regression as more threads try and fight for cache space, but those use cases seem to be highly specialized for HPC and Supercomputer-like tasks. SMT in the real world fills in the gaps where gaps are available, and this occurs mostly in heavily multi-threaded applications with no cache contention. In the best case, SMT offers a sizeable performance per watt increase. But on average, there are small (+22% on MT) gains to be had, and gaming performance isn’t disturbed, so it is worth keeping enabled on Zen 3.

 
Power Consumption, Temperature
Comments Locked

126 Comments

View All Comments

  • Oxford Guy - Friday, December 4, 2020 - link

    Suggestions:

    Compare with Zen 2 and Zen 1, particularly in games.

    Explain SMT vs. CMT. Also, is SMT + CMT possible?
  • AntonErtl - Sunday, December 6, 2020 - link

    CMT has at least two meanings.

    Sun's UltraSparc T1 has in-order cores that run several threads alternatingly on the functional units. This is probably the closest thing to SMT that makes sense on an in-order core. Combining this with SMT proper makes no sense; if you can execute instructions from different threads in the same cycle, there is no need for an additional mechanism for processing them in alternate cycles. Instruction fetch on some SMT cores processes instructions in alternate cycles, though.

    The AMD Bulldozer and family have pairs of cores that share more than cores in other designs share (but less than with SMT): They share the I-cache, front end and FPU. As a result, running code on both cores of a pair is often not as fast as when running it on two cores of different pairs. You can combine this scheme with SMT, but given that it was not such a shining success, I doubt anybody is going to do it.

    Looking at roughly contemporary CPUs (Athlon X4 845 3.5GHz Excavator and Core i7 6700K 4.2Ghz Skylake), when running the same application twice one after the other on the same core/thread vs. running it on two cores of the same pair or two threads of the same core, using two cores was faster by a factor 1.65 on the Excavator (so IMO calling them cores is justified), and using two threads was faster by a factor 1.11 on the Skylake. But Skylake was faster by a factor 1.28 with two threads than Excavator with two cores, and by a factor 1.9 when running only a single core/thread, so even on multi-threaded workloads a 4c/8t Skylake can beat an 8c Excavator (but AFAIK Excavators were not built in 8c configurations). The benchmark was running LaTeX.
  • Oxford Guy - Sunday, December 6, 2020 - link

    AMD's design was very inefficient in large part because the company didn't invest much into improving it. The decision was made, for instance, to stall high-performance with Piledriver in favor of a very very long wait for Zen. Excavator was made on a low-quality process and was designed to be cheap to make.

    Comparing a 2011/2012 design that was bad when it came out with Skylake is a bit of a stretch, in terms of what the basic architectural philosophy is capable of.

    I couldn't remember that fourth type (the first being standard multi-die CPU multiprocessing) so thanks for mentioning it (Sun's).
  • USGroup1 - Saturday, December 5, 2020 - link

    So yCruncher is far away from real world use cases and 3DPMavx isn't.
  • pc8086 - Sunday, December 6, 2020 - link

    Many congratulations to Dr. Ian Cutress for the excellent analysis carried out.

    If possible, it would be extremely interesting to repeat a similar rigorous analysis (at least on multi-threaded subsection of choosen benchmarks) on the following platforms:
    - 5900X (Zen 3, but fewer cores for each chiplet, maybe with more thermal headroom)
    - 5800X (Zen 3, only a single computational chiplet, so no inter CCX latency throubles)
    - 3950X (same cores and configuration, but with Zen 2, to check if the new, beefier core improved SMT support)
    - 2950X (Threadripper 2, same number of cores but Zen+, with 4 mamory channels; useful expecially for tests such as AIBench, which have gotten worse with SMT)
    - 3960X (Threadripper3, more cores, but Zen2 and with 4 memory ch.)

    Obviously, it would be interesting to check Intel HyperThreading impact on recent Comet Lake, Tiger Lake and Cascade Lake-X.

    For the time being, Apple has decided not to use any form of SMT on its own CPUs, so it is useful to fully understand the usefulness of SMT technologies for notebooks, high-end PCs and prosumer platforms.

    Than you very much.
  • eastcoast_pete - Sunday, December 6, 2020 - link

    Thanks Ian! With some of your comments about memory access limiting performance in some cases, how does (or would) a quad channel memory setup give in additional performance compared to the dual channel consumer setups (like these or mine) have? Now, I know that servers and actual workstations usually have 4 or more memory channels, and for good reason. So, in the time of 12 and 16 core CPUs, is it time for quad channel memory access for the rest of us, or would that break the bank?
  • mapesdhs - Thursday, December 10, 2020 - link

    That's a good question. As time moves on and we keep getting more cores, with people doing more things that make use of them (such as gaming and streaming at the same time, with browser/tabs open, livechat, perhaps an ecode too), perhaps indeed the plethora of cores does need better mem bw and parallelism, but maybe the end user would not yet tolerate the cost.

    Something I noticed about certain dual-socket S2011 mbds on Aliexpress is that they don't have as many memory channels as they claim, which with two CPUs does hurt performance of even consumer grade tasks such as video encoding:

    http://www.sgidepot.co.uk/misc/kllisre_analysis.tx...
  • bez5dva - Monday, December 7, 2020 - link

    Hi Dr. Cutress!

    Thanks for these interesting tests!
    Perhaps, SMT thing is a something that could drastically improve more budget CPUs performance? Your CPU has more than enough shiny cores for these games, but what if you take Ryzen 3100? I believe %age would be different, as it was in my real world case :)
    Back then i had 6600k@4500 and in some FPS games with a huge maps and a lot of players (Heroes and Generals; Planetside 2) i started to receive stutters in a tight fights, but when i switched to 6700@4500 it wasn't my case anymore. So i do believe that Hyperthreading worked in my case, cuz my CPUs were identical aside of virtual threads in the last one.

    Would super interesting to have this post updated with a cheaper sample results 😇
  • peevee - Monday, December 7, 2020 - link

    It is clear that 16-core Ryzen is power, memory and thermally limited. I bet SMT results on 8-core Ryzen 7 5800x would be much better for more loads.
  • naive dev - Tuesday, December 8, 2020 - link

    The slide states that Zen 3 decodes 4 instructions/cycle. Are there two independent decoders which each decode those 4 instruction for a thread? Or is there a single decoder that switches between the program counters of both threads but only decodes instructions of one thread per cycle?

Log in

Don't have an account? Sign up now