Conclusions: SMT On

I wasn’t too sure what we were going to see when I started this testing. I know the theory behind implementing SMT, and what it means for the instruction streams having access to core resources, and how cores that have SMT in mind from the start are built differently to cores that are just one thread per core. But theory only gets you so far. Aside from all the forum messages over the years talking about performance gains/losses when a product has SMT enabled, and the few demonstrations of server processors running focused workloads with SMT disabled, it is actually worth testing on real workloads to find if there is a difference at all.

Results Overview

In our testing, we covered three areas: Single Thread, Multi-Thread, and Gaming Performance.

In single threaded workloads, where each thread has access to all of the resources in a single core, we saw no change in performance when SMT is enabled – all of our workloads were within 1% either side.

In multi-threaded workloads, we saw an average uplift in performance of +22% when SMT was enabled. Most of our tests scored a +5% to a +35% gain in performance. A couple of workloads scored worse, mostly due to resource contention having so many threads in play – the limit here is memory bandwidth per thread. One workload scored +60%, a computational workload with little-to-no memory requirements; this workload scored even better in AVX2 mode, showing that there is still some bottleneck that gets alleviated with fewer instructions.

On gaming, overall there was no difference between SMT On and SMT Off, however some games may show differences in CPU limited scenarios. Deus Ex was down almost 10% when CPU limited, however Borderlands 3 was up almost 10%. As we moved to a more GPU limited scenario, those discrepancies were neutralized, with a few games still gaining single-digit percentage points improvement with SMT enabled.

For power and performance, we tested two examples where performance at two threads per core was either saw no improvement (Agisoft), or significant improvement (3DPMavx). In both cases, SMT Off mode (1 thread/core) ran at higher temperatures and higher frequencies. For the benchmark per performance was about equal, the power consumed was a couple of percentage points lower when running one thread per core. For the benchmark were running two threads per core has a big performance increase, the power in that mode was also lower, and there was a significant +91% performance per watt improvement by enabling SMT.

What Does This Mean?

I mentioned at the beginning of the article that SMT performance gains can be seen from two different viewpoints.

The first is that if SMT enables more performance, then it’s an easy switch to use, and some users consider that if you can get perfect scaling, then if SMT is an effective design.

The second is that if SMT enables too much performance, then it’s indicative of a bad core design. If you can get perfect scaling with SMT2, then perhaps something is wrong about the design of the core and the bottleneck is quite bad.

Having poor SMT scaling doesn’t always mean that the SMT is badly implemented – it can also imply that the core design is very good. If an effective SMT design can be interpreted as a poor core design, then it’s quite easy to see that vendors can’t have it both ways. Every core design has deficiencies (that much is true), and both Intel and AMD will tell its users that SMT enables the system to pick up extra bits of performance where workloads can take advantage of it, and for real-world use cases, there are very few downsides.

We’ve known for many years that having two threads per core is not the same as having two cores – in a worst case scenario, there is some performance regression as more threads try and fight for cache space, but those use cases seem to be highly specialized for HPC and Supercomputer-like tasks. SMT in the real world fills in the gaps where gaps are available, and this occurs mostly in heavily multi-threaded applications with no cache contention. In the best case, SMT offers a sizeable performance per watt increase. But on average, there are small (+22% on MT) gains to be had, and gaming performance isn’t disturbed, so it is worth keeping enabled on Zen 3.

 
Power Consumption, Temperature
Comments Locked

126 Comments

View All Comments

  • MrSpadge - Friday, December 4, 2020 - link

    That doesn't mean SMT is mainly responsible for that. The x86 decoders are a lot more complex. And at the top end you get diminishing performance returns for additional die area.
  • Wilco1 - Friday, December 4, 2020 - link

    I didn't say all the difference comes from SMT, but it can't be the x86 decoders either. A Zen 2 without L2 is ~2.9 times the size of a Neoverse N1 core in 7nm. That's a huge factor. So 2 N1 cores are smaller and significantly faster than 1 SMT2 Zen 2 core. Not exactly an advertisement for SMT, is it?
  • Dolda2000 - Friday, December 4, 2020 - link

    >Graviton 2 gives 75-80% of the performance of the fastest Rome at less than a third of the area
    To be honest, it wouldn't surprise me one bit if 90% of the area gives 10% of the performance. Wringing out that extra 1% single-threaded performance here or there is the name of the game nowadays.

    Also there are many other differences that probably cost a fair bit of silicon, like wider vector units (NEON is still 128-bit, and exceedingly few ARM cores implement SVE yet).
  • Wilco1 - Saturday, December 5, 2020 - link

    It's nowhere near as bad with Arm designs showing large gains every year. Next generation Neoverse has 40+% higher IPC.

    Yes Arm designs opt for smaller SIMD units. It's all about getting the most efficient use out of transistors. Having huge 512-bit SIMD units add a lot of area, power and complexity with little performance gain in typical code. That's why 512-bit SVE is used in HPC and nowhere else.

    So with Arm you get many different designs that target specific markets. That's more efficient than one big complex design that needs to address every market and isn't optimal in any.
  • whatthe123 - Saturday, December 5, 2020 - link

    The extra 20% of performance is difficult to achieve. You can already see it on zen CPUs, where 16 core designs are dramatically more efficient per core in multithread running at around 3.4ghz, vs 8 core designs running at 4.8ghz. I've always hated these comparisons with ARM for this reason... you need a part with 1:1 watt parity to make a fair comparison, otherwise 80% performance at half the power can also be accomplished even on x86 by just reducing frequency and upping core count.
  • Wilco1 - Saturday, December 5, 2020 - link

    Graviton clocks low to conserve power, and still gets close to Rome. You can easily clock it higher - Ampere Altra does clock the same N1 core 32% higher. So that 20-25% gap is already gone. We also know about the next generation (Neoverse N2 and V1) which have 40+% higher IPC.

    Yes adding more cores and clocking a bit lower is more efficient. But that's only feasible when your core is small! Altra Max has 128 cores on a single die, and I don't think we'll see AMD getting anywhere near that in the next few years even with chiplets.
  • peevee - Monday, December 7, 2020 - link

    It is obviously a lot LESS than 5%. Nothing that matters in terms of transistors (caches and vector units) increases. Even doubling of registers would add a few hundreds/thousands of transistors on a chip with tens of billions of transistors, less than 0.000001%.

    They can double all scalar units and it still would be below 1% increase.
  • Kangal - Friday, December 4, 2020 - link

    I agree.
    Adding SMT/HT requires something like a +10% increase in the Silicon Budget, and a +5% increase in power draw but increases performance by +30%, speaking in general. So it's worth the trade-off for daily tasks, and those on a budget.

    What I was curious to see, is if you disabled SMT on the 5950X, which has lots of cores. Leaving each thread with slightly more resources. And use the extra thermals to overclock the processor. How would that affect games?

    My hunch?
    Thread-happy games like Ashes of Singularity would perform worse, since it is optimised and can take advantage of the SMT. Unoptimized games like Fallout 76 should see an increase in performance. Whereas actually optimised games like Metro Exodus they should be roughly equal between OC versus SMT.
  • Dolda2000 - Friday, December 4, 2020 - link

    >What I was curious to see, is if you disabled SMT on the 5950X, which has lots of cores.
    That is exactly what he did in this article, though.
  • Kangal - Saturday, December 5, 2020 - link

    I guess you didn't understand my point.
    Think of a modern game which is well optimised, is both GPU intensive and CPU intensive. Such as Far Cry V or Metro Exodus. These games scale well anywhere from 4-physical-core to 8-physical-cores.

    So using the 5950X with its 16-physical-cores, you really don't need extra threads. In fact, it's possible to see a performance uplift without SMT, dropping it from 32-shared-threads down to 16-full-threads, as each core gets better utilisation. Now add to that some overclocking (+0.2GHz ?) due to the extra thermal headroom, and you may legitimately get more performance from these titles. Though I suspect they wouldn't see any substantial increases or decreases in frame rates.

    In horribly optimised games, like Fallout 76, Mafia 3, or even AC Odyssey, anything could happen (though probably they would see some increases). Whereas we already know that in games that aren't GPU intensive, but CPU intensive (eg practically all RTS games), these were designed to scale up much much better. So even with the full-cores and overclock, we know these games will actually show a decrease in performance from losing those extra threads/SMT.

Log in

Don't have an account? Sign up now