Simultaneous MultiThreading (SMT)

Zen will be AMD’s first foray into a true simultaneous multithreading structure, and certain parts of the core will act differently depending on their implementation. There are many ways to manage threads, particularly to avoid stalls where one thread is blocking another that ends in the system hanging or crashing. The drivers that communicate with the OS also have to make sure they can distinguish between threads running on new cores or when a core is already occupied – to achieve maximum throughput then four threads should be across two cores, but for efficiency where speed isn’t a factor, perhaps power gating/clock gating half the cores in a CCX is a good idea.

There are a number of ways that AMD will deal with thread management. The basic way is time slicing, and giving each thread an equal share of the pie. This is not always the best policy, especially when you have one performance dominant thread, or one thread that creates a lot of stalls, or a thread where latency is vital. In some methodologies the importance of a thread can be tagged or determined, and this is what we get here, though for some of the structures in the core it has to revert to a basic model.

With each thread, AMD performs internal analysis on the data stream for each to see which thread has algorithmic priority. This means that certain threads will require more resources, or that a branch miss needs to be prioritized to avoid long stall delays. The elements in blue (Branch Prediction, INT/FP Rename) operate on this methodology.

A thread can also be tagged with higher priority. This is important for latency sensitive operations, such as a touch-screen input or immediate user input elements required. The Translation Lookaside Buffers work in this way, to prioritize looking for recent virtual memory address translations. The Load Queue is similarly enabled this way, as typically low latency workloads require data as soon as possible, so the load queue is perfect for this.

Certain parts of the core are statically partitioned, giving each thread an equal timing. This is implemented mostly for anything that is typically processed in-order, such as anything coming out of the micro-op queue, the retire queue and the store queue.

The rest of the core is competitive, meaning that if a thread demands more resources it will try to get there first if there is space to do so each cycle.

New Instructions

AMD has a couple of tricks up its sleeve for Zen. Along with including the standard ISA, there are a few new custom instructions that are AMD only.

Some of the new commands are linked with ones that Intel already uses, such as RDSEED for random number generation, or SHA1/SHA256 for cryptography. The two new instructions are CLZERO and PTE Coalescing.

The first, CLZERO, is aimed to clear a cache line and is more aimed at the data center and HPC crowds. This allows a thread to clear a poisoned cache line atomically (in one cycle) in preparation for zero data structures. It also allows a level of repeatability when the cache line is filled with expected data. CLZERO support will be determined by a CPUID bit.

PTE (Page Table Entry) Coalescing is the ability to combine small 4K page tables into 32K page tables, and is a software transparent implementation. This is useful for reducing the number of entries in the TLBs and the queues, but requires certain criteria of the data to be used within the branch predictor to be met.

The Core Complex, Caches, and Fabric Some Final Thoughts and Comparisons
Comments Locked

106 Comments

View All Comments

  • Krysto - Wednesday, August 24, 2016 - link

    I think PCs in general run better on four cores than on two, even if most apps themselves can't take advantage of them, although I think in the next 5 years most new games will take advantage of 8 threads. But otherwise, it's just good for multitasking.
  • tarqsharq - Wednesday, August 24, 2016 - link

    I had an argument with one fellow on the internet regarding i7 being plenty for whatever I was doing in terms of core count. But streaming a show on one monitor while playing Overwatch was hitting 70%+ CPU usage, with all logical cores being 60-70% utilized consistently, with spikes up to 90%+.

    That was on my i7-4770K to be specific, running 1080P on a 144hz monitor for Overwatch, and Crunchyroll for 1080P anime stream on the second monitor.

    So some games combined with slight multitasking is already taxing the 4C/8T environment.
  • galta - Wednesday, August 24, 2016 - link

    And how much multitasking are we really using? If I had to guess, I would say not much, on average.
    You might have some folks here and there using it, but regular users need something between two and four cores, just as you said.
    You have the OS, the software you're using, be it a game or not, plus everything that's running behind the scenes, including Windows ineficiencies, and that's it. But for some weird guy that spends his day on 7zip, more than 4 cores brings no extra power.
    This is the reason why, no matter how excited we might get with 10 cores (I would love one, even if for bragging rights only), our i5s are enough for what we do.
    Maybe in 5 years from now games will be multithreaded, but I'm not holding my breath: something similar was said 5 years ago, and here we are.
    At the end of the day, we still need improvement in per core performance.
  • looncraz - Wednesday, August 24, 2016 - link

    Browsers are becoming better and better at using more cores... and we're all running tens of processes in the background, some of which fire interrupts on a CPU. More cores allows for more going on at the same time without interruptions. You can actually feel this moving to an eight-core FX-8350 from a quad core i5... those eight cores provide a somewhat smoother multi-tasking environment, despite each core being slower and the overall performance being lower.

    Humans are simply sensitive to changes in timing - more cores and more threads reduces the variability in timing, which improves perceived performance.
  • galta - Thursday, August 25, 2016 - link

    Hum....
    I don't know many people who share your opinion about FX-8350 vs i5.
    Anyway, we have been multitasking for a while, a least to some extent: OS, Word, anti-virus, browser. The question is: for this light multitasking, are we better off with several cores with poor performance/core, or with less cores but with great performance/core.
    Reviews and actual people generally prefer the later.
    As of browsers, great news that they are improving, but download/upload speed is by far the most important factor in users experience.
  • Alexvrb - Sunday, August 28, 2016 - link

    Download speed is fine for web browsing if you've got something faster than DSL. How much data exactly do you think you're consuming while browsing the web? Outside of streaming videos you won't use up a ton of bandwidth.
  • Cooe - Thursday, May 6, 2021 - link

    I know this is ANCIENT, but how the hell did you not realize that multi-core optimization was so bad only because nobody could afford greater than >4 core CPU's pre-Zen??? Modern games run freaking TERRIBLE now on 4c/4t i5's.
  • Notmyusualid - Wednesday, August 24, 2016 - link

    No, nope, nej, and nein.

    I see (FEEL) tangible improvements in my computing ever since dropped 2 cores for 4.

    And it looks like others below agree....
  • galta - Thursday, August 25, 2016 - link

    I believe you do, for the sweet spot is now around 4 cores, as I said before.
    The question is: do you believe that your experience will improve significantly if you mo to 6 or 8 cores?
    Probably not, unless you spend your day zipping files or rendering images.
  • Alexvrb - Sunday, August 28, 2016 - link

    They said the same thing about quad cores, and dual cores before that. AMD has to get on top of the curve, not behind it. They'll offer quad cores for more mainstream systems, and 8 for performance rigs. More for servers, and potentially less for low-power and/or low-cost.

Log in

Don't have an account? Sign up now