SMT Dead?

Simultaneous Multi Threading has been receiving quite a bit of criticism over the past months. Rumours about the demise of Hyper Threading were started, and Fred Weber of AMD even called it "a misuse of resources".

The reason why SMT was no longer considered "cool" was because of the very mediocre performance increase that the Pentium 4 gained from Hyper Threading. In fact, we are still encountering applications where Hyperthreading decreases performance.

Anand reported in his "AMD's Athlon 64 X2 4800+ & 4200+ Dual Core Performance Preview":
"The other thing we continue to see is that dual core with Hyper Threading in these multitasking environments is very much the double-edged sword. There are some situations where having both Hyper Threading and dual core gives Intel a huge performance boost, but there are others where the exact opposite is true. As it currently stands, we're not sure how much of a future Hyper Threading will have in future Intel architectures - but it's definitely not a sure win."
One of the upcoming AnandTech projects, a Database server comparison on SUSE Linux 9 SP1 (Kernel 2.6.x), is showing similar results - Hyper Threading decreases database read performance by 1% to 6% in many cases.

Why Hyperthreading fails to impress...

The current form of SMT [1] in the Pentium 4 is quite mediocre, but SMT is not going to disappear. The Netburst architecture is simply not well suited for SMT, and Intel implemented Hyper Threading with the goal of minimizing the die area cost. Only a few small structures were replicated - the die area cost was less than 5% of the total die area of the Pentium 4 (Northwood).

The whole idea behind SMT is to execute two (or more) threads at the same time, on the same processor. Normally, a CPU will execute one thread, switch context (save the contents of the registers and CPU state in the cache), and then load the registers for another thread and execute it. The main objective is that a second thread would use the execution units that one thread cannot use at the moment, and vice versa. This implies a wide issue superscalar CPU; in other words, a CPU that is capable of executing many instructions in parallel.

And the Pentium 4 is hardly a wide issue superscalar CPU. It has only 4 execution ports: one Load, one store and two for executing either FP or integer instructions. In the best case, you are using the double-pumped ALU attached to these two ports, and you can achieve a burst of 6 instructions in one clock cycle: 4 additions on the 2 double-pumped ALUs, a load and a store. But the chances that you find 4 independent additions are relatively small.

The trace cache is only capable of delivering 6 micro-ops every two cycles. Those 6 micro-ops are on average about 4 x86 instructions. So, in reality, the Pentium 4 will rarely be able to sustain more than 2 x86 instructions per clock cycle. That is fine for a single threaded CPU. We measured with Intel's Vtune that, for example, an FP intensive program such as Povray is running at an IPC of 0.8-0.9, while Database applications (integer intensive) runs at IPC ranging from 0.3 to 0.5. So, an IPC of 2 is more than enough...for a single threaded CPU, that is!

While Intel's engineers designed the Hyper Threading on the Pentium 4, they made sure that one stalled logical processor would not make the other logical processor stop too. Cache misses and handling branch mispredictions might cause the first logical CPU to fill up buffering queues so that the second logical CPU has no room to run.

Therefore, some buffers and queues are effectively cut in half when you run two threads. Below, you can see how some buffers are shared dynamically between two threads and some are simply cut in two.

With HT enabled, each thread can only have 63 µOPs in-flight in the reorder buffers instead of 126. That makes it harder to find independent instructions. So, the average IPC of two threads might be lower than when running one thread. Only 24 loads and 16 stores are in flight with HT enabled. With HT disabled, those numbers are doubled. Even worse is that the tiny Trace and data L1-cache of the P4 are shared between the two cores, even though this happens dynamically (one thread can have more entries than the other). It means that the average hitrate of the L1-cache is lower. Remember that the trace cache is about as big as an 8-16 KB L1-instruction cache, and the data cache is 16 KB large (8 KB on Northwood and Willamette).

Measuring the Dual core ...and why SMT can be impressive!
Comments Locked

28 Comments

View All Comments

  • Houdani - Wednesday, May 18, 2005 - link

    'Splain to me what you believe are the alleged "false assumptions."

    The only outright assumption I observed was located in the comments section. Specifically number two.
  • Ahkorishaan - Wednesday, May 18, 2005 - link

    Intel is by no means panicking, they're riding out a storm, and things will be dicey starting about 2/3 through 2006. AMD has the advantage now, but I honestly don't know if they can hold up against the R&D budget Intel has at it's fingertips.

    When P-m features get integrated into Intel's lineups AMD will be faced with the hotter, hungrier chip, and though they have more experience with the on-die Memory controller, and a nice head of steam, that might not be enough.

    I'm a fan of AMD and I applaud their foresight, but they need to keep on the ball if they expect to stay ahead for another year.
  • allanw - Wednesday, May 18, 2005 - link

    All this talk of databases and no mention of PostgreSQL? Cmon..
  • flatblastard - Wednesday, May 18, 2005 - link

    Oh great....more fuel for the "Intel panics" thread fire.
  • Rand - Wednesday, May 18, 2005 - link

    I haven't finished the article yet, but would you care to clarify your objections Questar?

    At least through the third page I haven't come across any assumptions or even real solid opinions he's put forth as yet.
    Thus far it's merely a technically oriented analysis of their respective offerings, nothing that I've read is particularly new or debateable/controversial.

  • Rapsven - Wednesday, May 18, 2005 - link

    Holy ****, Questar. That's all I'm going to say for you.

    Very informative. Though a lot of the more technical parts of the article flew right by me.
  • Questar - Wednesday, May 18, 2005 - link

    Wow, another AMD fanboy opinion piece based upon false assumtions. Go Anandtech!
  • sprockkets - Wednesday, May 18, 2005 - link

    not this time...

    nice pic on the last page, but I have no idea of the scale

Log in

Don't have an account? Sign up now