SMT Dead?

Simultaneous Multi Threading has been receiving quite a bit of criticism over the past months. Rumours about the demise of Hyper Threading were started, and Fred Weber of AMD even called it "a misuse of resources".

The reason why SMT was no longer considered "cool" was because of the very mediocre performance increase that the Pentium 4 gained from Hyper Threading. In fact, we are still encountering applications where Hyperthreading decreases performance.

Anand reported in his "AMD's Athlon 64 X2 4800+ & 4200+ Dual Core Performance Preview":
"The other thing we continue to see is that dual core with Hyper Threading in these multitasking environments is very much the double-edged sword. There are some situations where having both Hyper Threading and dual core gives Intel a huge performance boost, but there are others where the exact opposite is true. As it currently stands, we're not sure how much of a future Hyper Threading will have in future Intel architectures - but it's definitely not a sure win."
One of the upcoming AnandTech projects, a Database server comparison on SUSE Linux 9 SP1 (Kernel 2.6.x), is showing similar results - Hyper Threading decreases database read performance by 1% to 6% in many cases.

Why Hyperthreading fails to impress...

The current form of SMT [1] in the Pentium 4 is quite mediocre, but SMT is not going to disappear. The Netburst architecture is simply not well suited for SMT, and Intel implemented Hyper Threading with the goal of minimizing the die area cost. Only a few small structures were replicated - the die area cost was less than 5% of the total die area of the Pentium 4 (Northwood).

The whole idea behind SMT is to execute two (or more) threads at the same time, on the same processor. Normally, a CPU will execute one thread, switch context (save the contents of the registers and CPU state in the cache), and then load the registers for another thread and execute it. The main objective is that a second thread would use the execution units that one thread cannot use at the moment, and vice versa. This implies a wide issue superscalar CPU; in other words, a CPU that is capable of executing many instructions in parallel.

And the Pentium 4 is hardly a wide issue superscalar CPU. It has only 4 execution ports: one Load, one store and two for executing either FP or integer instructions. In the best case, you are using the double-pumped ALU attached to these two ports, and you can achieve a burst of 6 instructions in one clock cycle: 4 additions on the 2 double-pumped ALUs, a load and a store. But the chances that you find 4 independent additions are relatively small.

The trace cache is only capable of delivering 6 micro-ops every two cycles. Those 6 micro-ops are on average about 4 x86 instructions. So, in reality, the Pentium 4 will rarely be able to sustain more than 2 x86 instructions per clock cycle. That is fine for a single threaded CPU. We measured with Intel's Vtune that, for example, an FP intensive program such as Povray is running at an IPC of 0.8-0.9, while Database applications (integer intensive) runs at IPC ranging from 0.3 to 0.5. So, an IPC of 2 is more than enough...for a single threaded CPU, that is!

While Intel's engineers designed the Hyper Threading on the Pentium 4, they made sure that one stalled logical processor would not make the other logical processor stop too. Cache misses and handling branch mispredictions might cause the first logical CPU to fill up buffering queues so that the second logical CPU has no room to run.

Therefore, some buffers and queues are effectively cut in half when you run two threads. Below, you can see how some buffers are shared dynamically between two threads and some are simply cut in two.

With HT enabled, each thread can only have 63 µOPs in-flight in the reorder buffers instead of 126. That makes it harder to find independent instructions. So, the average IPC of two threads might be lower than when running one thread. Only 24 loads and 16 stores are in flight with HT enabled. With HT disabled, those numbers are doubled. Even worse is that the tiny Trace and data L1-cache of the P4 are shared between the two cores, even though this happens dynamically (one thread can have more entries than the other). It means that the average hitrate of the L1-cache is lower. Remember that the trace cache is about as big as an 8-16 KB L1-instruction cache, and the data cache is 16 KB large (8 KB on Northwood and Willamette).

Measuring the Dual core ...and why SMT can be impressive!
Comments Locked

28 Comments

View All Comments

  • nserra - Thursday, May 19, 2005 - link

    The previous post was for the biased person who wrote this article. Johan De Gelas

    ^
    |Just kiding ;)
  • nserra - Thursday, May 19, 2005 - link

    "AMDs current dual core architecture is vastly superior to Intels"

    This is wrong!!! You said your self that Intel "new" processor was more of a “special” packing than a dual core processor, so you should say is:
    "AMDs current dual core architecture is amazing let’s wait what intel will do at a latter time"

    TDP is for power consuming as a 500W PSU is at it. Just because you have 500W PSU doesn’t mean it draws 500W of power.

    New Venice core as more transistors than the previous core not just because of SSE3, there is new power stages than can be enable to further lower power consuming, I doubt that putting a Turion on a regular board will enable those new power stages.
  • Viditor - Thursday, May 19, 2005 - link

    G'day Jarred!

    "the Pentium M 2.0 GHz chips manage to run at 22W"

    To be specific, they have a TDP of 22 watts which isn't really the same thing...

    "as I understand it even under maximum load the Pentium M stays under 22W, right?"

    Not at all...in fact it can be significantly higher than that. Intel's TDP measures an average usage under load rather than peak, while AMD's measures absolute theoretical peak under the worst conditions. This is why the TDP is quite meaningless...

    I guess my point is that I am of the opinion that the Turion might actually run at significantly lower power usages. As absolutely nobody (that I am aware of) has tested beyond the system level (i.e. the chip itself), I can't be sure...but judging by the actual specs of the chips themselves (not the TDP, but the electrical specifications) it appears that the PM may indeed be higher.

    I know I've asked before, but with the power usage and heat becoming more and more important, couldn't you guys develop a test of the actual realworld usage of the chips themselves?
    I think it might be quite illuminating...

    Cheers!
  • 4lpha0ne - Thursday, May 19, 2005 - link

    @Questar:
    Criticizing Intel and saying good things about AMD and IBM means, that Johan is an AMD fanboy? I think not. You'll see, that the opinions about Whitefield, Merom, Yonah & Co. after hitting the public will be better than about Smithfield now. That's simply the result of the amount of effort put into the designs. A dedicated dual core design is not the same as an on die dual Xeon system.

    @photoguy99:
    I'd say, Johan can make this conclusion, because he has the knowledge to do so. I'd come to the same conclusion, since the Windows scheduler (at least for XP) is not so much core-aware. It just sees the logical or phyisical CPUs and if one becomes free, it just sends the next thread to it. This causes thread-hopping (can be seen in the tech-report dual core reviews thanks to task manager screenshots). In such cases it matters somewhat if the last used data is in the other L2 cache and can be quickly transferred to the current L2 cache. And it matters for multithreaded applications, which work on the same set of data.

    @mazuz:
    I'd suggest to look at benchmarks of a 275 vs. dual 248 with 1 dual channel memory bank and benchmarks of a dual Xeon with FSB800 and a similarly configured (cache, FSB, memory, HT) Smithfield. That's the difference caused by the SRQ-connection.

    @Ahkorishaan:
    The mentioned upcoming Intel cores will indeed be nice. But some people here and on many other forums sound like the dual core K8 was AMD's last CPU and the K8 their last core ever. :) However, have a look at AMD's patent portfolio and you'll see, that this is not the case. As Fred Weber said, AMD is also still looking at power consumption. This is maybe the reason, why we might see a future CPU with more cores, but less FPU power per core (due to shared FPUs).

    AMD is also working on using things like clock gating and throttling (used by P-M) to further reduce power consumption. Currently they only implemented some standard features to keep power consumption down like other transistor designs (especially slower transistors in not so critical places), microarchitectural changes (better HALT mode), C3 state and PowerNow!/C'n'Q.

    Matthias
  • JarredWalton - Wednesday, May 18, 2005 - link

    Viditor, I think the point is that the Pentium M 2.0 GHz chips manage to run at 22W - still less than 1/3 of what the Winchester and Venice cores put out, I think. What exactly did they do to get that low? Well, there's gating technology for sure - i.e. power down unused portions of the chip - but as I understand it even under maximum load the Pentium M stays under 22W, right?

    Maybe Johan has more specifics, but I don't. I just know the price for power use on the design is very impressive, and I was surprised some of the same tech wasn't used in Prescott.
  • Viditor - Wednesday, May 18, 2005 - link

    Your usual excellent work Johan, thanks.
    A couple of nits to pick...

    "Intel will use its P-m “know-how” to keep the power dissipation so low"

    If you could qualify exactly what "know-how" you mean, that would be appreciated. IMHO, a major reason that PM is able to stay so much cooler that the Netburst chips (and on par with the Athlons) is that it doesn't have nearly as many features... Is there a reason you see the PM translating well into full blown server and desktop chips?

    "Intel can leverage their experience with the power saving features of the P-m to design quad core CPUs with remarkably low TDP"

    Arrrrrggghhh! This is a pet peave of mine. TDP IS NOT POWER USAGE!!! Sorry, I know you know this, but most don't and it's been quite frustrating.
    For those who don't know, TDP is an arbitrary design spec for OEMs to use with the CPU...
    AMD's TDP is so much higher than Intel's relative to actual power usage because AMD is much more cautious in it's design spec, not because it uses that much power.


    As to Questar's comments, IMHO the fact that the worst thing he can say is a short unsubstantiated rant speaks volumes to the credibility of the article.
    Thanks again Johan!
  • phaxmohdem - Wednesday, May 18, 2005 - link

    You are all fools. IDT's Winchip X2 dual core solution will blow all of this crapolla out of the warer.
  • mazuz - Wednesday, May 18, 2005 - link

    "AMDs current dual core architecture is vastly superior to Intels"

    This seems like a pretty strong statement considering there doesn't seem to be any known real world advantage to this architecture.
  • photoguy99 - Wednesday, May 18, 2005 - link

    Johan, isn't this statement a little unfounded:

    "we can be pretty sure that there are applications out there that do benefit from very fast cache-to-cache transfers"

    How can you be pretty sure when you've cited none? I know you said you'll do more testing - but *after* that testing is done seems like the time to be "pretty sure" it's a real world benefit.

    You've written a good article, it was informative. Just prefer conservative research conclusions.

  • bob661 - Wednesday, May 18, 2005 - link

    #7
    Who cares which company is ahead or behind? I sure as hell don't. Give me good bang for the buck. That's all I want.

Log in

Don't have an account? Sign up now