What about Hyper-Threading and IMC?

Core's impressive execution resources and massive shared cache seem to make it the ideal CPU design for SMT. However, there is no Simultaneous Multi Threading anywhere in the Core architecture. The reason is not that SMT can't give good results (See our elaborate discussion here), but that the engineers were given the task to develop a CPU with a great performance ratio that could be used for the Server, Desktop and Mobile markets. So the designers in Israel decided against using SMT (Hyper-Threading). While SMT can offer up to a 40% performance boost, these performance benefits will only be seen in server applications. SMT also makes the hotspots even hotter, so SMT didn't fit very well in Core's "One Micro-Architecture to Rule them All" design philosophy.

As far as including an Integrated Memory Controller (IMC), we were also told that the transistors which could have been spent in the IMC were better spent in the 4 MB shared cache. This is of course highly debatable, but it is a fact cache consumes less power. The standard party line from Intel is that keeping the memory controller on the chipset allows them to support additional memory types without having to re-spin the CPU core. That is certainly true, and with the desktop/mobile sectors using standard DDR2 while servers are set to move to FB-DIMM designs, the added flexibility isn't terrible. Techniques such as memory disambiguation and improved prefetch logic can also help to eliminate any advantage an IMC might offer. Would an IMC improve Core's performance? Almost certainly, but Intel will for the time being pursue other options.

Conclusion 1 : AMD K8 versus Intel P8

The Intel Core architecture is clearly the heir and descendant of the hugely successful P6 architecture. However, it has state of the art technology on board such as micro-op/macro-op fusion, memory disambiguation and massive SIMD/FP power.

Compared to the excellent AMD K8/Hammer architecture, the Core CPU is simply a wider, more efficient and more out of order CPU. When I suggested to Jack Doweck that the massive execution resources may not be fully used until SMT is applied, he disagreed completely. Memory disambiguation should push the current limits of ILP in integer loads a lot higher, and the massive bandwidth that the L1 and L2 can deliver should help Core to come close to the execution utilization percentages of the current P-M. 33% more execution potential could thus come very close to 33% more performance, clock-for-clock.

So is it game over for AMD? Well, if you read the previous pages, it is pretty clear that there are some obvious improvements that should happen in AMD's next generation. However, there is no reason at all to assume that the current K8 architecture is at the end of its life. One obvious upgrade possibility is to enhance the SSE/SIMD power by increasing the wideness of each unit or by simply implementing more of them in the out of order FP pipeline.

To sustain the extra (SIMD) FP power, AMD should definitely improve the bandwidth of the two caches further. The K7 had a pretty slow L2-cache, and the K8 doubled the amount of bandwidth that the L2 could deliver for example. It's not unreasonable to think a 256-bit wide cache bus could be added to a near-future AMD design.

Finally, there is also a lot of headroom for increasing integer performance. The fact that Loads can hardly be reordered has been a known weak point since the early K7 days. In fact, we know that engineers at AMD were well aware of it then, and it is surprising that AMD didn't really fix this in the K8 architecture. Allowing a much more flexible reordering of Loads - even without memory disambiguation - would give a very healthy boost to IPC (5% and more). It is one of the main reasons why the P-M can beat the Athlon 64 clock-for-clock in certain applications.

Those are just a few examples that are well known. It is very likely that there are numerous other possible improvements that could take the K8 architecture much further.

Looking at the server version of Core ("Woodcrest") and considering that it is very hard to find a lot of ILP in server applications, the only weakness of Core is that there is no multi-threading in each Core. This small disadvantage is a logical result of the design goal of Core, an architecture which is an all-around compromise for the server, desktop and mobile markets. The lack of Hyper-Threading in Xeon Core products might give Sun and IBM a window of opportunity in the heavy thread server application benchmarks, but since Tigerton (65 nm, two Woodcrests in one package, 4 cores) will come quickly, the disadvantage of not being able to extract more TLP might never be seen. Our astute readers will have understood by now that it is pretty hard to find a weakness in the new Core architecture.

Conclusion 2 : The free lunch is back!

It is ironic that just a year ago, Intel and others were downplaying the importance of increasing IPC and extracting more ILP. Multi-core was the future, single thread performance was a minor consideration. The result was that the reputed Dr. Dobbs journal headlined : "the free lunch is over" [1] claiming that only larger caches would increase IPC a little bit and that the days that developers could count on the ever increasing clockspeeds and IPC efficiency of newer CPU to run code faster were numbered. Some analysts went even further and felt that CPU packages with many relatively simple, small in-order CPUs were the future.

At AnandTech, we were pretty skeptical about the "threading is our only savior" future, as Tim Sweeney, the leading developer behind the Unreal 3 engine, explained the challenges of multi-threaded development of the next generation of games. The fat, wide OoO core running at high clockspeeds was buried a little too soon. Yes, Intel's Core does not use the aggressive domino and LVS circuit-design strategy that NetBurst designs used to achieve stunning clockspeeds. At the same time, it is a fat, massive reordering CPU which gives free lunch to developers who don't want to spend too much time on debugging heavily threaded applications. Multi-core is here to stay, but getting better performance is once again the shared responsibility of both the developer and the CPU designer. Yes, dual-core is nice, but single threaded performance is still important!

I would like to express my thanks to the following people who helped to make this article possible:
Jack Doweck, "Foo", "Redpriest", Jarred and Anand

References

[1] The Free Lunch Is Over: A Fundamental Turn Toward Concurrency in Software By Herb Sutter


Intel's Next Generation Microarchitecture Unveiled, by David Kanter, Real World Technologies
Faster Load Times
Comments Locked

87 Comments

View All Comments

  • Betwon - Wednesday, May 3, 2006 - link

    Without branch prediction, K8 will become very very poor. Too terrible!

    The prediction is much better than the forever penalty.

    The penalty of disprediction is just the penalty of doing nothing.(don't predict)

    The penalty is fairly high. If you are against the prediction, you will find that the penalty will happen in K8 every 3 instructions averagely. K8@1.8G(without branch predictor ) will fail to win the old Pentium3@1G(with branch predictor ).

    This is the drawback of lack of prediction, whether branches or memory access: It can not speeds up anything, but often slows down.
    Without branch prediction, K8 will be down!
  • Betwon - Wednesday, May 3, 2006 - link

    It is very interesting that the P4's Load/store/Memory reordering method, which is very different with Core's.

    For P4, it always assumes that all load-ops can hit and find the load data from the store buffer or L1 data cache.
    Before one load-op is executed, it has to obtain the load address and all prior-store address and compare with them. If it is found that the load address is equal to one prior-store address, the load-op will assume that the store data is in the store buffer and the data has been ready and vaild, then start to execute speculatively.
    If the address-euqal is not found, the load-op will assume that the load data is in L1 data cache, and the data is ready and vaild, then start to execute speculatively.

    If the speculation fail or the miss happen, the speculative load-op and the relative speculative micro-ops have to be reexecuted -- it is called as 'replay'.

    The load-op can be executed speculatively, after it knew it's load address and compared the load address with the all prior-store address.
    The load-op can not be executed speculatively before it knew it's load address and compared the load address with the all prior-store address.

    The load-op speculates whether the load data is ready and vaild, but not speculate whether there is the true dependency with prior-store.

    But Core can speculate whether there is the true dependency with prior-store. Core has the smart predictor which can predict the store-to-load dependency precisely, before the load-op address is compared with the prior-store address.
  • Betwon - Wednesday, May 3, 2006 - link

    If you really want to know what is the Intel's load reordering and memory misambiguation, I can tell you the facts:

    http://www.stanford.edu/~merez/papers/LoadSched_IS...">http://www.stanford.edu/~merez/papers/LoadSched_IS...
    Speculation Techniques for Improving Load Related Instruction Scheduling 1999
    Adi Yoaz, Mattan Erez, Ronny Ronen, and Stephan Jourdan -- From Intel's Haifa, they designed the Load/Store Unit of Core.

    I had said that anandtech should study many things about CPU. Of course, I should study more things about CPU.
  • Betwon - Wednesday, May 3, 2006 - link

    sub ebp,ebp
    mov ecx, 1000000000

    B1:
    mov eax,[ebx]
    sub esi,1
    sub edi,1
    cmp ecx,ebp
    je B2

    mov edx,[ebx]
    sub esi,1
    sub edi,1
    cmp ecx,ebp
    je B2

    mov eax,[ebx]
    sub esi,1
    sub edi,1
    cmp ecx,ebp
    je B2

    mov edx,[ebx]
    sub esi,1
    sub edi,1
    cmp ecx,ebp
    je B2

    mov eax,[ebx]
    sub esi,1
    sub edi,1
    cmp ecx,ebp
    je B2

    mov edx,[ebx]
    sub ecx,1
    sub edi,1
    cmp ebp,ebp
    je B1

    B2:

    If the asm codes take 6000000000 cycles --> up to five x86 instructions at a time.
    It is so easy to verify.

    we can not call K5 -- 4 decoders, because it is too immature.
  • emboss - Monday, May 1, 2006 - link

    I'm not even sure the Core architecture has 4 decoders. There's lots of references in the Intel Optimisation manual to say that there's still only three (two simple + one complex):

    "On Intel Core Solo and Intel Core Duo processors, decoding of most packed SSE instructions is done by all three decoders. As a result the front end can process up to three packed SSE instructions every cycle." (page 1-32)

    "Improvement in decoder and micro-op fusion allows the front end to see most instructions as single µop instructions. This increases the throughput of the three decoders in the front end." (page 1-31)

    While it certainly wouldn't be the first time Intel manuals have been wrong, they're usually reasonably accurate.

    Also from the optimisation manual, it implies that the front end/decoder doing the fusion (for example, see the second quote above).
  • JarredWalton - Monday, May 1, 2006 - link

    Not sure if you're referring to Core Solo/Duo manuals or to Core "Conroe/Merom" manuals. The article is covering the *next* Core architecture, so I wouldn't be at all surprised if Core Duo only has 3 decoders while Conroe bumps that to 4.
  • emboss - Monday, May 1, 2006 - link

    Oops, yes, my mistake. I was referring to Solo/Duo. Damn those marketers :)

    This still leaves me puzzled over the unexpected SSE performance on Solo/Duo. Thinking about it a bit more, the performance would have been 4x "expected" (single uop SSE with two FADD units vs double uop SSE with only one FADD unit), whereas I was only getting a bit less than double. Gnah, back to emperical optimisation.
  • Furen - Monday, May 1, 2006 - link

    Yes, Yonah only has 3 decoders (and the same port arrangement as Dothan, too).
  • Loki726 - Monday, May 1, 2006 - link

    Great job Johan!

    Its articles like this that keep anandtech head and shoulders above everyone else. Instead of just running the latest and greatest core you get through the same old benchmarks and throwing some pretty comparison graphs at the reader, you actually take the time to figure out what parts of the architecture contribute to the performance you see in benchmarks. Keep it up!

    On a small side note, on your first figure of intel's core architecture on page 4, I think the cache size should be 4096kb. 4gb seems rather large...
  • Goi - Monday, May 1, 2006 - link

    Nice read. Did you get all your information solely from Jack Doweck, or are there papers outlining the Core architecture. I've read those for the Pentium-M and Netburst architecture(as well as several other architectures) but I haven't seen one of the Core yet.

Log in

Don't have an account? Sign up now