What about Hyper-Threading and IMC?

Core's impressive execution resources and massive shared cache seem to make it the ideal CPU design for SMT. However, there is no Simultaneous Multi Threading anywhere in the Core architecture. The reason is not that SMT can't give good results (See our elaborate discussion here), but that the engineers were given the task to develop a CPU with a great performance ratio that could be used for the Server, Desktop and Mobile markets. So the designers in Israel decided against using SMT (Hyper-Threading). While SMT can offer up to a 40% performance boost, these performance benefits will only be seen in server applications. SMT also makes the hotspots even hotter, so SMT didn't fit very well in Core's "One Micro-Architecture to Rule them All" design philosophy.

As far as including an Integrated Memory Controller (IMC), we were also told that the transistors which could have been spent in the IMC were better spent in the 4 MB shared cache. This is of course highly debatable, but it is a fact cache consumes less power. The standard party line from Intel is that keeping the memory controller on the chipset allows them to support additional memory types without having to re-spin the CPU core. That is certainly true, and with the desktop/mobile sectors using standard DDR2 while servers are set to move to FB-DIMM designs, the added flexibility isn't terrible. Techniques such as memory disambiguation and improved prefetch logic can also help to eliminate any advantage an IMC might offer. Would an IMC improve Core's performance? Almost certainly, but Intel will for the time being pursue other options.

Conclusion 1 : AMD K8 versus Intel P8

The Intel Core architecture is clearly the heir and descendant of the hugely successful P6 architecture. However, it has state of the art technology on board such as micro-op/macro-op fusion, memory disambiguation and massive SIMD/FP power.

Compared to the excellent AMD K8/Hammer architecture, the Core CPU is simply a wider, more efficient and more out of order CPU. When I suggested to Jack Doweck that the massive execution resources may not be fully used until SMT is applied, he disagreed completely. Memory disambiguation should push the current limits of ILP in integer loads a lot higher, and the massive bandwidth that the L1 and L2 can deliver should help Core to come close to the execution utilization percentages of the current P-M. 33% more execution potential could thus come very close to 33% more performance, clock-for-clock.

So is it game over for AMD? Well, if you read the previous pages, it is pretty clear that there are some obvious improvements that should happen in AMD's next generation. However, there is no reason at all to assume that the current K8 architecture is at the end of its life. One obvious upgrade possibility is to enhance the SSE/SIMD power by increasing the wideness of each unit or by simply implementing more of them in the out of order FP pipeline.

To sustain the extra (SIMD) FP power, AMD should definitely improve the bandwidth of the two caches further. The K7 had a pretty slow L2-cache, and the K8 doubled the amount of bandwidth that the L2 could deliver for example. It's not unreasonable to think a 256-bit wide cache bus could be added to a near-future AMD design.

Finally, there is also a lot of headroom for increasing integer performance. The fact that Loads can hardly be reordered has been a known weak point since the early K7 days. In fact, we know that engineers at AMD were well aware of it then, and it is surprising that AMD didn't really fix this in the K8 architecture. Allowing a much more flexible reordering of Loads - even without memory disambiguation - would give a very healthy boost to IPC (5% and more). It is one of the main reasons why the P-M can beat the Athlon 64 clock-for-clock in certain applications.

Those are just a few examples that are well known. It is very likely that there are numerous other possible improvements that could take the K8 architecture much further.

Looking at the server version of Core ("Woodcrest") and considering that it is very hard to find a lot of ILP in server applications, the only weakness of Core is that there is no multi-threading in each Core. This small disadvantage is a logical result of the design goal of Core, an architecture which is an all-around compromise for the server, desktop and mobile markets. The lack of Hyper-Threading in Xeon Core products might give Sun and IBM a window of opportunity in the heavy thread server application benchmarks, but since Tigerton (65 nm, two Woodcrests in one package, 4 cores) will come quickly, the disadvantage of not being able to extract more TLP might never be seen. Our astute readers will have understood by now that it is pretty hard to find a weakness in the new Core architecture.

Conclusion 2 : The free lunch is back!

It is ironic that just a year ago, Intel and others were downplaying the importance of increasing IPC and extracting more ILP. Multi-core was the future, single thread performance was a minor consideration. The result was that the reputed Dr. Dobbs journal headlined : "the free lunch is over" [1] claiming that only larger caches would increase IPC a little bit and that the days that developers could count on the ever increasing clockspeeds and IPC efficiency of newer CPU to run code faster were numbered. Some analysts went even further and felt that CPU packages with many relatively simple, small in-order CPUs were the future.

At AnandTech, we were pretty skeptical about the "threading is our only savior" future, as Tim Sweeney, the leading developer behind the Unreal 3 engine, explained the challenges of multi-threaded development of the next generation of games. The fat, wide OoO core running at high clockspeeds was buried a little too soon. Yes, Intel's Core does not use the aggressive domino and LVS circuit-design strategy that NetBurst designs used to achieve stunning clockspeeds. At the same time, it is a fat, massive reordering CPU which gives free lunch to developers who don't want to spend too much time on debugging heavily threaded applications. Multi-core is here to stay, but getting better performance is once again the shared responsibility of both the developer and the CPU designer. Yes, dual-core is nice, but single threaded performance is still important!

I would like to express my thanks to the following people who helped to make this article possible:
Jack Doweck, "Foo", "Redpriest", Jarred and Anand

References

[1] The Free Lunch Is Over: A Fundamental Turn Toward Concurrency in Software By Herb Sutter


Intel's Next Generation Microarchitecture Unveiled, by David Kanter, Real World Technologies
Faster Load Times
Comments Locked

87 Comments

View All Comments

  • Betwon - Wednesday, May 3, 2006 - link

    If you really want to know what is the Intel's load reordering and memory misambiguation, I can tell you the facts:

    http://www.stanford.edu/~merez/papers/LoadSched_IS...">http://www.stanford.edu/~merez/papers/LoadSched_IS...
    Speculation Techniques for Improving Load Related Instruction Scheduling 1999
    Adi Yoaz, Mattan Erez, Ronny Ronen, and Stephan Jourdan -- From Intel's Haifa, they designed the Load/Store Unit of Core.

    I had said that anandtech should study many things about CPU. Of course, I should study more things about CPU.
  • Betwon - Tuesday, May 2, 2006 - link

    P6: sub [mem],eax decodes to three micro-ops
    Core duo: sub [mem],eax decodes to two micro-ops
    K8: sub [mem],eax decodes to one macro-op

    P6: sub eax,[mem] decodes to two micro-ops
    Core duo: sub eax,[mem] decodes to one micro-op
    K8: sub eax,[mem] decodes to one macro-op

    Intel's micro-fusion is different with the K7/K8's macro-op.

    P4 has 2X2 int ALU and 2 AGU.
    K7/K8 has 3 int ALUand 3 AGU.
    But Core duo has only 2 int ALU and 2 AGU.

    The integer performance:
    Core duo>K7/K8

    Why?
    Because Core duo's length of the depenency chain of the critical path is the shorest.
    The most integer program asm codes can be thought as a high and thin tree of the depenency chain, (the longest depenency chain is called the critical path)
    The critical path determines the performance. The length of the depenency chain of the critical path is the cycles needed to complete this critical path.
    Core duo(2ALU/2AGU) spends less cycles than K7/K8(3ALU/3AGU) -- Because more INT funtions can not accelerate the true dependency atoms-operations.

    The P-M/Core duo's special ability of anti true dependency atoms-operations is the real reason of it's INT outperformance, which is different with old P6(such as Pentium 3).

    The most FP program asm codes can be thought as a boskage (there are many short depenency chains). The best ILP can be performed--The more FP FADD/FMUL funtions, the more performence.

    Double the FP funtions or double the speed of half-speed FP functions is a good idea for the most FP programs.
    But double INT funtions do not always enhence the INT preformence so greatly.
    Conroe only has three ALUs and two AGUs, but not 4 ALU and 4AGU.
    K7/K8 has three ALUs and three AGUs.
  • Betwon - Tuesday, May 2, 2006 - link

    Sorry, I'm not one from a English-language nations. I just tell my idea with many spelling or syntax errors.
    funtions -- funtion units

    I want to tell why Conroe with only 3ALU/2AGU has the INT outperformance. Core has the superexcellent ability to process the true dependency chains.(much much better than K8 3ALU/3AGU).
    Even Core duo with 2ALU/2AGU has the INT outperformance(much better than K8 3ALU/3AGU).

    The length of the depenency chain of the critical path
    The true dependency atom-operations
  • Starglider - Monday, May 1, 2006 - link

    A truly excellent article. I just have a couple of questions;

    quote:

    The second and more important advantage is the on die memory controller, which lowers the latency to the memory considerably. However, the lower clockspeeds of the Core CPUs (relative to NetBurst) and the faster FSB also lower latency significantly. With the numbers available to us now, we have reason to believe that the Athlon 64 X2's latency advantage will shrink to only 15 to 20%.


    It's clear that increasing FSB speed can reduce memory latency. However I'm not clear why lower CPU core speed will reduce absolute latency - sure it will reduce the number of CPU cycles that occur while waiting for memory, but how can it reduce the absolute delay?

    You don't seem to have included inter-instruction latency in your comparison tables. I know this data can be hard to get hold of, but it's critical to the performance of highly serial code (e.g. the pi calculation benchmarks that seem to be so popular at the moment). Is there any chance it could be included?

    Finally I'm wondering if Intel will revist some of the P4 clock speed enhancing tricks later on. Things like LVS and double pumped AUs would only have slowed down an already complex development process that Intel desperately needed to be completed quickly. But if AMD do come out with a new architecture that matches or exceeds Conroe on IPC, Intel might be able to respond quite quickly by bringing back some of their already well understood clock speed tricks to accelerate Conroe.
  • Makaveli - Monday, May 1, 2006 - link

    You need to get away from thinking of increased clockspeed for extra performance, the future is multicore Cpu's and parellism.
  • Starglider - Tuesday, May 2, 2006 - link

    I write heavily multithreaded applications for a living, but sometimes there is just no substitute for fast serial execution; a lot of things just can't be parallelised. Serial execution speed is effectively IPC * clock rate, so yes increasing clock rate is still very helpful as long as IPC doesn't suffer.
  • saratoga - Monday, May 1, 2006 - link

    ^^ Did you read the post you replied to? His point is valid.

    Lower clock speed is not going to improve memory latency. It may mean that latency is less painful, but if you took two core chips, one at 2GHz and the other at 3GHz, the absolute latency is roughly if not exactly the same for each. Though the cost of each ns of latency is 50% more dear on the 3GHz chip.
  • Spoonbender - Tuesday, May 2, 2006 - link

    It would be more accurate to say that each cycle of latency is more dear on the 2ghz chip, wouldn't it? ;)
    What matters is how *long* the latency is, in ns. A ns latency is a ns, and it forces the cpu to wait exactly one ns, no matter its clock speed... :)

    So yeah, definitely a valid point, and I wondered about that in the article as well.
  • saratoga - Wednesday, May 3, 2006 - link

    quote:

    It would be more accurate to say that each cycle of latency is more dear on the 2ghz chip, wouldn't it? ;)


    No. Its 50% worse for the 3GHz chip (since the clock speed is 50% higer).

    quote:

    What matters is how *long* the latency is, in ns. A ns latency is a ns, and it forces the cpu to wait exactly one ns, no matter its clock speed... :)


    And one ns is how many clock cycles on a 2GHz chip? And how many of a 3GHz chip? Think this through . . .
  • Spoonbender - Wednesday, May 3, 2006 - link

    quote:

    No. Its 50% worse for the 3GHz chip (since the clock speed is 50% higer).

    No it isn't. They both waste *exactly* one ns of execution per, well, ns of latency. How many cycles they can cram into that ns is irrelevant.

    [quote]
    And one ns is how many clock cycles on a 2GHz chip? And how many of a 3GHz chip? Think this through . .
    [/quote]
    Yeah, of course, with 1 ns latency, a 3ghz chip will waste more clock cycles than a 2ghz chip, yes. That's obvious. But they will both lose exactly 1 ns worth of execution. That's what matters. Not the number of clock cycles.
    If they both perform equally well, despite the clock speed difference, then adding 1 ns latency to both will have exactly the same impact on both. Yes, the 3ghz chip will lose more clock cycles, but the 2ghz will (if we stick with the assumption of similar performance), have a higher IPC, and so waste the same amount of actual work.

    If you like, look at Athlon 64 and P4.
    If both cpu's waste one clock cycle, then the A64 takes the biggest impact, because of its higher IPC.
    If both cpu's waste one ns, it doesn't change anything. True, the P4 loses the biggest amount of cycles, but as I said before, the A64 loses more *per* cycle. The net result is that they both lose, wait for it, *one* nanosecond's worth of execution.

Log in

Don't have an account? Sign up now