What about Hyper-Threading and IMC?

Core's impressive execution resources and massive shared cache seem to make it the ideal CPU design for SMT. However, there is no Simultaneous Multi Threading anywhere in the Core architecture. The reason is not that SMT can't give good results (See our elaborate discussion here), but that the engineers were given the task to develop a CPU with a great performance ratio that could be used for the Server, Desktop and Mobile markets. So the designers in Israel decided against using SMT (Hyper-Threading). While SMT can offer up to a 40% performance boost, these performance benefits will only be seen in server applications. SMT also makes the hotspots even hotter, so SMT didn't fit very well in Core's "One Micro-Architecture to Rule them All" design philosophy.

As far as including an Integrated Memory Controller (IMC), we were also told that the transistors which could have been spent in the IMC were better spent in the 4 MB shared cache. This is of course highly debatable, but it is a fact cache consumes less power. The standard party line from Intel is that keeping the memory controller on the chipset allows them to support additional memory types without having to re-spin the CPU core. That is certainly true, and with the desktop/mobile sectors using standard DDR2 while servers are set to move to FB-DIMM designs, the added flexibility isn't terrible. Techniques such as memory disambiguation and improved prefetch logic can also help to eliminate any advantage an IMC might offer. Would an IMC improve Core's performance? Almost certainly, but Intel will for the time being pursue other options.

Conclusion 1 : AMD K8 versus Intel P8

The Intel Core architecture is clearly the heir and descendant of the hugely successful P6 architecture. However, it has state of the art technology on board such as micro-op/macro-op fusion, memory disambiguation and massive SIMD/FP power.

Compared to the excellent AMD K8/Hammer architecture, the Core CPU is simply a wider, more efficient and more out of order CPU. When I suggested to Jack Doweck that the massive execution resources may not be fully used until SMT is applied, he disagreed completely. Memory disambiguation should push the current limits of ILP in integer loads a lot higher, and the massive bandwidth that the L1 and L2 can deliver should help Core to come close to the execution utilization percentages of the current P-M. 33% more execution potential could thus come very close to 33% more performance, clock-for-clock.

So is it game over for AMD? Well, if you read the previous pages, it is pretty clear that there are some obvious improvements that should happen in AMD's next generation. However, there is no reason at all to assume that the current K8 architecture is at the end of its life. One obvious upgrade possibility is to enhance the SSE/SIMD power by increasing the wideness of each unit or by simply implementing more of them in the out of order FP pipeline.

To sustain the extra (SIMD) FP power, AMD should definitely improve the bandwidth of the two caches further. The K7 had a pretty slow L2-cache, and the K8 doubled the amount of bandwidth that the L2 could deliver for example. It's not unreasonable to think a 256-bit wide cache bus could be added to a near-future AMD design.

Finally, there is also a lot of headroom for increasing integer performance. The fact that Loads can hardly be reordered has been a known weak point since the early K7 days. In fact, we know that engineers at AMD were well aware of it then, and it is surprising that AMD didn't really fix this in the K8 architecture. Allowing a much more flexible reordering of Loads - even without memory disambiguation - would give a very healthy boost to IPC (5% and more). It is one of the main reasons why the P-M can beat the Athlon 64 clock-for-clock in certain applications.

Those are just a few examples that are well known. It is very likely that there are numerous other possible improvements that could take the K8 architecture much further.

Looking at the server version of Core ("Woodcrest") and considering that it is very hard to find a lot of ILP in server applications, the only weakness of Core is that there is no multi-threading in each Core. This small disadvantage is a logical result of the design goal of Core, an architecture which is an all-around compromise for the server, desktop and mobile markets. The lack of Hyper-Threading in Xeon Core products might give Sun and IBM a window of opportunity in the heavy thread server application benchmarks, but since Tigerton (65 nm, two Woodcrests in one package, 4 cores) will come quickly, the disadvantage of not being able to extract more TLP might never be seen. Our astute readers will have understood by now that it is pretty hard to find a weakness in the new Core architecture.

Conclusion 2 : The free lunch is back!

It is ironic that just a year ago, Intel and others were downplaying the importance of increasing IPC and extracting more ILP. Multi-core was the future, single thread performance was a minor consideration. The result was that the reputed Dr. Dobbs journal headlined : "the free lunch is over" [1] claiming that only larger caches would increase IPC a little bit and that the days that developers could count on the ever increasing clockspeeds and IPC efficiency of newer CPU to run code faster were numbered. Some analysts went even further and felt that CPU packages with many relatively simple, small in-order CPUs were the future.

At AnandTech, we were pretty skeptical about the "threading is our only savior" future, as Tim Sweeney, the leading developer behind the Unreal 3 engine, explained the challenges of multi-threaded development of the next generation of games. The fat, wide OoO core running at high clockspeeds was buried a little too soon. Yes, Intel's Core does not use the aggressive domino and LVS circuit-design strategy that NetBurst designs used to achieve stunning clockspeeds. At the same time, it is a fat, massive reordering CPU which gives free lunch to developers who don't want to spend too much time on debugging heavily threaded applications. Multi-core is here to stay, but getting better performance is once again the shared responsibility of both the developer and the CPU designer. Yes, dual-core is nice, but single threaded performance is still important!

I would like to express my thanks to the following people who helped to make this article possible:
Jack Doweck, "Foo", "Redpriest", Jarred and Anand

References

[1] The Free Lunch Is Over: A Fundamental Turn Toward Concurrency in Software By Herb Sutter


Intel's Next Generation Microarchitecture Unveiled, by David Kanter, Real World Technologies
Faster Load Times
POST A COMMENT

85 Comments

View All Comments

  • JarredWalton - Monday, May 01, 2006 - link

    1.26 was Tualatin as well, but that's beside the point. Basically, clock speed at launch vs. final clock speed of the architecture was a disappointment for Intel. They were hoping for 6+ GHz at launch, and even thinking 10 GHz might be possible. Reply
  • KayKay - Monday, May 01, 2006 - link

    Someone should hire you to write textbooks because this was explained extremely well and in simple terms. Good Job Reply
  • JohanAnandtech - Monday, May 01, 2006 - link

    Thanks! :-)

    Very happy to read that.
    Reply
  • BitByBit - Monday, May 01, 2006 - link

    Fantastic article.
    In retrospect, it is easy to conclude that this is the route Intel should have chosen for P6's retirement.

    Core looks to be a very strong, all-round performer, unlike Netburst.
    We can only hope that AMD has an answer in the works, as K8 will have a hard time competing with this monster.
    It is unreasonable in my mind to expect a 4-5(?) year-old architecture to be able to compete with Intel's latest. AMD with K8 has had a long reign as the performance king, but is now facing something entirely different. Perhaps K8L will be able to offer serious competition.
    It will, however, take more than a doubling of the FP units (if rumour is correct) to achieve this. The cumulative effect of Conroe's architectural features (memory disambiguation, macro-ops fusion etc...) mean that Core's efficiency has far exceeded K8's, not to mention the impact of its vastly superior cache system - its 8-way 32kb * 2 L1 should in theory exceed the hitrate of K8's 2-way 64kb * 2 L1.

    It may not be until K10 is released that AMD takes back the performance crown.



    Reply
  • Larso - Monday, May 01, 2006 - link

    As the K8 is about 5 years old, and the current incarnations doesn't really include that many modifications, I wonder what AMD's engineers have been doing all these years. The K8 is not that different from the K7 even.

    Whats coming up? The AM2 version is basically the same beast with a new memory controller. The K8L, well since they didn't name it K9, I suppose its just small upgrades to the same design.

    I really like to think AMD has something coming we don't know about. Or rather, they ought to have something coming... Any rumors?
    Reply
  • Reynod - Monday, May 01, 2006 - link

    I can't help but think (and pray) that Larso's comment has some validity here.
    Why would AMD sit back and do nothing for so long? Would they have not been tinkering with various prototypes over the last couple of years? Are we in for a surprise?? Anand, you and the review team touched on several improvements they could make, care to outline these in some detail in a future article? Someone needs to give AMD some free advice ... heh heh
    Reply
  • Spoonbender - Monday, May 01, 2006 - link

    Keep in mind that AMD doesn't have Intel's resources. Until recently, they still lost money every quarter. So they might not have been able to work on a successor to K8 until recently. (I remeber reading an interview with some AMD boss, saying that the K8 was literally a last-ditch effort to survive. If that failed, there wouldn't be an AMD, so they threw everything they had at it)

    So "Why would AMD sit back and do nothing for so long?" Because they had a good project, and didn't have the resources to make a new one?
    Of course, it probably isn't that bad, just tossing out an alternative scenario. ;)

    However, they have hinted that they were working on specific architectures for the notebook and server markets. (Unlike Intel who are moving back to a single unified architecture).

    And despite its age, the K8 is still a pretty nice architecture, and it wouldn't be a huge undertaking to improve on it to get something quite a bit more efficient. Intel had to develop a new architecture because NetBurst just wouldn't cut it. AMD can probably afford to expand on K8 a bit longer, and even with K9/K10, I wouldn't expect a vastly different architecture.
    Reply
  • Spoonbender - Monday, May 01, 2006 - link

    "Because they had a good project" <- Was supposed to be product, not project... :) Reply
  • psychobriggsy - Monday, May 01, 2006 - link

    AMD said recently that they have three times the engineers on their books as they did when they designed K8.

    However I suspect they're working on K10/KX, although maybe some of them worked on K8L.

    Clearly it seems that some in-core work could translate into reasonable performance gains for the current K8 design. A 4-way L1 cache instead of 2-way for example, and a greater L2 to L1 bandwidth. Certainly a mechanism to reorder instructions so that loads can be performed earlier seems to be necessary. 2MB L2 per core could also help, and the 65nm die pictures that AMD showed recently did seem to show far denser cache. K8L is rumoured to include more FP resources, but I don't know about any of the other stuff - but AMD will be talking more about K8L (and beyond?) in June apparently.
    Reply
  • spinportal - Monday, May 01, 2006 - link

    Definitely a great treasure of an article to find on a monday morning detailing the Core architecture that the world is drooling for in June. I wonder what kind of simulation micro-arch software Intel and AMD use, as I remember the days doing my Masters, playing with Intel's Hypercube simulator (grav sim, fish & shark AI sim) and writing my own macro level visual-cpu execution simulator. Reply

Log in

Don't have an account? Sign up now