What about Hyper-Threading and IMC?

Core's impressive execution resources and massive shared cache seem to make it the ideal CPU design for SMT. However, there is no Simultaneous Multi Threading anywhere in the Core architecture. The reason is not that SMT can't give good results (See our elaborate discussion here), but that the engineers were given the task to develop a CPU with a great performance ratio that could be used for the Server, Desktop and Mobile markets. So the designers in Israel decided against using SMT (Hyper-Threading). While SMT can offer up to a 40% performance boost, these performance benefits will only be seen in server applications. SMT also makes the hotspots even hotter, so SMT didn't fit very well in Core's "One Micro-Architecture to Rule them All" design philosophy.

As far as including an Integrated Memory Controller (IMC), we were also told that the transistors which could have been spent in the IMC were better spent in the 4 MB shared cache. This is of course highly debatable, but it is a fact cache consumes less power. The standard party line from Intel is that keeping the memory controller on the chipset allows them to support additional memory types without having to re-spin the CPU core. That is certainly true, and with the desktop/mobile sectors using standard DDR2 while servers are set to move to FB-DIMM designs, the added flexibility isn't terrible. Techniques such as memory disambiguation and improved prefetch logic can also help to eliminate any advantage an IMC might offer. Would an IMC improve Core's performance? Almost certainly, but Intel will for the time being pursue other options.

Conclusion 1 : AMD K8 versus Intel P8

The Intel Core architecture is clearly the heir and descendant of the hugely successful P6 architecture. However, it has state of the art technology on board such as micro-op/macro-op fusion, memory disambiguation and massive SIMD/FP power.

Compared to the excellent AMD K8/Hammer architecture, the Core CPU is simply a wider, more efficient and more out of order CPU. When I suggested to Jack Doweck that the massive execution resources may not be fully used until SMT is applied, he disagreed completely. Memory disambiguation should push the current limits of ILP in integer loads a lot higher, and the massive bandwidth that the L1 and L2 can deliver should help Core to come close to the execution utilization percentages of the current P-M. 33% more execution potential could thus come very close to 33% more performance, clock-for-clock.

So is it game over for AMD? Well, if you read the previous pages, it is pretty clear that there are some obvious improvements that should happen in AMD's next generation. However, there is no reason at all to assume that the current K8 architecture is at the end of its life. One obvious upgrade possibility is to enhance the SSE/SIMD power by increasing the wideness of each unit or by simply implementing more of them in the out of order FP pipeline.

To sustain the extra (SIMD) FP power, AMD should definitely improve the bandwidth of the two caches further. The K7 had a pretty slow L2-cache, and the K8 doubled the amount of bandwidth that the L2 could deliver for example. It's not unreasonable to think a 256-bit wide cache bus could be added to a near-future AMD design.

Finally, there is also a lot of headroom for increasing integer performance. The fact that Loads can hardly be reordered has been a known weak point since the early K7 days. In fact, we know that engineers at AMD were well aware of it then, and it is surprising that AMD didn't really fix this in the K8 architecture. Allowing a much more flexible reordering of Loads - even without memory disambiguation - would give a very healthy boost to IPC (5% and more). It is one of the main reasons why the P-M can beat the Athlon 64 clock-for-clock in certain applications.

Those are just a few examples that are well known. It is very likely that there are numerous other possible improvements that could take the K8 architecture much further.

Looking at the server version of Core ("Woodcrest") and considering that it is very hard to find a lot of ILP in server applications, the only weakness of Core is that there is no multi-threading in each Core. This small disadvantage is a logical result of the design goal of Core, an architecture which is an all-around compromise for the server, desktop and mobile markets. The lack of Hyper-Threading in Xeon Core products might give Sun and IBM a window of opportunity in the heavy thread server application benchmarks, but since Tigerton (65 nm, two Woodcrests in one package, 4 cores) will come quickly, the disadvantage of not being able to extract more TLP might never be seen. Our astute readers will have understood by now that it is pretty hard to find a weakness in the new Core architecture.

Conclusion 2 : The free lunch is back!

It is ironic that just a year ago, Intel and others were downplaying the importance of increasing IPC and extracting more ILP. Multi-core was the future, single thread performance was a minor consideration. The result was that the reputed Dr. Dobbs journal headlined : "the free lunch is over" [1] claiming that only larger caches would increase IPC a little bit and that the days that developers could count on the ever increasing clockspeeds and IPC efficiency of newer CPU to run code faster were numbered. Some analysts went even further and felt that CPU packages with many relatively simple, small in-order CPUs were the future.

At AnandTech, we were pretty skeptical about the "threading is our only savior" future, as Tim Sweeney, the leading developer behind the Unreal 3 engine, explained the challenges of multi-threaded development of the next generation of games. The fat, wide OoO core running at high clockspeeds was buried a little too soon. Yes, Intel's Core does not use the aggressive domino and LVS circuit-design strategy that NetBurst designs used to achieve stunning clockspeeds. At the same time, it is a fat, massive reordering CPU which gives free lunch to developers who don't want to spend too much time on debugging heavily threaded applications. Multi-core is here to stay, but getting better performance is once again the shared responsibility of both the developer and the CPU designer. Yes, dual-core is nice, but single threaded performance is still important!

I would like to express my thanks to the following people who helped to make this article possible:
Jack Doweck, "Foo", "Redpriest", Jarred and Anand

References

[1] The Free Lunch Is Over: A Fundamental Turn Toward Concurrency in Software By Herb Sutter


Intel's Next Generation Microarchitecture Unveiled, by David Kanter, Real World Technologies
Faster Load Times
POST A COMMENT

85 Comments

View All Comments

  • GeeZee - Friday, May 05, 2006 - link

    Even with all the new technologies put into the new "Core" architecture, I think Intel will have a very tough time putting the nails in the coffin of the Athlon/Opteron.

    In performance tests(Not benchmarks that fit under 4mb) the Athlon was very competitive with the new core architecture, and beat it on many tests. On top of that A-64 and Opteron still blow it away when using 4 or more cores.

    As for the future....AMD has a tremendous amount of companies that are working with them to produce the next gen chips. IBM, Sony, Transmeta, Nvidia, Cray. Pretty much all the Mobo/Chipset manufacturers are much more frendly with AMD than intel.

    I wouldn't count out AMD untill their next gen CPU's flop....and I don't think it will. Imagine AMD with access to the code morphing software & Transmeta's vliw chip as a co processor & Via's encryption core & HT 3.0. All working flawlessly due to the new memory modes introduced on AM2. Add onto that Transmeta's manufacturing patents would cut power by 50%.

    Via gets Royalties on each chip, Transmeta gets access to AMD core technolgies. Everyone wins.

    AMD really surprised Intel with the Athlon. And I think they have somthing up their sleeve after the AM2.
    Reply
  • IntelUser2000 - Friday, May 05, 2006 - link

    quote:

    In performance tests(Not benchmarks that fit under 4mb) the Athlon was very competitive with the new core architecture, and beat it on many tests. On top of that A-64 and Opteron still blow it away when using 4 or more cores.


    Beat it?? Blow it away?? Have you seen the benchmarks of quad cores to know the reality?? Its the other way around. But when comparing against "Core" Duo that's different... Otherwise you are saying nonsense.
    Reply
  • GeeZee - Sunday, May 07, 2006 - link

    Really......
    http://sharikou.blogspot.com/2006/04/clovertown-sc...">http://sharikou.blogspot.com/2006/04/clovertown-sc...
    Mabye you should look at some facts with thoes blinded fanboy eyes.
    Reply
  • IntelUser2000 - Tuesday, May 09, 2006 - link

    quote:

    Really......
    http://sharikou.blogspot.com/2006/04/clovertown-sc...">http://sharikou.blogspot.com/2006/04/clovertown-sc...
    Mabye you should look at some facts with thoes blinded fanboy eyes.



    LOL. Anyone with ANY common sense should realize that the guy doesn't know what he is talking about. He claims Yonah uses 50W!!! Who's a fanboy here...

    And let me explain those clovertown scores.

    #1. Possibly not a good benchmark for looking at average performance:
    Take a look at Cinebench scores. You'll see that Pentium Extreme Edition 840 will outperform Pentium D 840 by over 15%!!! Now where do you see benchmark scores which shows the Pentium EE's outperforming Pentium D's by 15%?? That's right, MOST OF THE TIMES, IT DOESN'T!!! Pentium D's can outperform Pentium EE's lots of times.

    #2. The author's mind-boggling flawed logic on Clovertown's score:

    He claims that the reason Clovertown scales only 4.85 by using 8 cores is because its bandwidth starved. http://www.digitalvideoediting.com/articles/viewar...">http://www.digitalvideoediting.com/articles/viewar...

    Ah what do you see?? Opteron only scales 4.85x too!!!

    So what's the opinion on the blog?? HE'S A BLINDED FANBOY!!

    Stop posting in forums and use your useless brain on something else.

    Why people make up these stupid blogs though?? They are afraid to admit that Intel can actually do make something GOOD.
    Reply
  • IntelUser2000 - Tuesday, May 09, 2006 - link

    quote:

    In performance tests(Not benchmarks that fit under 4mb) the Athlon was very competitive with the new core architecture, and beat it on many tests. On top of that A-64 and Opteron still blow it away when using 4 or more cores.


    Pffft. Where do you see that?? Care to reveal those benchmarks?? Still in denial after looking at what Core Duo can do??
    Reply
  • IntelUser2000 - Tuesday, May 09, 2006 - link

    There are 3 main things people argue about when doubting Conroe.

    1. IDF system's scores are wrong because Intel could have modified the benchmarks.
    2. The K7/K8 decoders can all do complex instruction decoding which is better than Core
    3. The apps that doesn't fit in 4MB cache will perform slow.




    My response:
    1. ANANDTECH has shown that AFTER using THEIR OWN Quake 4 benchmark, the discrepancy between Conroe and OC'ed FX-60 INCREASED, indicating Intel's benchmarks are RATHER conservative.
    2. First, the two decoders(K7 and Core) can't be compared directly. While it was TRUE that K7 had superior decoder capability compared to P6, its different with Core, because more of the instructions that used to go to the complex decoder on the P6 now goes to the simple decoders in Core.
    3. The doubled AND lowered latency L2 cache on the Northwood gave 6-11%(Avg. 8.5%) gain in games. Doubled L2 cache on Barton gave 4-8%(6%) increase. Difference between Athlon 64 3000+(2.0GHz 512KB L2 single channel S754) and 3200+(1MB cache version) is 2.2-8%(5.1%).

    Caches doesn't do much. People seem to be somehow expecting 20% difference on the cache alone.
    Reply
  • Accord99 - Monday, May 08, 2006 - link

    Those scores beat a 4 single-core or a 2 dual-core Opteron system. Reply
  • clairvoyant129 - Sunday, May 07, 2006 - link

    How ironic you post that website in response to the above user (also calling him a fanboy) when it's a known fact that the author of the site manipulates information to favor AMD. Why don't you think a little next time? Reply
  • theteamaqua - Friday, May 05, 2006 - link

    im glad that intel is back on track, if they keep falling behind AMD, AMD is gonna jack up the price, intel jsut slash its cpu as much as 50%, the Pentium D 950, my mobo wont support conroe so ill jsut have to get the 960 when conroe launches,

    but what interest me most is the quad-coare thats coming Q1 next year, hopefully the performance can be as close to 200% of a dual-core counter-part running at the same speed
    Reply
  • thestain - Friday, May 05, 2006 - link

    Larger Cache and an extra decoder were bound to help Conroe in the small and simple tesing done by most benchmarks.

    But, what about applications that are a bit larger than Conroe's cache size or those that are complex causing the simple decoders to not be able to be used that much while placing the single complex decoder on the Conroe into short supply?

    Mike
    Reply

Log in

Don't have an account? Sign up now