Intel Core versus AMD's K8 architecture

by Johan De Gelas on 5/1/2006 4:00 AM EST
POST A COMMENT

85 Comments

Back to Article

  • GeeZee - Friday, May 05, 2006 - link

    Even with all the new technologies put into the new "Core" architecture, I think Intel will have a very tough time putting the nails in the coffin of the Athlon/Opteron.

    In performance tests(Not benchmarks that fit under 4mb) the Athlon was very competitive with the new core architecture, and beat it on many tests. On top of that A-64 and Opteron still blow it away when using 4 or more cores.

    As for the future....AMD has a tremendous amount of companies that are working with them to produce the next gen chips. IBM, Sony, Transmeta, Nvidia, Cray. Pretty much all the Mobo/Chipset manufacturers are much more frendly with AMD than intel.

    I wouldn't count out AMD untill their next gen CPU's flop....and I don't think it will. Imagine AMD with access to the code morphing software & Transmeta's vliw chip as a co processor & Via's encryption core & HT 3.0. All working flawlessly due to the new memory modes introduced on AM2. Add onto that Transmeta's manufacturing patents would cut power by 50%.

    Via gets Royalties on each chip, Transmeta gets access to AMD core technolgies. Everyone wins.

    AMD really surprised Intel with the Athlon. And I think they have somthing up their sleeve after the AM2.
    Reply
  • IntelUser2000 - Friday, May 05, 2006 - link

    quote:

    In performance tests(Not benchmarks that fit under 4mb) the Athlon was very competitive with the new core architecture, and beat it on many tests. On top of that A-64 and Opteron still blow it away when using 4 or more cores.


    Beat it?? Blow it away?? Have you seen the benchmarks of quad cores to know the reality?? Its the other way around. But when comparing against "Core" Duo that's different... Otherwise you are saying nonsense.
    Reply
  • GeeZee - Sunday, May 07, 2006 - link

    Really......
    http://sharikou.blogspot.com/2006/04/clovertown-sc...">http://sharikou.blogspot.com/2006/04/clovertown-sc...
    Mabye you should look at some facts with thoes blinded fanboy eyes.
    Reply
  • IntelUser2000 - Tuesday, May 09, 2006 - link

    quote:

    Really......
    http://sharikou.blogspot.com/2006/04/clovertown-sc...">http://sharikou.blogspot.com/2006/04/clovertown-sc...
    Mabye you should look at some facts with thoes blinded fanboy eyes.



    LOL. Anyone with ANY common sense should realize that the guy doesn't know what he is talking about. He claims Yonah uses 50W!!! Who's a fanboy here...

    And let me explain those clovertown scores.

    #1. Possibly not a good benchmark for looking at average performance:
    Take a look at Cinebench scores. You'll see that Pentium Extreme Edition 840 will outperform Pentium D 840 by over 15%!!! Now where do you see benchmark scores which shows the Pentium EE's outperforming Pentium D's by 15%?? That's right, MOST OF THE TIMES, IT DOESN'T!!! Pentium D's can outperform Pentium EE's lots of times.

    #2. The author's mind-boggling flawed logic on Clovertown's score:

    He claims that the reason Clovertown scales only 4.85 by using 8 cores is because its bandwidth starved. http://www.digitalvideoediting.com/articles/viewar...">http://www.digitalvideoediting.com/articles/viewar...

    Ah what do you see?? Opteron only scales 4.85x too!!!

    So what's the opinion on the blog?? HE'S A BLINDED FANBOY!!

    Stop posting in forums and use your useless brain on something else.

    Why people make up these stupid blogs though?? They are afraid to admit that Intel can actually do make something GOOD.
    Reply
  • IntelUser2000 - Tuesday, May 09, 2006 - link

    quote:

    In performance tests(Not benchmarks that fit under 4mb) the Athlon was very competitive with the new core architecture, and beat it on many tests. On top of that A-64 and Opteron still blow it away when using 4 or more cores.


    Pffft. Where do you see that?? Care to reveal those benchmarks?? Still in denial after looking at what Core Duo can do??
    Reply
  • IntelUser2000 - Tuesday, May 09, 2006 - link

    There are 3 main things people argue about when doubting Conroe.

    1. IDF system's scores are wrong because Intel could have modified the benchmarks.
    2. The K7/K8 decoders can all do complex instruction decoding which is better than Core
    3. The apps that doesn't fit in 4MB cache will perform slow.




    My response:
    1. ANANDTECH has shown that AFTER using THEIR OWN Quake 4 benchmark, the discrepancy between Conroe and OC'ed FX-60 INCREASED, indicating Intel's benchmarks are RATHER conservative.
    2. First, the two decoders(K7 and Core) can't be compared directly. While it was TRUE that K7 had superior decoder capability compared to P6, its different with Core, because more of the instructions that used to go to the complex decoder on the P6 now goes to the simple decoders in Core.
    3. The doubled AND lowered latency L2 cache on the Northwood gave 6-11%(Avg. 8.5%) gain in games. Doubled L2 cache on Barton gave 4-8%(6%) increase. Difference between Athlon 64 3000+(2.0GHz 512KB L2 single channel S754) and 3200+(1MB cache version) is 2.2-8%(5.1%).

    Caches doesn't do much. People seem to be somehow expecting 20% difference on the cache alone.
    Reply
  • Accord99 - Monday, May 08, 2006 - link

    Those scores beat a 4 single-core or a 2 dual-core Opteron system. Reply
  • clairvoyant129 - Sunday, May 07, 2006 - link

    How ironic you post that website in response to the above user (also calling him a fanboy) when it's a known fact that the author of the site manipulates information to favor AMD. Why don't you think a little next time? Reply
  • theteamaqua - Friday, May 05, 2006 - link

    im glad that intel is back on track, if they keep falling behind AMD, AMD is gonna jack up the price, intel jsut slash its cpu as much as 50%, the Pentium D 950, my mobo wont support conroe so ill jsut have to get the 960 when conroe launches,

    but what interest me most is the quad-coare thats coming Q1 next year, hopefully the performance can be as close to 200% of a dual-core counter-part running at the same speed
    Reply
  • thestain - Friday, May 05, 2006 - link

    Larger Cache and an extra decoder were bound to help Conroe in the small and simple tesing done by most benchmarks.

    But, what about applications that are a bit larger than Conroe's cache size or those that are complex causing the simple decoders to not be able to be used that much while placing the single complex decoder on the Conroe into short supply?

    Mike
    Reply
  • IntelUser2000 - Friday, May 05, 2006 - link

    quote:

    Larger Cache and an extra decoder were bound to help Conroe in the small and simple tesing done by most benchmarks.

    But, what about applications that are a bit larger than Conroe's cache size or those that are complex causing the simple decoders to not be able to be used that much while placing the single complex decoder on the Conroe into short supply?

    Mike


    LOL. You crack me up. Go see how much doubling of L2 caches help to increase performance. I guess the last 5 years of Netburst screwed people's mental abilities. Sure the caches will help Conroe, but if the CPU doesn't really need the extra cache, then it will be a waste. Kinda like how doubling L2 caches on Pentium D doesn't help a lot. Kinda like how doubling L2 caches on Athlon 64's don't help either. It's why Semprons excel.

    About decoders. Guess you are still in the old ages where one of the reasons K7 was better than P6 was because it has the ability to decode complex instructions in all decoders. If you read about Conroe, more of the instructions that USED to go to the complex decoders can now go to the simple decoders.

    quote:

    Larger Cache and an extra decoder were bound to help Conroe in the small and simple tesing done by most benchmarks.


    And which benchmark would that be.

    Guess there is gonna be a lot of AMD fanboys that are gonna cry when Conroe is shown.
    Reply
  • stopkidding - Tuesday, May 02, 2006 - link

    Did anyone notice that this comment thread is virtually free of the usual "Intel-this, AMD-that" comments that usually are seen on this site. The "fanbois" have nothing to bitch about as their little brains can't comprehend whats written in this article! :-) Reply
  • Reynod - Wednesday, May 03, 2006 - link

    Which is a sigh of relief I must say. I can swallow hard facts and interpret code ... my 4400+ looks like going in as my new server box ... and my next gaming box looks like being an OC'd Conroe. I just won't buy an Intel Mobo ... heh heh. Reply
  • mino - Tuesday, May 02, 2006 - link

    IMHO not, the article is extremely well written AND there are NO benchmarks => Intelman is happy from the text; AMD-man is hoping the real numbere won't be so bad...

    On topic, article is written in a very good style for general public.

    On thing I'am afraid of is the moment code is optimized for Core, any other irchitecture would take a performance hit. K8 the smallest one, PM/K7 the small one, P4/P6 the big one and all older plus C7 e pretty huge hits.

    That bothers me.

    Except that, AMD will live for a long time (Opteron alone would survive them for 5+ yrs.) and X2's will be finally cheaper. What else to pray for :)

    Best regards.
    Reply
  • mino - Tuesday, May 02, 2006 - link

    addennum:
    "the article is extremely well written FOR GENERAL AT AUDIENCE"

    Otherwise job well done Johan.
    Reply
  • nullpointerus - Tuesday, May 02, 2006 - link

    Right, but then why do they respond to the other articles? Reply
  • JustAnAverageGuy - Tuesday, May 02, 2006 - link

    Another top notch article, as always, Johan

    - JaAG
    Reply
  • dguy6789 - Tuesday, May 02, 2006 - link

    Thank you for writing this article. You have cleared up a large quantity of questions that I had in relation to the Core architecture. Reply
  • Betwon - Tuesday, May 02, 2006 - link

    sub eax,[edi+ebx+79]

    There are 3 registers used: eax, edi, ebx

    For Core duo, it decodes to one fusion-micro-op.
    In the reservation station (RS), only one entry is needed to be allocated. There are three registers spaces in one RS entry at least. And the results of address(edi+ebx+79) can be w rited back into the same position of one register in this RS entry.(A replace method)

    For K7/K8, it decodes to one macro-op?
    In the reservation station (RS), only one entry is needed to be allocated? There are three registers spaces in one RS entry?

    It can take one entry in ROB.

    But I don't believe AMD. It may take two entrys in RS, because there are only two registers spaces in one RS entry of K8. K8 hasn't three registers spaces in one RS entry.

    K8's RS is up to 8X3 macro-op, but not means that one macro-op can always take one entry in the RS.
    I say that I don't believe AMD.
    Of course, Other people have on need to believe me too.
    Reply
  • Betwon - Wednesday, May 03, 2006 - link

    If you really want to know what is the Intel's load reordering and memory misambiguation, I can tell you the facts:

    http://www.stanford.edu/~merez/papers/LoadSched_IS...">http://www.stanford.edu/~merez/papers/LoadSched_IS...
    Speculation Techniques for Improving Load Related Instruction Scheduling 1999
    Adi Yoaz, Mattan Erez, Ronny Ronen, and Stephan Jourdan -- From Intel's Haifa, they designed the Load/Store Unit of Core.

    I had said that anandtech should study many things about CPU. Of course, I should study more things about CPU.
    Reply
  • Betwon - Tuesday, May 02, 2006 - link

    P6: sub [mem],eax decodes to three micro-ops
    Core duo: sub [mem],eax decodes to two micro-ops
    K8: sub [mem],eax decodes to one macro-op

    P6: sub eax,[mem] decodes to two micro-ops
    Core duo: sub eax,[mem] decodes to one micro-op
    K8: sub eax,[mem] decodes to one macro-op

    Intel's micro-fusion is different with the K7/K8's macro-op.

    P4 has 2X2 int ALU and 2 AGU.
    K7/K8 has 3 int ALUand 3 AGU.
    But Core duo has only 2 int ALU and 2 AGU.

    The integer performance:
    Core duo>K7/K8

    Why?
    Because Core duo's length of the depenency chain of the critical path is the shorest.
    The most integer program asm codes can be thought as a high and thin tree of the depenency chain, (the longest depenency chain is called the critical path)
    The critical path determines the performance. The length of the depenency chain of the critical path is the cycles needed to complete this critical path.
    Core duo(2ALU/2AGU) spends less cycles than K7/K8(3ALU/3AGU) -- Because more INT funtions can not accelerate the true dependency atoms-operations.

    The P-M/Core duo's special ability of anti true dependency atoms-operations is the real reason of it's INT outperformance, which is different with old P6(such as Pentium 3).

    The most FP program asm codes can be thought as a boskage (there are many short depenency chains). The best ILP can be performed--The more FP FADD/FMUL funtions, the more performence.

    Double the FP funtions or double the speed of half-speed FP functions is a good idea for the most FP programs.
    But double INT funtions do not always enhence the INT preformence so greatly.
    Conroe only has three ALUs and two AGUs, but not 4 ALU and 4AGU.
    K7/K8 has three ALUs and three AGUs.
    Reply
  • Betwon - Tuesday, May 02, 2006 - link

    Sorry, I'm not one from a English-language nations. I just tell my idea with many spelling or syntax errors.
    funtions -- funtion units

    I want to tell why Conroe with only 3ALU/2AGU has the INT outperformance. Core has the superexcellent ability to process the true dependency chains.(much much better than K8 3ALU/3AGU).
    Even Core duo with 2ALU/2AGU has the INT outperformance(much better than K8 3ALU/3AGU).

    The length of the depenency chain of the critical path
    The true dependency atom-operations
    Reply
  • Starglider - Monday, May 01, 2006 - link

    A truly excellent article. I just have a couple of questions;

    quote:

    The second and more important advantage is the on die memory controller, which lowers the latency to the memory considerably. However, the lower clockspeeds of the Core CPUs (relative to NetBurst) and the faster FSB also lower latency significantly. With the numbers available to us now, we have reason to believe that the Athlon 64 X2's latency advantage will shrink to only 15 to 20%.


    It's clear that increasing FSB speed can reduce memory latency. However I'm not clear why lower CPU core speed will reduce absolute latency - sure it will reduce the number of CPU cycles that occur while waiting for memory, but how can it reduce the absolute delay?

    You don't seem to have included inter-instruction latency in your comparison tables. I know this data can be hard to get hold of, but it's critical to the performance of highly serial code (e.g. the pi calculation benchmarks that seem to be so popular at the moment). Is there any chance it could be included?

    Finally I'm wondering if Intel will revist some of the P4 clock speed enhancing tricks later on. Things like LVS and double pumped AUs would only have slowed down an already complex development process that Intel desperately needed to be completed quickly. But if AMD do come out with a new architecture that matches or exceeds Conroe on IPC, Intel might be able to respond quite quickly by bringing back some of their already well understood clock speed tricks to accelerate Conroe.
    Reply
  • Makaveli - Monday, May 01, 2006 - link

    You need to get away from thinking of increased clockspeed for extra performance, the future is multicore Cpu's and parellism. Reply
  • Starglider - Tuesday, May 02, 2006 - link

    I write heavily multithreaded applications for a living, but sometimes there is just no substitute for fast serial execution; a lot of things just can't be parallelised. Serial execution speed is effectively IPC * clock rate, so yes increasing clock rate is still very helpful as long as IPC doesn't suffer. Reply
  • saratoga - Monday, May 01, 2006 - link

    ^^ Did you read the post you replied to? His point is valid.

    Lower clock speed is not going to improve memory latency. It may mean that latency is less painful, but if you took two core chips, one at 2GHz and the other at 3GHz, the absolute latency is roughly if not exactly the same for each. Though the cost of each ns of latency is 50% more dear on the 3GHz chip.
    Reply
  • Spoonbender - Tuesday, May 02, 2006 - link

    It would be more accurate to say that each cycle of latency is more dear on the 2ghz chip, wouldn't it? ;)
    What matters is how *long* the latency is, in ns. A ns latency is a ns, and it forces the cpu to wait exactly one ns, no matter its clock speed... :)

    So yeah, definitely a valid point, and I wondered about that in the article as well.
    Reply
  • saratoga - Wednesday, May 03, 2006 - link

    quote:

    It would be more accurate to say that each cycle of latency is more dear on the 2ghz chip, wouldn't it? ;)


    No. Its 50% worse for the 3GHz chip (since the clock speed is 50% higer).

    quote:

    What matters is how *long* the latency is, in ns. A ns latency is a ns, and it forces the cpu to wait exactly one ns, no matter its clock speed... :)


    And one ns is how many clock cycles on a 2GHz chip? And how many of a 3GHz chip? Think this through . . .
    Reply
  • Spoonbender - Wednesday, May 03, 2006 - link

    quote:

    No. Its 50% worse for the 3GHz chip (since the clock speed is 50% higer).

    No it isn't. They both waste *exactly* one ns of execution per, well, ns of latency. How many cycles they can cram into that ns is irrelevant.

    [quote]
    And one ns is how many clock cycles on a 2GHz chip? And how many of a 3GHz chip? Think this through . .
    [/quote]
    Yeah, of course, with 1 ns latency, a 3ghz chip will waste more clock cycles than a 2ghz chip, yes. That's obvious. But they will both lose exactly 1 ns worth of execution. That's what matters. Not the number of clock cycles.
    If they both perform equally well, despite the clock speed difference, then adding 1 ns latency to both will have exactly the same impact on both. Yes, the 3ghz chip will lose more clock cycles, but the 2ghz will (if we stick with the assumption of similar performance), have a higher IPC, and so waste the same amount of actual work.

    If you like, look at Athlon 64 and P4.
    If both cpu's waste one clock cycle, then the A64 takes the biggest impact, because of its higher IPC.
    If both cpu's waste one ns, it doesn't change anything. True, the P4 loses the biggest amount of cycles, but as I said before, the A64 loses more *per* cycle. The net result is that they both lose, wait for it, *one* nanosecond's worth of execution.
    Reply
  • BigT383 - Monday, May 01, 2006 - link

    I loved this article. It's due to articles like these that I've been reading Anandtech since before the days of the K6-2. Reply
  • PandaBear - Monday, May 01, 2006 - link

    Of course Core should be better than K8, it better be.

    The only thing I am concerned about the Core architecture is with all these additional stuff, it will probably cost a lot to make, not just the CPU, but the MB, chipset, will also be expensive with the additional high speed circuitry. That means it will probably cost more.

    K8 has been 5 years old and it is not bad standing against the latest and greatest. If AMD have something in the pipeline that will be the next monster CPU, it will be great. What I am concern about AMD is whether they can keep their yield up and have enough $ left behind to design K9 and beyond. Don't just sit there and lose the momentum they gain.
    Reply
  • saratoga - Monday, May 01, 2006 - link

    Core is a pretty conservative design with a pretty small die for a new core. It should be very economical to produce. Probably more so then the chips its replaceing. Reply
  • IntelUser2000 - Monday, May 01, 2006 - link

    quote:

    Of course Core should be better than K8, it better be.

    The only thing I am concerned about the Core architecture is with all these additional stuff, it will probably cost a lot to make, not just the CPU, but the MB, chipset, will also be expensive with the additional high speed circuitry. That means it will probably cost more.


    Not really. Not many expected that Intel will do more than increasing clock speeds and cache sizes since that's what they have been doing that since Pentium II.

    http://www.reghardware.co.uk/2006/04/05/intel_conr...">http://www.reghardware.co.uk/2006/04/05/intel_conr...

    The ASP went down. $530 for the fastest mainstream Conroe is rather good.
    Reply
  • zsdersw - Monday, May 01, 2006 - link

    The pricing put out by Intel suggests that Core will be priced very aggressively. I can't see the 975 chipset costing significantly more than it does now when Core is released.

    The fact that Core is going to be built on Intel's 65nm process means that the "additional stuff" you refer to will cost less than it would if built on the 90nm process. And the die size probably grew a little, but not enough to offset the cost gains from the 65nm process.
    Reply
  • xtremejack - Monday, May 01, 2006 - link

    K8 is only 3 years old. Didn't AMD celebrate their 3rd anniversary of Opteron a few days ago. Reply
  • Griswold - Thursday, May 04, 2006 - link

    Its been sold for 3 years, but clearly the design is "a few days" older than that. Reply
  • evident - Monday, May 01, 2006 - link

    as a junior computer engineer at villanova university, i found this article to be really informative and an awesome read. it's really cool to see the differences between these CPU architectures and shows that they are actually teaching me something useful! Reply
  • PeteRoy - Monday, May 01, 2006 - link

    How can you say Netburst wasn't a huge success?

    I think Netburst was a success when it was launched and it should have died sooner, but it was good for it time and now it will be replaced.
    Reply
  • JarredWalton - Monday, May 01, 2006 - link

    NetBurst started at 1.5 GHz basically and topped out at 3.8 GHz. Compared to previous architectures, that's pretty tame. P6 went from 150 MHz to 1.26 GHz (and beyond if you want to count P-M). Success monetarily vs. success as an overall design are two different things, and clearly NetBurst ran into trouble. Where are the 5 GHz+ Tejas chips? Waiting somewhere beyond the thermal even horizon.... :) Reply
  • Missing Ghost - Monday, May 01, 2006 - link

    hum, no. There is a 1.4gHz P6, you forgot Tualatin. Reply
  • JarredWalton - Monday, May 01, 2006 - link

    1.26 was Tualatin as well, but that's beside the point. Basically, clock speed at launch vs. final clock speed of the architecture was a disappointment for Intel. They were hoping for 6+ GHz at launch, and even thinking 10 GHz might be possible. Reply
  • KayKay - Monday, May 01, 2006 - link

    Someone should hire you to write textbooks because this was explained extremely well and in simple terms. Good Job Reply
  • JohanAnandtech - Monday, May 01, 2006 - link

    Thanks! :-)

    Very happy to read that.
    Reply
  • BitByBit - Monday, May 01, 2006 - link

    Fantastic article.
    In retrospect, it is easy to conclude that this is the route Intel should have chosen for P6's retirement.

    Core looks to be a very strong, all-round performer, unlike Netburst.
    We can only hope that AMD has an answer in the works, as K8 will have a hard time competing with this monster.
    It is unreasonable in my mind to expect a 4-5(?) year-old architecture to be able to compete with Intel's latest. AMD with K8 has had a long reign as the performance king, but is now facing something entirely different. Perhaps K8L will be able to offer serious competition.
    It will, however, take more than a doubling of the FP units (if rumour is correct) to achieve this. The cumulative effect of Conroe's architectural features (memory disambiguation, macro-ops fusion etc...) mean that Core's efficiency has far exceeded K8's, not to mention the impact of its vastly superior cache system - its 8-way 32kb * 2 L1 should in theory exceed the hitrate of K8's 2-way 64kb * 2 L1.

    It may not be until K10 is released that AMD takes back the performance crown.



    Reply
  • Larso - Monday, May 01, 2006 - link

    As the K8 is about 5 years old, and the current incarnations doesn't really include that many modifications, I wonder what AMD's engineers have been doing all these years. The K8 is not that different from the K7 even.

    Whats coming up? The AM2 version is basically the same beast with a new memory controller. The K8L, well since they didn't name it K9, I suppose its just small upgrades to the same design.

    I really like to think AMD has something coming we don't know about. Or rather, they ought to have something coming... Any rumors?
    Reply
  • Reynod - Monday, May 01, 2006 - link

    I can't help but think (and pray) that Larso's comment has some validity here.
    Why would AMD sit back and do nothing for so long? Would they have not been tinkering with various prototypes over the last couple of years? Are we in for a surprise?? Anand, you and the review team touched on several improvements they could make, care to outline these in some detail in a future article? Someone needs to give AMD some free advice ... heh heh
    Reply
  • Spoonbender - Monday, May 01, 2006 - link

    Keep in mind that AMD doesn't have Intel's resources. Until recently, they still lost money every quarter. So they might not have been able to work on a successor to K8 until recently. (I remeber reading an interview with some AMD boss, saying that the K8 was literally a last-ditch effort to survive. If that failed, there wouldn't be an AMD, so they threw everything they had at it)

    So "Why would AMD sit back and do nothing for so long?" Because they had a good project, and didn't have the resources to make a new one?
    Of course, it probably isn't that bad, just tossing out an alternative scenario. ;)

    However, they have hinted that they were working on specific architectures for the notebook and server markets. (Unlike Intel who are moving back to a single unified architecture).

    And despite its age, the K8 is still a pretty nice architecture, and it wouldn't be a huge undertaking to improve on it to get something quite a bit more efficient. Intel had to develop a new architecture because NetBurst just wouldn't cut it. AMD can probably afford to expand on K8 a bit longer, and even with K9/K10, I wouldn't expect a vastly different architecture.
    Reply
  • Spoonbender - Monday, May 01, 2006 - link

    "Because they had a good project" <- Was supposed to be product, not project... :) Reply
  • psychobriggsy - Monday, May 01, 2006 - link

    AMD said recently that they have three times the engineers on their books as they did when they designed K8.

    However I suspect they're working on K10/KX, although maybe some of them worked on K8L.

    Clearly it seems that some in-core work could translate into reasonable performance gains for the current K8 design. A 4-way L1 cache instead of 2-way for example, and a greater L2 to L1 bandwidth. Certainly a mechanism to reorder instructions so that loads can be performed earlier seems to be necessary. 2MB L2 per core could also help, and the 65nm die pictures that AMD showed recently did seem to show far denser cache. K8L is rumoured to include more FP resources, but I don't know about any of the other stuff - but AMD will be talking more about K8L (and beyond?) in June apparently.
    Reply
  • spinportal - Monday, May 01, 2006 - link

    Definitely a great treasure of an article to find on a monday morning detailing the Core architecture that the world is drooling for in June. I wonder what kind of simulation micro-arch software Intel and AMD use, as I remember the days doing my Masters, playing with Intel's Hypercube simulator (grav sim, fish & shark AI sim) and writing my own macro level visual-cpu execution simulator. Reply
  • JarredWalton - Monday, May 01, 2006 - link

    I won't say it's a quick fix, but just as Core is a derivative of P6, AMD could potentially just get some better OoO capabilities into K8 and get some serious performance improvements. Their current inability to move loads forward much (if at all) makes them even more dependent on RAM latency. You could even say that they *needed* the IMC to improve performance, but still L2 latency is far better than RAM latency, and cutting down L2 latency hit from 12 cycles to say 6 cycles (if you can do the load 6 cycles early) would have to improve performance. Loads happen "all the time" in ASM, so optimizing their performance can pay huge dividends.

    I'm going to catch flak for this, but basically Intel has more elegant designs than AMD in several areas. It comes from throwing billions of dollars at the problems. Better L1 cache? Yup - 8-way vs. 2-way is pretty substantial; 256-bit vs. 128-bit is also substantial. Better specialization of hardware? Yeah, I have to give them that as well: rather than just using three ALUs, they often take the path of having a few faster ALUs to handle the common cases.

    Really, the only reason AMD was able to catch (and exceed) Intel performance was because Intel got hung up on clock speeds. They basically let marketing dictate chip design to engineering - which is never good, IMO, at least not in the long run. Even NetBurst still has some very interesting design features (double-pumped ALUs, specialization of functional units, trace cache), and if nothing else it served as a good lesson on how far you can push clock speeds and pipeline lengths before you start encountering some serious problems. I would have loved to see Northwood tweaked for 90nm and 65nm, personally - 31+8 pipeline stages was just hubris, but 20+8 with some other tweaks could have been interesting.

    Here's hoping AMD can make some real improvements to their chips sooner rather than later. Intel Core is looking very strong right now, and I would rather have close competition than a 20% margin of victory like we've been seeing lately. (First with AMD K8 beating NetBurst, and now it looks like Conroe is going to turn the tables.)
    Reply
  • Regs - Wednesday, June 07, 2006 - link

    Okcourse you will catch flack. It's an opinion.

    My opinion is Intel is more innovative or even compromising
    while AMD is more intuitive.
    Reply
  • Spoonbender - Monday, May 01, 2006 - link

    "8-way 32kb * 2 L1 should in theory exceed the hitrate of K8's 2-way 64kb * 2 L1."
    It should? I'd like to see some sources on that. From what I've seen, the 64KB cache still has an advantage there, with a hitrate not much below that of a 64KB/8-way.

    Also, I disagree that Intel's CPU's are generally more elegant. First, their L1 cache isn't neccesarily "better" (see above). Of course, the 256 vs 128 bit bandwidth is a big factor, however.

    Specialization in hardware? Is that elegant? I'd say there's a certain elegance in making a general solution as well, as opposed to specializing everything to the point where you're screwed if the code you have to execute isn't 100% optimal.

    And I definitely think AMD's distributed reservation stations are more elegant than the central one used by Intel. Same goes with the usual HyperTransport vs FSB story.
    There are a few other really elegant features of the K8 that I haven't seen duplicated in Core.

    So overall, I don't see the big deal with "elegance". Both architectures have plenty of elegant features. However, the K8 is definitely aging, and will have problems keeping up with Conroe.

    But then again, the K8 die is tiny in comparison. They've got plenty of space for improvements.

    Really looking forward to
    1) Being able to get a Merom-powered laptop, and
    2) Seeing what AMD comes up with next year.
    Reply
  • IntelUser2000 - Monday, May 01, 2006 - link

    quote:

    But then again, the K8 die is tiny in comparison. They've got plenty of space for improvements.


    Tiny?? 199mm2 for 2x1MB cache K8 at 90nm is tiny?? Ok there. Conroe with 4MB cache is around 140mm2 die size.

    http://www.aceshardware.com/forums/read_post.jsp?i...">http://www.aceshardware.com/forums/read_post.jsp?i...

    Even compairing against Intel's SRAM size for 90nm and 65nm comparison, at 90nm, Conroe would be 250mm2. In Prescott, 1MB L2 cache takes 16-17mm2. 250mm2-32mm2=218mm2.

    And Intel didn't to shrinks that are relative to SRAM sizes. In comparison for Cedarmill and Prescott 2M, which are same cores essentially.

    Prescott 2M: 135mm2
    Cedarmill: 81mm2

    Only difference being process size, the comparison is 0.6.

    Conroe at 90nm would have been 233mm2, which is compact as X2 per core.
    Reply
  • Spoonbender - Tuesday, May 02, 2006 - link

    Okay, I guess I should have said that at the same process size, it is tiny. I meant that when AMD gets around to migrating to 65nk as well, they'll have a smaller core (assuming no big changes to the chip), which gives them plenty (some?) of room for improvement.

    [quote]
    Conroe at 90nm would have been 233mm2, which is compact as X2 per core.
    [/quote]
    But in absolute terms, still bigger than an Athlon X2. Which means AMD has some space for improvement. That was my only point. I guess I should have been more clear. ;)
    Reply
  • coldpower27 - Wednesday, May 03, 2006 - link

    Yes, if using the 0.6 Factor for Brisbane the 65nm Athlon 64x2 it will be around 132mm2 assuming no changes over the 220mm2 Windsor DDR2 Athlon64x2. 199mm2 is only for Toledo which is reaching end of life and can no longer be used as a comparison. And it's irrelevent to compare to Conroe on what it would be on 90nm as it never was built on 90nm technology to begin with.

    Conroe is looking to be ~14x mm2 with the x=0-9. Yes if you can compare them at the same process nodes considering the Conroe will only be competing with the 65nm Dual Core Athlon 64x2 in the second half of it's lifetime.
    Reply
  • JumpingJack - Tuesday, May 02, 2006 - link

    Nice analysis, the current AMD X2 dual cores are about 1.5 to 2.0 X the size of Intel dual cores (on 65 nm), this is where 65 nm adds such a benefit. Conroe will come in around 140 mm^2 as you said. Yohna at 2 meg shared is 90 mm^2, less than 1/2 the X2.

    Right now, cost wise in Si realestate AMD is more expensive.
    Reply
  • BitByBit - Monday, May 01, 2006 - link

    quote:

    From what I've seen, the 64KB cache still has an advantage there, with a hitrate not much below that of a 64KB/8-way.


    Here is a good article on processor cache:
    http://en.wikipedia.org/wiki/CPU_cache">http://en.wikipedia.org/wiki/CPU_cache

    If you scroll down to the miss-rate vs. cache size graph, you can see that an 8-way 64Kb cache has a miss-rate less than one-tenth the miss-rate of a 2-way 64Kb cache.

    An 8-way 64Kb * 2 L1 would probably be too difficult to implement, given the time it would take to search. However, according to the relationships shown by that graph, increasing the Athlon's L1 associativity to 4-ways could yield a nice boost in hitrate and consequently performance.


    Reply
  • Spoonbender - Monday, May 01, 2006 - link

    And of course, we all know wikipedia is the ultimate source of all truth and knowledge... ;)

    Keep in mind that this graph only shows the Spec2000 benchmark (and only the integer section, at that). That's far from being representative of all code.

    According to http://www.amazon.com/gp/product/1558605967/002-28...">http://www.amazon.com/gp/product/155860...-2818646... which, in my experience is pretty damn good, the missrates are as follows *in general*:

    32KB, 8-way: 0.037
    64KB, 2-way: 0.031
    64KB 8-way: 0.029

    But yeah, of course improving the Athlon's cache would help. But it's not the first place I'd look to optimize. For one thing, making it more complex would, as seen above, not yield a significantly lower hitrate, but it would slow the cache down, either forcing them to increase its latency, or limiting the frequency potential of the cpu as a whole. The cache bandwidth might be a better candidate for improvement. Or some of the actual cpu logic. Or the L2 cache size. I think the L1 cache is pretty healthy on the K8 already.
    Reply
  • Betwon - Tuesday, May 02, 2006 - link

    The data can not be used for Core -- Because it did not use the smart prefetcher.

    The Advanced smart prefetchers of Core's L1D have decreased the miss-rates very much. In fact, The data cache of Core --much more efficiency than K8's.
    Compared with Core's smart cahce, K8's 64KB L1D is like an idiot .
    Reply
  • Spoonbender - Tuesday, May 02, 2006 - link

    How do you know? As the article said, the prefetching *might* in some cases decrease performance, even if it'll usually be an advantage. But I don't really think you have enough information to make a valid comparison. My point was simply that generally speaking, a 64KB, 2-way associative cache will have better hit rates than a 32KB 8-way associative. Of course, having fancy prefetching is always a good thing, but its effect *is* limited. If it was a huge improvement, people would have done that 8 years ago, instead of just messing with cache size and associativity. Reply
  • Betwon - Tuesday, May 02, 2006 - link

    Your information is too old and should be updated now.
    Prefetcher give much improvement in reducing the miss-rate.
    About 30-90% miss rate reduced.
    The good prefetcher tech is one of the most important performance factors.

    http://www.hpcaconf.org/hpca11/slides/hpca_inst_sl...">http://www.hpcaconf.org/hpca11/slides/hpca_inst_sl...
    Reply
  • Betwon - Tuesday, May 02, 2006 - link

    Who is James E. Smith? I think that you should know him.

    Data Cache Prefetching Using a Global History Buffer -- the prefetcher bring the great performance improvement! From 20-110%
    abstract
    http://ieeexplore.ieee.org/search/freesrchabstract...">http://ieeexplore.ieee.org/search/frees...+buffer%...
    Of course, you can download the full-text pdf file, if you have a IEEE member account. I can download and view it, but can not release it.
    slides ppt
    http://www.ece.gatech.edu/~leehs/ECE7102/slides/ka...">http://www.ece.gatech.edu/~leehs/ECE7102/slides/ka...
    Reply
  • Sunrise089 - Monday, May 01, 2006 - link

    flak flak flak

    Seriously - props to the author on a good article, but if I had one comment it would be that there are length issues to trying to provide the ammount of background needed for this sort of article. I think it's best to either just draw the comparisons between the two chips, or do a full-length many thousands of words write-up on the technical importance of the various topics. I read the article, and while writen well and informative in it's conclusions, I cannot say all the background was enough to make me really understand the concepts better. For example I already knew what out-of-order execution was, but only being able to read a few hundred words more on it didn't allow me to really learn enough to understand all of the reasons why the K8 had a disadvantage in that area, and if all you wanted was for me to understand that it did indeed have the disadvanatge, you could have just said so.
    Reply
  • JohanAnandtech - Monday, May 01, 2006 - link

    It is indeed an issue I struggled with. Writing full length articles on these subjects doesn't sound like a good idea for me: I personally do not like lengthy articles either. So I tried to keep a balance between being technical and keeping it understandable.

    Anyway, Just ask about the points where you were lost. Especially on the OoO matters: it is much more interesting than "AMD has a disadvantage". Basically, reordering happens between the decoding and the execute phase.

    Pushing loads forward helps in two ways:
    1.Whenever a load fails to get it's data from the L1-cache, the CPU has to find other instructions to execute. As loads are very common, it is easier to fill the gaps than when you can not move loads before other loads.

    2. If a load gets pushed forward and a L1-cache miss for that load occurs, it isn't that bad. This is very simplified, but assume the load has been pushed 5 cycles forward, and your L2-cache latency is 10, you only have to wait 5 cycles instead of 10.
    Reply
  • Furen - Monday, May 01, 2006 - link

    I'll be the grammar nazi today, lol.

    Last page, paragraph 5: "[...] increasing the <b>wideness</b> of each unit [...]"
    Width, perhaps? "Wideness" refers to either quality or state (neither of which is discrete) while "width" also also applies to measurable fact (128-bits wide, for example). You can talk about the wideness of the units, for example, but you cannot talk about increasing their wideness...

    Great article, by the way, it's been long since I've read such an enjoyable article.
    Reply
  • emboss - Monday, May 01, 2006 - link

    Just a quick note ... on page 4 you have the table with the execution unit details. There's a couple things incorrect (IMO) in the numbers.

    First, you list the number of double precision FLOPs per cycle. Double precision can be done with SSE, so in the K8 you can do 2 DP ADDs and 2 DP MULs every two cycles (due to the 64-bit wide datapaths), a total of 2 DP FLOPs per cycle.

    Core can do two SSE operations per cycle (the two symmetric units), giving it a total of 4 DP FLOPs per cycle. The third SSE unit does not handle FP ops, but instead handles shuffles and the like.

    Obviously, double both of these numbers if you want a "peak" single precision FLOPs per cycle.

    If instead you were meaning about extended precision (64 bit precision, 80 bit floats) x87 operations, it's exactly the same concept as above since Core has apparently has combined SSE/x87 units (and a fully pipelined FMUL, unlike the P4). This gives both the K8 and Core 2 EP FLOPs per cycle.

    Finally, you have the number of SSE units for the K7 wrong. The K7, like the K8, has two SSE units (FADD and FMUL), and the same 64 bit datapath as the K8. Of course, the K7 cannot handle SSE2, so must use x87 instructions for double precision (ie: two DP FLOPs per cycle).


    Apart from that, very nice article! I've been trying to optimise SSE code for the Core processor and have had to do things by trial and error thanks to the complete void of any decent documentation from Intel. One thing in particular was that I was finding "odd" performance properties with SSE that pointed towards it having two FMUL units. Being symmetric units explains a lot!
    Reply
  • JarredWalton - Monday, May 01, 2006 - link

    See above note regarding Core Duo versus Core "Conroe". (Nice naming scheme, Intel. *grumble*) I will let Johan take care of the rest of your comment as appropriate. (His knowledge of the low level details of all of the microarchitectures discussed here definitely surpasses mine!)

    Unfortunately, it's not particularly surprising to find out that optimal code for Core Duo may need to be slightly tweaked in order to extract the most performance from Conroe. Still, they ought to be similar enough that you own by optimizing for Core Duo. The flipside is the optimal code for Conroe could very likely run worse on Core Duo and other processors. Such is the price of progress, I guess.
    Reply
  • prx99 - Monday, May 01, 2006 - link

    Core is not the first x86 having 4 decoders. That was AMDs K5.

    I remember a statement from AMD that in some design they considered adding one more decoder. It turned out to actually slow down the design because the amount of clock speed lost was not compensated for by the smaller amount of performance gained.

    In my interpretation the fusion is done past the initial decoding, so there is not way more that 4 x86 ops can be decoded in a clock cycle (I'm referring to the "4+1" figure). The profit from fusion is not in the decoding stage but in the out of order engine.

    At AMDs, the "1 branch per cycle" rule is limited to branches seen by the predictor. A branch which is generally not taken is invisible to the prediction engine and therefore free.

    The original P4 indeed had a L1 latency of 2. The major P4 redesign in Prescott however increased it to 3.

    Load/store reordering is already done by the P4, but the penalty from a misprediction is fairly high. This is the drawback of any kind of prediction, whether branches or memory access: It speeds up things when being correct, but slows them down quite a bit more when not. This was the general picture seen in the P4: many applications were sped up by some amount, but some suffered greatly because they systematically fooled the P4's engines.

    Gruss, Andreas
    Reply
  • Betwon - Wednesday, May 03, 2006 - link

    Without branch prediction, K8 will become very very poor. Too terrible!

    The prediction is much better than the forever penalty.

    The penalty of disprediction is just the penalty of doing nothing.(don't predict)

    The penalty is fairly high. If you are against the prediction, you will find that the penalty will happen in K8 every 3 instructions averagely. K8@1.8G(without branch predictor ) will fail to win the old Pentium3@1G(with branch predictor ).

    This is the drawback of lack of prediction, whether branches or memory access: It can not speeds up anything, but often slows down.
    Without branch prediction, K8 will be down!
    Reply
  • Betwon - Wednesday, May 03, 2006 - link

    It is very interesting that the P4's Load/store/Memory reordering method, which is very different with Core's.

    For P4, it always assumes that all load-ops can hit and find the load data from the store buffer or L1 data cache.
    Before one load-op is executed, it has to obtain the load address and all prior-store address and compare with them. If it is found that the load address is equal to one prior-store address, the load-op will assume that the store data is in the store buffer and the data has been ready and vaild, then start to execute speculatively.
    If the address-euqal is not found, the load-op will assume that the load data is in L1 data cache, and the data is ready and vaild, then start to execute speculatively.

    If the speculation fail or the miss happen, the speculative load-op and the relative speculative micro-ops have to be reexecuted -- it is called as 'replay'.

    The load-op can be executed speculatively, after it knew it's load address and compared the load address with the all prior-store address.
    The load-op can not be executed speculatively before it knew it's load address and compared the load address with the all prior-store address.

    The load-op speculates whether the load data is ready and vaild, but not speculate whether there is the true dependency with prior-store.

    But Core can speculate whether there is the true dependency with prior-store. Core has the smart predictor which can predict the store-to-load dependency precisely, before the load-op address is compared with the prior-store address.
    Reply
  • Betwon - Wednesday, May 03, 2006 - link

    If you really want to know what is the Intel's load reordering and memory misambiguation, I can tell you the facts:

    http://www.stanford.edu/~merez/papers/LoadSched_IS...">http://www.stanford.edu/~merez/papers/LoadSched_IS...
    Speculation Techniques for Improving Load Related Instruction Scheduling 1999
    Adi Yoaz, Mattan Erez, Ronny Ronen, and Stephan Jourdan -- From Intel's Haifa, they designed the Load/Store Unit of Core.

    I had said that anandtech should study many things about CPU. Of course, I should study more things about CPU.
    Reply
  • Betwon - Wednesday, May 03, 2006 - link

    sub ebp,ebp
    mov ecx, 1000000000

    B1:
    mov eax,[ebx]
    sub esi,1
    sub edi,1
    cmp ecx,ebp
    je B2

    mov edx,[ebx]
    sub esi,1
    sub edi,1
    cmp ecx,ebp
    je B2

    mov eax,[ebx]
    sub esi,1
    sub edi,1
    cmp ecx,ebp
    je B2

    mov edx,[ebx]
    sub esi,1
    sub edi,1
    cmp ecx,ebp
    je B2

    mov eax,[ebx]
    sub esi,1
    sub edi,1
    cmp ecx,ebp
    je B2

    mov edx,[ebx]
    sub ecx,1
    sub edi,1
    cmp ebp,ebp
    je B1

    B2:

    If the asm codes take 6000000000 cycles --> up to five x86 instructions at a time.
    It is so easy to verify.

    we can not call K5 -- 4 decoders, because it is too immature.
    Reply
  • emboss - Monday, May 01, 2006 - link

    I'm not even sure the Core architecture has 4 decoders. There's lots of references in the Intel Optimisation manual to say that there's still only three (two simple + one complex):

    "On Intel Core Solo and Intel Core Duo processors, decoding of most packed SSE instructions is done by all three decoders. As a result the front end can process up to three packed SSE instructions every cycle." (page 1-32)

    "Improvement in decoder and micro-op fusion allows the front end to see most instructions as single µop instructions. This increases the throughput of the three decoders in the front end." (page 1-31)

    While it certainly wouldn't be the first time Intel manuals have been wrong, they're usually reasonably accurate.

    Also from the optimisation manual, it implies that the front end/decoder doing the fusion (for example, see the second quote above).
    Reply
  • JarredWalton - Monday, May 01, 2006 - link

    Not sure if you're referring to Core Solo/Duo manuals or to Core "Conroe/Merom" manuals. The article is covering the *next* Core architecture, so I wouldn't be at all surprised if Core Duo only has 3 decoders while Conroe bumps that to 4. Reply
  • emboss - Monday, May 01, 2006 - link

    Oops, yes, my mistake. I was referring to Solo/Duo. Damn those marketers :)

    This still leaves me puzzled over the unexpected SSE performance on Solo/Duo. Thinking about it a bit more, the performance would have been 4x "expected" (single uop SSE with two FADD units vs double uop SSE with only one FADD unit), whereas I was only getting a bit less than double. Gnah, back to emperical optimisation.
    Reply
  • Furen - Monday, May 01, 2006 - link

    Yes, Yonah only has 3 decoders (and the same port arrangement as Dothan, too). Reply
  • Loki726 - Monday, May 01, 2006 - link

    Great job Johan!

    Its articles like this that keep anandtech head and shoulders above everyone else. Instead of just running the latest and greatest core you get through the same old benchmarks and throwing some pretty comparison graphs at the reader, you actually take the time to figure out what parts of the architecture contribute to the performance you see in benchmarks. Keep it up!

    On a small side note, on your first figure of intel's core architecture on page 4, I think the cache size should be 4096kb. 4gb seems rather large...
    Reply
  • Goi - Monday, May 01, 2006 - link

    Nice read. Did you get all your information solely from Jack Doweck, or are there papers outlining the Core architecture. I've read those for the Pentium-M and Netburst architecture(as well as several other architectures) but I haven't seen one of the Core yet. Reply
  • JohanAnandtech - Monday, May 01, 2006 - link

    Thanks.

    The current Core Papers are pretty poor IMHO.

    Best sources of info:
    - Jack Doweck
    - IDF's presentations
    - David Kanter's article at RWT (going to add that link in the references)

    Reply
  • Orbs - Monday, May 01, 2006 - link

    I never studied chip design but even with my limited assembly knowledge I was able to follow this article. Very technical, very informative, and helps explain some of the insane benchmark results from Conroe's preview earlier this year.

    It will be interesting to see how Conroe/Merom perform when finalized, how they compare to final AM2 CPUs (especially higher clocked AM2 CPUs) and what AMD counters with in '07.
    Reply
  • JohanAnandtech - Monday, May 01, 2006 - link

    "I never studied chip design but even with my limited assembly knowledge I was able to follow this article"

    Very happy to read that.

    At this point there is not enough info on what AMD has in store, so I left that out. However, better load reordering (besides the announced extra FP power) seems a minimum for the next AMD micro-arch.



    Reply
  • at80eighty - Monday, May 01, 2006 - link

    Been a while since we've seen you around Johan - nice read btw Reply
  • Avalon - Monday, May 01, 2006 - link

    Wow, very nice read. Thanks a ton! Reply
  • owned66 - Monday, February 25, 2013 - link

    soo many AMD fan boys back then
    here is 2013 they dont exist anymore
    Reply

Log in

Don't have an account? Sign up now