Introduction

Wide Dynamic Execution, Advanced Digital Media Boost, Smart Memory Access and Advanced Smart Cache; those are the technologies that according to the marketing people at Intel enable Intel to build the high performance, low energy CPUs using the new Core architecture.

Of course, as an AnandTech Reader, you couldn't care less about which Hyper Super Advanced Label the marketing folks glue on their CPUs. "Extend the digital lifestyle by combining robust performance with low power consumption" could have been another marketing claim for the new Core architecture, but VIA already cornered that sentence for its C7 CPUs. The marketing slogans for Intel's Core and VIA's C7 are almost the same; the architectures are however vastly different.

No, let us find out what is really behind all this marketing hyper-talk, and preferably compare it with the AMD "K8" (Athlon 64, Opteron) architecture of Intel's NetBurst and Pentium M processors. That is what this article is all about. We talked to Jack Doweck, the engineer who designed the completely new Memory Reorder Buffer and Memory disambiguation system. Jack Doweck is one of the Intel Israel Development Center (IDC) architects.

The Intel "P8"

Intel marketing states that Core is a blend of P-M techniques and NetBurst architecture. However, Core is clearly a descendant of the Pentium Pro, or the P6 architecture. It is very hard to find anything "Pentium 4" or "NetBurst" in the Core architecture. While talking to Jack Doweck, it became clear that only the prefetching was inspired by experiences with the Pentium 4. Everything else is an evolution of "Yonah" (Core Duo), which was itself an improvement of Dothan and Banias. Those CPUs inherited the bus of the Pentium 4, but are still clearly children of the hugely successful P6 architecture. In a sense, you could call Core the "P8" architecture, with Banias/Dothan being based on the "P7" architecture. (Note that the architecture of Banias/Dothan was never given an official name, so we will refer to it as "P-M" for simplicity's sake.)

Of course this doesn't mean that Intel's engineers just bolted a few functional units and a few decoders on Yonah and called it a day. Jack told us that Woodcrest/Conroe/Merom are indeed based on Yonah, but that almost 80% of both the architecture and circuit design had to be redone.

CPU architecture in a nutshell

For those of you who are not so familiar with CPUs, we'll start with a crash course in CPU architectures. To understand CPU design, you must first look at the instructions that are sent to the CPU, and thus we start with the software.

Typical x86 software code consists of about 50% stores and loads, and there are about twice as many loads as there are stores. Of the remainder, about 15 to 20% of the instructions are branches (If, Then, Else), and the rest are mostly "ADD" (addition) and "MUL" (multiply) instructions. Only a very small percentage of code consists of more exotic instructions such as DIV (divisions), SQRT (square root), or other higher order math (e.g. trigonometric functions).

All these instructions are processed in a typical "Von Neuman" pipeline: Fetch, Decode, Operand Fetch, Execute, Retire.

Instructions are fetched based on the instruction pointer register, and initially they are nothing but long bit patterns to the CPU. It's only after the CPU starts decoding the bits that the instructions "start to make sense" to the CPU. Addresses and opcodes are decoded out of the instructions, and the addresses are used for the next step: the operand fetch. As you don't want the CPU to perform calculations with the addresses but rather on the content of these addresses - the "operands" - the CPU has to fetch the right data out of the data cache. Once these operands are put in the registers, the ALU is steered by the "opcode" (which has been decoded) to perform the right calculation on the operands in the registers.

The results are written to the architecture register file, the registers which can be used by the compiler. The results must also be written to the caches and the main memory, so that these are also up to date. That is the final phase, the retire phase. That is the basically how processing works in all CPUs.

The main challenge for the CPU designer today is the average memory latency the CPU sees. A Pentium 4 3.6 GHz with DDR-400 runs no less than 18 times faster than the base clock of the RAM (200 MHz). Every cycle the memory is being accessed, a minimum of 18 cycles pass on the CPU. At the same time, it takes several cycles to even send a request, and it takes a few cycles to send a request back. (We discussed this in the past in our overview of memory technology article.) The result is that wait times of 200 to 300 cycles are not uncommon on the Pentium 4. The goal of CPU cache is to avoid accessing RAM, but even if the CPU only has to go to system memory 4% of the time, that 4% of the time can lower performance significantly.

Memory Subsystem
POST A COMMENT

85 Comments

View All Comments

  • GeeZee - Friday, May 05, 2006 - link

    Even with all the new technologies put into the new "Core" architecture, I think Intel will have a very tough time putting the nails in the coffin of the Athlon/Opteron.

    In performance tests(Not benchmarks that fit under 4mb) the Athlon was very competitive with the new core architecture, and beat it on many tests. On top of that A-64 and Opteron still blow it away when using 4 or more cores.

    As for the future....AMD has a tremendous amount of companies that are working with them to produce the next gen chips. IBM, Sony, Transmeta, Nvidia, Cray. Pretty much all the Mobo/Chipset manufacturers are much more frendly with AMD than intel.

    I wouldn't count out AMD untill their next gen CPU's flop....and I don't think it will. Imagine AMD with access to the code morphing software & Transmeta's vliw chip as a co processor & Via's encryption core & HT 3.0. All working flawlessly due to the new memory modes introduced on AM2. Add onto that Transmeta's manufacturing patents would cut power by 50%.

    Via gets Royalties on each chip, Transmeta gets access to AMD core technolgies. Everyone wins.

    AMD really surprised Intel with the Athlon. And I think they have somthing up their sleeve after the AM2.
    Reply
  • IntelUser2000 - Friday, May 05, 2006 - link

    quote:

    In performance tests(Not benchmarks that fit under 4mb) the Athlon was very competitive with the new core architecture, and beat it on many tests. On top of that A-64 and Opteron still blow it away when using 4 or more cores.


    Beat it?? Blow it away?? Have you seen the benchmarks of quad cores to know the reality?? Its the other way around. But when comparing against "Core" Duo that's different... Otherwise you are saying nonsense.
    Reply
  • GeeZee - Sunday, May 07, 2006 - link

    Really......
    http://sharikou.blogspot.com/2006/04/clovertown-sc...">http://sharikou.blogspot.com/2006/04/clovertown-sc...
    Mabye you should look at some facts with thoes blinded fanboy eyes.
    Reply
  • IntelUser2000 - Tuesday, May 09, 2006 - link

    quote:

    Really......
    http://sharikou.blogspot.com/2006/04/clovertown-sc...">http://sharikou.blogspot.com/2006/04/clovertown-sc...
    Mabye you should look at some facts with thoes blinded fanboy eyes.



    LOL. Anyone with ANY common sense should realize that the guy doesn't know what he is talking about. He claims Yonah uses 50W!!! Who's a fanboy here...

    And let me explain those clovertown scores.

    #1. Possibly not a good benchmark for looking at average performance:
    Take a look at Cinebench scores. You'll see that Pentium Extreme Edition 840 will outperform Pentium D 840 by over 15%!!! Now where do you see benchmark scores which shows the Pentium EE's outperforming Pentium D's by 15%?? That's right, MOST OF THE TIMES, IT DOESN'T!!! Pentium D's can outperform Pentium EE's lots of times.

    #2. The author's mind-boggling flawed logic on Clovertown's score:

    He claims that the reason Clovertown scales only 4.85 by using 8 cores is because its bandwidth starved. http://www.digitalvideoediting.com/articles/viewar...">http://www.digitalvideoediting.com/articles/viewar...

    Ah what do you see?? Opteron only scales 4.85x too!!!

    So what's the opinion on the blog?? HE'S A BLINDED FANBOY!!

    Stop posting in forums and use your useless brain on something else.

    Why people make up these stupid blogs though?? They are afraid to admit that Intel can actually do make something GOOD.
    Reply
  • IntelUser2000 - Tuesday, May 09, 2006 - link

    quote:

    In performance tests(Not benchmarks that fit under 4mb) the Athlon was very competitive with the new core architecture, and beat it on many tests. On top of that A-64 and Opteron still blow it away when using 4 or more cores.


    Pffft. Where do you see that?? Care to reveal those benchmarks?? Still in denial after looking at what Core Duo can do??
    Reply
  • IntelUser2000 - Tuesday, May 09, 2006 - link

    There are 3 main things people argue about when doubting Conroe.

    1. IDF system's scores are wrong because Intel could have modified the benchmarks.
    2. The K7/K8 decoders can all do complex instruction decoding which is better than Core
    3. The apps that doesn't fit in 4MB cache will perform slow.




    My response:
    1. ANANDTECH has shown that AFTER using THEIR OWN Quake 4 benchmark, the discrepancy between Conroe and OC'ed FX-60 INCREASED, indicating Intel's benchmarks are RATHER conservative.
    2. First, the two decoders(K7 and Core) can't be compared directly. While it was TRUE that K7 had superior decoder capability compared to P6, its different with Core, because more of the instructions that used to go to the complex decoder on the P6 now goes to the simple decoders in Core.
    3. The doubled AND lowered latency L2 cache on the Northwood gave 6-11%(Avg. 8.5%) gain in games. Doubled L2 cache on Barton gave 4-8%(6%) increase. Difference between Athlon 64 3000+(2.0GHz 512KB L2 single channel S754) and 3200+(1MB cache version) is 2.2-8%(5.1%).

    Caches doesn't do much. People seem to be somehow expecting 20% difference on the cache alone.
    Reply
  • Accord99 - Monday, May 08, 2006 - link

    Those scores beat a 4 single-core or a 2 dual-core Opteron system. Reply
  • clairvoyant129 - Sunday, May 07, 2006 - link

    How ironic you post that website in response to the above user (also calling him a fanboy) when it's a known fact that the author of the site manipulates information to favor AMD. Why don't you think a little next time? Reply
  • theteamaqua - Friday, May 05, 2006 - link

    im glad that intel is back on track, if they keep falling behind AMD, AMD is gonna jack up the price, intel jsut slash its cpu as much as 50%, the Pentium D 950, my mobo wont support conroe so ill jsut have to get the 960 when conroe launches,

    but what interest me most is the quad-coare thats coming Q1 next year, hopefully the performance can be as close to 200% of a dual-core counter-part running at the same speed
    Reply
  • thestain - Friday, May 05, 2006 - link

    Larger Cache and an extra decoder were bound to help Conroe in the small and simple tesing done by most benchmarks.

    But, what about applications that are a bit larger than Conroe's cache size or those that are complex causing the simple decoders to not be able to be used that much while placing the single complex decoder on the Conroe into short supply?

    Mike
    Reply

Log in

Don't have an account? Sign up now