Motherboards Memory Storage Cases/Cooling/PSUs IT Computing Displays Mobile Mac CPUs & Chipsets Video Digital Cameras Linux Gadgets Systems Trade Shows Guides Home Increase Font Size Decrease Font Size Change Page Size
Intel Core versus AMD's K8 architecture
Intel Core versus AMD's K8 architecture
Date: May 1st, 2006
Topic: CPU & Chipset
Manufacturer: Intel
Author: Johan De Gelas
Buy the Gigabyte GA-G41M-ES2L Motherboard
Blank
 Newegg $64.99
 TigerDirect $59.99
 Buy.com $60.23
 
 

Introduction

Wide Dynamic Execution, Advanced Digital Media Boost, Smart Memory Access and Advanced Smart Cache; those are the technologies that according to the marketing people at Intel enable Intel to build the high performance, low energy CPUs using the new Core architecture.

Of course, as an AnandTech Reader, you couldn't care less about which Hyper Super Advanced Label the marketing folks glue on their CPUs. "Extend the digital lifestyle by combining robust performance with low power consumption" could have been another marketing claim for the new Core architecture, but VIA already cornered that sentence for its C7 CPUs. The marketing slogans for Intel's Core and VIA's C7 are almost the same; the architectures are however vastly different.

No, let us find out what is really behind all this marketing hyper-talk, and preferably compare it with the AMD "K8" (Athlon 64, Opteron) architecture of Intel's NetBurst and Pentium M processors. That is what this article is all about. We talked to Jack Doweck, the engineer who designed the completely new Memory Reorder Buffer and Memory disambiguation system. Jack Doweck is one of the Intel Israel Development Center (IDC) architects.

The Intel "P8"

Intel marketing states that Core is a blend of P-M techniques and NetBurst architecture. However, Core is clearly a descendant of the Pentium Pro, or the P6 architecture. It is very hard to find anything "Pentium 4" or "NetBurst" in the Core architecture. While talking to Jack Doweck, it became clear that only the prefetching was inspired by experiences with the Pentium 4. Everything else is an evolution of "Yonah" (Core Duo), which was itself an improvement of Dothan and Banias. Those CPUs inherited the bus of the Pentium 4, but are still clearly children of the hugely successful P6 architecture. In a sense, you could call Core the "P8" architecture, with Banias/Dothan being based on the "P7" architecture. (Note that the architecture of Banias/Dothan was never given an official name, so we will refer to it as "P-M" for simplicity's sake.)

Of course this doesn't mean that Intel's engineers just bolted a few functional units and a few decoders on Yonah and called it a day. Jack told us that Woodcrest/Conroe/Merom are indeed based on Yonah, but that almost 80% of both the architecture and circuit design had to be redone.

CPU architecture in a nutshell

For those of you who are not so familiar with CPUs, we'll start with a crash course in CPU architectures. To understand CPU design, you must first look at the instructions that are sent to the CPU, and thus we start with the software.

Typical x86 software code consists of about 50% stores and loads, and there are about twice as many loads as there are stores. Of the remainder, about 15 to 20% of the instructions are branches (If, Then, Else), and the rest are mostly "ADD" (addition) and "MUL" (multiply) instructions. Only a very small percentage of code consists of more exotic instructions such as DIV (divisions), SQRT (square root), or other higher order math (e.g. trigonometric functions).

All these instructions are processed in a typical "Von Neuman" pipeline: Fetch, Decode, Operand Fetch, Execute, Retire.

Instructions are fetched based on the instruction pointer register, and initially they are nothing but long bit patterns to the CPU. It's only after the CPU starts decoding the bits that the instructions "start to make sense" to the CPU. Addresses and opcodes are decoded out of the instructions, and the addresses are used for the next step: the operand fetch. As you don't want the CPU to perform calculations with the addresses but rather on the content of these addresses - the "operands" - the CPU has to fetch the right data out of the data cache. Once these operands are put in the registers, the ALU is steered by the "opcode" (which has been decoded) to perform the right calculation on the operands in the registers.

The results are written to the architecture register file, the registers which can be used by the compiler. The results must also be written to the caches and the main memory, so that these are also up to date. That is the final phase, the retire phase. That is the basically how processing works in all CPUs.

The main challenge for the CPU designer today is the average memory latency the CPU sees. A Pentium 4 3.6 GHz with DDR-400 runs no less than 18 times faster than the base clock of the RAM (200 MHz). Every cycle the memory is being accessed, a minimum of 18 cycles pass on the CPU. At the same time, it takes several cycles to even send a request, and it takes a few cycles to send a request back. (We discussed this in the past in our overview of memory technology article.) The result is that wait times of 200 to 300 cycles are not uncommon on the Pentium 4. The goal of CPU cache is to avoid accessing RAM, but even if the CPU only has to go to system memory 4% of the time, that 4% of the time can lower performance significantly.

Memory Subsystem   Next Page

 
  Index

Tools Share
Find lowest prices Find the lowest prices
Digg   del.icio.us   E-mail  
Print This Article Print this article  

84 Comments - Last by Regs, 1343 days ago
Username:
Password:
by Avalon, 1380 days ago
Wow, very nice read. Thanks a ton!

Reply
agreed by at80eighty, 1380 days ago
Been a while since we've seen you around Johan - nice read btw

Reply
Very technical, great article! What about AMD's Future :)? by Orbs, 1380 days ago
I never studied chip design but even with my limited assembly knowledge I was able to follow this article. Very technical, very informative, and helps explain some of the insane benchmark results from Conroe's preview earlier this year.

It will be interesting to see how Conroe/Merom perform when finalized, how they compare to final AM2 CPUs (especially higher clocked AM2 CPUs) and what AMD counters with in '07.

Reply
RE: Very technical, great article! What about AMD's Future :)? by JohanAnandtech, 1380 days ago
"I never studied chip design but even with my limited assembly knowledge I was able to follow this article"

Very happy to read that.

At this point there is not enough info on what AMD has in store, so I left that out. However, better load reordering (besides the announced extra FP power) seems a minimum for the next AMD micro-arch.





Reply
Core paper? by Goi, 1380 days ago
Nice read. Did you get all your information solely from Jack Doweck, or are there papers outlining the Core architecture. I've read those for the Pentium-M and Netburst architecture(as well as several other architectures) but I haven't seen one of the Core yet.

Reply
RE: Core paper? by JohanAnandtech, 1380 days ago
Thanks.

The current Core Papers are pretty poor IMHO.

Best sources of info:
- Jack Doweck
- IDF's presentations
- David Kanter's article at RWT (going to add that link in the references)



Reply
Awesome atricle by Loki726, 1380 days ago
Great job Johan!

Its articles like this that keep anandtech head and shoulders above everyone else. Instead of just running the latest and greatest core you get through the same old benchmarks and throwing some pretty comparison graphs at the reader, you actually take the time to figure out what parts of the architecture contribute to the performance you see in benchmarks. Keep it up!

On a small side note, on your first figure of intel's core architecture on page 4, I think the cache size should be 4096kb. 4gb seems rather large...

Reply
some nitpicking by prx99, 1380 days ago
Core is not the first x86 having 4 decoders. That was AMDs K5.

I remember a statement from AMD that in some design they considered adding one more decoder. It turned out to actually slow down the design because the amount of clock speed lost was not compensated for by the smaller amount of performance gained.

In my interpretation the fusion is done past the initial decoding, so there is not way more that 4 x86 ops can be decoded in a clock cycle (I'm referring to the "4+1" figure). The profit from fusion is not in the decoding stage but in the out of order engine.

At AMDs, the "1 branch per cycle" rule is limited to branches seen by the predictor. A branch which is generally not taken is invisible to the prediction engine and therefore free.

The original P4 indeed had a L1 latency of 2. The major P4 redesign in Prescott however increased it to 3.

Load/store reordering is already done by the P4, but the penalty from a misprediction is fairly high. This is the drawback of any kind of prediction, whether branches or memory access: It speeds up things when being correct, but slows them down quite a bit more when not. This was the general picture seen in the P4: many applications were sped up by some amount, but some suffered greatly because they systematically fooled the P4's engines.

Gruss, Andreas


Reply
RE: some nitpicking by emboss, 1380 days ago
I'm not even sure the Core architecture has 4 decoders. There's lots of references in the Intel Optimisation manual to say that there's still only three (two simple + one complex):

"On Intel Core Solo and Intel Core Duo processors, decoding of most packed SSE instructions is done by all three decoders. As a result the front end can process up to three packed SSE instructions every cycle." (page 1-32)

"Improvement in decoder and micro-op fusion allows the front end to see most instructions as single µop instructions. This increases the throughput of the three decoders in the front end." (page 1-31)

While it certainly wouldn't be the first time Intel manuals have been wrong, they're usually reasonably accurate.

Also from the optimisation manual, it implies that the front end/decoder doing the fusion (for example, see the second quote above).

Reply
RE: some nitpicking by JarredWalton, 1380 days ago
Not sure if you're referring to Core Solo/Duo manuals or to Core "Conroe/Merom" manuals. The article is covering the *next* Core architecture, so I wouldn't be at all surprised if Core Duo only has 3 decoders while Conroe bumps that to 4.

Reply
Comments Page 1 of 9

Unlicensed Software at Your Last Company
Anonymously Report Unlicensed Software with Our Form Now. Get Up to $1 Million.
We Buy Laptop and PC Memory! Sell to Us!
Min of 25 pieces required. Call us today at 239.354.1230.
Special Offer from The Economist
Get 12 issues of The Economist for $12. US subscribers only.
Free Forrester Risk Management Report
Demystifying Enterprise Risk Management. Download Free With Registration.
Download Microsoft Visual Studio ® Team System
Streamline Dev processes, Reduce time to market. Try Microsoft Visual Studio Team System, FREE!




Latest news by
DailyTech

 February 9, 2010

Blank
Blank
Blank
Blank
Blank
Blank
Blank
Blank
Blank
Blank
Blank
Blank
Blank
Blank
Blank
Blank

 February 8, 2010

Blank


more CPU & Chipset Discussions



pipeboost
Copyright © 1997-2010 AnandTech, Inc. All rights reserved. Terms, Conditions and Privacy Information.
Click Here for Advertising Information