Intel Core versus AMD's K8 architecture

Name: Intel Core versus AMD's K8 architecture
Item: Intel Core versus AMD's K8 architecture
Author: Johan De Gelas

by Johan De Gelas on May 1, 2006 4:00 AM EST

Posted in
CPUs

87 Comments | Add A Comment

87 Comments

Smarter Decoding

Similar to the K8 architecture, Core pre-decodes instructions that are fetched. Pre-decode information includes instruction length and decode boundaries.

A first for the x86 world, the Core architecture is equipped with four x86 decoders, 3 simple decoders and 1 complex decoder. The task of the decoders - for all current x86 CPUs - is not only to decipher the incoming instruction (opcode, addresses), but also to translate the 1 to 15 byte variable length x86 instructions into - easier to schedule and execute - fixed length RISC-like instructions (called micro-ops).

The most common x86 instructions are translated into a single micro-op by the 3 simple decoders. The complex decoder is responsible for the instructions that produce up to 4 micro-ops. The really long and complex x86 instructions are handled by a microcode sequencer. This way of handling the complex most CISC-y instructions has been adopted by all modern x86 CPU designs, including the P6, Athlon (XP and 64), and Pentium 4.

There is still more to the Core decoders. The first clever technique is macro-op fusion. It makes it possible for two relatively common x86 instructions to be fused into a single instruction. For example, the x86 compare instruction (CMP) is fused with a jump (JNE TARG). These instructions are typically the assembler result of a compiled if-then-else statement.

The result is that on average in a typical x86 program, for every 10 instruction, two x86 instructions (called macro-ops by Intel) are fused together. When two x86 instructions are fused together, the 4 decoders can decode 5 instructions in one cycle. The fused instruction travels down the pipeline as a single entity, and this has other advantages: more decode bandwidth, less space taken in the Out of Order (OoO) buffers, and less scheduling overhead. If Intel's "1 out of 10" claims are accurate, macro-ops fusion alone should account for an 11% performance boost relative to architectures that lack the technology.

The second clever technique already exists in the current P-M CPUs. There are a few x86 instructions which are pretty complex to perform, but which are at the same time a very typical and common x86 instruction. We are talking for example about mathematical operations where an address is referenced instead of a register. One common example is ADD [mem], EAX . This means add the content of register EAX to the content of a certain memory location (i.e. store the result back at the memory address). Store instructions which get broken down into store address and store data are another example.

In earlier designs such as the P6 (Pentium Pro, PII, PIII) architecture, these instruction would have been broken up into two or even three micro-ops. Remember that the whole philosophy behind all modern x86 CPUs, since the P6, is to decode x86 instructions into RISC-y micro-ops which are then fed to a fast RISC backend; the backend then schedules, issues, executes and retires the instructions in a smooth RISC way.

There is no way you could feed such an instruction (ADD [mem], EAX) to RISC execution units. It violates every RISC rule. RISC designs all load their data into the registers and then perform the necessary calculation on the registers.

So ADD [mem], EAX is broken down into:

Load the contents of [mem] into a register (MOV EBX, [mem])
An ALU operation, ADD the two registers together (ADD EBX, EAX)
Store the result back to memory (MOV [mem], EBX)

Since Banias, the ALU and the Load operation are kept together in one micro-op. This is called micro-op fusion. This is no small feat: in older designs keeping the load and ALU operation together would result in pipeline stages that take much longer and thus lower the maximum clock frequency. (In CPU designs, the maximum clock speed is essentially determined by the slowest possible pipeline stage execution time.) Only by using bigger, smarter circuitry that can do a lot in parallel is micro-op fusion possible without lowering the clock speed significantly.

The pre-decode stage recognizes the macro-ops (or x86) instructions that should be kept together. In the decoding phase, ADD [mem], EAX results in one micro-op. Again, this means that the CPU can stuff more instructions in the same OoO buffers, increasing efficiency and improving performance.

Core versus Hammer: Decoding

All very nice, but let us take a look at what really matters: How do the 3 simple + 1 complex decoders of Core compare to the 3 complex decoders of AMD's K8 architecture?

The original Athlon ("K7") has two way of decoding, Vector and Direct Path. The Vector Path decoding results in more than two RISC-like instructions (called "macro-ops" by AMD), the Direct Path in one, sometimes two macro-ops. Each of the decoders in K7 can handle both Vector Path and Direct Path decoding, but from a performance standpoint Direct Path is preferred since it results in fewer macro-ops. If you're wondering why were discussing K7 all of a sudden, just as Core is largely based off the P6 architecture, K8 is largely based off the K7 architecture.

The 3 complex decoders are powerful and can decode most x86 instructions, with few instructions requiring the Vector Path. The only downside of the K7 decoders is that some FP instructions and SSE instructions have to pass through the Vector Path. K8 has even stronger complex decoders and almost all FP and SSE instructions are also now decoded through the Direct Path decoders. This is possible as fetching and decoding takes more stages than it did in the K7; the K8 architecture is clearly more powerful when it comes to SIMD.

Obviously, Intel's Macro-op ( x86 instruction ) fusion does not exist in AMD's K8. However, micro-op fusion is available in another form. If we compare Intel's and AMD's macro-ops and micro-ops, it is easy to get confused. Take a look at the table below which explains the differences.

Micro-op fusion does exist in the Athlon. An ADD [mem], EAX is kept together in one macro-op as it travels through the pipeline. Therefore it will take only one place in the OoO buffers. However, the load and execute SSE/SSE2 operations can be fused on Core, while this is not the case on K8: packed SSE operations result in two macro-ops.

So how do Intel's Core and AMD's Hammer compare when it comes to decoding? It is hard to say at the moment without access to Intel's optimization manuals. However, we can get a pretty good idea. In almost every situation, the Core architecture has the advantage. It can decode 4 x86 instructions per cycle, and sometimes 5 thanks to x86 fusion. AMD's Hammer can do only 3.

The situation where AMD's 3 complex decoders can outperform Core's 1 complex + 3 simple decoders is much less likely to happen. It would happen when 3 instructions would be fetched that would have to be handled by the complex decoder of the Core CPU, but which are not too complex that the Microcode Sequencer must kick in. Since the most used x86 instructions all map to one Intel micro-op, this is pretty unlikely.

Memory Subsystem Out of Order Execution

PRINT THIS ARTICLE

Post Your Comment
Please log in or sign up to comment.

Comments Locked

87 Comments

View All Comments

Betwon - Wednesday, May 3, 2006 - link
Without branch prediction, K8 will become very very poor. Too terrible!

The prediction is much better than the forever penalty.

The penalty of disprediction is just the penalty of doing nothing.(don't predict)

The penalty is fairly high. If you are against the prediction, you will find that the penalty will happen in K8 every 3 instructions averagely. K8@1.8G(without branch predictor ) will fail to win the old Pentium3@1G(with branch predictor ).

This is the drawback of lack of prediction, whether branches or memory access: It can not speeds up anything, but often slows down.
Without branch prediction, K8 will be down!
Betwon - Wednesday, May 3, 2006 - link
It is very interesting that the P4's Load/store/Memory reordering method, which is very different with Core's.

For P4, it always assumes that all load-ops can hit and find the load data from the store buffer or L1 data cache.
Before one load-op is executed, it has to obtain the load address and all prior-store address and compare with them. If it is found that the load address is equal to one prior-store address, the load-op will assume that the store data is in the store buffer and the data has been ready and vaild, then start to execute speculatively.
If the address-euqal is not found, the load-op will assume that the load data is in L1 data cache, and the data is ready and vaild, then start to execute speculatively.

If the speculation fail or the miss happen, the speculative load-op and the relative speculative micro-ops have to be reexecuted -- it is called as 'replay'.

The load-op can be executed speculatively, after it knew it's load address and compared the load address with the all prior-store address.
The load-op can not be executed speculatively before it knew it's load address and compared the load address with the all prior-store address.

The load-op speculates whether the load data is ready and vaild, but not speculate whether there is the true dependency with prior-store.

But Core can speculate whether there is the true dependency with prior-store. Core has the smart predictor which can predict the store-to-load dependency precisely, before the load-op address is compared with the prior-store address.
Betwon - Wednesday, May 3, 2006 - link
If you really want to know what is the Intel's load reordering and memory misambiguation, I can tell you the facts:

http://www.stanford.edu/~merez/papers/LoadSched_IS...">http://www.stanford.edu/~merez/papers/LoadSched_IS...
Speculation Techniques for Improving Load Related Instruction Scheduling 1999
Adi Yoaz, Mattan Erez, Ronny Ronen, and Stephan Jourdan -- From Intel's Haifa, they designed the Load/Store Unit of Core.

I had said that anandtech should study many things about CPU. Of course, I should study more things about CPU.
Betwon - Wednesday, May 3, 2006 - link
sub ebp,ebp
mov ecx, 1000000000

B1:
mov eax,[ebx]
sub esi,1
sub edi,1
cmp ecx,ebp
je B2

mov edx,[ebx]
sub esi,1
sub edi,1
cmp ecx,ebp
je B2

mov eax,[ebx]
sub esi,1
sub edi,1
cmp ecx,ebp
je B2

mov edx,[ebx]
sub esi,1
sub edi,1
cmp ecx,ebp
je B2

mov eax,[ebx]
sub esi,1
sub edi,1
cmp ecx,ebp
je B2

mov edx,[ebx]
sub ecx,1
sub edi,1
cmp ebp,ebp
je B1

B2:

If the asm codes take 6000000000 cycles --> up to five x86 instructions at a time.
It is so easy to verify.

we can not call K5 -- 4 decoders, because it is too immature.
emboss - Monday, May 1, 2006 - link
I'm not even sure the Core architecture has 4 decoders. There's lots of references in the Intel Optimisation manual to say that there's still only three (two simple + one complex):

"On Intel Core Solo and Intel Core Duo processors, decoding of most packed SSE instructions is done by all three decoders. As a result the front end can process up to three packed SSE instructions every cycle." (page 1-32)

"Improvement in decoder and micro-op fusion allows the front end to see most instructions as single µop instructions. This increases the throughput of the three decoders in the front end." (page 1-31)

While it certainly wouldn't be the first time Intel manuals have been wrong, they're usually reasonably accurate.

Also from the optimisation manual, it implies that the front end/decoder doing the fusion (for example, see the second quote above).
JarredWalton - Monday, May 1, 2006 - link
Not sure if you're referring to Core Solo/Duo manuals or to Core "Conroe/Merom" manuals. The article is covering the *next* Core architecture, so I wouldn't be at all surprised if Core Duo only has 3 decoders while Conroe bumps that to 4.
emboss - Monday, May 1, 2006 - link
Oops, yes, my mistake. I was referring to Solo/Duo. Damn those marketers :)

This still leaves me puzzled over the unexpected SSE performance on Solo/Duo. Thinking about it a bit more, the performance would have been 4x "expected" (single uop SSE with two FADD units vs double uop SSE with only one FADD unit), whereas I was only getting a bit less than double. Gnah, back to emperical optimisation.
Furen - Monday, May 1, 2006 - link
Yes, Yonah only has 3 decoders (and the same port arrangement as Dothan, too).
Loki726 - Monday, May 1, 2006 - link
Great job Johan!

Its articles like this that keep anandtech head and shoulders above everyone else. Instead of just running the latest and greatest core you get through the same old benchmarks and throwing some pretty comparison graphs at the reader, you actually take the time to figure out what parts of the architecture contribute to the performance you see in benchmarks. Keep it up!

On a small side note, on your first figure of intel's core architecture on page 4, I think the cache size should be 4096kb. 4gb seems rather large...
Goi - Monday, May 1, 2006 - link
Nice read. Did you get all your information solely from Jack Doweck, or are there papers outlining the Core architecture. I've read those for the Pentium-M and Netburst architecture(as well as several other architectures) but I haven't seen one of the Core yet.

Intel Core versus AMD's K8 architecture

Smarter Decoding

Core versus Hammer: Decoding

Post Your Comment

87 Comments

View All Comments

Betwon - Wednesday, May 3, 2006 - link

Betwon - Wednesday, May 3, 2006 - link

Betwon - Wednesday, May 3, 2006 - link

Betwon - Wednesday, May 3, 2006 - link

emboss - Monday, May 1, 2006 - link

JarredWalton - Monday, May 1, 2006 - link

emboss - Monday, May 1, 2006 - link

Furen - Monday, May 1, 2006 - link

Loki726 - Monday, May 1, 2006 - link

Goi - Monday, May 1, 2006 - link

Log in

Don't have an account? Sign up now