Smarter Decoding

Similar to the K8 architecture, Core pre-decodes instructions that are fetched. Pre-decode information includes instruction length and decode boundaries.

A first for the x86 world, the Core architecture is equipped with four x86 decoders, 3 simple decoders and 1 complex decoder. The task of the decoders - for all current x86 CPUs - is not only to decipher the incoming instruction (opcode, addresses), but also to translate the 1 to 15 byte variable length x86 instructions into - easier to schedule and execute - fixed length RISC-like instructions (called micro-ops).

The most common x86 instructions are translated into a single micro-op by the 3 simple decoders. The complex decoder is responsible for the instructions that produce up to 4 micro-ops. The really long and complex x86 instructions are handled by a microcode sequencer. This way of handling the complex most CISC-y instructions has been adopted by all modern x86 CPU designs, including the P6, Athlon (XP and 64), and Pentium 4.

There is still more to the Core decoders. The first clever technique is macro-op fusion. It makes it possible for two relatively common x86 instructions to be fused into a single instruction. For example, the x86 compare instruction (CMP) is fused with a jump (JNE TARG). These instructions are typically the assembler result of a compiled if-then-else statement.


The result is that on average in a typical x86 program, for every 10 instruction, two x86 instructions (called macro-ops by Intel) are fused together. When two x86 instructions are fused together, the 4 decoders can decode 5 instructions in one cycle. The fused instruction travels down the pipeline as a single entity, and this has other advantages: more decode bandwidth, less space taken in the Out of Order (OoO) buffers, and less scheduling overhead. If Intel's "1 out of 10" claims are accurate, macro-ops fusion alone should account for an 11% performance boost relative to architectures that lack the technology.

The second clever technique already exists in the current P-M CPUs. There are a few x86 instructions which are pretty complex to perform, but which are at the same time a very typical and common x86 instruction. We are talking for example about mathematical operations where an address is referenced instead of a register. One common example is ADD [mem], EAX . This means add the content of register EAX to the content of a certain memory location (i.e. store the result back at the memory address). Store instructions which get broken down into store address and store data are another example.

In earlier designs such as the P6 (Pentium Pro, PII, PIII) architecture, these instruction would have been broken up into two or even three micro-ops. Remember that the whole philosophy behind all modern x86 CPUs, since the P6, is to decode x86 instructions into RISC-y micro-ops which are then fed to a fast RISC backend; the backend then schedules, issues, executes and retires the instructions in a smooth RISC way.

There is no way you could feed such an instruction (ADD [mem], EAX) to RISC execution units. It violates every RISC rule. RISC designs all load their data into the registers and then perform the necessary calculation on the registers.

So ADD [mem], EAX is broken down into:
Load the contents of [mem] into a register (MOV EBX, [mem])
An ALU operation, ADD the two registers together (ADD EBX, EAX)
Store the result back to memory (MOV [mem], EBX)
Since Banias, the ALU and the Load operation are kept together in one micro-op. This is called micro-op fusion. This is no small feat: in older designs keeping the load and ALU operation together would result in pipeline stages that take much longer and thus lower the maximum clock frequency. (In CPU designs, the maximum clock speed is essentially determined by the slowest possible pipeline stage execution time.) Only by using bigger, smarter circuitry that can do a lot in parallel is micro-op fusion possible without lowering the clock speed significantly.

The pre-decode stage recognizes the macro-ops (or x86) instructions that should be kept together. In the decoding phase, ADD [mem], EAX results in one micro-op. Again, this means that the CPU can stuff more instructions in the same OoO buffers, increasing efficiency and improving performance.

Core versus Hammer: Decoding

All very nice, but let us take a look at what really matters: How do the 3 simple + 1 complex decoders of Core compare to the 3 complex decoders of AMD's K8 architecture?

The original Athlon ("K7") has two way of decoding, Vector and Direct Path. The Vector Path decoding results in more than two RISC-like instructions (called "macro-ops" by AMD), the Direct Path in one, sometimes two macro-ops. Each of the decoders in K7 can handle both Vector Path and Direct Path decoding, but from a performance standpoint Direct Path is preferred since it results in fewer macro-ops. If you're wondering why were discussing K7 all of a sudden, just as Core is largely based off the P6 architecture, K8 is largely based off the K7 architecture.

The 3 complex decoders are powerful and can decode most x86 instructions, with few instructions requiring the Vector Path. The only downside of the K7 decoders is that some FP instructions and SSE instructions have to pass through the Vector Path. K8 has even stronger complex decoders and almost all FP and SSE instructions are also now decoded through the Direct Path decoders. This is possible as fetching and decoding takes more stages than it did in the K7; the K8 architecture is clearly more powerful when it comes to SIMD.

Obviously, Intel's Macro-op ( x86 instruction ) fusion does not exist in AMD's K8. However, micro-op fusion is available in another form. If we compare Intel's and AMD's macro-ops and micro-ops, it is easy to get confused. Take a look at the table below which explains the differences.


Micro-op fusion does exist in the Athlon. An ADD [mem], EAX is kept together in one macro-op as it travels through the pipeline. Therefore it will take only one place in the OoO buffers. However, the load and execute SSE/SSE2 operations can be fused on Core, while this is not the case on K8: packed SSE operations result in two macro-ops.

So how do Intel's Core and AMD's Hammer compare when it comes to decoding? It is hard to say at the moment without access to Intel's optimization manuals. However, we can get a pretty good idea. In almost every situation, the Core architecture has the advantage. It can decode 4 x86 instructions per cycle, and sometimes 5 thanks to x86 fusion. AMD's Hammer can do only 3.

The situation where AMD's 3 complex decoders can outperform Core's 1 complex + 3 simple decoders is much less likely to happen. It would happen when 3 instructions would be fetched that would have to be handled by the complex decoder of the Core CPU, but which are not too complex that the Microcode Sequencer must kick in. Since the most used x86 instructions all map to one Intel micro-op, this is pretty unlikely.

Memory Subsystem Out of Order Execution
POST A COMMENT

85 Comments

View All Comments

  • GeeZee - Friday, May 05, 2006 - link

    Even with all the new technologies put into the new "Core" architecture, I think Intel will have a very tough time putting the nails in the coffin of the Athlon/Opteron.

    In performance tests(Not benchmarks that fit under 4mb) the Athlon was very competitive with the new core architecture, and beat it on many tests. On top of that A-64 and Opteron still blow it away when using 4 or more cores.

    As for the future....AMD has a tremendous amount of companies that are working with them to produce the next gen chips. IBM, Sony, Transmeta, Nvidia, Cray. Pretty much all the Mobo/Chipset manufacturers are much more frendly with AMD than intel.

    I wouldn't count out AMD untill their next gen CPU's flop....and I don't think it will. Imagine AMD with access to the code morphing software & Transmeta's vliw chip as a co processor & Via's encryption core & HT 3.0. All working flawlessly due to the new memory modes introduced on AM2. Add onto that Transmeta's manufacturing patents would cut power by 50%.

    Via gets Royalties on each chip, Transmeta gets access to AMD core technolgies. Everyone wins.

    AMD really surprised Intel with the Athlon. And I think they have somthing up their sleeve after the AM2.
    Reply
  • IntelUser2000 - Friday, May 05, 2006 - link

    quote:

    In performance tests(Not benchmarks that fit under 4mb) the Athlon was very competitive with the new core architecture, and beat it on many tests. On top of that A-64 and Opteron still blow it away when using 4 or more cores.


    Beat it?? Blow it away?? Have you seen the benchmarks of quad cores to know the reality?? Its the other way around. But when comparing against "Core" Duo that's different... Otherwise you are saying nonsense.
    Reply
  • GeeZee - Sunday, May 07, 2006 - link

    Really......
    http://sharikou.blogspot.com/2006/04/clovertown-sc...">http://sharikou.blogspot.com/2006/04/clovertown-sc...
    Mabye you should look at some facts with thoes blinded fanboy eyes.
    Reply
  • IntelUser2000 - Tuesday, May 09, 2006 - link

    quote:

    Really......
    http://sharikou.blogspot.com/2006/04/clovertown-sc...">http://sharikou.blogspot.com/2006/04/clovertown-sc...
    Mabye you should look at some facts with thoes blinded fanboy eyes.



    LOL. Anyone with ANY common sense should realize that the guy doesn't know what he is talking about. He claims Yonah uses 50W!!! Who's a fanboy here...

    And let me explain those clovertown scores.

    #1. Possibly not a good benchmark for looking at average performance:
    Take a look at Cinebench scores. You'll see that Pentium Extreme Edition 840 will outperform Pentium D 840 by over 15%!!! Now where do you see benchmark scores which shows the Pentium EE's outperforming Pentium D's by 15%?? That's right, MOST OF THE TIMES, IT DOESN'T!!! Pentium D's can outperform Pentium EE's lots of times.

    #2. The author's mind-boggling flawed logic on Clovertown's score:

    He claims that the reason Clovertown scales only 4.85 by using 8 cores is because its bandwidth starved. http://www.digitalvideoediting.com/articles/viewar...">http://www.digitalvideoediting.com/articles/viewar...

    Ah what do you see?? Opteron only scales 4.85x too!!!

    So what's the opinion on the blog?? HE'S A BLINDED FANBOY!!

    Stop posting in forums and use your useless brain on something else.

    Why people make up these stupid blogs though?? They are afraid to admit that Intel can actually do make something GOOD.
    Reply
  • IntelUser2000 - Tuesday, May 09, 2006 - link

    quote:

    In performance tests(Not benchmarks that fit under 4mb) the Athlon was very competitive with the new core architecture, and beat it on many tests. On top of that A-64 and Opteron still blow it away when using 4 or more cores.


    Pffft. Where do you see that?? Care to reveal those benchmarks?? Still in denial after looking at what Core Duo can do??
    Reply
  • IntelUser2000 - Tuesday, May 09, 2006 - link

    There are 3 main things people argue about when doubting Conroe.

    1. IDF system's scores are wrong because Intel could have modified the benchmarks.
    2. The K7/K8 decoders can all do complex instruction decoding which is better than Core
    3. The apps that doesn't fit in 4MB cache will perform slow.




    My response:
    1. ANANDTECH has shown that AFTER using THEIR OWN Quake 4 benchmark, the discrepancy between Conroe and OC'ed FX-60 INCREASED, indicating Intel's benchmarks are RATHER conservative.
    2. First, the two decoders(K7 and Core) can't be compared directly. While it was TRUE that K7 had superior decoder capability compared to P6, its different with Core, because more of the instructions that used to go to the complex decoder on the P6 now goes to the simple decoders in Core.
    3. The doubled AND lowered latency L2 cache on the Northwood gave 6-11%(Avg. 8.5%) gain in games. Doubled L2 cache on Barton gave 4-8%(6%) increase. Difference between Athlon 64 3000+(2.0GHz 512KB L2 single channel S754) and 3200+(1MB cache version) is 2.2-8%(5.1%).

    Caches doesn't do much. People seem to be somehow expecting 20% difference on the cache alone.
    Reply
  • Accord99 - Monday, May 08, 2006 - link

    Those scores beat a 4 single-core or a 2 dual-core Opteron system. Reply
  • clairvoyant129 - Sunday, May 07, 2006 - link

    How ironic you post that website in response to the above user (also calling him a fanboy) when it's a known fact that the author of the site manipulates information to favor AMD. Why don't you think a little next time? Reply
  • theteamaqua - Friday, May 05, 2006 - link

    im glad that intel is back on track, if they keep falling behind AMD, AMD is gonna jack up the price, intel jsut slash its cpu as much as 50%, the Pentium D 950, my mobo wont support conroe so ill jsut have to get the 960 when conroe launches,

    but what interest me most is the quad-coare thats coming Q1 next year, hopefully the performance can be as close to 200% of a dual-core counter-part running at the same speed
    Reply
  • thestain - Friday, May 05, 2006 - link

    Larger Cache and an extra decoder were bound to help Conroe in the small and simple tesing done by most benchmarks.

    But, what about applications that are a bit larger than Conroe's cache size or those that are complex causing the simple decoders to not be able to be used that much while placing the single complex decoder on the Conroe into short supply?

    Mike
    Reply

Log in

Don't have an account? Sign up now