Out of Order execution: AMD versus Intel

To make this article more accessible and to make the differences between the AMD K8 and the new Intel Core architecture more clear, I tried to make both CPU diagrams in the same style. Here's the Core architecture overview:


And here's the K8 architecture:


There are a few obvious differences: Core has bigger OoO buffers: the 96 entry ROB buffer is - also thanks to Macro-op fusion - quite a bit bigger than the 72 Entry Macro-op buffer of K8. The P6 architecture could order only 40 instructions, this was doubled to 80 in the P-M architecture (Banias, Dothan, Yonah), and now it's increased even further to 96 for the Core architecture. We've created a table which compares the most important architectural details of several current CPU families:

Click to enlarge

The Core architecture uses a central reservation station, while the Athlon uses distributed schedulers. The advantage of a central reservation station is that utilization is better, however distributed schedulers allow more entries. NetBurst also uses distributed schedulers.

Using a central reservation station is another clear example of how Core is in fact the "P8", the second big improvement of the P6 architecture. Just like the P6 architecture, it uses a Reservation Station (RS) and allocates a specific execution unit to execute the micro-op. After execution, the micro-op results are stored in the ROB entry for that micro-op. This aspect of the Core design is clearly taken from the Yonah, Dothan and P6 architectures.

The biggest differences are not immediately visible on the diagrams above. Previous Intel architectures can only perform one branch prediction every two cycles, but Core can sustain one branch per cycle. The Athlon 64 can also perform one branch prediction per cycle.

Another impressive area is Core's SSE multimedia power. Three very powerful 128-bit SSE/SSE2/SSE3 units are available, and two of them are symmetric. Core will outperform the Athlon 64 vastly when it comes to 128-bit SSE2/3 processing.

On K8, 128-bit SSE instructions are decoded into two separate 64-bit instructions. Each Athlon 64 SSE unit can only do one 64-bit instruction at a time, so the Core architecture has essentially at least 2 times the processing power here. With 64-bit FP, Core can do 4 Double Precision FP calculations per cycle, while the Athlon 64 can do 3.

When it comes to integer execution resources, the Core architecture is an improvement over the Pentium 4 and Dothan CPUs, and is at the same level (if we only look at the number of execution units) as the Athlon 64. The Athlon 64 seems to have a small advantage when it comes to calculating addresses: it has 3 AGU compared to Core's 2. This could give the Athlon 64 an advantage in some less common integer workloads such as decrypting algorithms. The deeper, more flexible (Memory disambiguation, see further) out of order buffers and bigger, faster L2-cache of the Core should negate this small advantage in most integer workloads.

Decoding Instructions Faster Load Times
POST A COMMENT

85 Comments

View All Comments

  • JarredWalton - Monday, May 01, 2006 - link

    1.26 was Tualatin as well, but that's beside the point. Basically, clock speed at launch vs. final clock speed of the architecture was a disappointment for Intel. They were hoping for 6+ GHz at launch, and even thinking 10 GHz might be possible. Reply
  • KayKay - Monday, May 01, 2006 - link

    Someone should hire you to write textbooks because this was explained extremely well and in simple terms. Good Job Reply
  • JohanAnandtech - Monday, May 01, 2006 - link

    Thanks! :-)

    Very happy to read that.
    Reply
  • BitByBit - Monday, May 01, 2006 - link

    Fantastic article.
    In retrospect, it is easy to conclude that this is the route Intel should have chosen for P6's retirement.

    Core looks to be a very strong, all-round performer, unlike Netburst.
    We can only hope that AMD has an answer in the works, as K8 will have a hard time competing with this monster.
    It is unreasonable in my mind to expect a 4-5(?) year-old architecture to be able to compete with Intel's latest. AMD with K8 has had a long reign as the performance king, but is now facing something entirely different. Perhaps K8L will be able to offer serious competition.
    It will, however, take more than a doubling of the FP units (if rumour is correct) to achieve this. The cumulative effect of Conroe's architectural features (memory disambiguation, macro-ops fusion etc...) mean that Core's efficiency has far exceeded K8's, not to mention the impact of its vastly superior cache system - its 8-way 32kb * 2 L1 should in theory exceed the hitrate of K8's 2-way 64kb * 2 L1.

    It may not be until K10 is released that AMD takes back the performance crown.



    Reply
  • Larso - Monday, May 01, 2006 - link

    As the K8 is about 5 years old, and the current incarnations doesn't really include that many modifications, I wonder what AMD's engineers have been doing all these years. The K8 is not that different from the K7 even.

    Whats coming up? The AM2 version is basically the same beast with a new memory controller. The K8L, well since they didn't name it K9, I suppose its just small upgrades to the same design.

    I really like to think AMD has something coming we don't know about. Or rather, they ought to have something coming... Any rumors?
    Reply
  • Reynod - Monday, May 01, 2006 - link

    I can't help but think (and pray) that Larso's comment has some validity here.
    Why would AMD sit back and do nothing for so long? Would they have not been tinkering with various prototypes over the last couple of years? Are we in for a surprise?? Anand, you and the review team touched on several improvements they could make, care to outline these in some detail in a future article? Someone needs to give AMD some free advice ... heh heh
    Reply
  • Spoonbender - Monday, May 01, 2006 - link

    Keep in mind that AMD doesn't have Intel's resources. Until recently, they still lost money every quarter. So they might not have been able to work on a successor to K8 until recently. (I remeber reading an interview with some AMD boss, saying that the K8 was literally a last-ditch effort to survive. If that failed, there wouldn't be an AMD, so they threw everything they had at it)

    So "Why would AMD sit back and do nothing for so long?" Because they had a good project, and didn't have the resources to make a new one?
    Of course, it probably isn't that bad, just tossing out an alternative scenario. ;)

    However, they have hinted that they were working on specific architectures for the notebook and server markets. (Unlike Intel who are moving back to a single unified architecture).

    And despite its age, the K8 is still a pretty nice architecture, and it wouldn't be a huge undertaking to improve on it to get something quite a bit more efficient. Intel had to develop a new architecture because NetBurst just wouldn't cut it. AMD can probably afford to expand on K8 a bit longer, and even with K9/K10, I wouldn't expect a vastly different architecture.
    Reply
  • Spoonbender - Monday, May 01, 2006 - link

    "Because they had a good project" <- Was supposed to be product, not project... :) Reply
  • psychobriggsy - Monday, May 01, 2006 - link

    AMD said recently that they have three times the engineers on their books as they did when they designed K8.

    However I suspect they're working on K10/KX, although maybe some of them worked on K8L.

    Clearly it seems that some in-core work could translate into reasonable performance gains for the current K8 design. A 4-way L1 cache instead of 2-way for example, and a greater L2 to L1 bandwidth. Certainly a mechanism to reorder instructions so that loads can be performed earlier seems to be necessary. 2MB L2 per core could also help, and the 65nm die pictures that AMD showed recently did seem to show far denser cache. K8L is rumoured to include more FP resources, but I don't know about any of the other stuff - but AMD will be talking more about K8L (and beyond?) in June apparently.
    Reply
  • spinportal - Monday, May 01, 2006 - link

    Definitely a great treasure of an article to find on a monday morning detailing the Core architecture that the world is drooling for in June. I wonder what kind of simulation micro-arch software Intel and AMD use, as I remember the days doing my Masters, playing with Intel's Hypercube simulator (grav sim, fish & shark AI sim) and writing my own macro level visual-cpu execution simulator. Reply

Log in

Don't have an account? Sign up now