Out of Order Loads done right

Since the Pentium Pro, x86 CPUs have been capable of issuing and executing instructions out of order. However, on average one third of the instructions in those reorder buffers could not be reordered easily: we are talking about loads. Moving loads forward can give a very big boost to performance. Instead of loading a piece of data when you need it, it is much more useful to start the load as early as you can. That way, L1 and even L2-cache latencies can be much more easily hidden.

This is pretty easy to understand. Imagine having an ALU operation that needs a certain piece of data but that the data is not available in the L1-cache. If the load has been executed many cycles before the ALU operation needs that piece of data, the L2-cache latency is going to have a reduced impact. Of course, you don't want to load a value which is being or will be written to by a previous - following the program/thread order - store. That would mean you are loading an old value, not the up to date one. Check out the picture below.


Load 2 cannot be moved forward, since it has to wait until the first Store is done. Only after Store 1 is done will variable Y have its correct value. However there is no reason why Load 4 cannot move forward. It doesn't have to wait for Store 3 and store 1 to finish. By moving Load 4 forward, you give the load unit more time to get the right operand, as we assume that after load 4 a calculation with operand Y will happen.

Currently, CPUs will generally delay load 4 when a store is in flight (active). The problem is the address to which the stores will write has yet to be calculated. To be more precise, the memory addresses are still unknown during reordering and scheduling. When a Load micro-op enters the ROB, the memory addresses of previous stores (from the program order) are not known until they pass the AGU (Address Generation Units).

However, the risk that a load will load a value out of an address that is being written to by a store that has yet to be finished is pretty small (1-2%). That is why Jack Doweck of the Core development team decided to allow Loads to go ahead of previous stores, assuming that the load will not be loading information that will be updated by that preceding store. To avoid that the assumption was wrong, a predictor is used to help. The dynamic alias predictor tries to predict whether or not a previous store will write to the same address as the address from which the load - that you want to execute earlier, thus out of order - will load its data.

Based on Jack Doweck's comments and a study of Intel's previous P6 and P-M architectures I drew up the scheme below. Be warned that this is not the official Intel diagram.


The predictor gives the ReOrder Buffer (ROB) the permission to move a load ahead of a store or not. After the Load has been moved ahead and executed, the conflict logic scans the store buffer located in the Memory reOrder Buffer (MOB) to see if any of the stores which were located before the load (following program order) have written to the address of the out of order load. If so, the load must be redone, and the misprediction penalty is about 20 lost cycles. (Note that the branch misprediction penalty is also about 20 cycles). Worst case, the new dynamic alias predictor may slightly reduce performance, but realistically it's four steps forward, one step back, resulting in a net performance boost.

Determining whether a load and a store share the same address is called memory disambiguation. Allowing loads to move ahead of stores gives a big performance boost. In some snippets of benchmarking code, Intel saw up to a 40% performance boost, solely the result of the more flexible way Loads get reordered. It is pretty clear that we won't see this in most real applications, but it is nevertheless impressive and it should show tangible (10-20%) performance boosts together with the fast L2 and L1 cache.

Let us not forget that loads are probably the most important instructions of all. Not only are loads about one third of the micro-ops that are in flight in a x86 CPU, but they can also cause costly stalls when a load needs to go to the L2 cache (or worse, system memory). So how does this super flexible reordering of loads compare with other architectures?


The P6 and P-M could already reorder Loads pretty good. They could move one Load before other Loads, as well as before Stores which have no unknown addresses or addresses which do not reference the same address as the load. In contrast, the Athlon 64 can only move loads before independent ALU operations (ADD etc.). Loads cannot be moved ahead much at all to minimize the effect of a cache miss, and other loads cannot be used to keep the CPU busy if a load has to wait for a store to finish. This means that the Athlon 64 processor is severely limited when it comes to reorder code.

This is probably one of the most important reasons why the Athlon 64 does not outperform the P-M in gaming and integer workloads despite having a lower latency memory system and more integer execution sources. Integer workloads tend to jump around in memory, and have many unknown addresses which must be calculated first. It is less important for FP intensive loads, which is also one of the reasons why the Athlon 64 had no problem with Dothan in this kind of workload. FP workloads access the memory in a much more regular fashion.

Once Loads and Stores are in the queues of Load/Store units, the Athlon's L/S unit allows Loads to bypass Stores, except of course when the load would bypass a store to the same address. Unfortunately, by then the Loads are already out of the ICU and cannot be used to fill the holes that dependencies and cache misses make. You could say that the Athlon (64) has some Load/Store reordering but it's much later in the pipeline and is less flexible than the P6, P-M, and Core architectures.

Out of Order Execution Concluding Thoughts
POST A COMMENT

85 Comments

View All Comments

  • JarredWalton - Monday, May 01, 2006 - link

    1.26 was Tualatin as well, but that's beside the point. Basically, clock speed at launch vs. final clock speed of the architecture was a disappointment for Intel. They were hoping for 6+ GHz at launch, and even thinking 10 GHz might be possible. Reply
  • KayKay - Monday, May 01, 2006 - link

    Someone should hire you to write textbooks because this was explained extremely well and in simple terms. Good Job Reply
  • JohanAnandtech - Monday, May 01, 2006 - link

    Thanks! :-)

    Very happy to read that.
    Reply
  • BitByBit - Monday, May 01, 2006 - link

    Fantastic article.
    In retrospect, it is easy to conclude that this is the route Intel should have chosen for P6's retirement.

    Core looks to be a very strong, all-round performer, unlike Netburst.
    We can only hope that AMD has an answer in the works, as K8 will have a hard time competing with this monster.
    It is unreasonable in my mind to expect a 4-5(?) year-old architecture to be able to compete with Intel's latest. AMD with K8 has had a long reign as the performance king, but is now facing something entirely different. Perhaps K8L will be able to offer serious competition.
    It will, however, take more than a doubling of the FP units (if rumour is correct) to achieve this. The cumulative effect of Conroe's architectural features (memory disambiguation, macro-ops fusion etc...) mean that Core's efficiency has far exceeded K8's, not to mention the impact of its vastly superior cache system - its 8-way 32kb * 2 L1 should in theory exceed the hitrate of K8's 2-way 64kb * 2 L1.

    It may not be until K10 is released that AMD takes back the performance crown.



    Reply
  • Larso - Monday, May 01, 2006 - link

    As the K8 is about 5 years old, and the current incarnations doesn't really include that many modifications, I wonder what AMD's engineers have been doing all these years. The K8 is not that different from the K7 even.

    Whats coming up? The AM2 version is basically the same beast with a new memory controller. The K8L, well since they didn't name it K9, I suppose its just small upgrades to the same design.

    I really like to think AMD has something coming we don't know about. Or rather, they ought to have something coming... Any rumors?
    Reply
  • Reynod - Monday, May 01, 2006 - link

    I can't help but think (and pray) that Larso's comment has some validity here.
    Why would AMD sit back and do nothing for so long? Would they have not been tinkering with various prototypes over the last couple of years? Are we in for a surprise?? Anand, you and the review team touched on several improvements they could make, care to outline these in some detail in a future article? Someone needs to give AMD some free advice ... heh heh
    Reply
  • Spoonbender - Monday, May 01, 2006 - link

    Keep in mind that AMD doesn't have Intel's resources. Until recently, they still lost money every quarter. So they might not have been able to work on a successor to K8 until recently. (I remeber reading an interview with some AMD boss, saying that the K8 was literally a last-ditch effort to survive. If that failed, there wouldn't be an AMD, so they threw everything they had at it)

    So "Why would AMD sit back and do nothing for so long?" Because they had a good project, and didn't have the resources to make a new one?
    Of course, it probably isn't that bad, just tossing out an alternative scenario. ;)

    However, they have hinted that they were working on specific architectures for the notebook and server markets. (Unlike Intel who are moving back to a single unified architecture).

    And despite its age, the K8 is still a pretty nice architecture, and it wouldn't be a huge undertaking to improve on it to get something quite a bit more efficient. Intel had to develop a new architecture because NetBurst just wouldn't cut it. AMD can probably afford to expand on K8 a bit longer, and even with K9/K10, I wouldn't expect a vastly different architecture.
    Reply
  • Spoonbender - Monday, May 01, 2006 - link

    "Because they had a good project" <- Was supposed to be product, not project... :) Reply
  • psychobriggsy - Monday, May 01, 2006 - link

    AMD said recently that they have three times the engineers on their books as they did when they designed K8.

    However I suspect they're working on K10/KX, although maybe some of them worked on K8L.

    Clearly it seems that some in-core work could translate into reasonable performance gains for the current K8 design. A 4-way L1 cache instead of 2-way for example, and a greater L2 to L1 bandwidth. Certainly a mechanism to reorder instructions so that loads can be performed earlier seems to be necessary. 2MB L2 per core could also help, and the 65nm die pictures that AMD showed recently did seem to show far denser cache. K8L is rumoured to include more FP resources, but I don't know about any of the other stuff - but AMD will be talking more about K8L (and beyond?) in June apparently.
    Reply
  • spinportal - Monday, May 01, 2006 - link

    Definitely a great treasure of an article to find on a monday morning detailing the Core architecture that the world is drooling for in June. I wonder what kind of simulation micro-arch software Intel and AMD use, as I remember the days doing my Masters, playing with Intel's Hypercube simulator (grav sim, fish & shark AI sim) and writing my own macro level visual-cpu execution simulator. Reply

Log in

Don't have an account? Sign up now