Out of Order Loads done right

Since the Pentium Pro, x86 CPUs have been capable of issuing and executing instructions out of order. However, on average one third of the instructions in those reorder buffers could not be reordered easily: we are talking about loads. Moving loads forward can give a very big boost to performance. Instead of loading a piece of data when you need it, it is much more useful to start the load as early as you can. That way, L1 and even L2-cache latencies can be much more easily hidden.

This is pretty easy to understand. Imagine having an ALU operation that needs a certain piece of data but that the data is not available in the L1-cache. If the load has been executed many cycles before the ALU operation needs that piece of data, the L2-cache latency is going to have a reduced impact. Of course, you don't want to load a value which is being or will be written to by a previous - following the program/thread order - store. That would mean you are loading an old value, not the up to date one. Check out the picture below.


Load 2 cannot be moved forward, since it has to wait until the first Store is done. Only after Store 1 is done will variable Y have its correct value. However there is no reason why Load 4 cannot move forward. It doesn't have to wait for Store 3 and store 1 to finish. By moving Load 4 forward, you give the load unit more time to get the right operand, as we assume that after load 4 a calculation with operand Y will happen.

Currently, CPUs will generally delay load 4 when a store is in flight (active). The problem is the address to which the stores will write has yet to be calculated. To be more precise, the memory addresses are still unknown during reordering and scheduling. When a Load micro-op enters the ROB, the memory addresses of previous stores (from the program order) are not known until they pass the AGU (Address Generation Units).

However, the risk that a load will load a value out of an address that is being written to by a store that has yet to be finished is pretty small (1-2%). That is why Jack Doweck of the Core development team decided to allow Loads to go ahead of previous stores, assuming that the load will not be loading information that will be updated by that preceding store. To avoid that the assumption was wrong, a predictor is used to help. The dynamic alias predictor tries to predict whether or not a previous store will write to the same address as the address from which the load - that you want to execute earlier, thus out of order - will load its data.

Based on Jack Doweck's comments and a study of Intel's previous P6 and P-M architectures I drew up the scheme below. Be warned that this is not the official Intel diagram.


The predictor gives the ReOrder Buffer (ROB) the permission to move a load ahead of a store or not. After the Load has been moved ahead and executed, the conflict logic scans the store buffer located in the Memory reOrder Buffer (MOB) to see if any of the stores which were located before the load (following program order) have written to the address of the out of order load. If so, the load must be redone, and the misprediction penalty is about 20 lost cycles. (Note that the branch misprediction penalty is also about 20 cycles). Worst case, the new dynamic alias predictor may slightly reduce performance, but realistically it's four steps forward, one step back, resulting in a net performance boost.

Determining whether a load and a store share the same address is called memory disambiguation. Allowing loads to move ahead of stores gives a big performance boost. In some snippets of benchmarking code, Intel saw up to a 40% performance boost, solely the result of the more flexible way Loads get reordered. It is pretty clear that we won't see this in most real applications, but it is nevertheless impressive and it should show tangible (10-20%) performance boosts together with the fast L2 and L1 cache.

Let us not forget that loads are probably the most important instructions of all. Not only are loads about one third of the micro-ops that are in flight in a x86 CPU, but they can also cause costly stalls when a load needs to go to the L2 cache (or worse, system memory). So how does this super flexible reordering of loads compare with other architectures?


The P6 and P-M could already reorder Loads pretty good. They could move one Load before other Loads, as well as before Stores which have no unknown addresses or addresses which do not reference the same address as the load. In contrast, the Athlon 64 can only move loads before independent ALU operations (ADD etc.). Loads cannot be moved ahead much at all to minimize the effect of a cache miss, and other loads cannot be used to keep the CPU busy if a load has to wait for a store to finish. This means that the Athlon 64 processor is severely limited when it comes to reorder code.

This is probably one of the most important reasons why the Athlon 64 does not outperform the P-M in gaming and integer workloads despite having a lower latency memory system and more integer execution sources. Integer workloads tend to jump around in memory, and have many unknown addresses which must be calculated first. It is less important for FP intensive loads, which is also one of the reasons why the Athlon 64 had no problem with Dothan in this kind of workload. FP workloads access the memory in a much more regular fashion.

Once Loads and Stores are in the queues of Load/Store units, the Athlon's L/S unit allows Loads to bypass Stores, except of course when the load would bypass a store to the same address. Unfortunately, by then the Loads are already out of the ICU and cannot be used to fill the holes that dependencies and cache misses make. You could say that the Athlon (64) has some Load/Store reordering but it's much later in the pipeline and is less flexible than the P6, P-M, and Core architectures.

Out of Order Execution Concluding Thoughts
POST A COMMENT

85 Comments

View All Comments

  • GeeZee - Friday, May 05, 2006 - link

    Even with all the new technologies put into the new "Core" architecture, I think Intel will have a very tough time putting the nails in the coffin of the Athlon/Opteron.

    In performance tests(Not benchmarks that fit under 4mb) the Athlon was very competitive with the new core architecture, and beat it on many tests. On top of that A-64 and Opteron still blow it away when using 4 or more cores.

    As for the future....AMD has a tremendous amount of companies that are working with them to produce the next gen chips. IBM, Sony, Transmeta, Nvidia, Cray. Pretty much all the Mobo/Chipset manufacturers are much more frendly with AMD than intel.

    I wouldn't count out AMD untill their next gen CPU's flop....and I don't think it will. Imagine AMD with access to the code morphing software & Transmeta's vliw chip as a co processor & Via's encryption core & HT 3.0. All working flawlessly due to the new memory modes introduced on AM2. Add onto that Transmeta's manufacturing patents would cut power by 50%.

    Via gets Royalties on each chip, Transmeta gets access to AMD core technolgies. Everyone wins.

    AMD really surprised Intel with the Athlon. And I think they have somthing up their sleeve after the AM2.
    Reply
  • IntelUser2000 - Friday, May 05, 2006 - link

    quote:

    In performance tests(Not benchmarks that fit under 4mb) the Athlon was very competitive with the new core architecture, and beat it on many tests. On top of that A-64 and Opteron still blow it away when using 4 or more cores.


    Beat it?? Blow it away?? Have you seen the benchmarks of quad cores to know the reality?? Its the other way around. But when comparing against "Core" Duo that's different... Otherwise you are saying nonsense.
    Reply
  • GeeZee - Sunday, May 07, 2006 - link

    Really......
    http://sharikou.blogspot.com/2006/04/clovertown-sc...">http://sharikou.blogspot.com/2006/04/clovertown-sc...
    Mabye you should look at some facts with thoes blinded fanboy eyes.
    Reply
  • IntelUser2000 - Tuesday, May 09, 2006 - link

    quote:

    Really......
    http://sharikou.blogspot.com/2006/04/clovertown-sc...">http://sharikou.blogspot.com/2006/04/clovertown-sc...
    Mabye you should look at some facts with thoes blinded fanboy eyes.



    LOL. Anyone with ANY common sense should realize that the guy doesn't know what he is talking about. He claims Yonah uses 50W!!! Who's a fanboy here...

    And let me explain those clovertown scores.

    #1. Possibly not a good benchmark for looking at average performance:
    Take a look at Cinebench scores. You'll see that Pentium Extreme Edition 840 will outperform Pentium D 840 by over 15%!!! Now where do you see benchmark scores which shows the Pentium EE's outperforming Pentium D's by 15%?? That's right, MOST OF THE TIMES, IT DOESN'T!!! Pentium D's can outperform Pentium EE's lots of times.

    #2. The author's mind-boggling flawed logic on Clovertown's score:

    He claims that the reason Clovertown scales only 4.85 by using 8 cores is because its bandwidth starved. http://www.digitalvideoediting.com/articles/viewar...">http://www.digitalvideoediting.com/articles/viewar...

    Ah what do you see?? Opteron only scales 4.85x too!!!

    So what's the opinion on the blog?? HE'S A BLINDED FANBOY!!

    Stop posting in forums and use your useless brain on something else.

    Why people make up these stupid blogs though?? They are afraid to admit that Intel can actually do make something GOOD.
    Reply
  • IntelUser2000 - Tuesday, May 09, 2006 - link

    quote:

    In performance tests(Not benchmarks that fit under 4mb) the Athlon was very competitive with the new core architecture, and beat it on many tests. On top of that A-64 and Opteron still blow it away when using 4 or more cores.


    Pffft. Where do you see that?? Care to reveal those benchmarks?? Still in denial after looking at what Core Duo can do??
    Reply
  • IntelUser2000 - Tuesday, May 09, 2006 - link

    There are 3 main things people argue about when doubting Conroe.

    1. IDF system's scores are wrong because Intel could have modified the benchmarks.
    2. The K7/K8 decoders can all do complex instruction decoding which is better than Core
    3. The apps that doesn't fit in 4MB cache will perform slow.




    My response:
    1. ANANDTECH has shown that AFTER using THEIR OWN Quake 4 benchmark, the discrepancy between Conroe and OC'ed FX-60 INCREASED, indicating Intel's benchmarks are RATHER conservative.
    2. First, the two decoders(K7 and Core) can't be compared directly. While it was TRUE that K7 had superior decoder capability compared to P6, its different with Core, because more of the instructions that used to go to the complex decoder on the P6 now goes to the simple decoders in Core.
    3. The doubled AND lowered latency L2 cache on the Northwood gave 6-11%(Avg. 8.5%) gain in games. Doubled L2 cache on Barton gave 4-8%(6%) increase. Difference between Athlon 64 3000+(2.0GHz 512KB L2 single channel S754) and 3200+(1MB cache version) is 2.2-8%(5.1%).

    Caches doesn't do much. People seem to be somehow expecting 20% difference on the cache alone.
    Reply
  • Accord99 - Monday, May 08, 2006 - link

    Those scores beat a 4 single-core or a 2 dual-core Opteron system. Reply
  • clairvoyant129 - Sunday, May 07, 2006 - link

    How ironic you post that website in response to the above user (also calling him a fanboy) when it's a known fact that the author of the site manipulates information to favor AMD. Why don't you think a little next time? Reply
  • theteamaqua - Friday, May 05, 2006 - link

    im glad that intel is back on track, if they keep falling behind AMD, AMD is gonna jack up the price, intel jsut slash its cpu as much as 50%, the Pentium D 950, my mobo wont support conroe so ill jsut have to get the 960 when conroe launches,

    but what interest me most is the quad-coare thats coming Q1 next year, hopefully the performance can be as close to 200% of a dual-core counter-part running at the same speed
    Reply
  • thestain - Friday, May 05, 2006 - link

    Larger Cache and an extra decoder were bound to help Conroe in the small and simple tesing done by most benchmarks.

    But, what about applications that are a bit larger than Conroe's cache size or those that are complex causing the simple decoders to not be able to be used that much while placing the single complex decoder on the Conroe into short supply?

    Mike
    Reply

Log in

Don't have an account? Sign up now