Sideband Stack Optimizer

Intel's very first Pentium M introduced a feature Intel called its dedicated stack manager. As its name implies, the dedicated stack manager was used to handle all x86 stack operations (i.e. push, pop, call, return). The purpose of the stack manager was to keep those stack operations, which are frequently used with function calls in code, separate from the rest of the x86 instruction stream sent to the CPU. The dedicated stack manager would handle decode and "execution" of these operations so that they wouldn't clog up the processor's decoders and execution units later in the pipeline. Intel essentially "widened" the core by offloading some operations to separate hardware.

With Barcelona, AMD is introducing a similar technology it is calling a Sideband Stack Optimizer. Stack instructions no longer go through the 3-way decoder and stack operations no longer go through the integer execution units, effectively widening Barcelona at minimal cost. The Sideband Stack Optimizer, like Intel's dedicated stack manager, features its own adder that handles all stack operations. It's a small tweak that can help overall performance, and it's simply one that made sense for AMD to implement.

Faster Loads

When looking at the performance of the Athlon 64 and Intel's Core 2 processors, it's easy to understand why Intel has a strong performance advantage in applications that make heavy use of SSE. But what about applications like gaming and business apps that should greatly benefit from AMD's on-die memory controller? Is the Core 2's larger L2 cache and aggressive prefetchers all that it needs to overcome AMD's on-die memory controller?

One major aspect of Intel's Core micro-architecture advantage is its ability to allow load instructions to bypass previous load and store instructions. On average, about 1/3 of all instructions in a program end up being loads, thus if you can improve load performance you can generally impact overall application performance pretty significantly. With Intel's Core micro-architecture, it's possible for loads to be re-ordered to ensure that instructions dependent on those loads get the data they need without waiting for costly memory accesses.

Core also allowed for loads to be moved ahead of stores, which was previously not allowed due to the possibility that an earlier store could invalidate the data that was just loaded. Intel figured that the possibility of a store writing over a load ends up being very small, on the order of 1 - 2%, therefore with a reasonably accurate predictor you could correctly guess when re-ordering a load ahead of a store was possible. Intel's Core 2 based processors feature prediction logic to guess whether a store and a load share the same memory address; if the predictor determines that they won't, then it allows the load to be re-ordered ahead of the store. In the small chance that the predictor is incorrect however, the load has to be redone at the cost of a pipeline flush (similar to what happens if the processor mispredicts a branch).

AMD's K8 architecture had no equivalent scheme for allowing the out of order execution of loads ahead of other loads and stores, so even without an on-die memory controller Intel was able to execute some memory operations faster than AMD. Barcelona fixes this problem through an almost identical scheme to what Intel implemented in its Core 2 processors.

Barcelona can now re-order loads ahead of other loads, just like Core 2 can. It can also execute loads ahead of other stores, assuming that the processor knows that the two don't share the same memory address. While Intel uses a predictor to determine whether or not the store aliases with the load, AMD takes a more conservative approach. Barcelona waits until the store address is calculated before determining whether or not the load can be processed ahead of it. By doing it this way, Barcelona is never wrong and there's no chance of a mispredict penalty. AMD's designers looked at using a predictor like Intel did but found that it offered no performance improvement on its architecture. AMD can generate up to three store addresses per clock as it has three AGUs (Address Generation Units) compared to Intel's one for stores, so it would make sense that AMD has a bit more execution power to calculate a store address before moving a load ahead of it.

The out of order load execution improvements to Barcelona should prove to be even more effective than they were in Core 2 given that AMD previously couldn't do any reordering of loads before the Int/FP schedulers whereas Core Duo could do a limited amount of re-ordering.

Core Tune-up Even More Tweaks
POST A COMMENT

83 Comments

View All Comments

  • chucky2 - Friday, March 02, 2007 - link

    Can you post the link that originates at AMD's own website then that says specifically that AM2+ CPU's are guaranteed to work - understandably maybe not supporting every new feature - in current AM2 boards?

    Not a news post from DailyTech, The Inquirer, Toms, whatever...one that's on AMD's site itself.

    And No, AMD could make AM2+ completely incompatible with current AM2 boards and they probably wouldn't see much drop if at all from the large OEM's. The large OEM's would just ensure that when the AM2+ CPU's came in, AM2+ motherboards would likewise come in.

    Believe me, I want to see the link...because I'm desperately awaiting 690G or MCP68, whichever comes first (which is probably MCP68 at the pace AMD is moving on 690G).

    Chuck
    Reply
  • yacoub - Thursday, March 01, 2007 - link

    quote:

    In order to keep die sizes manageable, AMD constructed its quad-core Barcelona out of four cores each with a 128KB L1 and 512KB L2,


    You say 128kb L1 per core but the diagram image just beneath that text shows a 64bit L1 cache. Please confirm which it is.

    Thanks.

    Awesome article, btw. Seems like quite a significant group of changes to the CPU. Looking forward to seeing how it stacks up against the best Quad Core2 Intel can offer. =)
    Reply
  • yacoub - Thursday, March 01, 2007 - link

    also, please forgive my hasty typing - I wrote "128kb" and "64bit" - I meant "128KB" and "64KB" Reply
  • JarredWalton - Thursday, March 01, 2007 - link

    L1 is 128K total - 64K data and 64K instruction. Reply
  • Beenthere - Thursday, March 01, 2007 - link

    AMD doesn't do knee-jerk reactions like Intel because AMD has superior products. AMD continues to take market share from Intel in every segment and Barcelona will continue that trend. Barcelona looks to be every bit as superior to Intel's hacked/patched/glued together chips as Opteron was when introduced. Intel's chips depend on huge cache size for their performance and that crutch won't work after the intro of Barcelona.

    For those without a clue, AMD didn't start design of Barcelona last week or last year. It's been in the development pipeline for many years and thr performance will demonstrate exactly why AMD's long term platform stability is the right choice for most enterprise buyers. Intel is gonna feel the pain again.
    Reply
  • Roy2001 - Thursday, March 01, 2007 - link

    Facts please, no BS. Reply
  • zsdersw - Thursday, March 01, 2007 - link

    Idiocy incarnate. Reply
  • Regs - Thursday, March 01, 2007 - link

    AMD, like Intel, start numerious projects. Just not all of them get to this finish line. Actually a lot of them don't even reach the end of the planning phase before being scratched.

    As for Intel and their large caches...well I'd say it's amazing how half their die (if not more) is used for cache and still had enough space for all the core logic that's kicking the crap out of the K8 now.

    Common sense!
    Reply
  • erwos - Thursday, March 01, 2007 - link

    Looks like some good improvements coming down the pipe. The cache size issue makes me nervous, though - 512kb per core is starting to look a little antiquated, and there's no information about the bandwidth to the L3 cache (which, presumably, is slower than L2). Reply
  • SmokeRngs - Thursday, March 01, 2007 - link

    In the past, AMD did not need the large cache sizes that Intel did for their processors. This was very obvious in regards to the Netburst architecture. However, while Core2 is much better than Netburst there are still disadvantages for Intel.

    I'll explain a little background as far as I understand it. In the K7 and Netburst days, Intel had to have the cache to make up for their long pipeline. Branch mispredictions are going to happen and the penalty on the long pipeline of the Netburst processors hurt their IPC badly. The shorter pipeline on the K7 did not have the same performance penalty due to the shorter pipeline. With K8, the on die memory controller also negated the need for large L2 caches due to the reduced latency when accessing main memory. This has been one of the major performance aspects for the K8 architecture.

    The Core2 architecture obviously does not have the on die memory controller so the need for larger caches is still present and Intel sees improvement due to the larger caches. Barcelona still has the on die memory controller and the previous efficiency is still there and still negates the need for large caches. This is just the difference between architectures. While having a larger cache on the K8 did improve performance some in some usage scenarios, it wasn't on the same scale as the improvements Intel received with a larger cache.

    AMD can't compete with Intel in regards to cache size. However, other architecture differences make up for the lack of large amounts of cache. Barcelona having a smaller cache does not seem to be a big problem. If it was a big problem, AMD probably would have gone with a larger cache to get the extra performance. Bigger does not always mean better or at least enough better to warrant the extra.

    Smaller cache will mean fewer transistors which should mean better yields, lower power consumption and cheaper to produce.
    Reply

Log in

Don't have an account? Sign up now