Smarter Prefetching and Caching

Making sure that the right instructions and data are ready for use in the caches in the "3 GHz and beyond" era is one of the most important tasks of the architectural engineer. This helps to ensure performance increases as clockspeeds get pushed higher; otherwise, higher CPU clockspeeds will simply result in the processor spending more time waiting for data. This technique of priming the caches is knows as Prefetching; however, the current hardware prefetching algorithms don't always lead to success. There are quite a few cases where they can actually lower performance, especially in bandwidth sensitive applications.

The Core architecture prefetching is without any doubt superior to what can be found in the Athlon 64 and Pentium 4. There are no less than three prefetchers - two data, one instruction - in each core, plus two prefetchers for the L2-cache. With eight prefetchers active in one dual-core Core CPU, all those prefetchers could easily get in the way of the "demand" bandwidth - the bandwidth which is needed by the load operations of the running program. In order to avoid this bottleneck, the prefetch monitor of the Core CPUs always give priority to the demand bandwidth; the prefetchers will never steal too much bandwidth away from the running program.

There is more. The data prefetch needs to perform tag lookups (tags = index of cache) in the caches frequently. To avoid this resulting in higher latency for the "normal" (caused by the program running) tag lookups, the data prefetch uses the store port for the tag lookup. If you remember, loads happen about twice as often as stores. This means the store port is used only half as much as the load port and it make sense to use that port for tag lookup by the prefetchers. Note also that stores are not critical for system performance in most cases -- once the data is "written" the processor can go on about its business. The cache/memory subsystem is in charge of replicating the data down to main memory, and as long as this happens eventually, everything works fine.

The cache system of the Core CPU is also very impressive. A massive 4 MB L2-cache is shared between the two cores and is accessed in only 12 to 14 cycles. Each core also has a 3 cycle 32 KB Instruction cache and a 32 KB data cache at its disposal. Note that the "Trace Cache" of NetBurst has been left behind with the return to shorter pipelines; the NetBurst Trace Cache basically functions as an instruction cache for pre-decoded instructions, and while this was apparently helpful for the long pipeline of NetBurst, Intel has apparently determined that a traditional L1 caching scheme makes more sense for Core.

Cache Architecture Overview

Click to enlarge

Just a quick look at the numbers in the above table make it clear that the memory subsystem of the Core architecture is impressive. It has twice as much L2 cache as current dual-core CPUs (the same amount as Presler), but the cache is still accessible with low latency. The shared L2 cache also allows one core to use more than 2MB of cache if necessary. Both L1 and L2 cache are accessed via a 256-bit wide bus, allowing the caches to deliver massive bandwidth to the core.

Core versus Hammer: Memory subsystem

Core's most important competitor, the Hammer ("K8") architecture, has two small but noteworthy advantages. The first is its bigger 2 x 64 KB L1 cache. This is only a small advantage as an 8-way 32 KB cache will have a hit rate close to that of a 2-way 64 KB cache.

The second and more important advantage is the on die memory controller, which lowers the latency to the memory considerably. However, the lower clockspeeds of the Core CPUs (relative to NetBurst) and the faster FSB also lower latency significantly. With the numbers available to us now, we have reason to believe that the Athlon 64 X2's latency advantage will shrink to only 15 to 20%. For comparison, the memory subsystem of the Pentium 4 was almost twice as slow as the Athlon 64 (80-90 ns versus 45-50 ns).

However, those two small advantages are likely negated by all the other memory subsystem metrics. The Core CPUs have much bigger caches and much smarter prefetching than the competition. The Core architecture's L1 cache delivers about twice as much bandwidth (Measured by ScienceMark), while it's L2-cache is about 2.5 times faster than the Athlon 64/Opteron one.

Index Decoding Instructions
Comments Locked

87 Comments

View All Comments

  • Betwon - Tuesday, May 2, 2006 - link

    The data can not be used for Core -- Because it did not use the smart prefetcher.

    The Advanced smart prefetchers of Core's L1D have decreased the miss-rates very much. In fact, The data cache of Core --much more efficiency than K8's.
    Compared with Core's smart cahce, K8's 64KB L1D is like an idiot .
  • Spoonbender - Tuesday, May 2, 2006 - link

    How do you know? As the article said, the prefetching *might* in some cases decrease performance, even if it'll usually be an advantage. But I don't really think you have enough information to make a valid comparison. My point was simply that generally speaking, a 64KB, 2-way associative cache will have better hit rates than a 32KB 8-way associative. Of course, having fancy prefetching is always a good thing, but its effect *is* limited. If it was a huge improvement, people would have done that 8 years ago, instead of just messing with cache size and associativity.
  • Betwon - Tuesday, May 2, 2006 - link

    Your information is too old and should be updated now.
    Prefetcher give much improvement in reducing the miss-rate.
    About 30-90% miss rate reduced.
    The good prefetcher tech is one of the most important performance factors.

    http://www.hpcaconf.org/hpca11/slides/hpca_inst_sl...">http://www.hpcaconf.org/hpca11/slides/hpca_inst_sl...
  • Betwon - Tuesday, May 2, 2006 - link

    Who is James E. Smith? I think that you should know him.

    Data Cache Prefetching Using a Global History Buffer -- the prefetcher bring the great performance improvement! From 20-110%
    abstract
    http://ieeexplore.ieee.org/search/freesrchabstract...">http://ieeexplore.ieee.org/search/frees...+buffer%...
    Of course, you can download the full-text pdf file, if you have a IEEE member account. I can download and view it, but can not release it.
    slides ppt
    http://www.ece.gatech.edu/~leehs/ECE7102/slides/ka...">http://www.ece.gatech.edu/~leehs/ECE7102/slides/ka...
  • Sunrise089 - Monday, May 1, 2006 - link

    flak flak flak

    Seriously - props to the author on a good article, but if I had one comment it would be that there are length issues to trying to provide the ammount of background needed for this sort of article. I think it's best to either just draw the comparisons between the two chips, or do a full-length many thousands of words write-up on the technical importance of the various topics. I read the article, and while writen well and informative in it's conclusions, I cannot say all the background was enough to make me really understand the concepts better. For example I already knew what out-of-order execution was, but only being able to read a few hundred words more on it didn't allow me to really learn enough to understand all of the reasons why the K8 had a disadvantage in that area, and if all you wanted was for me to understand that it did indeed have the disadvanatge, you could have just said so.
  • JohanAnandtech - Monday, May 1, 2006 - link

    It is indeed an issue I struggled with. Writing full length articles on these subjects doesn't sound like a good idea for me: I personally do not like lengthy articles either. So I tried to keep a balance between being technical and keeping it understandable.

    Anyway, Just ask about the points where you were lost. Especially on the OoO matters: it is much more interesting than "AMD has a disadvantage". Basically, reordering happens between the decoding and the execute phase.

    Pushing loads forward helps in two ways:
    1.Whenever a load fails to get it's data from the L1-cache, the CPU has to find other instructions to execute. As loads are very common, it is easier to fill the gaps than when you can not move loads before other loads.

    2. If a load gets pushed forward and a L1-cache miss for that load occurs, it isn't that bad. This is very simplified, but assume the load has been pushed 5 cycles forward, and your L2-cache latency is 10, you only have to wait 5 cycles instead of 10.
  • Furen - Monday, May 1, 2006 - link

    I'll be the grammar nazi today, lol.

    Last page, paragraph 5: "[...] increasing the <b>wideness</b> of each unit [...]"
    Width, perhaps? "Wideness" refers to either quality or state (neither of which is discrete) while "width" also also applies to measurable fact (128-bits wide, for example). You can talk about the wideness of the units, for example, but you cannot talk about increasing their wideness...

    Great article, by the way, it's been long since I've read such an enjoyable article.
  • emboss - Monday, May 1, 2006 - link

    Just a quick note ... on page 4 you have the table with the execution unit details. There's a couple things incorrect (IMO) in the numbers.

    First, you list the number of double precision FLOPs per cycle. Double precision can be done with SSE, so in the K8 you can do 2 DP ADDs and 2 DP MULs every two cycles (due to the 64-bit wide datapaths), a total of 2 DP FLOPs per cycle.

    Core can do two SSE operations per cycle (the two symmetric units), giving it a total of 4 DP FLOPs per cycle. The third SSE unit does not handle FP ops, but instead handles shuffles and the like.

    Obviously, double both of these numbers if you want a "peak" single precision FLOPs per cycle.

    If instead you were meaning about extended precision (64 bit precision, 80 bit floats) x87 operations, it's exactly the same concept as above since Core has apparently has combined SSE/x87 units (and a fully pipelined FMUL, unlike the P4). This gives both the K8 and Core 2 EP FLOPs per cycle.

    Finally, you have the number of SSE units for the K7 wrong. The K7, like the K8, has two SSE units (FADD and FMUL), and the same 64 bit datapath as the K8. Of course, the K7 cannot handle SSE2, so must use x87 instructions for double precision (ie: two DP FLOPs per cycle).


    Apart from that, very nice article! I've been trying to optimise SSE code for the Core processor and have had to do things by trial and error thanks to the complete void of any decent documentation from Intel. One thing in particular was that I was finding "odd" performance properties with SSE that pointed towards it having two FMUL units. Being symmetric units explains a lot!
  • JarredWalton - Monday, May 1, 2006 - link

    See above note regarding Core Duo versus Core "Conroe". (Nice naming scheme, Intel. *grumble*) I will let Johan take care of the rest of your comment as appropriate. (His knowledge of the low level details of all of the microarchitectures discussed here definitely surpasses mine!)

    Unfortunately, it's not particularly surprising to find out that optimal code for Core Duo may need to be slightly tweaked in order to extract the most performance from Conroe. Still, they ought to be similar enough that you own by optimizing for Core Duo. The flipside is the optimal code for Conroe could very likely run worse on Core Duo and other processors. Such is the price of progress, I guess.
  • prx99 - Monday, May 1, 2006 - link

    Core is not the first x86 having 4 decoders. That was AMDs K5.

    I remember a statement from AMD that in some design they considered adding one more decoder. It turned out to actually slow down the design because the amount of clock speed lost was not compensated for by the smaller amount of performance gained.

    In my interpretation the fusion is done past the initial decoding, so there is not way more that 4 x86 ops can be decoded in a clock cycle (I'm referring to the "4+1" figure). The profit from fusion is not in the decoding stage but in the out of order engine.

    At AMDs, the "1 branch per cycle" rule is limited to branches seen by the predictor. A branch which is generally not taken is invisible to the prediction engine and therefore free.

    The original P4 indeed had a L1 latency of 2. The major P4 redesign in Prescott however increased it to 3.

    Load/store reordering is already done by the P4, but the penalty from a misprediction is fairly high. This is the drawback of any kind of prediction, whether branches or memory access: It speeds up things when being correct, but slows them down quite a bit more when not. This was the general picture seen in the P4: many applications were sped up by some amount, but some suffered greatly because they systematically fooled the P4's engines.

    Gruss, Andreas

Log in

Don't have an account? Sign up now