Smarter Prefetching and Caching

Making sure that the right instructions and data are ready for use in the caches in the "3 GHz and beyond" era is one of the most important tasks of the architectural engineer. This helps to ensure performance increases as clockspeeds get pushed higher; otherwise, higher CPU clockspeeds will simply result in the processor spending more time waiting for data. This technique of priming the caches is knows as Prefetching; however, the current hardware prefetching algorithms don't always lead to success. There are quite a few cases where they can actually lower performance, especially in bandwidth sensitive applications.

The Core architecture prefetching is without any doubt superior to what can be found in the Athlon 64 and Pentium 4. There are no less than three prefetchers - two data, one instruction - in each core, plus two prefetchers for the L2-cache. With eight prefetchers active in one dual-core Core CPU, all those prefetchers could easily get in the way of the "demand" bandwidth - the bandwidth which is needed by the load operations of the running program. In order to avoid this bottleneck, the prefetch monitor of the Core CPUs always give priority to the demand bandwidth; the prefetchers will never steal too much bandwidth away from the running program.

There is more. The data prefetch needs to perform tag lookups (tags = index of cache) in the caches frequently. To avoid this resulting in higher latency for the "normal" (caused by the program running) tag lookups, the data prefetch uses the store port for the tag lookup. If you remember, loads happen about twice as often as stores. This means the store port is used only half as much as the load port and it make sense to use that port for tag lookup by the prefetchers. Note also that stores are not critical for system performance in most cases -- once the data is "written" the processor can go on about its business. The cache/memory subsystem is in charge of replicating the data down to main memory, and as long as this happens eventually, everything works fine.

The cache system of the Core CPU is also very impressive. A massive 4 MB L2-cache is shared between the two cores and is accessed in only 12 to 14 cycles. Each core also has a 3 cycle 32 KB Instruction cache and a 32 KB data cache at its disposal. Note that the "Trace Cache" of NetBurst has been left behind with the return to shorter pipelines; the NetBurst Trace Cache basically functions as an instruction cache for pre-decoded instructions, and while this was apparently helpful for the long pipeline of NetBurst, Intel has apparently determined that a traditional L1 caching scheme makes more sense for Core.

Cache Architecture Overview

Click to enlarge

Just a quick look at the numbers in the above table make it clear that the memory subsystem of the Core architecture is impressive. It has twice as much L2 cache as current dual-core CPUs (the same amount as Presler), but the cache is still accessible with low latency. The shared L2 cache also allows one core to use more than 2MB of cache if necessary. Both L1 and L2 cache are accessed via a 256-bit wide bus, allowing the caches to deliver massive bandwidth to the core.

Core versus Hammer: Memory subsystem

Core's most important competitor, the Hammer ("K8") architecture, has two small but noteworthy advantages. The first is its bigger 2 x 64 KB L1 cache. This is only a small advantage as an 8-way 32 KB cache will have a hit rate close to that of a 2-way 64 KB cache.

The second and more important advantage is the on die memory controller, which lowers the latency to the memory considerably. However, the lower clockspeeds of the Core CPUs (relative to NetBurst) and the faster FSB also lower latency significantly. With the numbers available to us now, we have reason to believe that the Athlon 64 X2's latency advantage will shrink to only 15 to 20%. For comparison, the memory subsystem of the Pentium 4 was almost twice as slow as the Athlon 64 (80-90 ns versus 45-50 ns).

However, those two small advantages are likely negated by all the other memory subsystem metrics. The Core CPUs have much bigger caches and much smarter prefetching than the competition. The Core architecture's L1 cache delivers about twice as much bandwidth (Measured by ScienceMark), while it's L2-cache is about 2.5 times faster than the Athlon 64/Opteron one.

Index Decoding Instructions
Comments Locked

87 Comments

View All Comments

  • Betwon - Wednesday, May 3, 2006 - link

    If you really want to know what is the Intel's load reordering and memory misambiguation, I can tell you the facts:

    http://www.stanford.edu/~merez/papers/LoadSched_IS...">http://www.stanford.edu/~merez/papers/LoadSched_IS...
    Speculation Techniques for Improving Load Related Instruction Scheduling 1999
    Adi Yoaz, Mattan Erez, Ronny Ronen, and Stephan Jourdan -- From Intel's Haifa, they designed the Load/Store Unit of Core.

    I had said that anandtech should study many things about CPU. Of course, I should study more things about CPU.
  • Betwon - Tuesday, May 2, 2006 - link

    P6: sub [mem],eax decodes to three micro-ops
    Core duo: sub [mem],eax decodes to two micro-ops
    K8: sub [mem],eax decodes to one macro-op

    P6: sub eax,[mem] decodes to two micro-ops
    Core duo: sub eax,[mem] decodes to one micro-op
    K8: sub eax,[mem] decodes to one macro-op

    Intel's micro-fusion is different with the K7/K8's macro-op.

    P4 has 2X2 int ALU and 2 AGU.
    K7/K8 has 3 int ALUand 3 AGU.
    But Core duo has only 2 int ALU and 2 AGU.

    The integer performance:
    Core duo>K7/K8

    Why?
    Because Core duo's length of the depenency chain of the critical path is the shorest.
    The most integer program asm codes can be thought as a high and thin tree of the depenency chain, (the longest depenency chain is called the critical path)
    The critical path determines the performance. The length of the depenency chain of the critical path is the cycles needed to complete this critical path.
    Core duo(2ALU/2AGU) spends less cycles than K7/K8(3ALU/3AGU) -- Because more INT funtions can not accelerate the true dependency atoms-operations.

    The P-M/Core duo's special ability of anti true dependency atoms-operations is the real reason of it's INT outperformance, which is different with old P6(such as Pentium 3).

    The most FP program asm codes can be thought as a boskage (there are many short depenency chains). The best ILP can be performed--The more FP FADD/FMUL funtions, the more performence.

    Double the FP funtions or double the speed of half-speed FP functions is a good idea for the most FP programs.
    But double INT funtions do not always enhence the INT preformence so greatly.
    Conroe only has three ALUs and two AGUs, but not 4 ALU and 4AGU.
    K7/K8 has three ALUs and three AGUs.
  • Betwon - Tuesday, May 2, 2006 - link

    Sorry, I'm not one from a English-language nations. I just tell my idea with many spelling or syntax errors.
    funtions -- funtion units

    I want to tell why Conroe with only 3ALU/2AGU has the INT outperformance. Core has the superexcellent ability to process the true dependency chains.(much much better than K8 3ALU/3AGU).
    Even Core duo with 2ALU/2AGU has the INT outperformance(much better than K8 3ALU/3AGU).

    The length of the depenency chain of the critical path
    The true dependency atom-operations
  • Starglider - Monday, May 1, 2006 - link

    A truly excellent article. I just have a couple of questions;

    quote:

    The second and more important advantage is the on die memory controller, which lowers the latency to the memory considerably. However, the lower clockspeeds of the Core CPUs (relative to NetBurst) and the faster FSB also lower latency significantly. With the numbers available to us now, we have reason to believe that the Athlon 64 X2's latency advantage will shrink to only 15 to 20%.


    It's clear that increasing FSB speed can reduce memory latency. However I'm not clear why lower CPU core speed will reduce absolute latency - sure it will reduce the number of CPU cycles that occur while waiting for memory, but how can it reduce the absolute delay?

    You don't seem to have included inter-instruction latency in your comparison tables. I know this data can be hard to get hold of, but it's critical to the performance of highly serial code (e.g. the pi calculation benchmarks that seem to be so popular at the moment). Is there any chance it could be included?

    Finally I'm wondering if Intel will revist some of the P4 clock speed enhancing tricks later on. Things like LVS and double pumped AUs would only have slowed down an already complex development process that Intel desperately needed to be completed quickly. But if AMD do come out with a new architecture that matches or exceeds Conroe on IPC, Intel might be able to respond quite quickly by bringing back some of their already well understood clock speed tricks to accelerate Conroe.
  • Makaveli - Monday, May 1, 2006 - link

    You need to get away from thinking of increased clockspeed for extra performance, the future is multicore Cpu's and parellism.
  • Starglider - Tuesday, May 2, 2006 - link

    I write heavily multithreaded applications for a living, but sometimes there is just no substitute for fast serial execution; a lot of things just can't be parallelised. Serial execution speed is effectively IPC * clock rate, so yes increasing clock rate is still very helpful as long as IPC doesn't suffer.
  • saratoga - Monday, May 1, 2006 - link

    ^^ Did you read the post you replied to? His point is valid.

    Lower clock speed is not going to improve memory latency. It may mean that latency is less painful, but if you took two core chips, one at 2GHz and the other at 3GHz, the absolute latency is roughly if not exactly the same for each. Though the cost of each ns of latency is 50% more dear on the 3GHz chip.
  • Spoonbender - Tuesday, May 2, 2006 - link

    It would be more accurate to say that each cycle of latency is more dear on the 2ghz chip, wouldn't it? ;)
    What matters is how *long* the latency is, in ns. A ns latency is a ns, and it forces the cpu to wait exactly one ns, no matter its clock speed... :)

    So yeah, definitely a valid point, and I wondered about that in the article as well.
  • saratoga - Wednesday, May 3, 2006 - link

    quote:

    It would be more accurate to say that each cycle of latency is more dear on the 2ghz chip, wouldn't it? ;)


    No. Its 50% worse for the 3GHz chip (since the clock speed is 50% higer).

    quote:

    What matters is how *long* the latency is, in ns. A ns latency is a ns, and it forces the cpu to wait exactly one ns, no matter its clock speed... :)


    And one ns is how many clock cycles on a 2GHz chip? And how many of a 3GHz chip? Think this through . . .
  • Spoonbender - Wednesday, May 3, 2006 - link

    quote:

    No. Its 50% worse for the 3GHz chip (since the clock speed is 50% higer).

    No it isn't. They both waste *exactly* one ns of execution per, well, ns of latency. How many cycles they can cram into that ns is irrelevant.

    [quote]
    And one ns is how many clock cycles on a 2GHz chip? And how many of a 3GHz chip? Think this through . .
    [/quote]
    Yeah, of course, with 1 ns latency, a 3ghz chip will waste more clock cycles than a 2ghz chip, yes. That's obvious. But they will both lose exactly 1 ns worth of execution. That's what matters. Not the number of clock cycles.
    If they both perform equally well, despite the clock speed difference, then adding 1 ns latency to both will have exactly the same impact on both. Yes, the 3ghz chip will lose more clock cycles, but the 2ghz will (if we stick with the assumption of similar performance), have a higher IPC, and so waste the same amount of actual work.

    If you like, look at Athlon 64 and P4.
    If both cpu's waste one clock cycle, then the A64 takes the biggest impact, because of its higher IPC.
    If both cpu's waste one ns, it doesn't change anything. True, the P4 loses the biggest amount of cycles, but as I said before, the A64 loses more *per* cycle. The net result is that they both lose, wait for it, *one* nanosecond's worth of execution.

Log in

Don't have an account? Sign up now