The Pentium 4's Cache

The second method of combating the problems associated with mis-predicted branches that could occur within the Pentium 4's 20-stage pipeline is by using what Intel calls the Execution Trace Cache. We mentioned this Trace Cache when we first had a look at the architecture, but we're finally able to provide you with some more information regarding the L1 cache and how Intel is positioning it.

First of all, let's take a quick look at what the Execution Trace Cache does:

The decoder of any x86 CPU (what takes the fetched instructions and decodes them into a form understandable by the execution units) has one of the highest gate counts out of all of the pieces of logic. This translates into quite a bit of time being spent in the decoding stage when preparing to process an instruction either for the first time or after a branch mis-prediction.

The Execution Trace Cache acts as a middle-man between the decoding stage and the first stage of execution after the decoding has been complete. The trace cache essentially caches decoded micro-ops (the instructions after they have been fetched and decoded, thus ready for execution) so that instead of going through the fetching and decoding process all over again when executing a new instruction, the Pentium 4 can just go straight to the trace cache, retrieve its decoded micro-op and begin execution.

This helps to hide the penalties associated with a mis-predicted branch later on in the Pentium 4's 20-stage pipeline. Another benefit of the trace cache is that it caches the micro-ops in the predicted path of execution, meaning that if the Pentium 4 fetches 3 instructions from the trace cache they are already presented in their order of execution. This adds potential for an incorrectly predicted path of execution of the cached micro-ops however Intel is confident that these penalties will be minimized because of the prediction algorithms used by the Pentium 4.

Intel is abandoning the common method of defining cache size, at least for the Execution Trace Cache. Instead, they are stating that the trace cache can cache approximately 12K micro-ops. Since we don't have any other architectures quite like this, we can't really offer a comparison for that number. In addition to the L1 Execution Trace Cache, the Pentium 4 features an 8KB L1 Data Cache. If you're big on processor specs, you'll realize that this is smaller than the Pentium III's current 16KB L1 Data Cache. According to Intel, this size sacrifice was made in order to achieve a better price/performance ratio for the Pentium 4 in respect to the cost of the additional die size/transistors versus the performance an additional 8KB would offer.

The Pentium 4 will also feature a 256KB L2 cache running at the processor's core clock speed. This L2 cache will feature a much higher bandwidth than the current 256KB L2 on the Pentium III, partly because of the fact that the Pentium 4 will be running at a higher clock speed but also because of the fact that data is transferred on every clock as opposed to every other clock with the Pentium III's cache.

In terms of the bandwidth available to and from the L2 cache, a hypothetical Pentium III clocked at 1.5GHz would have 24GB/s of available bandwidth to and from the L2 cache, while a Pentium 4 clocked at the same speed would have 48GB/s of available bandwidth because it is able to transfer data on every clock.

This is one area where the Athlon (Thunderbird) has a disadvantage, because the chip features a 64-bit path to its L2 cache whereas the Pentium III/4 feature a 256-bit datapath to its L2 cache.

Just as with the Pentium III, all of the Pentium 4's L1 cache (including the Execution Trace Cache) will be duplicated in its L2 cache.

Rapid Execution Engine The Pentium 4's Chipset & Bus
Comments Locked

1 Comments

View All Comments

  • soldja boi - Thursday, September 3, 2020 - link

    really helpfull thank YOU (soldja boi)

Log in

Don't have an account? Sign up now