Cachemem – Latency Comparison

We have said countless times that the Pentium 4’s architecture is centered on low latency operation, but it is about time that we actually investigated that statement a little closer.  Hopefully this will give us some hints on unlocking the performance of the Pentium 4.

Cachemem - Latency Comparison
Time in Cycles (lower is better)
Data Size
Pentium 4 1.7GHz
Athlon 1.33GHz
4KB
1
1
8KB
2
4
16KB
2
4
32KB
19
4
64KB
19
4
128KB
19
20
256KB
25
20
512KB
281
202
1MB
289
202
2MB
292
229
4MB
292
238
8MB
295
241
16MB
297
244
32MB
297
245

We have provided you with two objects to look at, a graph and a table; they both represent the same data but sometimes it’s just easier to look at a graph while we also need the numbers located in the data table to perform some analysis upon.  The latency data is taken with respect to 4KB accesses.

The first thing to notice is that the Pentium 4’s L1 data cache (8KB) can indeed be accessed in half the time of the Athlon’s L1 data cache (64KB).  At the same time we do see that as soon as the size of the test data grows far beyond 8KB, the latency increases dramatically from 2 cycles up to 19 cycles.  In contrast, the Athlon’s larger L1 data cache allows it to maintain a fairly low latency up to 64KB.  By the time the Athlon hits the 128KB data point its latency jumps to 20 cycles and stays there even up to 256KB. 

Remember that the Athlon has an exclusive L2 cache subsystem, meaning that its 256KB of L2 is all free for data, while the Pentium 4’s L2 cache contains a duplicate of the data in its L1 data cache as well.  Although only 8KB in size, this L1 cache duplication in L2 gives the Pentium 4 a higher latency at 256KB than the Athlon but only by 5 cycles.

At the 512KB data point both of the processors have run out of precious cache to depend on and now they both fall victim to the perils of their memory buses.  Fortunately for the Athlon, it makes the transition to DDR SDRAM which gives it an additional 182 cycle latency.  However the Pentium 4 is penalized by RDRAM’s higher latency and thus suffers an additional 271 cycle memory latency simply when moving to 512KB.  Imagine what kind of a performance hit is incurred if an application has a footprint just slightly larger than that of the Pentium 4’s L2 cache.  The performance hit would come from a latency penalty of almost 271 cycles. 

Remember when we said that there is a good possibility that the 0.13-micron Pentium 4 (codename: Northwood) would have a 512KB L2 cache?  Here is strong evidence showing why a larger L2 cache would help.  In order to lessen the effects of RDRAM’s relatively high latency, a larger L2 cache could keep the Pentium 4’s performance quite strong in those applications that aren’t necessarily memory bandwidth dependent but more latency dependent.  As we have seen from other recent investigations, these applications are more commonplace today thus having a larger L2 cache would definitely help the Pentium 4.  Provided that the 0.13-micron Pentium 4 has a 512KB L2 cache, that 271 cycle latency penalty wouldn’t be incurred at the 512KB marker, allowing more of today’s applications to hold their data within the processor’s cache.  A larger L2 cache would help the Athlon as well, but not as much since it is already paired up with a fairly low latency memory subsystem.

Speaking of memory subsystems, do take note of the Athlon’s 51 cycle advantage towards the latter half of the graph.  But as you are about to see, latency is only half of the picture.

The Test Cachemem – Cache Bandwidth Comparison

Log in

Don't have an account? Sign up now