Cache and Memory Controller Comparison

Now that you know what parts to compare, let's drill a little deeper. Since cache is a major element separating Phenom II from its precessor, let's start there.

Phenom II, like its predecessor, maintains a 3-cycle 64KB L1 cache. With Nehalem, Intel had to move to a 4-cycle cache, so Phenom II retains the hit rate and performance benefits of a larger, faster L1. The L2 cache latency is where Phenom II and Intel’s architectures really differ.

Phenom II, like the original, has a 512KB L2 cache per core, but the cache is a high latency 15 cycle cache. Compared to the Athlon X2’s 20 cycle L2, Phenom II looks pretty good, but now look at Penryn. Penryn’s 15 cycle L2 is the same speed as Phenom’s L2, but it’s 2-6x larger. Core i7 trumps them all with a very fast 11 cycle L2, although it achieves this by having the smallest L2 cache per core out of the bunch - only 256KB in size.

AMD asserts that Phenom II’s L3 cache is now 2-cycles faster than Phenom’s L3. At 3x the size but with improved access time, Phenom II’s L3 is closer to where it should have been in the first place. Everest measures Phenom II’s L3 as having a 55-cycle latency, while Core i7 has a 35 cycle L3. Sandra puts Core i7 and the original Phenom at 55 cycles, but Phenom II at 71 cycles. I checked with Intel and AMD, and it appears neither application is reporting the correct L3 access latencies for either processor. Intel confirmed Core i7’s L3 as a 42 cycle L3 and I’m still waiting to hear back from AMD on the time to access its cache, but I suspect it will be around 50 cycles.

Processor L1 Latency L2 Latency L3 Latency
AMD Phenom II X4 920 (2.80GHz) 3 cycles 15 cycles AMD won't tell me
AMD Phenom @ 2.8GHz 3 cycles 15 cycles AMD won't tell me
Athlon X2 5400 (2.80GHz) 3 cycles 20 cycles -
Intel Core 2 Quad QX9770 (3.2GHz) 3 cycles 15 cycles -
Intel Core 2 Quad Q9400 (2.66GHz) 3 cycles 15 cycles -
Intel Core i7-965 (3.2GHz) 4 cycles 11 cycles 42 cycles

Main memory access time is more telling. A trip down memory lane will cost you 107 ns on an original Phenom processor, 100 ns on an Athlon X2, and now only 95 ns on a Phenom II. The 11% improvement in memory access performance is due to improvements AMD made when it redesigned the memory controller to include support for DDR3.

L2: It’s the New L1

I think I finally get it. When Nehalem launched I spoke with lead architect Ronak Singal at great length about its L2 cache being too small. I even made this graph to illustrate my point:


Click to Enlarge

With only 256KB per core, Core i7’s L2 cache was a large step back. Ronak argued that its 11-cycle load latency was more important than size. But it took Phenom II for me to understand why.

The original Phenom suffered because not only did it have very little L2 cache per core (512KB compared to as much as 6MB with Penryn), but it also had a very small L3 cache. Four cores sharing a 2MB L3 cache just wasn’t enough. The problem is AMD was die constrained; Phenom needed more L3 cache but AMD needed to keep the die size manageable to avoid bankruptcy. Architecturally, Phenom was ahead of its time.

If we were to live in the dual-core era forever, Intel had the right design - two cores could easily sit behind one large shared L2 cache. Move to four cores and the shared L2 design stops making sense. In some situations you’ll have cores operating on independent threads with no spatial locality, and for these scenarios each core will need its own L2 cache. In other scenarios you’ll have multiple cores working on the same data, in which case you’ll need a large cache shared by all cores. Again, Phenom was the right quad-core design, it just didn’t have enough cache (not to mention its other shortcomings).

In a way, Intel recognized that Conroe and Penryn were designed to win the dual-core race - over the life of both CPUs less than 5% of its desktop shipments were quad-core chips. Intel’s last tick and tock dominated the dual-core market. Nehalem and Westmere on the other hand are more interested in winning the multi-core races.

Phenom II addresses the cache deficiency. With a 6MB L3 cache, it nearly has the same size L3 as Core i7. The L2 caches remain larger at 512KB per core but I suspect that’s because AMD didn’t have the time/resources to redesign its cores for Phenom II. It takes 15 cycles to access AMD’s 512KB L2; that’s the same amount of time it takes to access Penryn’s 2x6MB L2. I’ll gladly wait 15 cycles if I have the hit rate of a 6MB cache, but not for a 512KB cache. AMD too will pursue a faster L2, that will most likely come in 2011 with Bulldozer (Orochi and Llano CPUs).

With a very large L3 cache, it no longer makes sense to have a large L2. Instead the L2 needs to be as fast as possible, acting as spillover from L1. Look at what happened to L1 cache sizes as CPUs got wider and faster. The L1 cache grew from 1KB, 8KB, 16KB and eventually up to 32 and 64KB in today’s designs. However L1 sizes haven’t increased beyond that point; instead we saw L2 caches grow and grow. Eventually they too hit a stopping point; for AMD that was Phenom, and for Intel that was Core i7.

With the number of cores growing, we need a large cache shared between all of the cores. Imagine a 12-core processor; would it have a massive 36MB shared L2 cache? Definitely not. It’d be too slow for starters, and the penalty for not finding something in L1 would be tremendous. Remember the point of the memory hierarchy: to hide latency between the software and the processor. A pyramid doesn’t work if the base fattens out too quickly. In the future, as we move to four, eight and more cores, L2 caches will have to be motherly figures to a core’s L1, feeding them individually, rather than a mess hall to feed everyone. That role will fall to the L3 cache.Carrying that further, we may even see future CPUs with more cores add a forth level of cache.

With the role of the L2 cache redefined from being service-all to a service-one, it makes sense for it to be small and fast. The original Phenom had the right idea, it needed a larger L3. Core i7 perfected that idea, and Phenom II took a step towards that. Cache sizes must continue to grow, but as they do, the number of levels of cache must increase as well to avoid a single, large penalty being paid as you go from one level of cache to the next.

Clock for Clock, Still Slower than Core 2 & Core i7 Finally, Cool 'n' Quiet You Can Use
Comments Locked

93 Comments

View All Comments

  • Proteusza - Thursday, January 8, 2009 - link

    No, I said I hoped it could at least compete with a Core 2 Duo.

    if its too much to hope that a 2 year younger, 758 million transistor CPU could compete clock for clock with a first gen Core 2 Duo, then AMD has truly fallen to new lows. It has more transistors than i7, and yet it cant compete with a Core 2 Duo let alone i7. What happened to the sheer brilliance of the A64 days? It could beat the pants off any Pentium 4. Now the best AMD can do is barely acceptable performance at a higher clockspeed than Intel needs, all the while using a larger die than Intels.

    This keeps them in the game, but it means I wont bother buying one. Why should I?
  • coldpower27 - Thursday, January 8, 2009 - link

    Those days are over, their success was also contigent with Intel stumbling a bit and they did that with P4, with Intel firing on all cylinders, AMD at acceptable is just where they are supposed to be.
  • Denithor - Thursday, January 8, 2009 - link

    It wasn't so much of a stumble, more like a face-plant into a cactus. Wearing shorts and a tshirt.

    Intel fell flat with Netburst and refused to give up on it for far too long (Willamette -> Northwood -> Prescott -> Cedar Mill). I mean, the early days of P4 were horrible - it was outperformed by lower-clocked P3 chips until the increased clockspeed was finally too high for architectural differences to negate.

    Into this mix AMD tossed a grenade, the A64 - followed by the X2 on the same architecture. With its IMC and superior architecture there was no way Netburst could compete. Unfortunately, AMD hasn't really done anything since then to follow through. And even today's PII isn't going to change things dramatically for them, they're still playing second fiddle to Intel's products (which means they're forced into following Intel's lead in the pricing game).
  • JKflipflop98 - Thursday, January 8, 2009 - link

    Damn it feels good to be a gangsta ;)
  • Kob - Thursday, January 8, 2009 - link

    Thanks for the meaningful comparison with such a wide range of processors. However, I wonder why the benchmarks are so much tilted toward the graphics/gaming world. I think that many in the SOHO world will benefit from test results of other common applications/fields such as VS Compilation, AutoCAD manipulation, Encryption, simple database indexing and even a Chess game.
  • ThePooBurner - Thursday, January 8, 2009 - link

    In the article you compare this to the 4800 series of GPUs. I actually see this as the 3800 series. It works out perfectly. The 2900 came along way late and didn't deliver, used to much power, didn't overclock well, and was just all around a looser of a card. Then the 3800 came along. Basically the same thing, but with a die shrink that allowed it to outstretch, just enough, it's predecessor. It was the first card where they got the mix right. After that came the 4800 with a big boost and even more competition. This is what i now see happening with the CPU line. The Phenom 1 was the 2900, and the Phenom II is the 3800. Getting the mix right and getting ready for the next big swing. But, as you point out, Intel isn't likely to sit back, and we can all agree that they are a much different competitor than Nvidia is.
  • Denithor - Thursday, January 8, 2009 - link

    ...and just like the 3800 series, it falls just short of the target.

    Remember? The 3870 couldn't quite catch the 8800GT and the 3850 couldn't quite match the 9600GT. While they weren't bad cards, they unfortunately also didn't give AMD the muscle to set pricing where they wanted it, instead they had to put pricing in line with how nVidia priced their offerings.

    Same is happening here, with AMD pricing their chips in line with Intel's Q9400/Q9300 processors. And they may have to drop those prices if Intel cuts the Q9550/Q9400 down another peg.
  • Griswold - Friday, January 9, 2009 - link

    Rubbish theory. First of all, these cards were actually available whereas the 8800GT was in extreme short supply and thus much more expensive for many weeks, even into 2008, because it literally made everything else nvidia had to offer obsolete. I couldnt get one and settled for a 3870 for that reason.

    Secondly, the 9600GT? Do you realize how much later that card came to the game than the 3850? It hit the market near the end of february. Thats almost 3 months after the launch of the 38xx part.

    The whole comparison is silly.
  • ThePooBurner - Friday, January 9, 2009 - link

    The 3800 line wasn't ever meant to beat the 8800 line. It just wasn't in the cards. It's purpose was to get the reins back under control. Cut the power and get back to a decent power/performance ratio as well as get equal power to a previous generation in a smaller package to help improve margins. It was a stage setter. From the first time i read about it i knew that it was just a setup for something more, something "bigger and better" that was going to come next. And then the 4800 came along and delivered the goods. I get this same feeling reading about the Phenom II. It's setting the stage. Getting about the same power (a small bump, just like the 3870 over the 2900) in a smaller package, a better power/performance ratio, etc.. This is simply a stage setting for the next big thing. The next CPU from AMD after this one is going to deliver. I'm sure of it.
  • Kougar - Thursday, January 8, 2009 - link

    If you tried Everest and Sandra, what about CPU-Z's cache latency tool? It's not part of the CPU-Z package anymore, but they still offer it. Link: http://www.cpuid.com/download/latency.zip">http://www.cpuid.com/download/latency.zip

    I thought this tool was very accurate, or is this not the case? It even detected the disabled L3 cache on a Northwood that turned out to be a rebadeged Gallatin CPU.

Log in

Don't have an account? Sign up now