Cache and Memory Controller Comparison

Now that you know what parts to compare, let's drill a little deeper. Since cache is a major element separating Phenom II from its precessor, let's start there.

Phenom II, like its predecessor, maintains a 3-cycle 64KB L1 cache. With Nehalem, Intel had to move to a 4-cycle cache, so Phenom II retains the hit rate and performance benefits of a larger, faster L1. The L2 cache latency is where Phenom II and Intel’s architectures really differ.

Phenom II, like the original, has a 512KB L2 cache per core, but the cache is a high latency 15 cycle cache. Compared to the Athlon X2’s 20 cycle L2, Phenom II looks pretty good, but now look at Penryn. Penryn’s 15 cycle L2 is the same speed as Phenom’s L2, but it’s 2-6x larger. Core i7 trumps them all with a very fast 11 cycle L2, although it achieves this by having the smallest L2 cache per core out of the bunch - only 256KB in size.

AMD asserts that Phenom II’s L3 cache is now 2-cycles faster than Phenom’s L3. At 3x the size but with improved access time, Phenom II’s L3 is closer to where it should have been in the first place. Everest measures Phenom II’s L3 as having a 55-cycle latency, while Core i7 has a 35 cycle L3. Sandra puts Core i7 and the original Phenom at 55 cycles, but Phenom II at 71 cycles. I checked with Intel and AMD, and it appears neither application is reporting the correct L3 access latencies for either processor. Intel confirmed Core i7’s L3 as a 42 cycle L3 and I’m still waiting to hear back from AMD on the time to access its cache, but I suspect it will be around 50 cycles.

Processor L1 Latency L2 Latency L3 Latency
AMD Phenom II X4 920 (2.80GHz) 3 cycles 15 cycles AMD won't tell me
AMD Phenom @ 2.8GHz 3 cycles 15 cycles AMD won't tell me
Athlon X2 5400 (2.80GHz) 3 cycles 20 cycles -
Intel Core 2 Quad QX9770 (3.2GHz) 3 cycles 15 cycles -
Intel Core 2 Quad Q9400 (2.66GHz) 3 cycles 15 cycles -
Intel Core i7-965 (3.2GHz) 4 cycles 11 cycles 42 cycles

Main memory access time is more telling. A trip down memory lane will cost you 107 ns on an original Phenom processor, 100 ns on an Athlon X2, and now only 95 ns on a Phenom II. The 11% improvement in memory access performance is due to improvements AMD made when it redesigned the memory controller to include support for DDR3.

L2: It’s the New L1

I think I finally get it. When Nehalem launched I spoke with lead architect Ronak Singal at great length about its L2 cache being too small. I even made this graph to illustrate my point:

Click to Enlarge

With only 256KB per core, Core i7’s L2 cache was a large step back. Ronak argued that its 11-cycle load latency was more important than size. But it took Phenom II for me to understand why.

The original Phenom suffered because not only did it have very little L2 cache per core (512KB compared to as much as 6MB with Penryn), but it also had a very small L3 cache. Four cores sharing a 2MB L3 cache just wasn’t enough. The problem is AMD was die constrained; Phenom needed more L3 cache but AMD needed to keep the die size manageable to avoid bankruptcy. Architecturally, Phenom was ahead of its time.

If we were to live in the dual-core era forever, Intel had the right design - two cores could easily sit behind one large shared L2 cache. Move to four cores and the shared L2 design stops making sense. In some situations you’ll have cores operating on independent threads with no spatial locality, and for these scenarios each core will need its own L2 cache. In other scenarios you’ll have multiple cores working on the same data, in which case you’ll need a large cache shared by all cores. Again, Phenom was the right quad-core design, it just didn’t have enough cache (not to mention its other shortcomings).

In a way, Intel recognized that Conroe and Penryn were designed to win the dual-core race - over the life of both CPUs less than 5% of its desktop shipments were quad-core chips. Intel’s last tick and tock dominated the dual-core market. Nehalem and Westmere on the other hand are more interested in winning the multi-core races.

Phenom II addresses the cache deficiency. With a 6MB L3 cache, it nearly has the same size L3 as Core i7. The L2 caches remain larger at 512KB per core but I suspect that’s because AMD didn’t have the time/resources to redesign its cores for Phenom II. It takes 15 cycles to access AMD’s 512KB L2; that’s the same amount of time it takes to access Penryn’s 2x6MB L2. I’ll gladly wait 15 cycles if I have the hit rate of a 6MB cache, but not for a 512KB cache. AMD too will pursue a faster L2, that will most likely come in 2011 with Bulldozer (Orochi and Llano CPUs).

With a very large L3 cache, it no longer makes sense to have a large L2. Instead the L2 needs to be as fast as possible, acting as spillover from L1. Look at what happened to L1 cache sizes as CPUs got wider and faster. The L1 cache grew from 1KB, 8KB, 16KB and eventually up to 32 and 64KB in today’s designs. However L1 sizes haven’t increased beyond that point; instead we saw L2 caches grow and grow. Eventually they too hit a stopping point; for AMD that was Phenom, and for Intel that was Core i7.

With the number of cores growing, we need a large cache shared between all of the cores. Imagine a 12-core processor; would it have a massive 36MB shared L2 cache? Definitely not. It’d be too slow for starters, and the penalty for not finding something in L1 would be tremendous. Remember the point of the memory hierarchy: to hide latency between the software and the processor. A pyramid doesn’t work if the base fattens out too quickly. In the future, as we move to four, eight and more cores, L2 caches will have to be motherly figures to a core’s L1, feeding them individually, rather than a mess hall to feed everyone. That role will fall to the L3 cache.Carrying that further, we may even see future CPUs with more cores add a forth level of cache.

With the role of the L2 cache redefined from being service-all to a service-one, it makes sense for it to be small and fast. The original Phenom had the right idea, it needed a larger L3. Core i7 perfected that idea, and Phenom II took a step towards that. Cache sizes must continue to grow, but as they do, the number of levels of cache must increase as well to avoid a single, large penalty being paid as you go from one level of cache to the next.

Clock for Clock, Still Slower than Core 2 & Core i7 Finally, Cool 'n' Quiet You Can Use


View All Comments

  • poohbear - Thursday, January 08, 2009 - link

    this is fantastic news! and just when i was about to upgrade from my ancient s939 system to a C2D system, seems i might be sticking to AMD after all! thanks for review! Reply
  • PrezWeezy - Thursday, January 08, 2009 - link

    For less than $20 more the i7 920 looks like it wins in every single test by a fair margin, doesn't seem like this is really all that competitive, considering the i7 is still in the "high price" phase. I can't believe it wont drop to the $275 mark rather soon which would put the XII 940 back to the same position the original Phenom was, too little too late. Reply
  • Roland00 - Thursday, January 08, 2009 - link

    More expensive Motherboard+More expensive Ram makes i7 about 400 dollars more in cost Reply
  • strikeback03 - Friday, January 09, 2009 - link

    How you figure? By the chart on page 4, it is less than $200. Even if you go for one of the $300 motherboards, you won't see a $400 difference.

    When I built my current system, E6600/P965/2GB DDR2 cost me over $600, and that was considered a decent mid-range system. As my primary use of computing power is Photoshop, I would definitely go for i7 even if cheaper motherboards do not become available.
  • Roland00 - Friday, January 09, 2009 - link

    It isn't quite 400 but here

    motherboard p45 vs x58 most x58 are 300 vs 100-120 for p45,
    Ram, 6 gb of ddr3 is about 200, vs 50 or 60 for 6gb of ddr2.
    For a nonstock cpu cooler you are talking 60 to 70 bucks with the i7 for it is a new socket and their is very few products for it. You can get a good cpu cooler for intel quad for 30 to 40 dollars.

    Savings about 200+140+30=370

    If you get things on sale you might be able to find 6gb ram for 150, cpu for 230, and you may be able to get the motherboard cheaper if you get one of the basic versions but you are still talking about 300 more.
    I am not saying i7 isn't worth the extra money, it is still new tech but it does show beneficial gains (on encoding, minimum frame rate on games, and overclocking) but right now the motherboards and ram is expensive.
  • BSMonitor - Friday, January 09, 2009 - link

    You are rounding up on i7 side and rounding down on the core 2 side. It is not $50 for 6GB of DDR2. It is not only $300 for x58. I have seen them for $200. I have seen 6GB DDR3 kits for $140 too.

    It's like we are talking about bare entry into Core 2 and Phenom II, but enthusiest for i7. Why does one need 6GB to entry into i7? 3GB would be reasonable and ~$70-80.

    Phenom II is cute yes, but nothing to jump on.
  • Roland00 - Friday, January 09, 2009 - link

    I am rounding up on the I7 side for like most people I buy things with tax (for my state charges tax on internet transactions.) In addition many people buy their equipment in stores such as fry's, microcenter, comp usa, etc.

    370 times 8.25% tax rate (my area's sales tax) is...400 dollars and 52 cents


    And no I am not overpricing the ram or similar equipment. Go to Fry's, Microcenter, or some other store and you will see the prices I listed or much higher.


    Regardless you seem to be missing the point, the original poster I was responding to was saying i7 was only 25 dollars higher, and I said that was wrong for you have to figure in the platform costs.
  • PrezWeezy - Friday, January 09, 2009 - link

    You were right, I had forgotten about the new socket and DD3. Even so, using the parts Anandtech used, the i7 is about $187 more expensive than the PII (pun totally intended). The C2D with a Q9400 though is only $44 cheaper than the i7. Almost all of that has to do with the motherboards used here, and I'm sure you could find a combo of motherboard/CPU that would bring the price closer but that's besides the point. Reply
  • calyth - Thursday, January 08, 2009 - link

    "In theory, the AMD design made sense. If you were running a single threaded application, the core that your thread was active on would run at full speed, while the remaining three cores would run at a much lower speed. AMD included this functionality under the Cool 'n' Quiet umbrella. In practice however, Phenom's Cool 'n' Quiet was quite flawed. Vista has a nasty habit of bouncing threads around from one core to the next, which could result in the following phenomenon (no pun intended): when running a single-threaded application, the thread would run on a single core which would tell Vista that it needed to run at full speed. Vista would then move the thread to the next core, which was running at half-speed; now the thread is running on a core that's half the speed as the original core it started out on."

    Anand, read that sentence again.

    The problem isn't AMD designing a chip with broken CnQ. The problem is that Microsoft, after so many years, still can't write a scheduler. The problem persists on XP too. The thread that handles the mouse would rev up, causing the chip to switch p-state. Switching p-states takes time, and because of exclusive caching on AMD chips, when the scheduler puts the same thread on different cores, it causes the L1 & L2 to be ineffective.

    I have trouble in WinXP with CnQ on if I move my mouse, but not surprisingly, the same Phenom chip works like a chap in Linux. Because the scheduler isn't an idiot, and 1GHz is more than enough to handle mouse input.

    AMD erred in fixing a software problem in hardware. Independent p-states saved some power if only a single thread needed the speed.
  • Zak - Thursday, January 08, 2009 - link

    Well, I hope AMD won't lose the momentum, because right now there isn't that much to celebrate: they've barely caught up with Intel's 2 years old CPU line:(


Log in

Don't have an account? Sign up now