Nehalem Part 3: The Cache Debate, LGA-1156 and the 32nm Futureby Anand Lal Shimpi on November 19, 2008 8:00 PM EST
- Posted in
Another Part? Oh there will be more
In an unexpected turn of events I found myself deep in conversation with many Intel engineers as well as Pat Gelsinger himself about the design choices made in Nehalem. At the same time, Intel just released its 2009 roadmap which outlined some of the lesser known details of the mainstream LGA-1156 Nehalem derivatives.
I hadn’t planned on my next Nehalem update being about caches and mainstream parts, but here we go. For further reading I'd suggest our first two Nehalem articles and the original Nehalem architecture piece.
Nehalem’s Cache: More Controversial Than You’d Think
I spoke with Ronak Singhal, Chief Architect on Nehalem, at Intel’s Core i7 launch event last week in San Francisco and I said to him: “I think you got the cache sizes wrong on Nehalem”. I must be losing my shyness.
He thought I was talking about the L3 cache and asked if I meant it needed to be bigger, and I clarified that I was talking about the anemic 256KB L2 per core.
We haven’t seen a high end Intel processor with only 256KB L2 per core since Willamette, the first Pentium 4. Since then Intel has been on a steady ramp upwards as far as cache sizes go. I made a graph of L2 cache size per core of all of the major high end Intel cores for the past decade:
For the most part we’ve got a linear trend, there are a few outliers but you can see that earlier in 2008 you’d expect Intel CPUs to have around 2 - 3MB of L2 cache per core. Now look at the lower right of the chart, see the little orange outlier? Yeah, that’s the Core i7 with its 256KB L2 cache per core, it’s like 2002 - 2007 never happened.
If we look at total on-chip cache size however (L2 + L3), the situation is very different:
Now we’ve got an exponential growth of cache size, not linear, and all of the sudden the Core i7 conforms to societal norms. To understand why, we have to look at what happened around 2005 - 2006: Intel started shipping dual-core CPUs. As core count went up, so did the total amount of cache per chip. Dual core CPUs quickly started shipping with 2MB and 4MB of cache per chip and the outgoing 45nm quad-core Penryns had 12MB of L2 cache on a single package.
The move to multi-core chip designs meant that the focus was no longer on feeding the individual core, but making sure all of the cores on the chip were taken care of. It’s all so very socialist (oh no! ;) ).
Nehalem was designed to be a quad-core product, but also one that’s able to scale up to 8 cores and down to 2 cores. Intel believes in this multi-core future so designing for dual-core didn’t make sense as eventually dual-core will go away in desktops, a future that’s still a few years away but a course we’re on nonetheless.
AMD's shift to an all quad-core client roadmap
Intel is pushing the shift to quad-core, much like AMD is. By 2010 all of AMD’s mainstream and enthusiast CPUs will be quad-core with the ultra low end being dual-core, a trend that will continue into 2011. The shift to quad-core makes sense, unfortunately today very few consumer applications benefit from four cores. I hate to keep re-using this same table but it most definitely applies here:
Back when AMD introduced its triple-core Phenom parts I put together a little table illustrating the speedup you get from one, two and four cores in SYSMark 2007:
|SYSMark 2007 Overall
|Intel Celeron 420 (1 core, 512KB, 1.6GHz)
|Intel Celeron E1200 (2 cores, 512KB, 1.6GHz)
|% Increase from 1 to 2 cores
|Intel Core 2 Duo E6750 (2 cores, 4MB, 2.66GHz)
|Intel Core 2 Quad Q6700 (4 cores, 8MB, 2.66GHz)
|% Increase from 2 to 4 cores
Not only are four cores unnecessary for most consumers today, but optimizing a design for four cores by opting for very small, low latency L2 caches and a large, higher latency L3 cache for the chip isn’t going to yield the best desktop performance.
A Nehalem optimized for two cores would have a large L2 cache similar to what we saw happening on the first graph, but one optimized for four or more cores would look like what the Core i7 ended up being. What’s impressive is that Intel, in optimizing for a quad-core design, was still able to ensure that performance either didn’t change at all or improved in applications that aren’t well threaded.
Apparently the L2 cache size was and still is a controversial issue within Intel, many engineers still feel like it is too small for current workloads. The problem with making it larger is not just one of die size, but also one of latency. Intel managed to get Nehalem’s L2 cache down to 10 cycles, the next bump in L2 size would add another 1 - 2 cycles to its latency. At 512KB per core, 20% longer to access the cache was simply unacceptable to the designers.
In fact, going forward there’s no guarantee that the L2 caches will see growth in size, but the focus instead may be on making the L3 cache faster. Right now the 8MB L3 cache takes around 41 cycles to access, but there’s clearly room for improvement - getting a 30 cycle L3 should be within the realm of possibility. I pushed Ronak for more details on how Intel would achieve a lower latency L3, but the best I got was “microarchitectural tweaks”.
As I mentioned before, Ronak wanted the L3 to be bigger on Nehalem; at 8MB that’s only 2MB per core and merely sufficient in his eyes. There are two 32nm products due out in the next 2 years, I suspect that at least one of them will have an even larger L3 to continue the exponential trend I showed in the second chart above.
Could the L2 be larger? Sure. But Ronak and his team ultimately felt that the tradeoff between size/latency was necessary for what Intel’s targets were with Nehalem. And given its 0 - 60% performance increase, clock for clock, over Penryn - I can’t really argue.