Intel Core 2 Extreme QX9650 - Penryn Ticks Aheadby Anand Lal Shimpi on October 29, 2007 12:13 AM EST
- Posted in
We've never seen Intel with such a strong roadmap before, the company is truly firing on all cylinders and executing with amazing precision. Server, mobile, desktop and even new areas like ultra mobility and graphics all have absolutely wonderful roadmaps to look forward to. The biggest complaint we've had about Intel these days is that they kind of botched the X38 launch. Think back to a couple of years ago, what was our chief complaint then? Probably leaving us with power hungry, under performing processors for about 5 years. Today we're looking at a damn good Intel.
Recap: What's a Penryn?
Core, it's the architecture that shook an industry and today Intel is officially doing its first update to it. Prior to Intel's Core architecture, there wasn't much to get excited about when it came to Intel on the desktop.
At the same time, with Penryn Intel is very much the victim of its own success. How do you follow up such a tremendous splash with anything but equal greatness? AMD is close but still has yet to produce a response to Core 2, much less Penryn, and thus Intel's biggest competition today is itself.
In January 2007 Intel first showed off its 45nm High K + metal gate transistors, a dramatic departure from Intel's current 65nm transistors not only in size/switching speed but actual composition. If you remember back to the days of the original P6 processors, with a smaller transistor we saw tremendous improvements in die size, power consumption and performance. These days, such dramatic improvements are much harder to come by given that we're already dealing with such small transistor feature sizes. Gone are the days of the free lunch with each die shrink.
We've gone over the technical details of Intel's high-K + metal gate enhancements, but the end result is that at the same clock speed you can expect dramatic reductions in power. Alternatively, at the same power levels, you can achieve much higher switching rates and thus higher clock speeds.
The core architecture of Penryn remains unchanged from Conroe; with the smaller transistors Intel's able to fit in a few new features and more cache on the chip while still maintaining a smaller die size. Where each dual-core Conroe die measured 143 mm^2, Penryn is merely 107 mm^2 despite having 50% more cache (6MB vs. 4MB). Obviously the quad core chips double overall area but you get the point.
Intel also uses a lot more of these new 45nm transistors than before; while a dual-core Conroe was made up of 291 million transistors, the comparable Penryn weighs in at 410 million (582M vs. 820M for quad-core variants). You're getting 40% more transistors and 50% more cache in a 25% smaller package; the latter is obviously most important to Intel as it helps reduce costs and drive profits up. So while it may seem generous, the move is purely self motivated on Intel's part.
The larger cache is a bit different than what we've seen in Conroe. While Conroe's cache is a 4MB 16-way set-associative L2, the 6MB Penryn cache is 24-way set-associative, designed to improve hit rates and keep latency manageable in an already large cache. Intel hasn't revealed whether Penryn's prefetchers have been adjusted to help populate its larger cache any better. As we saw in our original Penryn preview, Penryn's cache performance remains unchanged; latencies in our final stepping are identical to Conroe.
The cache enhancements are by far the biggest consumer of those extra transistors in Penryn, but believe it or not, they aren't responsible for the biggest performance boost. Intel has been fairly steady in adding new instructions to the x86 ISA and Penryn continues the trend with the addition of SSE4. Penryn gets 47 new instructions that make up the first implementation of SSE4; more will come with Nehalem at the end of 2008. We'll talk about SSE4 performance later on in this article, but here are the instructions you get with Penryn:
Penryn also implements a new divider that impacts both integer and floating point divides using a radix-16 algorithm. The algorithm computes more bits of the result of a divide each pass (four bits per iteration vs. two bits in Conroe), decreasing divide latency.
The faster divider is a very specific enhancement that should manifest itself as a performance boost in 3D and imaging applications.
Penryn's Super Shuffle Engine should also improve SSE2, SSE3 and SSE4 applications that use a lot of shuffle operations. Cache performance is also improved slightly for misaligned stores, which should improve performance, once again, in 3D and imaging applications. Finally, there are some power enhancements made to Penryn, but these are mobile-specific and thus don't apply to any of the desktop variants.