Original Link: http://www.anandtech.com/show/2199

In our meeting today, Stephen L Smith (Intel's Vice President Director of the Digital Enterprise Group) and Pat Gelsinger revealed some very interesting details about their next-generation Penryn family of processors. These processors will be launched in the second half of this year. Details of Nehalem, which will see the light in the second half of 2008, were also provided.

Both new processors are based upon the current Intel Core micro architecture and use Intel's newest 45nm Hi-k process technology with its hafnium-based Hi-k + metal gate transistor design. We discussed Intel's 45nm process technology in more detail previously. According to Intel, more than fifteen 45nm Hi-k product designs are in various stages of development, and Intel will have two 45nm manufacturing fabs in production by the end of the year, with a total of four in production by the second half of 2008.

Penryn's Enhanced Core Architecture

The quad core version of Penryn contains 820 million transistors (Kentsfield has 582 million) in two very small dies of 107mm2. That makes the new design 25 percent smaller than Intel's current 65nm Quad core (143 mm2).

The new Penryn CPU also has yet another addition to the x86 ISA: Intel Streaming SIMD Extensions 4 (SSE4) instructions. It has also been confirmed that Penryn will deliver higher IPC and higher clock speeds. Intel wouldn't say more than "more than 3 GHz", but considering that the FSB is bumped up to 1600 MHz, 3.2 GHz is likely. However, several Intel people confirmed that if necessary ("depending on what the competition does"), the 45nm CPUs can go quite a bit higher (3.6 GHz is probably a safe estimate, considering how far current Core 2 CPUs are able to overclock).

With regards to power, Intel will be introducing what it is calling "Deep Power Down Technology", or a new lower power state, C6. The new C6 state reduces core voltage down to the absolute minimum for the given process technology, shuts down the core clock as well as turns off all of the caches. It is the absolute lowest power state that can be attained and will be introduced on Mobile Penryn family processors.

Penryn family processors are supposed to be socket-compatible, meaning that on the desktop we will see them introduced as LGA-775 CPUs. We'd expect that Intel's new lineup of chipsets will be required, but we are not sure if the new chipsets will support the 1600MHz FSB out of the box or if a refresh will be required.

Penryn-based processors also have a much better divider unit, roughly doubling the divider speed using a faster divide technique called Radix 16. Also, the shuffle engine has been improved. Intel's "Super Shuffle Engine" is a 128-bit, single-pass shuffle unit that can perform full-width shuffles in a single cycle, improving performance for SSE2, SSE3 and SSE4 instructions that have shuffle-like operations such as pack, unpack and wider packed shifts.

The last improvement is the "Split Load Cache Enhancement" which lowers the impact of data which is not aligned to cacheline boundaries. This seems to happen in some SSE intensive imaging applications.

The Quad core desktop and the quad core Xeon products will need 120W, 80W and 50W (LV) just like today. The dual core products will get a 40W/65W and 80W TDP.

Better Virtualization

Intel's current hardware support for virtualization in the current Core architecture is lackluster to say the least. To understand this you must understand what happens in a "pure" software-based virtualization solution such as VMware ESX 2.5.3 running on older Intel CPUs.

A technique called "ring deprivileging" is used as the guest OS cannot be allowed to run in the lowest ring 0 where it normally runs; the Virtual Machine Manager or hypervisor now runs there. That means that every time the guest application asks the help of the guest OS, which needs to run instructions which are only available in ring 0, the VMM must intercept that "SYSENTER" and emulate the normal execution. This is quite costly in performance terms.

Hardware assisted virtualization does not have that problem: both the OS and the VMM have their own ring 0. Despite this, Intel's HW assisted solutions didn't give any speed boost. It has not been discussed in detail, but Penryn speeds up virtual machine transition (entry/exit) times by 25% to 75%, and this requires no virtual machine software changes. This might be similar to AMD's nested page technology, although we don't have any clear details at present.

Last but not least, the dual core Penryn processors get a 6 MB shared cache and the quad versions get 12 MB cache. Both new designs will also come with a "higher degree of associativity". Considering the current designs are 16-way set associative, most likely the newer chips will feature a 24-way set associative L2 cache.

Intel EDAT: the End of the Multi-core Clock Speed Disadvantage?

Intel also talked about its "Enhanced Dynamic Acceleration Technology" which is effectively integrated overclocking based on load. If you are running a single threaded application (or a multi-threaded application that's predominantly using a single thread), Intel's EDAT can power down the second core and increase the frequency of the working core to maintain the same thermal envelope at all times.

Intel's EDAT could spell the end of the clock speed differential between single and multi-core processors. With all cores running workloads, the multi-core system would be clocked lower, but when some cores are idle the chip could potentially run at the same speed as a single core solution would. Single core designs have pretty much disappeared from roadmaps already, but considering there are still applications that are single threaded in nature and benefit more from clock speed improvements, future processors will offer both options in a single package.


Intel hasn't revealed too much about the performance of Penryn but Pat did leave us with a few comments. We don't know anything more about the test conditions than what we are presenting, and we didn't do the measurements ourselves, so take it for what it's worth.

Comparing a 3.2GHz Penryn (1.6GHz FSB) to a 3.0GHz Conroe (1.33GHz FSB), Intel has measured more than 20% increase in gaming performance (with no code changes). For video encoding applications, if SSE4 is utilized, the same Penryn vs. Conroe comparison can offer more than a 40% increase in performance.

Finally, Intel mentioned that in the server space, the fastest quad core Penryn available (>3GHz) vs. a 2.67GHz quad core Xeon resulted in a greater than 45% increase in performance in "bandwidth and FP intensive applications". It's incredibly vague (and oddly similar to AMD's claims of Barcelona vs. Xeon performance), but Pat mentioned that STREAM and certain benchmarks in SpecFP could be considered to be "bandwidth and FP intensive".

Again, we are just reporting what Intel told us. It will be a while before we can actually verify any of these claims or put them in the right context. Given the various enhancements that we've reported on, however, it's only reasonable to expect Penryn to be faster than Conroe, clock-for-clock. Whether that's 10% faster, 20% faster, or something else will be made clear in the future.

Nehalem Micro Architecture: Intel Embraces the IMC and IGP

Surprisingly, Intel gave away quite a few details about Nehalem. Although Nehalem is still based on the 4-issue Core architecture, it takes "multithreading" to a whole new level. First of all, Nehalem can contain up to eight cores per die. Combined with 2-way Simultaneous Multi-Threading (SMT or Hyper-Threading), you'll have the ability to execute up to 16 threads on one chip!

Nehalem will also use multi-level shared cache. Pat Gelsinger indicated that only the highest level of cache would be shared, meaning that Nehalem could very well have a similar cache hierarchy to AMD's Barcelona (independent L1/L2 caches per core, but a shared L3 cache). The power of each core is "dynamically managed" which might indicate that Nehalem goes one step further than AMD's Barcelona core: it could have independent power planes.

Nehalem will no longer use a FSB but a serial point to point interconnect. Even more revolutionary is the fact that Nehalem will have an integrated memory controller (IMC) and that the number of serial interconnects is variable (Intel's version of "HyperTransport"). Another potentially groundbreaking move is that some Nehalem CPUs will have a GPU integrated (Intel's version of "Fusion"). With an integrated memory controller, new interconnect, and potentially integrated graphics, Nehalem will obviously require a new socket.

Intel would not give any more detail, but it is clear that the GPU will not be high-end (that would require too much power); more likely it will be a kind of midrange (or even low-end depending on your perspective) solution. Intel would not confirm this, but it seems pretty clear to us that Xeon DP and desktop products will probably have an IMC that supports DDR3. Xeon MPs will probably have an IMC that supports registered FB-DIMMs with DDR3. Nehalem should first be available in the second half of 2008 as Intel talked about "production ramping in 2008, with full production in 2009".

Final Words

Continuing the trend started by the "new Intel", today we were given a ridiculous amount of information about Intel's coming microprocessor architectures. Obviously part of today's announcements were intended to pre-empt any excitement about AMD's Barcelona architecture, but Intel is doing the right thing. It's sharing a very forward looking roadmap with the public early on in order to rebuild trust and confidence, especially after what happened with NetBurst.

This is the exact approach we would like AMD to embrace as well, keep the public informed, especially if you have exciting things to talk about. We understand the desire to try and protect trade secrets, but with the complexity of modern processors and the amount of information that gets shared between partners, not to mention "corporate espionage", it seems likely that Intel and AMD already more or less know what their competitors are planning. Changing course late in product development is nearly impossible to accomplish, and especially when current products begin to lag behind the competition we would expect companies to try to garner interest by talking about the future.

Looking to the future, one thing that is clear is that multi-core solutions are truly becoming the norm. We still haven't managed to realize the potential of even dual core solutions with many applications (particularly games), and with quad core and octal core processors in the pipeline the need for software to become more multithreading-friendly is reaching a critical stage. The good news is that we're beginning to hear a lot more about multithreaded software engine designs, and we are even beginning to see some of the fruits of these labors.

Naturally, without any actual hardware test, it is impossible to say right now whether Intel or AMD will come out on top with their next-generation processors. Regardless of which company "wins", the best news is that it appears the cutthroat competition will continue for the time being. All you have to do is look at current processor prices to appreciate the importance of competition, and we're certainly looking forward to the day when quad core and higher systems fall into midrange and lower price brackets!

Log in

Don't have an account? Sign up now