Original Link: http://www.anandtech.com/show/2658
The Dark Knight: Intel's Core i7by Anand Lal Shimpi & Gary Key on November 3, 2008 12:00 AM EST
- Posted in
Nuh - hay - lem
At least that's how Intel PR pronounces it.
I've been racking my brain for the past month on how best to review this thing, what angle to take, it's tough. You see, with Conroe the approach was simple: the Pentium 4 was terrible, AMD proudly wore its crown and Intel came in and turned everyone's world upside down. With Nehalem, the world is fine, it doesn't need fixing. AMD's pricing is quite competitive, Intel's performance is solid, power consumption isn't getting out of control...things are nice.
But we've got that pesky tick-tock cadence and things have to change for the sake of change (or more accurately, technological advancement, I swear I'm not getting cynical in my old age):
2008, that's us, that's Nehalem.
Could Nehalem ever be good enough? It's the first tock after Conroe, that's like going on stage after the late Richard Pryor, it's not an enviable position to be in. Inevitably Nehalem won't have the same impact that Conroe did, but what could Intel possibly bring to the table that it hasn't already?
Let's go ahead and get started, this is going to be interesting...
Nehalem's Architecture - A Recap
I spent 15 pages and thousands of words explaining Intel's Nehalem architecture in detail already, but what I'm going to try and do now is summarize that in a page. If you want greater detail please consult the original article, but here are the cliff's notes.
Nehalem, as I've mentioned countless times before, is a "tock" processor in Intel's tick-tock cadence. That means it's a new microarchitecture but based on an existing manufacturing process, in this case 45nm.
A quad-core Nehalem is made up of 731M transistors, down from 820M in Yorkfield, the current quad-core Core 2s based on the Penryn microarchitecture. The die size has gone up however, from 214 mm^2 to 263 mm^2. That's fewer transistors but less densely packed ones, part of this is due to a reduction in cache size and part of it is due to a fundamental rearchitecting of the microprocessor.
Nehalem is Intel's first "native" quad-core design, meaning that all four cores are a part of one large, monolithic die. Each core has its own L1 and L2 caches, and all four sit behind a large 8MB L3 cache. The L1 cache remains unchanged from Penryn (the current 45nm Core 2 architecture), although it is slower at 4 cycles vs. 3. The L2 cache gets a little faster but also gets a lot smaller at 256KB per core, whereas the lowest end Penryns split 3MB of L2 among two cores. The L3 cache is a new addition and serves as a common pool that all four cores can access, which will really help in cache intensive multithreaded applications (such as those you'd encounter in a server). Nehalem also gets a three-channel, on-die DDR3 memory controller, if you haven't heard by now.
At the core level, everything gets deeper in Nehalem. The CPU is just as wide as before and the pipeline stages haven't changed, but the reservation station, load and store buffers and OoO scheduling window all got bigger. Peak execution power hasn't gone up, but Nehalem should be much more efficient at using its resources than any Core microarchitecture before it.
Once again to address the server space Nehalem increases the size of its TLBs and adds a new 2nd level unified TLB. Branch prediction is also improved, but primarily for database applications.
Hyper Threading is back in its typical 2-way fashion, so a single quad-core Nehalem can work on 8 threads at once. Here we have yet another example of Nehalem making more efficient use of the execution resources rather than simply throwing more transistors at the problem. With Penryn Intel hit nearly 1 billion transistors for a desktop quad-core chip, clearly Nehalem was an attempt to both address the server market and make more efficient use of those transistors before the next big jump and crossing the billion transistor mark.
Multiple Clock Domains
Functionally there are some basic differences between Nehalem and previous Intel architectures. The Front Side Bus is gone and replaced with Intel's Quick Path Interconnect, similar to AMD's Hyper Transport. The QPI implementation on the first Nehalem is a 25.6GB/s interface which matches up perfectly to the 25.6GB/s of memory bandwidth Nehalem has.
The CPU operates on a multiplier of the QPI source clock, which in this case is 133MHz. The top bin Nehalem runs at 3.2GHz or 133MHz x 24. The L3 cache and memory controller operate on a separate clock frequency called the un-core clock. This frequency is currently 20x the BCLK, or 2.66GHz.
This is all very similar to AMD's Phenom, but where the two differ is in how they handle power management. While AMD will allow individual cores to request different clock speeds, Nehalem attempts to run all of its cores at the same frequency; if one core is idle then it's simply power gated and the core is effectively turned off. I explain this in greater detail here but the end result is that we don't have the strange performance issues that sometimes appear with AMD's Cool'n'Quiet enabled. While we have to turn off CnQ to get repeatable results in some of our benchmarks (in some cases we'll see a 50% performance hit with CnQ enabled), Intel's EIST seems to be fine when turned on and does not concern us.
Looking at Nehalem's microarchitecture one thing becomes very clear: this is a CPU designed to address Intel's shortcomings in the server space. There's nothing inherently wrong about that, but it's a different approach than what Intel did with Conroe. With Conroe Intel took a mobile architecture and using the philosophy that what was good for mobile, in terms of power efficiency and performance per watt, would also be good for the desktop, it created its current microarchitecture.
This was in stark contrast to how microprocessor development used to work; chips would be designed for the server/workstation/high end desktop market and trickle down to mainstream users and the mobile space. But Conroe changed all of that, it's a good part of why Intel's Core 2 architecture makes such a great desktop and mobile processor.
Power obviously also matters in servers but not to the same extent as notebooks, needless to say Conroe did well in the server market but it lacked some key features that allowed AMD to hang onto market share.
Nehalem started out as an architecture that addressed these enterprise shortcomings head on. The on-die memory controller, Hyper Threading, larger TLBs, improved virtualization performance, restructured cache hierarchy, the new 2nd level branch predictor, all of these features will be very important to making Intel more competitive in the enterprise space, but at what cost to desktop power consumption and performance?
Intel promises better energy efficiency for the desktop, we'll be the judge of that...
I'm stating the concern up front because when I approached today's Nehalem review that's what I had in mind. Everyone has high expectations for Nehalem, but it hasn't been that long since Intel dropped Prescott on us - what I want to find out is whether Intel has stayed true to its mission on keeping power in check or if we've simply regressed with Nehalem.
The only hope I had for Nehalem was that it was the first high performance desktop core that implemented Intel's new 2:1 performance:power ratio rule. Also used by the Atom's design team, every feature that made its way into Nehalem had to increase performance by 2% for every 1% increase in power consumption otherwise it wasn't allowed in the design. In the past Intel used a general 1:1 ratio between power and performance, but with Nehalem the standards were much higher. We'll find out if Intel was all talk in a moment, but let's take a look at Nehalem's biggest weakness first.
With a new microarchitecture comes a new naming system and while it makes sense for Intel to ditch the Duo/Quad suffixes that's about the only sensible thing that we get with Nehalem's marketing. The new name has already been announced, Nehalem is officially known as the Intel Core i7 processor. Model numbers are back of course and the three chips that Intel is announcing today are the 965, 940 and 920. The specs break down like this:
|Processor||Clock Speed||QPI Speed (GT/sec)||L3 Cache||Memory Speed Support||TDP||Unlocked?||Price|
|Intel Core i7-965 Extreme Edition||3.20GHz||6.4||8MB||DDR3-1066||130W||Yes||$999|
|Intel Core i7-940||2.93GHz||4.8||8MB||DDR3-1066||130W||No||$562|
|Intel Core i7-920||2.66GHz||4.8||8MB||DDR3-1066||130W||No||$284|
Obviously there's no changing Intel's naming system now, but I'd just like to voice my disapproval with regards to the naming system. It just doesn't sound very good.
These chips aren't launching today, Intel is simply letting us talk about them today. You can expect an official launch with availability by the end of the month.
By moving the memory controller on-die Intel dramatically increased the pincount of its processor. While AMD's Phenom featured a 940-pin pinout, Intel's previous Core 2 processors only had 775 contact pads on their underside. With three 64-bit DDR3 channels however, Intel's Core i7's ballooned to 1366 pads making the chip and socket both physically larger:
The downside to integrating a memory controller is that if there are any changes in memory technology or in the number of memory channels, you need a new socket. Sometime in 2009 Intel will introduce a cheaper Nehalem derivative with only a 2-channel memory controller, most likely to compete in the < $200 CPU price points. These CPUs will use a LGA-1156 socket, but future 8-core versions of Nehalem will use LGA-1366 like the CPUs we're reviewing here today.
The larger socket also requires a bigger heatsink, here's a look at the new Intel reference cooler:
From left to right: 45nm Core 2 Duo cooler, 45nm Core 2 Quad cooler, 45nm Core i7 Cooler
Nehalem's Weakness: Cache
Intel opted for a very Opteron-like cache hierarchy with Nehalem, each core gets a small L2 cache and they all sit behind one large, shared L3 cache. This sort of a setup benefits large codebase applications that are also well threaded, for example the type of things you'd encounter in a database server. The problem is that the CPU launching today, the Core i7, is designed to be used in a desktop.
Let's look at a quick comparison between Nehalem and Penryn's cache setups:
|Intel Nehalem||Intel Penryn|
|L1 Size / L1 Latency||64KB / 4 cycles||64KB / 3 cycles|
|L2 Size / L2 Latency||256KB / 11 cycles||6MB* / 15 cycles|
|L3 Size / L3 Latency||8MB / 39 cycles||N/A|
|Main Memory Latency (DDR3-1600 CAS7)||107 cycles (33.4 ns)||160 cycles (50.3 ns)|
*Note 6MB per 2 cores
Nehalem's L2 cache does get a bit faster, but the speed doesn't make up for the lack of size. I suspect that Intel will address the L2 size issue with the 32nm shrink, but until then most applications will have to deal with a significantly reduced L2 cache size per core. The performance impact is mitigated by two things: 1) the fast L3 cache, and 2) the very fast on die memory controller. Fortunately for Nehalem, most applications can't fit entirely within cache and thus even the large 6MB and 12MB L2 caches of its predecessors can't completely contain everything, thus giving Nehalem's L3 cache and memory controller time to level the playing field.
The end result, as you'll soon see, is that in some cases Nehalem's architecture manages to take two steps forward, and two steps back, resulting a zero net improvement over Penryn. The perfect example is 3D gaming as you can see below:
|Intel Nehalem (3.2GHz)||Intel Penryn (3.2GHz)|
|Age of Conan||123 fps||107.9 fps|
|Race Driver GRID||102.9 fps||103 fps|
|Crysis||40.5 fps||41.7 fps|
|Farcry 2||115.1 fps||102.6 fps|
|Fallout 3||83.2 fps||77.2 fps|
Age of Conan and Fallout 3 show significant improvements in performance when not GPU bound, while Crysis and Race Driver GRID offer absolutely no benefit to Nehalem. It's almost Prescott-like in that Intel put in a lot of architectural innovation into a design that can, at times, offer no performance improvement over its predecessor. Where Nehalem fails to be like Prescott is in that it can offer tremendous performance increases and it's on the very opposite end of the power efficiency spectrum, but we'll get to that in a moment.
Understanding Nehalem's Memory Architecture
Nehalem does spice things up a bit in the memory department, not only does it have an integrated memory controller (a first for an x86 Intel CPU) but the memory controller in question has an unusual three-channel configuration. All other AMD and Intel systems use dual channel DDR2 or DDR3 memory controllers; with each channel being 64-bits wide, you have to install memory in pairs for peak performance.
With a three-channel DDR3 memory controller, Nehalem requires the use of three DDR3 modules to achieve peak bandwidth - which also means that the memory manufacturers are going to be selling special 3-channel DDR3 kits made specifically for Nehalem. Motherboard makers will be doing one of two things to implement Nehalem's three-channel memory interface on boards; you'll either see boards with four DIMM slots or boards with six:
Four DDR3 slots, three DDR3 channels
In the four-slot configuration the first three slots correspond to the first three channels, the fourth slot is simply sharing one of the memory channels. The downside to this approach is that your memory bandwidth drops to single-channel performance as you start filling up your memory. For example, if you have 4 x 1GB sticks, the first 3GB of memory will be interleaved between the three memory channels and you'll get 25.6GB/s of bandwidth to data stored in the first 3GB. The final 1GB however won't be interleaved and you'll only get 8.5GB/s of bandwidth to it. Despite the unbalanced nature of memory bandwidth in this case, your aggregate bandwidth is still greater in this configuration than a dual-channel setup.
Six DDR3 slots, two slots per DDR3 channel
The more common arrangement will be six DIMM slots where each DDR3 channel is connected to a pair of DIMM slots. In this configuration as long as you install DIMMs in triplicate you'll always get the full 25.6GB/s of memory bandwidth.
That discussion is entirely theoretical however, the real question is: does Nehalem's triple-channel memory controller actually matter or would two channels suffice? I suspect that Hyper Threading simply improved Nehalem's efficiency not necessarily its need for more data. The three-channel memory controller is probably far more important for servers and will be especially useful in the upcoming 8-core version of Nehalem due out sometime next year. To find out we simply benchmarked Nehalem in a handful of applications with a 4GB/dual channel configuration and a 6GB/triple-channel configuration. Note that none of these tests actually used more than 4GB of memory so the size difference doesn't matter, we kept memory timings the same between all tests.
|Dual Channel DDR3-1066 (9-9-9-20)||Triple Channel DDR3-1066 (9-9-9-20)|
|Memory Tests - Everest v1547|
|Read Bandwidth||12859 MB/s||13423 MB/s|
|Write Bandwidth||12410 MB/s||12401 MB/s|
|Copy Bandwidth||16474 MB/s||18074 MB/s|
|Latency||37.2 ns||44.2 ns|
|Cinebench R10 (Multi-threaded test)||18499||18458|
|x264 HD Encoding Test (First Pass / Second Pass)||83.8 fps / 30.3 fps||85.3 fps / 30.3 fps|
|WinRAR 3.80 - 602MB Folder||118 seconds||117 seconds|
|Vantage - Memories||6753||6712|
|Vantage - TV and Movies||5601||5637|
|Vantage - Gaming||10202||9849|
|Vantage - Music||5378||4593|
|Vantage - Communications||6671||6422|
|Vantage - Productivity||7589||7676|
|WinRAR (Built in Benchmark)||3283||3306|
|Nero Recode - Office Space - 7.55GB||131 seconds||130 seconds|
|SuperPI - 32M (mins:seconds)||11:55||11:52|
|Far Cry 2 - Ranch Medium (1680 x 1050)||62.1 fps||62.4 fps|
|Age of Conan - 1680 x 1050||51.5 fps||51.1 fps|
|Company of Heroes - 1680 x 1050||136.6 fps||133.6 fps|
At DDR3-1066 speeds we found no real performance difference between the Core i7-965 running in two channel vs. three channel mode, the added bandwidth is simply not useful for most desktop applications. For some reason we were able to get better latency scores on the dual-channel configuration, but there's a good chance that may be due to the early nature of BIOSes on these boards. In benchmarks were the latency difference was noticeable we saw the dual-channel configuration pull ahead slightly, then in other tests where the added bandwidth helped we saw the triple-channel configuration do better. Honestly, it's mostly a wash between the two.
Our recommendation would be to stick with three channels, but if you have existing memory and can't populate the third channel yet it's not a huge deal, really, two is fine here for the time being.
What about the Impact of DDR3 Speeds?
Intel only officially supports up to DDR3-1066 on Nehalem, but hitting 1333MHz and 1600MHz isn't a problem thanks to DDR3 being a mature technology that's been in use for a couple of years now.
|Triple Channel DDR3-1066 (9-9-9-20)||Triple Channel DDR3-1333 (9-9-9-20)||Triple Channel DDR3-1600 (9-9-9-24)|
|Memory Tests - Everest v1547|
|Read Bandwidth||13423 MB/s||14127 MB/s||17374 MB/s|
|Write Bandwidth||12401 MB/s||12404 MB/s||14169 MB/s|
|Copy Bandwidth||18074 MB/s||16953 MB/s||19447 MB/s|
|Latency||44.2 ns||38.8 ns||33.5 ns|
|x264 HD Encoding Test (First Pass / Second Pass)||85.3 fps / 30.3 fps||86.4 fps / 30.6 fps||88.1 fps / 30.7 fps|
|WinRAR 3.80 - 602MB Folder||117 seconds||111 seconds||106 seconds|
|Vantage - Memories||6712||6809||6886|
|Vantage - TV and Movies||5637||5716||5716|
|Vantage - Gaming||9849||10570||11013|
|Vantage - Music||4593||4798||4896|
|Vantage - Communications||6422||6486||6630|
|Vantage - Productivity||7676||7803||7819|
|WinRAR (Built in Benchmark)||3306||3520||3707|
|Nero Recode - Office Space - 7.55GB||130 seconds||127 seconds||126 seconds|
|SuperPI - 32M (mins:seconds)||11:52||11:36||11:25|
|Far Cry 2 - Ranch Medium (1680 x 1050)||62.4 fps||62.5 fps||62.7 fps|
|Age of Conan - 1680 x 1050||51.1 fps||51.1 fps||51.1 fps|
|Company of Heroes - 1680 x 1050||133.6 fps||135.8 fps||136.8 fps|
The real world performance benefit from going to DDR3-1066 to 1600, despite having to lower memory timings slightly, is around 3%. The raw increase in memory bandwidth amounts to about 30% and in a completely memory bandwidth bound test like the WinRAR benchmark you're looking at a 12% boost in performance, but that's going to be very rare in most real world scenarios. The reduction in latency is particularly impressive when you jump up to DDR3-1600, it only takes 33.5ns to access main memory.
If you do want the absolute best performance out of your Nehalem system you're going to want a three-channel DDR3-1600+ kit, but you'll only be giving up a couple of percent if you opt for the entry level 1066MHz modules at like timings. Although not shown, in this article anyway, reducing the memory timings to 7-7-7-20 at DDR3-1066 will close the slight performance gap quickly in most instances.
Intel's Warning on Memory Voltage
One of the most interesting changes for us with the release of the i7/X58 platform is the advances that have been made with DDR3. DDR3 had an auspicious introduction over a year and half ago when the P35 chipset debuted. Intel then introduced the X38 chipset with a focus on DDR3 support although DDR2 continued to perform better on the platform. It was not until the Intel X48 and NVIDIA 790i chipset releases earlier this year that users recognized DDR3 could become a performance factor on the desktop.
However, in order to glean the absolute best performance from these chipsets, the user needed DDR3 that was capable of running higher than DDR3-1800 speeds. The ICs from Micron at the time required a healthy 1.9V or higher to reach those speeds and the coveted 2000MHz mark. Samsung introduced a new family of ICs last spring that were capable of running up to 2200MHz or higher on +2.0V. While typical desktop applications or games did not take advantage of these speeds and resulting memory bandwidth, they did make for top results in the synthetic benchmarks.
Pricing was another problem that prevented the growth of DDR3 into the main stream market. Not only was DDR3 expensive, the market was flooded with DDR2 memory that performed equally well on the desktop at over half the price. As with most new technologies, it is a chicken and egg scenario when it comes to mass market product acceptance.
Intel had originally planned on X38/X48 being DDR3 only, but the market was not ready for it. We still feel that way to some degree but Intel believes this is the time for DDR3 to become their memory technology of choice for the next few years. As such, the introduction of i7/X58 brings with it a requirement for DDR3 memory. This requirement comes with a couple of caveats, the primary one being that Intel is highly recommending, more like suggesting a visit from the Grim Reaper is coming soon, that memory voltage does not exceed 1.65V on a long term basis or your new i7 might not work one day.
The majority of current DDR3-1066/1333 modules adhere to the base 1.5V JEDEC spec along with not needing more than 1.65V when overclocking, although overclocks amount to a couple hundred MHz increase at best with these products. The higher end DDR3 that has been on the market since last winter typically requires 1.8V or so to run above DDR3-1600. In fact, most of the current DDR3-1800+ memory usually requires 1.9V or higher. In some cases, depending on the SPD, it has difficulty even booting at 1.5V.
By coincidence or not, newer DDR3 ICs coming to market now from Qimonda, Samsung, and Elpida are able to operate from DDR3-1066 up to DDR3-1800 on 1.5V to 1.65V depending on timings and module size. In fact, we have experience with the new Samsung and Qimonda ICs (both 3GB and 6GB kits) operating at DDR3-1866 (9-9-8-20) up to DDR3-2000 (10-9-9-24) on 1.65V~1.75V with the ASUS Rampage II Extreme board. The good news is that these modules are starting to show up at the e-tailors with price points below previous DDR3 products.
This last week has been a busy one in the labs as we have started to receive a variety of memory modules from Kingston, OCZ, Patriot, GSkill, and Corsair for our upcoming DDR3 Shootout and Memory Guide for i7. The products range from the $109 3GB DDR3-1333 (9-9-9-24) kit from GSkill to the Corsair/OCZ 6GB DDR3-1600 (9-9-9-24) kits, and finally our DDR3-2000 (9-9-9-24) 1.65V kit from Kingston.
Our initial opinion at this time is that dual or tri-channel DDR3-1333 running at 8-8-8-20 timings will satisfy about 80% of the users in the market. In fact, DDR3-1066 at 7-7-7-18 might be the better solution for most applications right now considering the latency improvements over CAS8 or CAS9 DDR3-1333. Of course, running DDR3-1333 at CAS7 would be ideal from a price and performance viewpoint.
For the more performance oriented crowd, we have found the sweet spot for performance and keeping money in your wallet, to be tri-channel DDR3-1600 running at 8-8-8-20, something most of the new DDR3-1600 6GB kits will do easily on 1.6V or less. Of course, the benchmarking enthusiast will still want DDR3-1866 or higher on this platform. Something that is attainable now with voltages in the 1.65V~1.75V range depending on final speeds, board design, and loads as all three i7 processors are memory multiplier unlocked.
Getting back to that 1.65V warning, Intel is quite serious about this voltage level and is ensuring the board manufacturers remind the users in a variety of ways ranging from statements in the user manuals to various BIOS warnings when changing VDimm above 1.65V. We have been running exhaustive tests at various voltages and firmly believe that if VCore, QPI/IMC Voltage, and VDimm are properly aligned, that running VDimm up to 1.80V should be acceptable with proper cooling and non 24/7 operation. Of course that is not a promise, but we will have additional results shortly.
In the meantime, Intel also recommends not taking QPI/IMC (uncore/VTT) voltages above 1.3V. In fact, we think this setting is just as dangerous as or more so than high VDimm to the processor’s long term health. However, this setting is also one that greatly improves memory clocking and bclk levels along with a proper dose of IOH voltage. Just how far you can take QPI/IMC (VTT) voltage is something we are working on (1.475V is working well for us), just be aware that it is a delicate balance between this setting and VDimm to get the most out your memory. In most of our tests at this point on the 920, we usually bump QPI/IMC (VTT) voltage up to get additional memory/core clocks while maintaining the memory voltage around 1.65V.
Thread It Like Its Hot
Hyper Threading was a great technology, simply first introduced on the wrong processor. The execution units of any modern day microprocessor are power hungry and consume a lot of die space, the last thing you want is to have them be idle with nothing to do. So you implement various tricks to keep them fed and working as often as possible. You increase cache sizes to make sure they never have to wait on main memory, you integrate a memory controller to ensure that trips to main memory are as speedy as possible, you prefetch data that you think you'll need in the future, you predict branches, etc...
Enabling simultaneous multi-threaded (SMT) execution is one of the most power efficient uses of a microprocessor's transistor budget, as it requires a very minimal increase in die size but can easily double the utilization of a CPU's execution units. SMT, or as Intel calls it, Hyper Threading does this by simply dispatching two threads of instructions to an individual processor core at the same time without increasing the available execution resources. Parallelism is paramount to extracting peak performance out of any out of order core, double the number of instructions being looked at to extract parallelism from and you increase your likelihood of getting work done without waiting on other instructions to retire or data to come back from memory.
In the Pentium 4 days enabling Hyper Threading required less than a 5% increase in die size but resulted in anywhere from a 0 - 35% increase in performance. On the desktop we rarely saw a big boost in performance except in multitasking scenarios, but these days multithreaded software is far more common than it was six years ago when Hyper Threading first made its debut.
This table shows what needed to be added, partitioned, shared or unchanged to enable Hyper Threading on Intel's Core microarchitecture
When the Pentium 4 made its debut however all we really had to worry about was die size, power consumption had yet to become a big issue (which the P4 promptly changed). These days power efficiency, die size and performance all go hand in hand and thus the benefits of Hyper Threading must also be looked at from the power perspective.
I took a small sampling of benchmarks ranging from things like POV-Ray which scales very well with more threads to iTunes, an application that couldn't care less if you had more than two cores. What we're looking at here are the performance and power impact due to Hyper Threading:
|Intel Core i7-965 (Nehalem 3.2GHz)||POV-Ray 3.7 Beta 29||Cinebench R10 1CPU||Race Driver GRID|
|HT Disabled||3239 PPS||207W||4671 CBMarks||161.8W||103 fps||300.7W|
|HT Enabled||4202 PPS||233.7W||4452 CBMarks||159.5W||102.9 fps||302W|
Looking at POV-Ray we see a 30% increase in performance for a 12% increase in total system power consumption, that more than exceeds Intel's 2:1 rule for performance improvement vs. increase in power consumption. The single threaded Cinebench test shows a slight decrease in both performance and power consumption (negligible) and the same can be said for Race Driver GRID.
When Hyper Threading improves performance, it does so at a reasonable increase in power consumption. When performance isn't impacted, neither is power consumption. This time around Hyper Threading has no drawbacks, while before the only way to get it was with a processor that was too hot and barely competitive, today Intel offers it on an architecture that we actually like. Hyper Threading is actually the first indication of Nehalem's true strength, not performance, but rather power efficiency...
Is Nehalem Efficient?
At this year's IDF in San Francisco, Intel revealed a little discussed but extremely important aspect of Nehalem's circuit design:
The Nehalem design is Intel's first microprocessor in the past two decades to feature absolutely no domino logic, it's a fully static CMOS design. I've explained the differences between dynamic domino and static CMOS design in the past, but simply put: domino logic is used as a clock speed play. It's incredibly useful in implementing very high speed circuit paths on a chip and hit its all time peak in Intel's usage in the Pentium 4 days. The downside to using such high speed logic is that it requires a lot of power, but in microprocessor design there are always tradeoffs to be made.
There are many other energy efficiency plays within Nehalem
In Nehalem, Intel took the new architecture as an opportunity to revamp its design, went in and removed all remaining domino logic - but without impacting the peak clock speed of the architecture. The tradeoff here is one of die size, by using more parallel logic Intel was able to convert some serial, high speed paths, into larger, slower circuits that removed the need for domino logic. Details are unfortunately light and a bit beyond the scope of this review, but the move to an all static CMOS design is bound to reduce power consumption. Do you smell a comparison coming?
Both Nehalem and Penryn are built on the same 45nm process, available at the same clock speeds and capable of running the very same applications. In theory, Nehalem should be more power efficient, at the same clock speed, across the board thanks to its static CMOS design. To find out I measured average power consumption over the duration of a handful of benchmarks I used in this review.
|Performance||POV-Ray 3.7||Cinebench XCPU||x264 HD||Crysis|
|Intel Core 2 Quad Q9450 (Penryn - 2.66GHz)||2238 PPS||11502 CBMarks||61.5 fps||34.0 fps|
|Intel Core i7-920 (Nehalem - 2.66GHz)||3528 PPS||16211 CBMarks||74.8 fps||33.2 fps|
|Nehalem Performance Advantage||57.6%||40.9%||21.6%||-2%|
I picked these four benchmarks because they show us the range of Nehalem's performance, going from no performance improvement all the way up to a gain of nearly 60%. Now let's look at the power consumption in each of these four benchmarks:
|Power Consumption||POV-Ray 3.7||Cinebench XCPU||x264 HD||Crysis|
|Intel Core 2 Quad Q9450 (Penryn - 2.66GHz)||168.1W||175.2W||167.5W||220.8W|
|Intel Core i7-920 (Nehalem - 2.66GHz)||202.2W||208.6W||176.6W||230.8W|
|Nehalem Power Disadvantage||+34.1W||+33.4W||+9.1W||+10W|
If you actually go through and do the math you'll find that Nehalem, despite using more power, is more efficient than Penryn. Performance per watt is around 24% better in POV-Ray, 15.5% better in Cinebench and 13% better in the x264 HD test. Crysis, the only benchmark where Nehalem actually falls behind, does require more power and thus Nehalem loses the efficiency battle there.
It seems as if Nehalem is even more polarizing than I had though. Despite the move to a fully static CMOS design, the changes aren't enough to make up for the scenario where Nehalem can't offer more performance; power consumption still goes up, albeit not terribly.
It's also worth noting that the power comparison really depends on the CPU used, here we've got the same comparison but with the Core i7-965 vs. the Core 2 Extreme QX9770, both clocked at 3.2GHz:
|Performance||POV-Ray 3.7||Cinebench R10 - XCPU||x264 HD||Crysis|
|Intel Core 2 Extreme QX9770 (Penryn - 3.2GHz)||2641 PPS||14065 CBMarks||73.2 fps||41.7 fps|
|Intel Core i7-965 (Nehalem - 3.2GHz)||4202 PPS||18810 CBMarks||85.8 fps||40.5 fps|
|Power Consumption||POV-Ray 3.7||Cinebench R10 - XCPU||x264 HD||Crysis|
|Intel Core 2 Extreme QX9770 (Penryn - 3.2GHz)||230.7W||227.6W||230.3W||293.6W|
|Intel Core i7-965 (Nehalem - 3.2GHz)||233.7W||230.7W||196.2W||248.5W|
It's tough to draw any conclusions based on two CPUs, but it is possible that at higher clock speeds Nehalem's efficiency advantage kicks in. The QX9770 has always been a bit high on the power consumption side, whereas the i7-965, even in situations where it is slower than the QX9770, offers better power efficiency here.
Turbo Mode: Gimmicky or Useful?
I refuse to make any further references to the Turbo buttons on PCs from the 80s and early 90s in this section :)
Intel's Core 2 processors have historically been quite overclockable, however most users don't overclock and thus they get no benefit from the added headroom in Intel's chips. Enthusiasts obviously benefit and get the performance of the best CPUs at much lower price points thanks to overclocking, but the rest of the world has all of this untapped power sitting under their heatsinks.
Varying clock speed according to system demands and temperature is nothing new, but it's predominantly done in the downward direction. At idle periods CPU clock speeds are dropped, when temperature limits are reached the same also happens, but why not boost clock speed when conditions are ideal?
This is exactly what Intel's Turbo mode does. Originally introduced on mobile Penryn, Turbo mode simply increases the operating frequency of the processor if conditions are cool enough for the CPU to run at the higher frequency. On mobile Penryn we only saw a frequency jump if one core was idle, but with Nehalem's Turbo mode all four cores can overclock themselves if temperatures are cool enough.
Each Nehalem can run its four cores at up to 133MHz higher than the stock frequency (e.g. 3.33GHz in the case of the 3.2GHz 965 model), or if only one core is active then it can run at up to 266MHz higher than stock (3.46GHz up from 3.2GHz).
I measured the impact of Nehalem's Turbo mode on the top bin Core i7-965, which runs at 3.2GHz by default but can ratchet up to 3.33GHz or 3.46GHz depending on whether the workload is single or multi-threaded:
|POV-Ray 3.7||3dsmax 9 SPECapc CPU Rendering Composite||x264 HD Benchmark (Pass 1 / Pass 2)||iTunes WAV to MP3 Convert||iTunes WAV to AAC Convert (Single Threaded)|
|Intel Core i7-965 (3.2GHz, Turbo OFF)||4017 PPS||17.1||82.7 fps / 30.4 fps||27.1 seconds||34.1 seconds|
|Intel Core i7-965 (3.2GHz, Turbo ON)||4202 PPS||17.6||85.8 fps / 31.6 fps||26.4 seconds||32.8 seconds|
At best we should see a 4ish % increase in performance and the fact that POV-Ray shows us something greater than that tells us that Turbo mode works (and we're within the 1 - 2% margin of error of the test). Surprisingly enough, all of the multi-threaded tests had no problems using Turbo mode to their benefit giving us a 3 - 4% increase in performance thanks to the corresponding increase in clock speed. The AAC iTunes test is important as it is single-threaded, but despite the larger increase in clock speed performance didn't seem to improve any more.
Our Turbo testbed
Now these tests were conducted on an open-air testbench with an aftermarket cooler by Thermalright, we wondered what would happen if we used a retail Intel HSF and stuck the Core i7 in a system with a Radeon HD 4870 and a 1200W PSU. The CPU actually ran a lot warmer and Turbo Mode never engaged, pretty much as expected.
With Nehalem it may be worth investing in one of these oversized heatsinks, even if you're not overclocking, you'll get a couple of extra percent in the performance department if you can keep the cores cool.
The Chipset - Meet Intel's X58
Nehalem moves the North Bridge and memory controller on-die, but just like in the AMD world there's still a need for an off-die chipset, in this case it's Intel's brand new X58.
The Intel X58 chipset is a two chip solution although later next year Intel will introduce a single chip solution alongside the mainstream version of Nehalem (which will use a different socket). Traditionally Intel referred to its North Bridge as the MCH, shorthand for Memory Controller Hub; that definition no longer applies to Nehalem so X58 is called an I/O Hub (IOH).
The X58 IOH attaches to the same ICH10 (I/O Controller Hub) that is used in Intel's 4-series chipsets.
The biggest feature of X58 is that with proper "certification" by NVIDIA, motherboard makers can include support for the right BIOS flags to allow NVIDIA's drivers to enable SLI on the platform. Meaning the X58 will be the first Intel chipset to support both CrossFire and SLI multi-GPU solutions without the use of any NVIDIA silicon. There's a per-motherboard fee from NVIDIA for each certified X58 board sold and thus not all boards will be certified, the most prominent of which is Intel's own X58 board. Luckily we also had access to ASUS' P6T Deluxe which is certified, giving us the ability to look at CrossFire and SLI scaling on X58 vs. other platforms.
X58 Multi-GPU Scaling
While we had some hope earlier in the year of unifying our SLI and CrossFire testbed under Skulltrail, we had to scrap that project due to numerous difficulties in testing. Today, we have another ray of hope. Having a single platform that will allow us to run both SLI and CrossFire would give us better ability to compare multiGPU scaling as there would be fewer variables to consider.
So we did a few tests today to see how the new X58 platform handles multiGPU scaling. We have compared CrossFire on X48 and SLI on 790i to multiGPU scaling on X58 in Oblivion and Quake Wars to get a taste of what we should expect. And the results are a bit of a mixed bag.
With Enemy Territory: Quake Wars, we see very consistent performance. Our single card numbers were very close to the numbers we saw on other platforms, and the general degree of scaling was maintained. Both NVIDIA and AMD hardware scaled slightly better on X58 here than on the hardware we had been using. This is a good sign that the potential for accurate comparison and good quality multiGPU testing might be possible on X58 going forward.
But there there's the Oblivion test.
Under Oblivion, NVIDIA hardware scaled less on X58, and our AMD tests were ... let's say inconclusive.
We have often had trouble with AMD drivers, especially when looking at CrossFire performance. The method that AMD uses to maintain and test their drivers necessitates eliminating some games from testing for extended periods of time. This can sometimes result in games that used to work well with AMD hardware or scale well with CrossFire to stop performing up to par or to stop scaling as well as they should.
The consistent fix, unfortunately, has been for review sites to randomly stumble upon these problems. We usually see resolutions very quickly to issues like this, but that doesn't change the fact that it shouldn't happen in the first place.
In any event, the past couple weeks have been more frustrating than usual when testing AMD hardware. We've switched drivers 4 different times in testing, and we still have issues. Yes, 3 of these four drivers have been hotfix beta drivers, but for people with Far Cry 2 the hotfix is all they've got, which is still broken even after three tries.
We certainly know that NVIDIA doesn't have it all right when it comes to drivers. But we really feel like AMD's monthly driver release schedule wastes resources on unnecessary work that could be better used to benefit their customers. If we are going to have hotfix drivers come out anyway, we might as well make sure that every full release incorporates all the fixes in every hot fix and doesn't break anything the last driver fixed.
The point of all this is, our money is on a lack of scaling under Oblivion due to some aspect of this beta driver we are using rather than scaling on X58.
As for the NVIDIA results, we're a little more worried about those. It could be that we are also seeing a driver issue here, but it just could be that Oblivion does something that doesn't scale well with SLI on X58. We were really surprised to see this as we expected comparable scaling. As the driers mature, we'll definitely test the issue of multiGPU scaling on X58 further.
Our First X58 Motherboard Preview: The ASUS Rampage II Extreme
We utilized the ASUS Rampage II Extreme motherboard for our overclocking and memory tests in today’s article. We will take a detailed look at this board and others from MSI, Gigabyte, ASUS again, and Intel in few days. Boards from these particular manufacturers will be available shortly and our review samples now feature retail production kits, not engineering or early production samples.
All of the boards have performed very well so far, but we have been on the BIOS of the day merry-go-round for the last week or so. However, it appears the current BIOS releases are finally to the point of being acceptable for public release, not exactly perfect yet, but a multitude of problems have been addressed over the past few weeks.
That said, this particular board is designed for a very niche market and will see limited production numbers. The mainstream enthusiast board from ASUS will be the P6T Deluxe board, a board that we actually prefer in most cases. This ROG board will be ASUS’s primary weapon in the ultra high-end market against some stiff competition from the Gigabyte EX58 Extreme board. Pricing is not set yet, but we expect it to be around $400. ASUS includes an extensive accessory kit that features their external LCD poster.
As expected with an Republic of Gamers motherboard, the BIOS options are extensive and well laid out. Of note, the QPI/DRAM Core Voltage is not the DRAM voltage setting. We think it should actually be called QPI/IMC or just Uncore voltage. In fact, as we discussed earlier, this voltage setting could potentially be more damaging to the CPU than the 1.65V recommendation on DRAM. Otherwise, the BIOS is straight forward and allows for a myriad of tuning options. We were able to easily get our i7 965 samples up to 4.2GHz on air (not the retail cooler), water, and our CoolIT Systems Freezone Elite. Our i7 920 samples reached about 3.8GHz, although we think there is a possibility for stable 4GHz operation with them.
The Big Picture
This board compares well to the previous Maximus Formula II and Rampage Extreme boards. We have the return of an eight-layer board dressed out in our favorite black, silver, and Ferrari red primary color scheme. The memory and peripheral slots return in a blue and white motif with the first PCI Express x1 slot that usually houses the SupremeFX X-FI audio card sporting black.
Due to the new LGA 1333 (Socket B1) design that is larger than the current LGA 775 along with six DIMM slots, the area around the CPU is crowded, resulting in a creative layout design that manages to squeeze all the options in a slightly extended ATX format. However, the layout just does not look as clean as previous ROG offerings to us although it is still aesthetically pleasing. ASUS throws in eight fan headers that can be controlled and monitored in the BIOS or via a Windows utility program.
Around the Board
Six DDR3 DIMM slots are included for tri-channel goodness. Performance and compatibility continues to be better when utilizing the blue slots. The memory sub-system receives a three-phase power delivery system.
The TweakIT toggle and power/reset switches carry over from the Rampage Extreme board. This system lets you overclock on the fly from within Windows or even during applications when the CPU is loaded. Eight different solder points and pin-outs allow multimeter readings of DIMM, ICH, ICH PCIe, IOH, QPI, CPU PLL, and Core voltages.
The ICH10R Southbridge is utilized and provides the six SATA ports (dark blue) along with RAID 0,1,5,10. ASUS reverted to the JMicron JMB363 for an extra SATA port (black), an eSATA port on the IO panel and IDE duties. The iROG chipset returns and offers the same features as before, on-board LED control, time keep function, BIOS flashback, additional voltage controls, and a temperature based protection scheme if you enable it.
ASUS includes two PCI Express 2.0 x1 slots, three x16 PCIe 2.0 slots (dual x16 or tri x16/x8/x8), and a lonely PCI slot. Tri-Crossfire and SLI support is included, we just need better drivers from AMD/NVIDIA to recognize the graphics potential of this platform. If you utilize double slot GPU cards, the second PCIe x1 slot and the PCI slot will be physically unavailable with a CF or SLI setup.
The black PCIe x1 slot doubles as the HD Audio slot that features the ADI SoundMAX 2000B chipset with support for Creative X-FI 4.0 routines via a software implementation. This is the last hurrah for the ADI chipset as they have exited the on-board audio business but will continue to provide support into the near future.
Below the ROG silkscreen is the VTT CPU Power Card. The second heatsink is for the X58 chipset and works quite well in early testing. However, if you are running a CF or SLI setup and need the first PCIe x1 slot for audio or other purposes, you are out of luck as the last set of fins on the heatsink blocks full-length cards. We hope that ASUS will address this before commencing retail production.
The IO panel is standard and almost legacy free. The PS/2 keyboard port is a nod to the overclocking crowd as is the clear CMOS switch. Six USB 2.0 ports are available along with six more via headers on the motherboard. An IEEE 1394a port courtesy of the fast VIA VT8308P chipset and the eSATA port via a JMicron 363 are included along with dual RJ-45 ports sporting the Marvell 88E8056-NNC1 controller chips that offer teaming capability.
CPU Real Estate
The CPU socket area is crowded but manageable for most cooling setups. ASUS utilizes their “16-phase” power delivery system along with a 3-phase system for the Northbridge. The EPU2 design allows switching between four or sixteen phases to save energy although we think anyone with this board is probably not concerned with it. The board utilizes a combination of Fujitsu ML and Solid Aluminum capacitors.
That concludes our quick overview of the ASUS Rampage II Extreme board. We will be back shortly with full reviews of several X58 boards.
|CPU:|| AMD Phenom 9950 (2.6GHz) |
Intel Core i7-965 (3.2GHz)
Intel Core i7-940 (2.93GHz)
Intel Core i7-920 (2.66GHz)
Intel Core 2 Extreme QX9770 (3.2GHz/1600MHz)
Intel Core 2 Quad Q9650 (3.00GHz/1333MHz)
Intel Core 2 Quad Q9450 (2.66GHz/1333MHz)
|Motherboard:|| ASUS P6T Deluxe (Intel X58) |
Intel DX58SO (Intel X58)
Intel DX48BT2 (Intel X48)
Gigabyte GA-MA790FX (AMD 790FX)
|Chipset:|| Intel X48 |
|Chipset Drivers:|| Intel 188.8.131.520 (Intel) |
AMD Catalyst 8.9
|Hard Disk:||Intel X25-M SSD (80GB)|
|Memory:||G.Skill DDR2-800 2 x 2GB (4-4-4-12) |
Qimonda DDR3-1066 4 x 1GB (7-7-7-20)
|Video Card:||eVGA GeForce GTX 280|
|Video Drivers:|| NVIDIA ForceWare 180.43 (Vista64) |
NVIDIA ForceWare 178.24 (Vista32)
|Desktop Resolution:||1920 x 1200|
|OS:||Windows Vista Ultimate 32-bit (for SYSMark) |
Windows Vista Ultimate 64-bit
Overclocking: The Initial Results
Our first overclock test with the top Core i7 processor, the 965XE, suggests an average high-end air or water-cooled, 24/7 stable overclock of about 4.2GHz (roughly the same as the latest Yorkfield-based, quad-core processors available right now). Our VCore setting had to increase to 1.5V for stable operation, a significant jump from the 1.4125V required at 4GHz on this board. We set QPI/Dram to 1.325V and VDimm to 1.65V. Our top clock with this particular CPU on our FreeZone Elite cooler is 4.5GHz with 1.55V.
Our second overclock test utilized the 920 processor that clocks in at a stock 2.66GHz. The multiplier is locked on this CPU and the 940 model so overclocking is done via Bclk. We were able to reach a 24/7 stable 3.8GHz overclock on 1.5V with memory (BIOS 0503 raised our voltage requirements) at DDR3-1520 (7-7-7-20) on 1.675V. We think 4GHz is possible on this board with additional tuning and a BIOS update. However, Bclk is limited to around 200~220 on the current i7 series, so additional headroom is probably limited on this CPU. However, performance was excellent during overclocking and a 3.6GHz overclock was possible with 1.425V. This CPU reminds us of the Q6600 at launch, an excellent overclocker that continues to be a bargain.
Although the new architecture allows for near-independent tuning of the processor cores and the underlying memory subsystem, mastering your system will still require a fair amount of patience and consideration. There are numerous tuning tradeoffs with this platform that we are still working through at this time. One such trade off is determining whether to use high CPU multipliers and standard memory ratios when overclocking versus a lower CPU multiplier and high Bclck (bus speed) combination.
In early testing, we have found advantages to both methods depending on the application. It appears right now that a combination of higher Bclk with a lower CPU multiplier will provide a slightly better performing platform, if you can properly balance the memory timings, voltages, and speed. We will help direct your efforts in this process by providing a complete overclocking guide for Core i7 shortly.
General Application Performance
SYSMark 2007 is an application benchmark suite that plays back real world usage scenarios in four categories (E-Learning, Video Creation, Productivity and 3D), using the following applications:
Adobe After Effects 7
Adobe Illustrator CS2
Adobe Photoshop CS2
AutoDesk 3ds Max 8
Macromedia Flash 8
Microsoft Excel 2003
Microsoft Outlook 2003
Microsoft PowerPoint 2003
Microsoft Word 2003
Microsoft Project 2003
Microsoft Windows Media Encoder 9 series
Sony Vegas 7
Performance is measured in each individual category and then an overall score is reported.
We don't see a huge performance increase thanks to Nehalem, we're looking at an average boost of 7 - 12% at the same clock speed as Penryn. The Core i7-965 managed a 7% performance advantage over the QX9770, while the i7-920 pulled 12% on the identically clocked Q9450.
The biggest performance boost is naturally in the 3D suite, the rest of the applications are showing 5 - 10% performance boosts at the same clock.
3D Rendering Performance
Our first 3D rendering test is POV-Ray 37 beta 29 and its SMP benchmark, the performance is measured in ray traced pixels per second:
As we've already seen, Nehalem's multi-threaded 3D rendering performance is absolutely insane - the $284 Core i7-920 is faster than the $1400 Core 2 Extreme QX9770.
Next up we have Cinebench with both its single and multi-threaded rendering tests:
Nehalem's single-threaded performance is still improved over Penryn, here we're seeing a 13.7% increase in performance at 3.2GHz.
Toss more threads at the i7 and the performance boost jumps to 34%, once again the i7-920 is faster than the QX9770.
Our final 3D rendering benchmark is the SPECapc 3dsmax 8 CPU rendering test run on 3dsmax 9:
That's another 30%+ advantage for Nehalem. If you do a lot of 3D rendering on your system, Intel is going to give you $1400 worth of performance for $284. Merry Christmas.
Video and Media Encoding Performance
Nehalem is happiest in 3D rendering applications, but video encoding is a close second place. Our first video encoding test is Tech ARP's x264 HD benchmark, which does a test encode of a 720p source file using the x264 codec. We're reporting results from the 0.59.819 version of x264.
Here Nehalem can "only" deliver a 17% performance advantage over Penryn, but once again the $284 i7-920 is faster than the $1400 QX9770.
Our DivX test is the same DivX / XMpeg 5.03 test we've run for the past couple of years now, the 1080p source file is encoded using the unconstrained DivX profile, quality/performance is set balanced at 5 and enhanced multithreading is enabled:
You know the drill, Intel introduces a new architecture and obsoletes its old one - the i7-920 is looking mighty tempting...
The WME test gives us a little dose of reality, the i7-920 is only as fast as the QX9770 here:
Although this isn't a video encoding test I wanted to include a lighter workload to illustrate a situation where Nehalem is no faster than its predecessor. The iTunes benchmark is a simple WAV to MP3 encode and Nehalem isn't able to dominate here:
Penryn pulls slightly ahead at the top end but at 2.66GHz it falls slightly behind. In both cases the point is that you'll be faster with a QX9770 than an i7-920, and you'll be no faster with a Nehalem vs. Penryn.
We have already teased you with the results of our gaming benchmarks, but sometimes charts are easier to look at than tables. I should stress that to get any of these modern games to be even remotely CPU bound I had to drop resolution and image quality, which is fine for this as we're trying to evaluate whether or not Nehalem is architecturally faster. In the real world however, you'll not see any performance difference in any of these titles with Nehalem over Penryn.
We start off with our Age of Conan benchmark. This is a fraps test, we take our character and swim him to shore and back measuring performance during the process.
Note that to get this game to be at all CPU bound we had to drop to medium quality and run at 1280 x 1024:
This is the only game where we see Nehalem boast such a tremendous performance advantage. I suspect that it's extremely sensitive to memory latency for some reason, resulting in the i7-920 being faster than the QX9770.
Our GRID benchmark is a fraps test that measures frame rate at the very beginning of a race with our car starting at the back and then crashing into a wall:
While racing games are usually great physics tests, GRID just wasn't CPU bound enough to show serious differences between the CPUs. Nehalem and Penryn are basically no different here.
For Crysis we ran the built in CPU2 benchmark on version 1.21 of the game:
Like GRID, Nehalem offers nothing over Penryn in the performance department.
Far Cry 2's built in benchmark tool using the Ranch small test brings us our next set of numbers:
Here we have another game that favors the Core i7's architecture, but as we just saw it's not an across the board sort of win as some games miss Penryn's larger L2 cache.
Our last benchmark is a walk towards Megaton in Fallout 3:
The edge goes to Intel's Core i7 once again.
Overall in gaming tests the situations where Nehalem was faster than Penryn outnumbered those where it didn't, but upgrading to Nehalem for faster gaming performance doesn't make sense. We were entirely too GPU bound in all of these titles, if you want Nehalem it should be because of its performance elsewhere.
Expecting a sequel to be a reincarnation of the original is just setting yourself up for disappointment. A good sequel will be able to stand on its own, independent of whatever may have come before it. Nehalem is Intel's Dark Knight, it lacks the reinvention that made Conroe so incredible, but it continues what was started in 2006.
The Core i7's general purpose performance is solid, you're looking at a 5 - 10% increase in general application performance at the same clock speeds as Penryn. Where Nehalem really succeeds however is in anything involving video encoding or 3D rendering, the performance gains there are easily in the 20 - 40% range. Part of the performance boost here is due to Hyper Threading, but the on-die memory controller and architectural tweaks are just as responsible for driving Intel's performance through the roof.
The iTunes results do paint a downside to Nehalem, there are going to be some situations where Intel's new architecture doesn't offer a performance advantage over its predecessor. If you're not doing a lot of 3D rendering or video encoding work and you already have a Core 2 Quad, the upgrade to Nehalem won't be worth it. If you're still stuck on a Pentium 4 or something similarly slow by today's standards, a jump to Nehalem would be warranted.
Gaming performance is actually better than expected for Nehalem, there were enough cases where the new architecture pulled ahead despite its very small L2 cache that I wouldn't mind recommending it for gamers. In most GPU limited situations however you won't see any performance improvement, at least with today's GPUs, over Penryn.
While posting some very impressive performance gains, Nehalem is nearly as much about efficiency. Hyper Threading alone delivers a 0 - 30% increase in performance at a 0 - 15% increase in power consumption; the problem is that Nehalem's efficiency is only as good as its performance and in those areas where Nehalem can't outperform Penryn, its power efficiency suffers.
I can't help but wonder if what we saw with the QX9770 is indicative of a larger Nehalem advantage, if Penryn's power consumption truly does increase dramatically as clock speed goes up, while Nehalem is able to reel it back in. If that is indeed the case, then Nehalem is even more important for the future of the Core microarchitecture than I originally thought. You could consider it the reverse-Prescott in that case, if its design choices are meant to keep power consumption under control as clock speed ramps up.
It seems odd debating over the usefulness of a processor that can easily offer a 20 - 40% increase in performance, the issue is that the advantages are very specific in their nature. While Conroe reset the entire board, Nehalem is very targeted in where it improves performance the most. That is one benefit of the tick-tock model however, if Intel was too aggressive (or conservative?) with this design then it only needs to last two years before it's replaced with something else. I am guessing that once Intel moves to 32nm however, L2 cache sizes will increase once more and perhaps bring greater performance to all applications.
Quite possibly the biggest threat to Nehalem is that, even at the low end, $284 is a good amount for a microprocessor these days. You can now purchase AMD's entire product line for less than $180 and the cost of entry to a Q9550 is going to be lower, at least at the start, than a Core i7 product. There's no denying that the Core i7 is the fastest thing to close out 2008, but you may find that it's not the most efficient use of money. The first X58 motherboards aren't going to be cheap and you're stuck using more expensive DDR3 memory. If you're running applications where Nehalem shines (e.g. video encoding, 3D rendering) then the ticket price is likely worth it, if you're not then the ~10% general performance improvement won't make financial sense.
It also remains to be seen what will happen to the Nehalem market once Intel introduces the LGA-1156 version next year for lower price points. By introducing a $284 part this early Intel appears to be courting the Q6600/Q9450/Q9550 buyers to the LGA-1366 platform, which would mean that the two-channel Nehalems are strictly value parts and perhaps there won't be much fragmentation in the market as a result.
Intel has two thirds of the perfect trifecta here. Nehalem brings the ability to work on more threads at a time, redefining video encoding and 3D rendering performance, its SSDs shook the storage world, that just leaves Larrabee...