Original Link: http://www.anandtech.com/show/2663

Nehalem: The Unwritten Chapters

by Anand Lal Shimpi on November 7, 2008 12:00 AM EST


Despite being extremely well prepared in having Nehalem, motherboards, coolers and memory well before launch, the run up to the NDA lift of Intel's Core i7 processors was stressful. There was so much to test: multi-GPU compatibility with X58, memory controller performance, general application performance, overclocking, Hyper Threading, etc...

We're all still hard at work on sorting out the details, Gary is working on a X58 motherboard roundup and has been testing 12GB memory configurations for the past several days (as well as working with board vendors to improve performance/compatibility with 12GB but I'll let you tell him about that), Derek is working on multi-GPU performance and Kris has been working on an overclocking guide. What have I been up to? Well, I've been trying to answer a few lingering questions about Nehalem.

What I've got today are the first results of the questions I've been asking, I've spent the past week looking at power efficiency, memory latency and talking to some of Intel's finest on the phone about Nehalem. And I'm back to report, gather 'round for Nehalem: The Unwritten Chapters.

The Uncore

I got a little more detail from Intel on the un-core clock. Just like Phenom, Intel’s Core i7 is divided into an area called the “core” and an area called the “uncore”. The core contains the individual processor cores and their L1/L2 caches, while the uncore houses the memory controller and the shared L3 cache. In our review I mentioned that the uncore runs at 2.66GHz, which is true, but only for the Core i7-965. The Core i7-940 and 920 both run the uncore at 2.13GHz.

The uncore clock is defined by Intel just like the core clock is - Intel sets it based on yield and performance targets. As I mentioned in the launch review, the uncore clock runs at a simple multiplier of the bclk (133MHz): 20x for the i7-965 and 16x for the i7-940/920. The uncore also runs at its own voltage (1.20V) and that voltage doesn't scale up/down.

On Intel’s own X58 board the uncore clock is configured on the memory settings page and is simply called UCLK:

I took the i7-965, ran it at 2.66GHz to simulate an i7-920, and varied the uncore clock to measure the impact in L3 cache and memory latency:

Core Clock Uncore Clock L3 Latency Main Memory Latency x264 HD Benchmark Cinebench XCPU Benchmark
2.66GHz 2.93GHz 34 cycles 143 cycles 72.8 fps 13456
2.66GHz 2.66GHz 36 cycles 148 cycles 73.0 fps 13429
2.66GHz 2.13GHz 41 cycles 159 cycles 72.7 fps 13182

 

At a 2.66GHz uncore clock things seem to hit a sweet spot, although the translation to real-world performance just isn't there. Perhaps in a very memory intensive test we'd see something more pronounced, but even the x264 HD encoding test showed no performance difference between the three uncore clock speeds.

Surprisingly enough, I couldn’t get the i7-965’s uncore to hit 3.2GHz - Vista would bluescreen before I could even get to the desktop (note that the Intel X58 board I was using did not support adjusting the uncore voltage, so it remained at stock). As the table above shows, increases in uncore frequency aren't nearly as useful as increasing the CPU frequency. Intel recognized this performance relationship as well and chose to optimize the uncore for power consumption, not clock speed, which means that the uncore won't be able to clock as high as the core itself. You could always increase the voltage a lot to try and boost uncore speed but right now it's not looking like the tradeoff would be worth it as you'd increase power quite a bit.



The Overclocking Story: Much Ado About Nothing

This one is a complete non-issue, but it's worth explaining. Intel's Turbo mode allows the Core i7 to ratchet up its clock speed by 133MHz or 266MHz depending on how many cores are active and if the CPU is cool enough. Every Core i7 is guaranteed to be able to work at up to 133MHz faster than its native clock speed if more than one core is active and the chip is cool enough, and 266MHz if only a single core is active and once again, the chip is cool enough.

It turns out that Turbo mode is governed by more than just temperature however, both current draw and TDP are monitored to make sure that the CPU isn't exceeding its designed specifications when running in Turbo mode. If either value is exceeded then the chip will automatically reduce its clock multiplier back to its stock setting to avoid damaging the CPU. It's sort of like the overheating protection that Intel has had on its CPUs since the Pentium 4 days; if the chip gets too hot, it underclocks itself until it's cool again.

The concern was that because of these TDP and current draw limitations, you would not be able to take lower end Core i7 processors and overclock them. The $999 Core i7-965 Extreme Edition doesn't have this problem as you can manually configure both the max TDP and current draw values, just like you can adjust its clock multiplier.

It turns out that the concerns are unfounded - all X58 motherboards should ship with a BIOS setting that tells the CPU to ignore its TDP/current limits. On the Intel X58 board the setting looks like this:


Enable this feature and overclocking the i7 is completely limitless

On the ASUS P6T Deluxe it looks like this:

To measure its impact I took a Core i7-920, kept the feature disabled, and tried to overclock it until I hit either the TDP or current limit. Turbo mode made this easier as it would still attempt to boost the frequency of the processor by a multiplier step, even when overclocked. When I hit that TDP limit, Turbo mode wouldn't activate. On my particular chip the limit was 3.7GHz at 1.348V, running Cinebench at this frequency the CPU would try to clock up to 3.89GHz but fall right back down thanks to hitting these hardcoded limits.


We're running into that TDP/current wall, otherwise we'd have a 21x multiplier here

I rebooted the system, went back into the BIOS and disabled the protection. Ran the Cinebench test again and whaddyaknow, Turbo mode was operational again:


Eureka! It works.

With the TDP/current limits ignored, my Core i7-920 could clock higher, just as you'd expect. Now remember that increases in voltage result in tremendous increases in power consumption, so feeding a lot of voltage to these things in an effort to hit higher clock speeds will ruin the power efficiency of your system, but you'll get the performance and Intel has done nothing to limit overclocking with the Core i7.

If you're curious, I was able to hit 3.3GHz (165MHz BCLK x 20) without so much as increasing the core voltage on my Core i7-920. Nearly 4GHz with a hefty boost in Vcc didn't require much effort, although I would personally opt for a milder voltage overclock.



Intel's Secret: Nehalem Can Be Very Power Efficient

I tried an experiment while I was testing Nehalem, I recorded power consumption while running every single benchmark I ran for the review. I did the same for Intel's Core 2 Extreme QX9770 and compared the two. I published an abridged version of these results in the review, basically showing that the Core i7-965 offered much better power consumption, across the board, than the equivalently clocked QX9770 while the Core i7-920 was outshined by the Q9450 which drew less total system power. Both datapoints were valid but there were too many unanswered questions to draw any serious conclusions at that point. I met with Intel several times since the review went live, tested and retested processors and I believe I've come up with an understanding of what's going on from a power standpoint with Nehalem.

All three of Intel's Core i7 CPUs that will be available at launch this month are 130W TDP parts. At 3.2GHz that's expected, but at 2.66GHz that's a bit high compared to Intel's other quad-core 2.66GHz processors on the market. The Core 2 Quad Q9450, for example, has a 95W TDP and runs at 2.66GHz. The lower TDP is made possible by a lower core voltage, which is enabled by the fact that Intel has been building quad-core Penryns for a while and yields are high enough where driving core voltage down is possible. The same will eventually happen to the Core i7, but it's such a new design, such a radical departure from Intel's previous Core based CPUs and so early in the manufacturing process that there simply hasn't been time to get yields high enough to produce < 100W TDP 2.66GHz parts.


Multiple sample points are necessary for proper analysis...


...and plus lots of Nehalems are more fun

The Q9450 can operate at voltages down to 0.85V and as high as 1.3625V, while the Core i7-920 currently appears to be limited to a minimum of around 1.137V. Power consumption of a CPU at a fixed clock speed is proportional to the square of the voltage, so despite whatever power efficiencies Intel has included in Nehalem they will not outweigh a Penryn running at a lower core voltage. So we'd expect the Core 2 Quad Q9450 to have lower power consumption than the Core i7-920, at least today, until Intel can get a competitively low TDP 920 out on the market. But what about the i7-965?

The Core 2 Extreme QX9770 has a 136W TDP, slightly higher than the 130W TDP of the Core i7-965 and both running at the same 3.2GHz frequency. Now this comparison gave me some very interesting data, look at the power consumption numbers across all of the benchmarks (note that this is average system power, recorded over the entire benchmark run for each test):

CPU Intel Core 2 Extreme QX9770 (3.2GHz) Intel Core i7-965 (3.2GHz)
Idle
138.7W 105.5W
POV-Ray
230.7W 240.4W
Cinebench (1 thread)
194.3W 168.3W
Cinebench (max threads)
227.6W 230.7W
3dsmax 9 SPECapc CPU test
220.1W 209.4W
x264 HD Encode Test
230.3W 196.2W
DivX 6.8.3
221.7W 202.1W
Windows Media Encoder
249W 201.2W
Age of Conan
306.2W 267.3W
Race Driver GRID
348.8W 302W
Crysis
293.6W 248.5W
FarCry 2
324.2W 271.9W
Fallout 3
303.2W 225W

 

When compared to the QX9770, the Core i7-965 draws at worst near to or slightly more than the same amount of power, but at best, you see a significant reduction in total system power consumption. There are only two cases where the QX9770 draws less power than the i7-965.

Note that the idle power on the i7-965 is very low, one thing that must be enabled to achieve this is the QPI power management option in the X58 BIOS which for whatever reason was disabled by default in our original review.

If you want to look at performance, here is the corresponding performance data to that power data:

CPU Intel Core 2 Extreme QX9770 (3.2GHz) Intel Core i7-965 (3.2GHz)
POV-Ray
2641 PPS 4202 PPS
Cinebench (1 thread)
3937 CBMarks 4475 CBMarks
Cinebench (max threads)
14065 CBMarks 18810 CBMarks
3dsmax 9 SPECapc CPU test
13.1 17.6
x264 HD Encode Test
73.2 fps 85.8 fps
DivX 6.8.3
42.4 seconds 32.8 seconds
Windows Media Encoder
29 seconds 24 seconds
Age of Conan
107.9 fps 123 fps
Race Driver GRID
103.0 fps 102.9 fps
Crysis
41.7 fps 40.5 fps
FarCry 2
102.6 fps 115.1 fps
Fallout 3
77.2 fps 83.2 fps

 

When the i7-965 significantly outperforms the QX9770, its power consumption is around the same - thus giving us much better performance per watt. When the i7-965 can't really outperform the QX9770, for example in some of the gaming benchmarks, the total system power consumption is much lower.

I confirmed that I didn't have a particularly low power Core i7-965 by testing multiple chips, and Intel confirmed that my QX9770 fell within the middle of its distribution for power characteristics of all QX9770s. It looks extremely probably that at the same TDP level, Nehalem has the ability to be much more power efficient than even Penryn - all without so much as a die shrink, remember that both of these CPUs are built on the same 45nm process.



Oooh, Shiny - But Why?

Remember this slide?

How about this one?

I referenced both in the Core i7 review, alluding to the possibility that those fundamental design changes would give the Core i7 much better power efficiency than Core 2. However in speaking to Intel's Nehalem architects and power engineers I came to the realization that those very design changes wouldn't be solely responsible for the sorts of power efficiency gains I showed on the previous page. If you look at maximum power consumption as a hard limit, for example the 130W TDP, Nehalem's designers have to somehow - without the benefits of a die shrink - improve performance without increasing power.

Since Core i7 is a "tock" processor you just get the new architecture, you don't get the benefits of Moore's law since it's still a 45nm chip. With no help from the manufacturing process, Nehalem's architects must create ways to save power and then spend the power savings on improving performance. Switching to an all static CMOS design and a more power efficient cache are two examples of ways that the Nehalem architects won themselves a bigger power budget, without increasing the total TDP of the chip. The architects then promptly spent their power savings on more performance; since the market has already accepted a 130W TDP part, simply delivering lower power but with no additional performance wouldn't make any sense. It's because of this that we're able to see these 20 - 60% increases in performance without correspondingly large increases in power consumption.

So why then is the Core i7-965 so much more power efficient than the QX9770? The answer actually boils down to the architectural level decisions made in Nehalem. Remember the power gate transistors?

With these transistors Intel can effectively shut off an entire core if it is idle, cutting it off completely from being a power drain. At the same TDP, for applications that don't use all four cores, Intel's Core i7 should draw less power than any Core 2 Duo before it and we see this in the single-threaded Cinebench test as well as the gaming tests:

CPU Intel Core 2 Extreme QX9770 (3.2GHz) Intel Core i7-965 (3.2GHz)
Idle
138.7W 105.5W
Cinebench (1 thread)
194.3W 168.3W
Age of Conan
306.2W 267.3W
Race Driver GRID
348.8W 302W
Crysis
293.6W 248.5W
FarCry 2
324.2W 271.9W
Fallout 3
303.2W 225W

 

The Cinebench test is single threaded so only one core is active at any time and only a few of the gaming tests can keep all four cores busy, thus giving the Core i7 the ability to be far more power efficient than Intel's Core 2 Extreme QX9770.

But what about in the multi-threaded tests (or the gaming tests like FarCry 2 that actually stress all four cores)? Here, at worst, the Core i7 draws about the same amount of power as the Core 2 despite offering much better performance. In these situations we get a combination of things benefitting Nehalem. The memory controller is on-die and built on a 45nm process, instead of 90nm like on the QX9770's X48 chipset, which gives Nehalem an edge. The transistor design decisions, while mostly spent on increasing performance, can have an impact on power consumption here as well. Nehalem also has fewer transistors and a smaller cache, the majority of which runs slower than the cache in Penryn.

The sum of all of this is that at the same TDP value, with less than four cores fully active, Intel's Core i7 is capable of drawing a good 10 - 20% less total system power than the previous generation 45nm Core 2. With all cores pegged at 100%, the Core i7 tends to draw the same amount of power or a bit more, but performance is improved significantly in those cases thanks to Hyper Threading.

It's interesting but not surprising that the Core i7's power story mimics its performance one: well threaded applications show huge improvements in power efficiency, but the unexpected benefit is that not-so-well-threaded applications can also showcase Core i7's more efficient power usage.



Final Words

The more I think about it, the more I'm confident that the Core i7 continues to fuel Intel's beacon of performance, although admittedly the biggest gains are in well threaded workloads (I will be working on a Hyper Threading/multi-tasking set of tests next). It's not worth the upgrade for most existing Core 2 Quad owners unless you do a lot of video encoding, video editing or 3D rendering, but going forward it looks very likely to continue Intel's performance lead even as AMD brings up its 45nm Phenom processors.

Take power efficiency into account however and then Nehalem gets more interesting to more people. Right now we're only talking about 130W TDP parts, which means that the power efficiency really only applies to someone looking to replace a QX9770. Going forward, when Intel can deliver a 95W, 65W or even lower TDP based on the Nehalem then there may be a compelling power efficiency story. A 10 - 20% decrease in power consumption, at the same manufacturing process, is nothing to scoff at. Then a year from now we get the same architecture built on 32nm, which will hopefully give us an even further reduction in power consumption. It's weird to say, but Nehalem may end up being an incredibly good architecture for notebooks. Keep that in mind before buying those new MacBooks guys.

The power efficiency story gets even more exciting when you realize that these gains come with no change in manufacturing process. Pardon the pun, but the next tick is going to be a cool one.

The overclocking story with Core i7 isn't as complex as it sounded at first, fundamentally you can still clock this thing the way you did the Core 2s before it. Turbo mode and the TDP/current limitations do add some complexities, but with the flip of a BIOS switch they go away if you don't wish to bother with them. Change can be scary, but in this case there's no reason to be worried.

The Core i7 appears to be just as smooth of an overclocker as the Core 2s before it. Increase the BCLK and off you go, free performance from Intel and its wonderful fabs.

The split between the core and the uncore in terms of clock speed and overclocking potential doesn't appear to be that big of a deal either. The uncore runs slower on the lower end chips, but increasing its clock speed doesn't really do all that much for performance. There's a reason Intel kept the uncore running slower than the core and it doesn't look like there's much real world benefit in pushing it much higher.

With Nehalem Intel implemented a lot of changes simultaneously. We got Hyper Threading, a completely static CMOS design, new power gate transistors, QPI, an integrated memory controller and some other lower level architectural tweaks. It's a lot to digest, but we're getting there. To Intel: deliver us some 95W and 65W TDP Nehalem and you'll win the hearts of the current Q6600/Q9300/Q9450 owners.

And I can't wait to see one of these things in a notebook, mobile Nehalem could be the most exciting Centrino launch since Merom...

Log in

Don't have an account? Sign up now