Nehalem: The Unwritten Chapters

Name: Nehalem: The Unwritten Chapters
Item: Nehalem: The Unwritten Chapters
Author: Anand Lal Shimpi

by Anand Lal Shimpi on November 7, 2008 12:00 AM EST

Posted in
CPUs

23 Comments | Add A Comment

23 Comments

Oooh, Shiny - But Why?

Remember this slide?

How about this one?

I referenced both in the Core i7 review, alluding to the possibility that those fundamental design changes would give the Core i7 much better power efficiency than Core 2. However in speaking to Intel's Nehalem architects and power engineers I came to the realization that those very design changes wouldn't be solely responsible for the sorts of power efficiency gains I showed on the previous page. If you look at maximum power consumption as a hard limit, for example the 130W TDP, Nehalem's designers have to somehow - without the benefits of a die shrink - improve performance without increasing power.

Since Core i7 is a "tock" processor you just get the new architecture, you don't get the benefits of Moore's law since it's still a 45nm chip. With no help from the manufacturing process, Nehalem's architects must create ways to save power and then spend the power savings on improving performance. Switching to an all static CMOS design and a more power efficient cache are two examples of ways that the Nehalem architects won themselves a bigger power budget, without increasing the total TDP of the chip. The architects then promptly spent their power savings on more performance; since the market has already accepted a 130W TDP part, simply delivering lower power but with no additional performance wouldn't make any sense. It's because of this that we're able to see these 20 - 60% increases in performance without correspondingly large increases in power consumption.

So why then is the Core i7-965 so much more power efficient than the QX9770? The answer actually boils down to the architectural level decisions made in Nehalem. Remember the power gate transistors?

With these transistors Intel can effectively shut off an entire core if it is idle, cutting it off completely from being a power drain. At the same TDP, for applications that don't use all four cores, Intel's Core i7 should draw less power than any Core 2 Duo before it and we see this in the single-threaded Cinebench test as well as the gaming tests:

CPU	Intel Core 2 Extreme QX9770 (3.2GHz)	Intel Core i7-965 (3.2GHz)
Idle	138.7W	105.5W
Cinebench (1 thread)	194.3W	168.3W
Age of Conan	306.2W	267.3W
Race Driver GRID	348.8W	302W
Crysis	293.6W	248.5W
FarCry 2	324.2W	271.9W
Fallout 3	303.2W	225W

The Cinebench test is single threaded so only one core is active at any time and only a few of the gaming tests can keep all four cores busy, thus giving the Core i7 the ability to be far more power efficient than Intel's Core 2 Extreme QX9770.

But what about in the multi-threaded tests (or the gaming tests like FarCry 2 that actually stress all four cores)? Here, at worst, the Core i7 draws about the same amount of power as the Core 2 despite offering much better performance. In these situations we get a combination of things benefitting Nehalem. The memory controller is on-die and built on a 45nm process, instead of 90nm like on the QX9770's X48 chipset, which gives Nehalem an edge. The transistor design decisions, while mostly spent on increasing performance, can have an impact on power consumption here as well. Nehalem also has fewer transistors and a smaller cache, the majority of which runs slower than the cache in Penryn.

The sum of all of this is that at the same TDP value, with less than four cores fully active, Intel's Core i7 is capable of drawing a good 10 - 20% less total system power than the previous generation 45nm Core 2. With all cores pegged at 100%, the Core i7 tends to draw the same amount of power or a bit more, but performance is improved significantly in those cases thanks to Hyper Threading.

It's interesting but not surprising that the Core i7's power story mimics its performance one: well threaded applications show huge improvements in power efficiency, but the unexpected benefit is that not-so-well-threaded applications can also showcase Core i7's more efficient power usage.

Intel's Secret: Nehalem Can Be Very Power Efficient Final Words

PRINT THIS ARTICLE

Post Your Comment
Please log in or sign up to comment.

Comments Locked

23 Comments

View All Comments

Denithor - Saturday, November 8, 2008 - link
HT works well on i7 because of two things: software is much more multithreaded today and there have been drastic throughput & memory controller improvements in the generations from Netbust to Nehalem.

Multithreaded applications can be accelerated hugely by pulling resources from multiple cores to work on one application (whether physical or virtual cores doesn't matter).

HT on Netbust was like fitting a garden hose onto a fire hydrant. The data just backed up and couldn't feed through the pipe smoothly. On i7 the bandwidth and memory controller have been optimized to improve flow so the cores don't sit idle (HT basically levels the flow of work across the cores so they all stay busy).
TA152H - Saturday, November 8, 2008 - link
Actually, you're probably missing the point that Nehalem is a lot wider than the Pentium 4 was. Consequently for any given clock cycle, you have more execution resources available for two threads that are probably not used, and could be with an additional thread.

Most of the time, the data is read from the L1 cache, or, at worst, the L2 cache, so the memory throughput isn't going to be a huge problem most of the time. But, then again, the i7 has a bigger L1 cache, which probably helps as well. It's very slow though, and it makes you wonder why they shackled this processor with a very slow L1 cache (the same clock as a Pentium 4, but with much lower clock speed design). I mean, it can't clock higher than the Penryn, and the cache isn't any bigger, so does it need to be 33% slower? Power savings are nice, but not for a 33% slower L1 cache.

Also, I'm curious why Intel gave up on the Pentium 4 before the 45 nm production. If you think about it, the drastically lower power use of this manufacturing technology would have yielded enormous improvements in clock speed (since the limitation for it was not based on transistor switching speed, but on the power/heat). I don't think there's any doubt they'd be running over 6 GHz, and with some effective tweaks (and undoing some of the Prescott's damager) it might be an interesting processor. Probably not though, but I'm a little curious how it would pan out.
ltcommanderdata - Saturday, November 8, 2008 - link
Yes, I think HT fits well with Nehalem because of the increased execution resources, 3 ALUs, 2 FPUs, and 3 SSE units compared to 3 ALUs and 2 FPU/SSE units in Netburst. Although I think HT serves a different purpose in each design. Netburst didn't have as much memory bandwidth and it's latency was higher so HT served to hide that, while Nehalem has plenty of memory bandwidth and execution resources and HT serves to best take advantage of those resources.

In regards to the high cache latency, I have to agree. I have yet to see an explanation of where the high L1 cache latency comes from. And the L2 cache latency is similarly unimpressive considering Dothan had a 2MB L2 cache per core with a 10 cycle latency while Nehalem's 256KB L2 cache per core has higher latency at 11 cycles. Granted that perhaps having a L3 cache forces limitations on the caches, but I still think the latencies are quite high. No offense to the Oregon team, but the last time they did a microarchitecture refresh in Prescott they increased the P4's L1 cache latency from 2 cycles in Northwood to 4 cycles in Prescott and the L2 latency from 16 cycles to 23 cycles so it's disconcerting that they've increased the L1 cache latency from 3 cycles in Penryn to the same 4 cycles in Nehalem, decreased the L2 cache size from 6MB to 256KB to only gain 4 cycles to 11 cycles, and added a 39 cycle L3 cache. I don't think latencies will improve in Westmere, but hopefully they can double the L2 cache to 512KB without increasing latencies and similarly increase the L3 cache, probably to 12MB, without increasing latencies. And maybe latencies can improve in the next microarchitecture refresh in Sandy Bridge with the return of the Israeli team.

And I also agree that the P4 could probably still have hope with the 45nm process. Even at the 65nm process, Presler still had potential. With the Pentiumm Extreme Edition 965, Intel had basically caught up with the power consumption of it's competitor the FX-60. And things actually improved over time, if you looked at the original Presler B1 stepping Intel was only able to reach 3GHz in the 930D at a 95W TDP, while by the last D0 stepping released after Conroe, Presler was able to reach 3.6GHz in the 960D under the same 95W TDP. Under the same process, a 20% increase in clock speed for the same power consumption is impressive for any micro-architecture, and especially Netburst.

Clearly, the 65nm process could have brought Netburst's power consumption under control, but by that time development focus had long already shifted to Merom which is why Presler/Ceder Mill was only a shrink rather than a redesign of Prescott. I guess we'll never know what could have happened if Intel had actually used Presler to correct Prescott's flaws such as reducing cache latency, adding a 2nd instruction decoder to keep the Trace Cache and execution units fed, introducing a native dual core design like Yonah over Dothan, etc. But I think the Merom strategy was in the end better since even with a redesign to improve performance, Netburst would probably always have power consumption on the high-end of acceptible, and would have never been fit for mobile usage which is where consumer focus is shifting.
IntelUser2000 - Saturday, November 8, 2008 - link
Don't complain with the lack of single thread increase. Where do you think the majority of the performance increase in Core 2 came from?? It's not a new idea, it just has better memory parallelism(memory disambiguation, excellent prefetchers).

Future IS MULTI-THREAD. Single thread brings minimal performance increase. For gamers who care, GPU does far more than CPU and multi-threading increases things in things that really matter.

Westmere isn't gonna bring large L2 caches, L3 caches will increase but that's because the core count is going to 6 cores. Sandy Bridge will bring per core L2 cache to 512KB, but how much do you think that'll do?? It's at most 5-10%.

The ways to increase x86 CPU performance is decreasing. This is the reason Sandy Bridge will bring advanced Turbo Mode implementation for single threaded performance.
ltcommanderdata - Sunday, November 9, 2008 - link
I wasn't aware that I was complaining about single-threaded performance in my previous posts.

And another important thing that Sandy Bridge is bringing is AVX. SIMD doesn't benefit all programs, but it does increase performance of optimized applications regardless of whether they are single-threaded or multi-threaded.
SiXiam - Saturday, November 8, 2008 - link
"The Q9450 can operate at voltages down to 0.85V and as high as 1.3625V, while the Core i7-920 currently appears to be limited to a minimum of around 1.137V."

- I just wanted to let everyone know that benchmarkreviews.com got the i7 920 at stock speeds with 1.125volts.

2.66 GHz @ 1.125v 133mhz x20
http://benchmarkreviews.com/index.php?option=com_c...">http://benchmarkreviews.com/index.php?o...Itemid=6...
Denithor - Friday, November 7, 2008 - link
Great article. Very impressive results here, congrats to the i7 design team. Of course, we all said the same thing when C2D was launched, with a much bigger differential in performance/watt versus the "Netbust" architecture.

Have you guys tried F@H SMP client on these i7 chips yet? I'm curious how they stack up against the Q9xx0 series in raw performance. Do the multithreading improvements help put CPU folding any closer to GPU folding or will GPU continue to reign supreme?

Does Intel intend to launch dual-core versions of these processors or will this generation be quad only?

Finally, for myself, I have an e8400 and an e3110 which are more than adequate for my current needs. I doubt I will even bother with one of these new setups, I'll just wait until Westmere and the 32nm improvements (higher clocks, lower power, heat and probably price).
Strid - Friday, November 7, 2008 - link
Yeah, I agree. While the offer a solid quad-core performance, and possibly also with a decent energy efficiency, they're not much use for a guy like me who doesn't use much of that multi-core jazz.
They might not chew up more watts than QX9770, but QX9770 still is a lot more hungry than even the currently quickest 45 nm dual core (E8600). Any news as to a dual-core'd version of Nehalem yet? I'll stick to my Xeon E3110 until then.
tynopik - Friday, November 7, 2008 - link
> (I will be working on a Hyper Threading/multi-tasking set of tests next).

looking forward to it!

(and then the VM tests ;)
cpugeek - Friday, November 7, 2008 - link
I think anandtech fail to mention about QPI vs FSB. QPI is super power hungry and offset a lot of power reduction done by Intel. Thats why Lynfield/clarkfield will be much better power efficient since they didn't use QPI physical layer to talk with chipset/tylesburg.

Nehalem: The Unwritten Chapters

Oooh, Shiny - But Why?

Post Your Comment

23 Comments

View All Comments

Denithor - Saturday, November 8, 2008 - link

TA152H - Saturday, November 8, 2008 - link

ltcommanderdata - Saturday, November 8, 2008 - link

IntelUser2000 - Saturday, November 8, 2008 - link

ltcommanderdata - Sunday, November 9, 2008 - link

SiXiam - Saturday, November 8, 2008 - link

Denithor - Friday, November 7, 2008 - link

Strid - Friday, November 7, 2008 - link

tynopik - Friday, November 7, 2008 - link

cpugeek - Friday, November 7, 2008 - link

Log in

Don't have an account? Sign up now