Intel's Secret: Nehalem Can Be Very Power Efficient

I tried an experiment while I was testing Nehalem, I recorded power consumption while running every single benchmark I ran for the review. I did the same for Intel's Core 2 Extreme QX9770 and compared the two. I published an abridged version of these results in the review, basically showing that the Core i7-965 offered much better power consumption, across the board, than the equivalently clocked QX9770 while the Core i7-920 was outshined by the Q9450 which drew less total system power. Both datapoints were valid but there were too many unanswered questions to draw any serious conclusions at that point. I met with Intel several times since the review went live, tested and retested processors and I believe I've come up with an understanding of what's going on from a power standpoint with Nehalem.

All three of Intel's Core i7 CPUs that will be available at launch this month are 130W TDP parts. At 3.2GHz that's expected, but at 2.66GHz that's a bit high compared to Intel's other quad-core 2.66GHz processors on the market. The Core 2 Quad Q9450, for example, has a 95W TDP and runs at 2.66GHz. The lower TDP is made possible by a lower core voltage, which is enabled by the fact that Intel has been building quad-core Penryns for a while and yields are high enough where driving core voltage down is possible. The same will eventually happen to the Core i7, but it's such a new design, such a radical departure from Intel's previous Core based CPUs and so early in the manufacturing process that there simply hasn't been time to get yields high enough to produce < 100W TDP 2.66GHz parts.


Multiple sample points are necessary for proper analysis...


...and plus lots of Nehalems are more fun

The Q9450 can operate at voltages down to 0.85V and as high as 1.3625V, while the Core i7-920 currently appears to be limited to a minimum of around 1.137V. Power consumption of a CPU at a fixed clock speed is proportional to the square of the voltage, so despite whatever power efficiencies Intel has included in Nehalem they will not outweigh a Penryn running at a lower core voltage. So we'd expect the Core 2 Quad Q9450 to have lower power consumption than the Core i7-920, at least today, until Intel can get a competitively low TDP 920 out on the market. But what about the i7-965?

The Core 2 Extreme QX9770 has a 136W TDP, slightly higher than the 130W TDP of the Core i7-965 and both running at the same 3.2GHz frequency. Now this comparison gave me some very interesting data, look at the power consumption numbers across all of the benchmarks (note that this is average system power, recorded over the entire benchmark run for each test):

CPU Intel Core 2 Extreme QX9770 (3.2GHz) Intel Core i7-965 (3.2GHz)
Idle
138.7W 105.5W
POV-Ray
230.7W 240.4W
Cinebench (1 thread)
194.3W 168.3W
Cinebench (max threads)
227.6W 230.7W
3dsmax 9 SPECapc CPU test
220.1W 209.4W
x264 HD Encode Test
230.3W 196.2W
DivX 6.8.3
221.7W 202.1W
Windows Media Encoder
249W 201.2W
Age of Conan
306.2W 267.3W
Race Driver GRID
348.8W 302W
Crysis
293.6W 248.5W
FarCry 2
324.2W 271.9W
Fallout 3
303.2W 225W

 

When compared to the QX9770, the Core i7-965 draws at worst near to or slightly more than the same amount of power, but at best, you see a significant reduction in total system power consumption. There are only two cases where the QX9770 draws less power than the i7-965.

Note that the idle power on the i7-965 is very low, one thing that must be enabled to achieve this is the QPI power management option in the X58 BIOS which for whatever reason was disabled by default in our original review.

If you want to look at performance, here is the corresponding performance data to that power data:

CPU Intel Core 2 Extreme QX9770 (3.2GHz) Intel Core i7-965 (3.2GHz)
POV-Ray
2641 PPS 4202 PPS
Cinebench (1 thread)
3937 CBMarks 4475 CBMarks
Cinebench (max threads)
14065 CBMarks 18810 CBMarks
3dsmax 9 SPECapc CPU test
13.1 17.6
x264 HD Encode Test
73.2 fps 85.8 fps
DivX 6.8.3
42.4 seconds 32.8 seconds
Windows Media Encoder
29 seconds 24 seconds
Age of Conan
107.9 fps 123 fps
Race Driver GRID
103.0 fps 102.9 fps
Crysis
41.7 fps 40.5 fps
FarCry 2
102.6 fps 115.1 fps
Fallout 3
77.2 fps 83.2 fps

 

When the i7-965 significantly outperforms the QX9770, its power consumption is around the same - thus giving us much better performance per watt. When the i7-965 can't really outperform the QX9770, for example in some of the gaming benchmarks, the total system power consumption is much lower.

I confirmed that I didn't have a particularly low power Core i7-965 by testing multiple chips, and Intel confirmed that my QX9770 fell within the middle of its distribution for power characteristics of all QX9770s. It looks extremely probably that at the same TDP level, Nehalem has the ability to be much more power efficient than even Penryn - all without so much as a die shrink, remember that both of these CPUs are built on the same 45nm process.

The Overclocking Story: Much Ado About Nothing Oooh, Shiny - But Why?
POST A COMMENT

23 Comments

View All Comments

  • Denithor - Saturday, November 8, 2008 - link

    HT works well on i7 because of two things: software is much more multithreaded today and there have been drastic throughput & memory controller improvements in the generations from Netbust to Nehalem.

    Multithreaded applications can be accelerated hugely by pulling resources from multiple cores to work on one application (whether physical or virtual cores doesn't matter).

    HT on Netbust was like fitting a garden hose onto a fire hydrant. The data just backed up and couldn't feed through the pipe smoothly. On i7 the bandwidth and memory controller have been optimized to improve flow so the cores don't sit idle (HT basically levels the flow of work across the cores so they all stay busy).
    Reply
  • TA152H - Saturday, November 8, 2008 - link

    Actually, you're probably missing the point that Nehalem is a lot wider than the Pentium 4 was. Consequently for any given clock cycle, you have more execution resources available for two threads that are probably not used, and could be with an additional thread.

    Most of the time, the data is read from the L1 cache, or, at worst, the L2 cache, so the memory throughput isn't going to be a huge problem most of the time. But, then again, the i7 has a bigger L1 cache, which probably helps as well. It's very slow though, and it makes you wonder why they shackled this processor with a very slow L1 cache (the same clock as a Pentium 4, but with much lower clock speed design). I mean, it can't clock higher than the Penryn, and the cache isn't any bigger, so does it need to be 33% slower? Power savings are nice, but not for a 33% slower L1 cache.

    Also, I'm curious why Intel gave up on the Pentium 4 before the 45 nm production. If you think about it, the drastically lower power use of this manufacturing technology would have yielded enormous improvements in clock speed (since the limitation for it was not based on transistor switching speed, but on the power/heat). I don't think there's any doubt they'd be running over 6 GHz, and with some effective tweaks (and undoing some of the Prescott's damager) it might be an interesting processor. Probably not though, but I'm a little curious how it would pan out.
    Reply
  • ltcommanderdata - Saturday, November 8, 2008 - link

    Yes, I think HT fits well with Nehalem because of the increased execution resources, 3 ALUs, 2 FPUs, and 3 SSE units compared to 3 ALUs and 2 FPU/SSE units in Netburst. Although I think HT serves a different purpose in each design. Netburst didn't have as much memory bandwidth and it's latency was higher so HT served to hide that, while Nehalem has plenty of memory bandwidth and execution resources and HT serves to best take advantage of those resources.

    In regards to the high cache latency, I have to agree. I have yet to see an explanation of where the high L1 cache latency comes from. And the L2 cache latency is similarly unimpressive considering Dothan had a 2MB L2 cache per core with a 10 cycle latency while Nehalem's 256KB L2 cache per core has higher latency at 11 cycles. Granted that perhaps having a L3 cache forces limitations on the caches, but I still think the latencies are quite high. No offense to the Oregon team, but the last time they did a microarchitecture refresh in Prescott they increased the P4's L1 cache latency from 2 cycles in Northwood to 4 cycles in Prescott and the L2 latency from 16 cycles to 23 cycles so it's disconcerting that they've increased the L1 cache latency from 3 cycles in Penryn to the same 4 cycles in Nehalem, decreased the L2 cache size from 6MB to 256KB to only gain 4 cycles to 11 cycles, and added a 39 cycle L3 cache. I don't think latencies will improve in Westmere, but hopefully they can double the L2 cache to 512KB without increasing latencies and similarly increase the L3 cache, probably to 12MB, without increasing latencies. And maybe latencies can improve in the next microarchitecture refresh in Sandy Bridge with the return of the Israeli team.

    And I also agree that the P4 could probably still have hope with the 45nm process. Even at the 65nm process, Presler still had potential. With the Pentiumm Extreme Edition 965, Intel had basically caught up with the power consumption of it's competitor the FX-60. And things actually improved over time, if you looked at the original Presler B1 stepping Intel was only able to reach 3GHz in the 930D at a 95W TDP, while by the last D0 stepping released after Conroe, Presler was able to reach 3.6GHz in the 960D under the same 95W TDP. Under the same process, a 20% increase in clock speed for the same power consumption is impressive for any micro-architecture, and especially Netburst.

    Clearly, the 65nm process could have brought Netburst's power consumption under control, but by that time development focus had long already shifted to Merom which is why Presler/Ceder Mill was only a shrink rather than a redesign of Prescott. I guess we'll never know what could have happened if Intel had actually used Presler to correct Prescott's flaws such as reducing cache latency, adding a 2nd instruction decoder to keep the Trace Cache and execution units fed, introducing a native dual core design like Yonah over Dothan, etc. But I think the Merom strategy was in the end better since even with a redesign to improve performance, Netburst would probably always have power consumption on the high-end of acceptible, and would have never been fit for mobile usage which is where consumer focus is shifting.
    Reply
  • IntelUser2000 - Saturday, November 8, 2008 - link

    Don't complain with the lack of single thread increase. Where do you think the majority of the performance increase in Core 2 came from?? It's not a new idea, it just has better memory parallelism(memory disambiguation, excellent prefetchers).

    Future IS MULTI-THREAD. Single thread brings minimal performance increase. For gamers who care, GPU does far more than CPU and multi-threading increases things in things that really matter.

    Westmere isn't gonna bring large L2 caches, L3 caches will increase but that's because the core count is going to 6 cores. Sandy Bridge will bring per core L2 cache to 512KB, but how much do you think that'll do?? It's at most 5-10%.

    The ways to increase x86 CPU performance is decreasing. This is the reason Sandy Bridge will bring advanced Turbo Mode implementation for single threaded performance.
    Reply
  • ltcommanderdata - Sunday, November 9, 2008 - link

    I wasn't aware that I was complaining about single-threaded performance in my previous posts.

    And another important thing that Sandy Bridge is bringing is AVX. SIMD doesn't benefit all programs, but it does increase performance of optimized applications regardless of whether they are single-threaded or multi-threaded.
    Reply
  • SiXiam - Saturday, November 8, 2008 - link

    "The Q9450 can operate at voltages down to 0.85V and as high as 1.3625V, while the Core i7-920 currently appears to be limited to a minimum of around 1.137V."

    - I just wanted to let everyone know that benchmarkreviews.com got the i7 920 at stock speeds with 1.125volts.

    2.66 GHz @ 1.125v 133mhz x20
    http://benchmarkreviews.com/index.php?option=com_c...">http://benchmarkreviews.com/index.php?o...Itemid=6...
    Reply
  • Denithor - Friday, November 7, 2008 - link

    Great article. Very impressive results here, congrats to the i7 design team. Of course, we all said the same thing when C2D was launched, with a much bigger differential in performance/watt versus the "Netbust" architecture.

    Have you guys tried F@H SMP client on these i7 chips yet? I'm curious how they stack up against the Q9xx0 series in raw performance. Do the multithreading improvements help put CPU folding any closer to GPU folding or will GPU continue to reign supreme?

    Does Intel intend to launch dual-core versions of these processors or will this generation be quad only?

    Finally, for myself, I have an e8400 and an e3110 which are more than adequate for my current needs. I doubt I will even bother with one of these new setups, I'll just wait until Westmere and the 32nm improvements (higher clocks, lower power, heat and probably price).
    Reply
  • Strid - Friday, November 7, 2008 - link

    Yeah, I agree. While the offer a solid quad-core performance, and possibly also with a decent energy efficiency, they're not much use for a guy like me who doesn't use much of that multi-core jazz.
    They might not chew up more watts than QX9770, but QX9770 still is a lot more hungry than even the currently quickest 45 nm dual core (E8600). Any news as to a dual-core'd version of Nehalem yet? I'll stick to my Xeon E3110 until then.
    Reply
  • tynopik - Friday, November 7, 2008 - link

    > (I will be working on a Hyper Threading/multi-tasking set of tests next).

    looking forward to it!

    (and then the VM tests ;)
    Reply
  • cpugeek - Friday, November 7, 2008 - link

    I think anandtech fail to mention about QPI vs FSB. QPI is super power hungry and offset a lot of power reduction done by Intel. Thats why Lynfield/clarkfield will be much better power efficient since they didn't use QPI physical layer to talk with chipset/tylesburg. Reply

Log in

Don't have an account? Sign up now