Oooh, Shiny - But Why?

Remember this slide?

How about this one?

I referenced both in the Core i7 review, alluding to the possibility that those fundamental design changes would give the Core i7 much better power efficiency than Core 2. However in speaking to Intel's Nehalem architects and power engineers I came to the realization that those very design changes wouldn't be solely responsible for the sorts of power efficiency gains I showed on the previous page. If you look at maximum power consumption as a hard limit, for example the 130W TDP, Nehalem's designers have to somehow - without the benefits of a die shrink - improve performance without increasing power.

Since Core i7 is a "tock" processor you just get the new architecture, you don't get the benefits of Moore's law since it's still a 45nm chip. With no help from the manufacturing process, Nehalem's architects must create ways to save power and then spend the power savings on improving performance. Switching to an all static CMOS design and a more power efficient cache are two examples of ways that the Nehalem architects won themselves a bigger power budget, without increasing the total TDP of the chip. The architects then promptly spent their power savings on more performance; since the market has already accepted a 130W TDP part, simply delivering lower power but with no additional performance wouldn't make any sense. It's because of this that we're able to see these 20 - 60% increases in performance without correspondingly large increases in power consumption.

So why then is the Core i7-965 so much more power efficient than the QX9770? The answer actually boils down to the architectural level decisions made in Nehalem. Remember the power gate transistors?

With these transistors Intel can effectively shut off an entire core if it is idle, cutting it off completely from being a power drain. At the same TDP, for applications that don't use all four cores, Intel's Core i7 should draw less power than any Core 2 Duo before it and we see this in the single-threaded Cinebench test as well as the gaming tests:

CPU Intel Core 2 Extreme QX9770 (3.2GHz) Intel Core i7-965 (3.2GHz)
Idle
138.7W 105.5W
Cinebench (1 thread)
194.3W 168.3W
Age of Conan
306.2W 267.3W
Race Driver GRID
348.8W 302W
Crysis
293.6W 248.5W
FarCry 2
324.2W 271.9W
Fallout 3
303.2W 225W

 

The Cinebench test is single threaded so only one core is active at any time and only a few of the gaming tests can keep all four cores busy, thus giving the Core i7 the ability to be far more power efficient than Intel's Core 2 Extreme QX9770.

But what about in the multi-threaded tests (or the gaming tests like FarCry 2 that actually stress all four cores)? Here, at worst, the Core i7 draws about the same amount of power as the Core 2 despite offering much better performance. In these situations we get a combination of things benefitting Nehalem. The memory controller is on-die and built on a 45nm process, instead of 90nm like on the QX9770's X48 chipset, which gives Nehalem an edge. The transistor design decisions, while mostly spent on increasing performance, can have an impact on power consumption here as well. Nehalem also has fewer transistors and a smaller cache, the majority of which runs slower than the cache in Penryn.

The sum of all of this is that at the same TDP value, with less than four cores fully active, Intel's Core i7 is capable of drawing a good 10 - 20% less total system power than the previous generation 45nm Core 2. With all cores pegged at 100%, the Core i7 tends to draw the same amount of power or a bit more, but performance is improved significantly in those cases thanks to Hyper Threading.

It's interesting but not surprising that the Core i7's power story mimics its performance one: well threaded applications show huge improvements in power efficiency, but the unexpected benefit is that not-so-well-threaded applications can also showcase Core i7's more efficient power usage.

Intel's Secret: Nehalem Can Be Very Power Efficient Final Words
Comments Locked

23 Comments

View All Comments

  • lemonadesoda - Wednesday, November 19, 2008 - link

    Anand. Fantastic article, but:

    1./ You didnt mention whether your tests were on 32bit or 64bit. We know that 32bit Core 2 is more efficient due to microcode fusion, whereas that isnt true for 64bit. On i7, opcode fusion is there on 64bit.

    2./ I think you should execute a CPU HALT to observe deep down idle. This figure, say 110W, should then be SUBTRACTED from all other results. Why? Because this is essentially the mainboard/HDD/system power draw excluding the CPU. I see from your figures that the power used (as a delta from idle) on i7 is actually HIGHER than QX9770. So I actually have a very different view than you. I think x58 is much more efficient, and that internal memory controller is less power than older northbridge. But when the i7 is crunching, is is using more power AT THE CPU than the QX9770
  • prodystopian - Monday, November 10, 2008 - link

    While this limit is a non-issue for anyone getting a X58 motherboard, what about those looking for the e2xxx of this generation? When looking for a cheap CPU to heavily OC to get an extreme Price/performance, it would be best to pair with a cheap motherboard such as the next P series (not X). I'm assuming we don't know whether this BIOS switch will be on the P series motherboards, but if it is not, that is where the real problem occurs.
  • Live - Sunday, November 9, 2008 - link

    I don't know if this has been answered yet but what are the advantage of the i7-965 higher QPI? Can you overclock the QPI and if so dose it make a difference?
  • Live - Sunday, November 9, 2008 - link

    Live I think you meant to write:

    I don't know if this has been answered yet, but what is the advantage of the i7-965 higher QPI? Can you overclock the QPI and if so does it make a difference?
  • CEO Ballmer - Saturday, November 8, 2008 - link

    Made for Vista!

    http://fakesteveballmer.blogspot.com">http://fakesteveballmer.blogspot.com
  • Rev1 - Saturday, November 8, 2008 - link

    Maybe im missing something but being that the multiplier was not unlocked how did he get it that high?
  • frazz - Saturday, November 8, 2008 - link

    Surely CPU power at a fixed voltage is proportional to the square of the voltage, not the cube? I thought the formula was this:

    Power dissipation = C.V^2.f where C is the capacitance being switched per clock cycle
  • frazz - Saturday, November 8, 2008 - link

    Sorry I meant CPU power at a fixed FREQUENCY is proportional to the square of the voltage. D'oh.
  • HolyFire - Saturday, November 8, 2008 - link

    I agree. This surely was a misinterpretation of Intel's slide, which actually meant: If the frequency is increased proportionally to the voltage, the power will go like voltage cubed. But for a fixed frequency, power goes like voltage squared.

    In either case, I find that slide a little suspicious, as I have not yet seen any theoretical or experimental result suggesting that frequency should be linearly proportional to voltage.
  • ltcommanderdata - Friday, November 7, 2008 - link

    Great article. It's nice to see someone do a more in depth analysis of Nehalem's characteristics rather than just printing a bunch of benchmarks.

    In regards to you Hyperthreading tests, it might be interesting to isolate the causes of HT performance increases in Nehalem. HT quite often was a hinderance for Netburst and it would be interesting to see whether the cause was primarily HT's implementation in Netburst or just do the the maturity of HT compatible software at the time. It's an odd coincidence that the last processor to carry HT, besides Atom, was the Pentium Extreme Edition 965 while the first desktop processor to reintroduce HT is again numbered 965 as part of the Core i7 family.

    For instance, you could try to compare the speedup that 965EE receives going from 2 to 4 threads against the i7-965 doing the same. It would also be interesting to see if HT's performance delta improves going from Windows XP to Windows Vista, which would imply that Vista's scheduler is smarter about dispatching tasks to logical cores that don't share resources.

    And in regards to mobile Nehalem, I agree that the power consumption improvements could really benefit notebooks, but it's kind of curious that Nehalem won't come to notebooks until Q3 2009. I believe previous Core 2 rollouts for Merom and Penryn were pretty fast, like a quarter spread between the desktop, notebook, and UP/DP server markets, but this looks to be a 3 quarter spread. I wonder what the delay is? With a Q3 2009 mobile Nehalem launch, they might as well just wait a quarter and do a strong roll out of Westmere on mobile first.

Log in

Don't have an account? Sign up now