Nehalem: The Unwritten Chapters

by Anand Lal Shimpi on 11/7/2008 12:00 AM EST


Back to Article

  • lemonadesoda - Wednesday, November 19, 2008 - link

    Anand. Fantastic article, but:

    1./ You didnt mention whether your tests were on 32bit or 64bit. We know that 32bit Core 2 is more efficient due to microcode fusion, whereas that isnt true for 64bit. On i7, opcode fusion is there on 64bit.

    2./ I think you should execute a CPU HALT to observe deep down idle. This figure, say 110W, should then be SUBTRACTED from all other results. Why? Because this is essentially the mainboard/HDD/system power draw excluding the CPU. I see from your figures that the power used (as a delta from idle) on i7 is actually HIGHER than QX9770. So I actually have a very different view than you. I think x58 is much more efficient, and that internal memory controller is less power than older northbridge. But when the i7 is crunching, is is using more power AT THE CPU than the QX9770
  • prodystopian - Monday, November 10, 2008 - link

    While this limit is a non-issue for anyone getting a X58 motherboard, what about those looking for the e2xxx of this generation? When looking for a cheap CPU to heavily OC to get an extreme Price/performance, it would be best to pair with a cheap motherboard such as the next P series (not X). I'm assuming we don't know whether this BIOS switch will be on the P series motherboards, but if it is not, that is where the real problem occurs. Reply
  • Live - Sunday, November 09, 2008 - link

    I don't know if this has been answered yet but what are the advantage of the i7-965 higher QPI? Can you overclock the QPI and if so dose it make a difference? Reply
  • Live - Sunday, November 09, 2008 - link

    Live I think you meant to write:

    I don't know if this has been answered yet, but what is the advantage of the i7-965 higher QPI? Can you overclock the QPI and if so does it make a difference?
  • CEO Ballmer - Saturday, November 08, 2008 - link

    Made for Vista!">
  • Rev1 - Saturday, November 08, 2008 - link

    Maybe im missing something but being that the multiplier was not unlocked how did he get it that high? Reply
  • frazz - Saturday, November 08, 2008 - link

    Surely CPU power at a fixed voltage is proportional to the square of the voltage, not the cube? I thought the formula was this:

    Power dissipation = C.V^2.f where C is the capacitance being switched per clock cycle
  • frazz - Saturday, November 08, 2008 - link

    Sorry I meant CPU power at a fixed FREQUENCY is proportional to the square of the voltage. D'oh. Reply
  • HolyFire - Saturday, November 08, 2008 - link

    I agree. This surely was a misinterpretation of Intel's slide, which actually meant: If the frequency is increased proportionally to the voltage, the power will go like voltage cubed. But for a fixed frequency, power goes like voltage squared.

    In either case, I find that slide a little suspicious, as I have not yet seen any theoretical or experimental result suggesting that frequency should be linearly proportional to voltage.
  • ltcommanderdata - Friday, November 07, 2008 - link

    Great article. It's nice to see someone do a more in depth analysis of Nehalem's characteristics rather than just printing a bunch of benchmarks.

    In regards to you Hyperthreading tests, it might be interesting to isolate the causes of HT performance increases in Nehalem. HT quite often was a hinderance for Netburst and it would be interesting to see whether the cause was primarily HT's implementation in Netburst or just do the the maturity of HT compatible software at the time. It's an odd coincidence that the last processor to carry HT, besides Atom, was the Pentium Extreme Edition 965 while the first desktop processor to reintroduce HT is again numbered 965 as part of the Core i7 family.

    For instance, you could try to compare the speedup that 965EE receives going from 2 to 4 threads against the i7-965 doing the same. It would also be interesting to see if HT's performance delta improves going from Windows XP to Windows Vista, which would imply that Vista's scheduler is smarter about dispatching tasks to logical cores that don't share resources.

    And in regards to mobile Nehalem, I agree that the power consumption improvements could really benefit notebooks, but it's kind of curious that Nehalem won't come to notebooks until Q3 2009. I believe previous Core 2 rollouts for Merom and Penryn were pretty fast, like a quarter spread between the desktop, notebook, and UP/DP server markets, but this looks to be a 3 quarter spread. I wonder what the delay is? With a Q3 2009 mobile Nehalem launch, they might as well just wait a quarter and do a strong roll out of Westmere on mobile first.
  • Denithor - Saturday, November 08, 2008 - link

    HT works well on i7 because of two things: software is much more multithreaded today and there have been drastic throughput & memory controller improvements in the generations from Netbust to Nehalem.

    Multithreaded applications can be accelerated hugely by pulling resources from multiple cores to work on one application (whether physical or virtual cores doesn't matter).

    HT on Netbust was like fitting a garden hose onto a fire hydrant. The data just backed up and couldn't feed through the pipe smoothly. On i7 the bandwidth and memory controller have been optimized to improve flow so the cores don't sit idle (HT basically levels the flow of work across the cores so they all stay busy).
  • TA152H - Saturday, November 08, 2008 - link

    Actually, you're probably missing the point that Nehalem is a lot wider than the Pentium 4 was. Consequently for any given clock cycle, you have more execution resources available for two threads that are probably not used, and could be with an additional thread.

    Most of the time, the data is read from the L1 cache, or, at worst, the L2 cache, so the memory throughput isn't going to be a huge problem most of the time. But, then again, the i7 has a bigger L1 cache, which probably helps as well. It's very slow though, and it makes you wonder why they shackled this processor with a very slow L1 cache (the same clock as a Pentium 4, but with much lower clock speed design). I mean, it can't clock higher than the Penryn, and the cache isn't any bigger, so does it need to be 33% slower? Power savings are nice, but not for a 33% slower L1 cache.

    Also, I'm curious why Intel gave up on the Pentium 4 before the 45 nm production. If you think about it, the drastically lower power use of this manufacturing technology would have yielded enormous improvements in clock speed (since the limitation for it was not based on transistor switching speed, but on the power/heat). I don't think there's any doubt they'd be running over 6 GHz, and with some effective tweaks (and undoing some of the Prescott's damager) it might be an interesting processor. Probably not though, but I'm a little curious how it would pan out.
  • ltcommanderdata - Saturday, November 08, 2008 - link

    Yes, I think HT fits well with Nehalem because of the increased execution resources, 3 ALUs, 2 FPUs, and 3 SSE units compared to 3 ALUs and 2 FPU/SSE units in Netburst. Although I think HT serves a different purpose in each design. Netburst didn't have as much memory bandwidth and it's latency was higher so HT served to hide that, while Nehalem has plenty of memory bandwidth and execution resources and HT serves to best take advantage of those resources.

    In regards to the high cache latency, I have to agree. I have yet to see an explanation of where the high L1 cache latency comes from. And the L2 cache latency is similarly unimpressive considering Dothan had a 2MB L2 cache per core with a 10 cycle latency while Nehalem's 256KB L2 cache per core has higher latency at 11 cycles. Granted that perhaps having a L3 cache forces limitations on the caches, but I still think the latencies are quite high. No offense to the Oregon team, but the last time they did a microarchitecture refresh in Prescott they increased the P4's L1 cache latency from 2 cycles in Northwood to 4 cycles in Prescott and the L2 latency from 16 cycles to 23 cycles so it's disconcerting that they've increased the L1 cache latency from 3 cycles in Penryn to the same 4 cycles in Nehalem, decreased the L2 cache size from 6MB to 256KB to only gain 4 cycles to 11 cycles, and added a 39 cycle L3 cache. I don't think latencies will improve in Westmere, but hopefully they can double the L2 cache to 512KB without increasing latencies and similarly increase the L3 cache, probably to 12MB, without increasing latencies. And maybe latencies can improve in the next microarchitecture refresh in Sandy Bridge with the return of the Israeli team.

    And I also agree that the P4 could probably still have hope with the 45nm process. Even at the 65nm process, Presler still had potential. With the Pentiumm Extreme Edition 965, Intel had basically caught up with the power consumption of it's competitor the FX-60. And things actually improved over time, if you looked at the original Presler B1 stepping Intel was only able to reach 3GHz in the 930D at a 95W TDP, while by the last D0 stepping released after Conroe, Presler was able to reach 3.6GHz in the 960D under the same 95W TDP. Under the same process, a 20% increase in clock speed for the same power consumption is impressive for any micro-architecture, and especially Netburst.

    Clearly, the 65nm process could have brought Netburst's power consumption under control, but by that time development focus had long already shifted to Merom which is why Presler/Ceder Mill was only a shrink rather than a redesign of Prescott. I guess we'll never know what could have happened if Intel had actually used Presler to correct Prescott's flaws such as reducing cache latency, adding a 2nd instruction decoder to keep the Trace Cache and execution units fed, introducing a native dual core design like Yonah over Dothan, etc. But I think the Merom strategy was in the end better since even with a redesign to improve performance, Netburst would probably always have power consumption on the high-end of acceptible, and would have never been fit for mobile usage which is where consumer focus is shifting.
  • IntelUser2000 - Saturday, November 08, 2008 - link

    Don't complain with the lack of single thread increase. Where do you think the majority of the performance increase in Core 2 came from?? It's not a new idea, it just has better memory parallelism(memory disambiguation, excellent prefetchers).

    Future IS MULTI-THREAD. Single thread brings minimal performance increase. For gamers who care, GPU does far more than CPU and multi-threading increases things in things that really matter.

    Westmere isn't gonna bring large L2 caches, L3 caches will increase but that's because the core count is going to 6 cores. Sandy Bridge will bring per core L2 cache to 512KB, but how much do you think that'll do?? It's at most 5-10%.

    The ways to increase x86 CPU performance is decreasing. This is the reason Sandy Bridge will bring advanced Turbo Mode implementation for single threaded performance.
  • ltcommanderdata - Sunday, November 09, 2008 - link

    I wasn't aware that I was complaining about single-threaded performance in my previous posts.

    And another important thing that Sandy Bridge is bringing is AVX. SIMD doesn't benefit all programs, but it does increase performance of optimized applications regardless of whether they are single-threaded or multi-threaded.
  • SiXiam - Saturday, November 08, 2008 - link

    "The Q9450 can operate at voltages down to 0.85V and as high as 1.3625V, while the Core i7-920 currently appears to be limited to a minimum of around 1.137V."

    - I just wanted to let everyone know that got the i7 920 at stock speeds with 1.125volts.

    2.66 GHz @ 1.125v 133mhz x20">
  • Denithor - Friday, November 07, 2008 - link

    Great article. Very impressive results here, congrats to the i7 design team. Of course, we all said the same thing when C2D was launched, with a much bigger differential in performance/watt versus the "Netbust" architecture.

    Have you guys tried F@H SMP client on these i7 chips yet? I'm curious how they stack up against the Q9xx0 series in raw performance. Do the multithreading improvements help put CPU folding any closer to GPU folding or will GPU continue to reign supreme?

    Does Intel intend to launch dual-core versions of these processors or will this generation be quad only?

    Finally, for myself, I have an e8400 and an e3110 which are more than adequate for my current needs. I doubt I will even bother with one of these new setups, I'll just wait until Westmere and the 32nm improvements (higher clocks, lower power, heat and probably price).
  • Strid - Friday, November 07, 2008 - link

    Yeah, I agree. While the offer a solid quad-core performance, and possibly also with a decent energy efficiency, they're not much use for a guy like me who doesn't use much of that multi-core jazz.
    They might not chew up more watts than QX9770, but QX9770 still is a lot more hungry than even the currently quickest 45 nm dual core (E8600). Any news as to a dual-core'd version of Nehalem yet? I'll stick to my Xeon E3110 until then.
  • tynopik - Friday, November 07, 2008 - link

    > (I will be working on a Hyper Threading/multi-tasking set of tests next).

    looking forward to it!

    (and then the VM tests ;)
  • cpugeek - Friday, November 07, 2008 - link

    I think anandtech fail to mention about QPI vs FSB. QPI is super power hungry and offset a lot of power reduction done by Intel. Thats why Lynfield/clarkfield will be much better power efficient since they didn't use QPI physical layer to talk with chipset/tylesburg. Reply
  • npp - Friday, November 07, 2008 - link

    "Intel has done nothing to limit overclocking with the Core i7" :)
    There was such a huge anti-campaign going on everywhere towards Core i7 overclocking that it seems almost funny to hear that now. I just couldn't imagine how on earth Intel would ditch one of the sweetest things in the geek world just for fun... They weren't so stupid, fortunately.

    I would be very interested in some idle/full load temps, particulary for the junior model, at stock speeds and overclocked to some reasonable 24/7 level. It's interesting to see how much they differ from temps we're used to see right now with the good old Core 2 Duos/Quads.
  • Mclendo06 - Friday, November 07, 2008 - link

    Does anyone know what the contact pads on the top edges of the processor are for? I've wondered this for a while but a quick google search only yielded questions. Also, Anand, thanks for the great coverage. Reply
  • Clauzii - Saturday, November 08, 2008 - link

    Good question. Probably used for final testing/burn in. Reply

Log in

Don't have an account? Sign up now