Integrated Memory Controller

In Nehalem’s un-core lies a number of DDR3 memory controllers, on-die and off of the motherboard - finally. The first incarnation of Nehalem will ship with a triple-channel DDR3 memory controller, meaning that DDR3 DIMMs will have to be installed in sets of three in order to get peak bandwidth. Memory vendors will begin selling Nehalem memory kits with three DIMMs just for this reason. Future versions of Nehalem will ship with only two active controllers, but at the high end and for the server market we’ll have three.

With three DDR3 memory channels, Nehalem will obviously have tons of memory bandwidth, which will help feed its wider and hungrier cores. A side effect of a tremendous increase in memory bandwidth is that Nehalem’s prefetchers can work much more aggressively.

I haven’t talked about Nehalem’s server focus in a couple of pages so here we go again. With Xeon and some server workloads, Core 2’s prefetchers were a bit too aggressive so for many enterprise applications the prefetchers were actually disabled. This mostly happened with applications that had very high bandwidth utilization, where the prefetchers would kick in and actually rob the system of useful memory bandwidth.

With Nehalem the prefetcher aggressiveness can be throttled back if there’s not enough available bandwidth.


When Intel made the move to an on-die memory controller it needed a high speed interconnect between chips, thus the Quick Path Interconnect (QPI) was born. I’m not sure whether or not QPI or Hyper Transport is a better name for this.

Each QPI link is bi-directional supporting 6.4 GT/s per link. Each link is 2-bytes wide so you get 12.8GB/s of bandwidth per link in each direction, for a total of 25.6GB/s of bandwidth on a single QPI link.

The high end Nehalem processors will have two QPI links while mainstream Nehalem chips will only have one.

The QPI aspect of Nehalem is much like HT with AMD’s processors, now developers need to worry about Intel systems being a NUMA platform. In a multi-socket Nehalem system, each socket will have its own local memory and applications need to ensure that the processor has its data in the memory attached to it rather than memory attached to an adjacent socket.

Here’s one area where AMD having being so much earlier with an IMC and HT really helps Intel. Much of the software work that has been done to take advantage of AMD’s architecture in the server world will now benefit Nehalem.

New Instructions

With Penryn, Intel extended the SSE4 instruction set to SSE4.1 and in Nehalem Intel added a few more instructions which Intel is calling SSE4.2.

The future of Intel’s architectural extensions beyond Nehalem lie in the Advanced Vector Extensions (AVX), which add support for 256-bit vector operations. AVX is an intermediate step between where SSE is today and where Larrabee is going with its instruction set. At some point I suspect we may see some sort of merger between these two ISAs.


Further Power Managed Cache? New Stuff: Power Management


View All Comments

  • retardliquidator - Thursday, August 21, 2008 - link

    ... think again.

    more luck next time before starting the flamebait about not two bytes wide but 20bits.

    effective usable speed is exactly 2bytes, as with 10/8 coding you need 20bits to encode your 16 relevant ones.

    you fail at failing.
  • defter - Friday, August 22, 2008 - link

    Links are 20-bit wide, regardless of encoding or whether 1,2,8,16 or 20 bits are used to tranmist the data.

    I wonder who is flamebaiting here, a previous poster just mentioned the correct link width, he wasn't talking about "usable speed".
  • rbadger - Thursday, August 21, 2008 - link

    "Each QPI link is bi-directional supporting 6.4 GT/s per link. Each link is 2-bytes wide..."

    This is actually incorrect. Each link is 20 bits wide, not 16 (2 bytes). This information is on the slide posted directly below the paragraph.
  • JarredWalton - Thursday, August 21, 2008 - link

    It's 20-bits but using a standard 8/10 encoding mechanism, so of the 20 bits only 16 are used to transmit data and the other four bits are (I believe) for clock signaling and/or error correction. It's the same thing we see with SATA and HyperTransport. Reply
  • ltcommanderdata - Thursday, August 21, 2008 - link

    Since the PCU has a firmware, I wonder if it will be updatable? It would be useful if lessons learn in the power management logic of later steppings and in Westmere can be brought back to all Nehalems through a firmware update for lower power consumption or even better performance with better Turbo mode application. Although a failed or corrupt firmware update on a CPU could be very problematic. Reply
  • wingless - Thursday, August 21, 2008 - link

    I thought about this when I read about it the first time too. Flashing your CPU could kill the power management or the whole CPU in one fell swoop! Reply

Log in

Don't have an account? Sign up now