Nehalem - Everything You Need to Know about Intel's New Architecture

Name: Nehalem - Everything You Need to Know about Intel's New Architecture
Item: Nehalem - Everything You Need to Know about Intel's New Architecture
Author: Anand Lal Shimpi

by Anand Lal Shimpi on November 3, 2008 1:00 PM EST

Posted in
CPUs

35 Comments | Add A Comment

35 Comments

Integrated Memory Controller

In Nehalem’s un-core lies a number of DDR3 memory controllers, on-die and off of the motherboard - finally. The first incarnation of Nehalem will ship with a triple-channel DDR3 memory controller, meaning that DDR3 DIMMs will have to be installed in sets of three in order to get peak bandwidth. Memory vendors will begin selling Nehalem memory kits with three DIMMs just for this reason. Future versions of Nehalem will ship with only two active controllers, but at the high end and for the server market we’ll have three.

With three DDR3 memory channels, Nehalem will obviously have tons of memory bandwidth, which will help feed its wider and hungrier cores. A side effect of a tremendous increase in memory bandwidth is that Nehalem’s prefetchers can work much more aggressively.

I haven’t talked about Nehalem’s server focus in a couple of pages so here we go again. With Xeon and some server workloads, Core 2’s prefetchers were a bit too aggressive so for many enterprise applications the prefetchers were actually disabled. This mostly happened with applications that had very high bandwidth utilization, where the prefetchers would kick in and actually rob the system of useful memory bandwidth.

With Nehalem the prefetcher aggressiveness can be throttled back if there’s not enough available bandwidth.

QPI

When Intel made the move to an on-die memory controller it needed a high speed interconnect between chips, thus the Quick Path Interconnect (QPI) was born. I’m not sure whether or not QPI or Hyper Transport is a better name for this.

Each QPI link is bi-directional supporting 6.4 GT/s per link. Each link is 2-bytes wide so you get 12.8GB/s of bandwidth per link in each direction, for a total of 25.6GB/s of bandwidth on a single QPI link.

The high end Nehalem processors will have two QPI links while mainstream Nehalem chips will only have one.

The QPI aspect of Nehalem is much like HT with AMD’s processors, now developers need to worry about Intel systems being a NUMA platform. In a multi-socket Nehalem system, each socket will have its own local memory and applications need to ensure that the processor has its data in the memory attached to it rather than memory attached to an adjacent socket.

Here’s one area where AMD having being so much earlier with an IMC and HT really helps Intel. Much of the software work that has been done to take advantage of AMD’s architecture in the server world will now benefit Nehalem.

New Instructions

With Penryn, Intel extended the SSE4 instruction set to SSE4.1 and in Nehalem Intel added a few more instructions which Intel is calling SSE4.2.

The future of Intel’s architectural extensions beyond Nehalem lie in the Advanced Vector Extensions (AVX), which add support for 256-bit vector operations. AVX is an intermediate step between where SSE is today and where Larrabee is going with its instruction set. At some point I suspect we may see some sort of merger between these two ISAs.

Further Power Managed Cache? New Stuff: Power Management

PRINT THIS ARTICLE

Post Your Comment
Please log in or sign up to comment.

Comments Locked

35 Comments

View All Comments

rflcptr - Wednesday, September 24, 2008 - link
The origin of Turbo Mode isn't Penryn, but rather Intel's DAT first found in the Santa Rosa mobile chipset, of which both Merom and Penryn were compatible (and functioning with the tech activated).
hoohoo - Wednesday, September 10, 2008 - link
Outside of games the only area where raw performance matters to me is running high performance code - for me this is graphics processing and 3D rendering code, and it's just a hobby. I have some dealings with the HPC crowd though.

In the HPC biz memory bandwidth is a big issue. AMD has won hands down on that metric against Intel until perhaps the past six months. Nehalem server chip looks like it will beat Opteron on this metric.

Another important metric for HPC is linear algebra performance. GPUs are very good at linear algebra, but GPUs have strange programming requirements for the people who understand scientific programming - these people want to worry about the science or engineering and not about the specific cache architecture of an Nvidia or ATI GPU.

Just because I could, last winter I wrote some 2D graphics processing routines for an 8800GT+CUDA+AthlonX2-5200: gaussian blur, sharp filter, the like. I achieved on the order of 20x speed improvement on the 8800GT vs the AthlonX2 all on Linux - but it was a moderately brutal programming experience and I doubt your average researcher will do it. And, well, the PCIe bandwidth bottleneck would be a problem for large scale batch processing for such a simple calculation.

I don't know about ATI GPUs yet. I got a 3870 eight months ago and installed the AMD HPC GPU SDK ('nuff acromyms for ya?), but I can't face the pain of using it if after booting all the way into XP I could be fragging away in HL2 or Q4, or conquering the world in Civ2 Gold instead - and nobody really uses Windows servers for HPC clusters anyway. I think about write Brook+ code on my 3870 sometimes but honestly I don't care that much. It'll be similar performance to the 8800, and it'll be *Windows* code.

If Intel can produce a chip that is somewhere between Larabee and Nehalem, matching memory bandwidth with an easily programmed but highly parallel chip then Intel will have an opportunity to define a new sub-market: HPC processors.

It is indicative of the deficiencies of AMD marketing that it has a good GPGPU and the only way to program it is on the one OS that HPC shies away from: Windows. Clusters mostly run on Linux or UNIX.

But, AMD is working on a CPU+GPU product that could compete in that market.

Which of AMD or Intel will realize that there is money to be made with a chip that combines 2 or 4 CPU cores + 4 or 8 GPU style linear algebra cores, all with IEEE double precision ability?

Whither the Cell?

:-)
hooflung - Monday, November 3, 2008 - link
I think you are minimizing what an operating system is. While it is true that Linux, AIX and Solaris account for a large number of HPC and cluster environments that doesn't mean Windows is poor in this regard.

There are solid options for windows HPC where infiniband is very, very solid. Microsoft helped define the spec. However, it isn't done often enough in the public's eye like Linux. Remember, AIX and Windows were the most solid platforms for J2EE for a long, long time.

Also, Windows Clusters can be the better TCO solution for people. EVE-Online used Windows 2000 (now 2003 x86 and x64 ) and wrote their own load balancing software in Stackless ( and now have their own async IO stackless library ) which holds 33k at any given time of the day.

You really just have to decide what market you are trying to reach when considering OS choices. They all can provide similar performance.
Pixy - Friday, September 5, 2008 - link
All this sounds nice... but I have a question: when will laptops become fanless? The CPU is fast enough, work on turning down the heat!
Davinchy - Tuesday, August 26, 2008 - link
I thought I read somewhere that if the other processor cores where not working then they shut down and the one that was working got more juice and overclocked.. So wouldn't that suggest that for the average consumer This chip will game much faster than a penryn?

Dunno Maybe I read it wrong
jediknight - Sunday, August 24, 2008 - link
For desktop builders.
My aging S754 Athlon64 is dying.. so it's time to start thinking of building a new one. My laptop can only hold me out for so long, though..

Will I be able to buy a quad-core Nehalem processor in about the $250-300 range by the end of the year?
UnlimitedInternets36 - Saturday, August 23, 2008 - link
Core i7 wins big time for 3D rendering, modeling, and CAD programs.

Turbo is the best feature. I hope, at least in the Extreme Edition we can set the Turbo headroom to like 5Ghz!!! and have a totally dynamic over-clock scaling FTW!

Zbrush can utilize 256 processors, so I think a 2 socket Core i7 will help me out just fine. Sure is don't automatic boost FPS in games, but but that's partially a programming issue as well. Sooner or later the coding will catch up.
munyaka - Friday, August 22, 2008 - link
I have always stuck with amd but this is the final Nail in the coffin.
X1REME - Friday, August 22, 2008 - link
Is there anything you see that we don't, please explain why?
niva - Friday, August 22, 2008 - link
Well it is another step forward for Intel while AMD is still falling farther and farther behind the times. I want to caution that at this point there is no software actually optimized to run on i7 and any potential new instructions the chips will have. Once that happens and games are patched/recompiled or new games come out to take advantage of the massive CPU/memory bandwidth i7 offers it will be lights out.

Waiting on AMD to come out with the next best thing is becoming really old. I have a Phenom system, I won't need a new one for at least another year or two but even though I wish AMD would do better they're just being dominated by intel right now.

Nehalem - Everything You Need to Know about Intel's New Architecture

Integrated Memory Controller

QPI

New Instructions

Post Your Comment

35 Comments

View All Comments

rflcptr - Wednesday, September 24, 2008 - link

hoohoo - Wednesday, September 10, 2008 - link

hooflung - Monday, November 3, 2008 - link

Pixy - Friday, September 5, 2008 - link

Davinchy - Tuesday, August 26, 2008 - link

jediknight - Sunday, August 24, 2008 - link

UnlimitedInternets36 - Saturday, August 23, 2008 - link

munyaka - Friday, August 22, 2008 - link

X1REME - Friday, August 22, 2008 - link

niva - Friday, August 22, 2008 - link

Log in

Don't have an account? Sign up now