Understanding Nehalem's Memory Architecture

Nehalem does spice things up a bit in the memory department, not only does it have an integrated memory controller (a first for an x86 Intel CPU) but the memory controller in question has an unusual three-channel configuration. All other AMD and Intel systems use dual channel DDR2 or DDR3 memory controllers; with each channel being 64-bits wide, you have to install memory in pairs for peak performance.

With a three-channel DDR3 memory controller, Nehalem requires the use of three DDR3 modules to achieve peak bandwidth - which also means that the memory manufacturers are going to be selling special 3-channel DDR3 kits made specifically for Nehalem. Motherboard makers will be doing one of two things to implement Nehalem's three-channel memory interface on boards; you'll either see boards with four DIMM slots or boards with six:


Four DDR3 slots, three DDR3 channels

In the four-slot configuration the first three slots correspond to the first three channels, the fourth slot is simply sharing one of the memory channels. The downside to this approach is that your memory bandwidth drops to single-channel performance as you start filling up your memory. For example, if you have 4 x 1GB sticks, the first 3GB of memory will be interleaved between the three memory channels and you'll get 25.6GB/s of bandwidth to data stored in the first 3GB. The final 1GB however won't be interleaved and you'll only get 8.5GB/s of bandwidth to it. Despite the unbalanced nature of memory bandwidth in this case, your aggregate bandwidth is still greater in this configuration than a dual-channel setup.

 


Six DDR3 slots, two slots per DDR3 channel

The more common arrangement will be six DIMM slots where each DDR3 channel is connected to a pair of DIMM slots. In this configuration as long as you install DIMMs in triplicate you'll always get the full 25.6GB/s of memory bandwidth.

That discussion is entirely theoretical however, the real question is: does Nehalem's triple-channel memory controller actually matter or would two channels suffice? I suspect that Hyper Threading simply improved Nehalem's efficiency not necessarily its need for more data. The three-channel memory controller is probably far more important for servers and will be especially useful in the upcoming 8-core version of Nehalem due out sometime next year. To find out we simply benchmarked Nehalem in a handful of applications with a 4GB/dual channel configuration and a 6GB/triple-channel configuration. Note that none of these tests actually used more than 4GB of memory so the size difference doesn't matter, we kept memory timings the same between all tests.

  Dual Channel DDR3-1066 (9-9-9-20) Triple Channel DDR3-1066 (9-9-9-20)
Memory Tests - Everest v1547    
Read Bandwidth 12859 MB/s 13423 MB/s
Write Bandwidth 12410 MB/s 12401 MB/s
Copy Bandwidth 16474 MB/s 18074 MB/s
Latency 37.2 ns 44.2 ns
Cinebench R10 (Multi-threaded test) 18499 18458
x264 HD Encoding Test (First Pass / Second Pass) 83.8 fps / 30.3 fps 85.3 fps / 30.3 fps
WinRAR 3.80 - 602MB Folder 118 seconds 117 seconds
PCMark Vantage 7438 7490
Vantage - Memories 6753 6712
Vantage - TV and Movies 5601 5637
Vantage - Gaming 10202 9849
Vantage - Music 5378 4593
Vantage - Communications 6671 6422
Vantage - Productivity 7589 7676
WinRAR (Built in Benchmark) 3283 3306
Nero Recode - Office Space - 7.55GB 131 seconds 130 seconds
SuperPI - 32M (mins:seconds) 11:55 11:52
Far Cry 2 - Ranch Medium (1680 x 1050) 62.1 fps 62.4 fps
Age of Conan - 1680 x 1050 51.5 fps 51.1 fps
Company of Heroes - 1680 x 1050 136.6 fps 133.6 fps

 

At DDR3-1066 speeds we found no real performance difference between the Core i7-965 running in two channel vs. three channel mode, the added bandwidth is simply not useful for most desktop applications. For some reason we were able to get better latency scores on the dual-channel configuration, but there's a good chance that may be due to the early nature of BIOSes on these boards. In benchmarks were the latency difference was noticeable we saw the dual-channel configuration pull ahead slightly, then in other tests where the added bandwidth helped we saw the triple-channel configuration do better. Honestly, it's mostly a wash between the two.

Our recommendation would be to stick with three channels, but if you have existing memory and can't populate the third channel yet it's not a huge deal, really, two is fine here for the time being.

Nehalem's Weakness: Cache What about the Impact of DDR3 Speeds?
POST A COMMENT

73 Comments

View All Comments

  • npp - Tuesday, November 4, 2008 - link

    Well, the funny thing is THG got it all messed up, again - they posted a large "CRIPPLED OVERCKLOCKING" article yesterday, and today I saw a kind of apology from them - they seem to have overlooked a simple BIOS switch that prevents the load through the CPU from rising above 100A. Having a month to prepare the launch article, they didn't even bother to tweak the BIOS a bit. That's why I'm not taking their articles seriously, not because they are biased towards Intel ot AMD - they are simply not up to the standars (especially those here @anandtech). Reply
  • gvaley - Tuesday, November 4, 2008 - link

    Now give us those 64-bit benchmarks. We already knew that Core i7 will be faster than Core 2, we even knew how much faster.
    Now, it was expected that 64-bit performance will be better on Core i7 that on Core 2. Is that true? Draw a parallel between the following:

    Performance jump from 32- to 64-bit on Core 2
    vs.
    Performance jump from 32- to 64-bit on Core i7
    vs.
    Performance jump from 32- to 64-bit on Phenom
    Reply
  • badboy4dee - Tuesday, November 4, 2008 - link

    and what's those numbers on the charts there? Are they frames per second? high is better then if thats what they are. Charts need more detail or explanation to them dude!

    TSM
    Reply
  • MarchTheMonth - Tuesday, November 4, 2008 - link

    I don't believe I saw this anywhere else, but the spots for the cooler on the Mobo, they the same as like the LGA 775, i.e. can we use (non-Intel) coolers that exist now for the new socket? Reply
  • marc1000 - Tuesday, November 4, 2008 - link

    no, the new socket is different. the holes are 80mm far from each other, on socket 775 it was 72mm away. Reply
  • Agitated - Tuesday, November 4, 2008 - link

    Any info on whether these parts provide an improvement on virtualized workloads or maybe what the various vm companies have planned for optimizing their current software for nehalem? Reply
  • yyrkoon - Tuesday, November 4, 2008 - link

    Either I am not reading things correctly, or the 130W TDP does not look promising for the end user such as myself that requires/wants a low powered high performance CPU.

    The future in my book is using less power, not more, and Intel does not right now seem to be going in this direction. To top things off, the performance increase does not seem to be enough to justify this power increase.

    Being completely off grid(100% solar / wind power), there seem to be very few options . . . I would like to see this change. Right now as it stands, sticking with the older architecture seems to make more sense.
    Reply
  • 3DoubleD - Tuesday, November 4, 2008 - link

    130W TDP isn't much worse for previous generations of quad core processors which were ~100W TDP. Also, TDP isn't a measure of power usage, but of the required thermal dissipation of a system to maintain an operating temperature below an set value (eg. Tjmax). So if Tjmax is lower for i7 processors than it is for past quad cores, it may use the same amount of power, but have a higher TDP requirement. The article indicates that power draw has increased, but usually with a large increase in performance. Page 9 of the article has determined that this chip has a greater performance/watt than its predecessors by a significant margin.

    If you are looking for something that is extremely low power, you shouldn't be looking at a quad core processor. Go buy a laptop (or an EeePC-type laptop with an Atom processor). Intel has kept true to its promise of 2% performance increase for every 1% power increase (eg. a higher performance per watt value).

    Also, you would probably save more power overall if you just hibernate your computer when you aren't using it.
    Reply
  • Comdrpopnfresh - Monday, November 3, 2008 - link

    Do differing cores have access to another's L2? Is it directly, through QPI, or through L3?
    Also, is the L2 inclusive in the L3; does the L3 contain the L2 data?
    Reply
  • xipo - Monday, November 3, 2008 - link

    I know games are not the strong area of nehalem, but there are 2 games i'd like to see tested. Unreal T. 3 and Half Life 2 E2.. just to know how does nehalem handles those 2 engines ;D Reply

Log in

Don't have an account? Sign up now