Understanding Nehalem's Memory Architecture

Nehalem does spice things up a bit in the memory department, not only does it have an integrated memory controller (a first for an x86 Intel CPU) but the memory controller in question has an unusual three-channel configuration. All other AMD and Intel systems use dual channel DDR2 or DDR3 memory controllers; with each channel being 64-bits wide, you have to install memory in pairs for peak performance.

With a three-channel DDR3 memory controller, Nehalem requires the use of three DDR3 modules to achieve peak bandwidth - which also means that the memory manufacturers are going to be selling special 3-channel DDR3 kits made specifically for Nehalem. Motherboard makers will be doing one of two things to implement Nehalem's three-channel memory interface on boards; you'll either see boards with four DIMM slots or boards with six:


Four DDR3 slots, three DDR3 channels

In the four-slot configuration the first three slots correspond to the first three channels, the fourth slot is simply sharing one of the memory channels. The downside to this approach is that your memory bandwidth drops to single-channel performance as you start filling up your memory. For example, if you have 4 x 1GB sticks, the first 3GB of memory will be interleaved between the three memory channels and you'll get 25.6GB/s of bandwidth to data stored in the first 3GB. The final 1GB however won't be interleaved and you'll only get 8.5GB/s of bandwidth to it. Despite the unbalanced nature of memory bandwidth in this case, your aggregate bandwidth is still greater in this configuration than a dual-channel setup.

 


Six DDR3 slots, two slots per DDR3 channel

The more common arrangement will be six DIMM slots where each DDR3 channel is connected to a pair of DIMM slots. In this configuration as long as you install DIMMs in triplicate you'll always get the full 25.6GB/s of memory bandwidth.

That discussion is entirely theoretical however, the real question is: does Nehalem's triple-channel memory controller actually matter or would two channels suffice? I suspect that Hyper Threading simply improved Nehalem's efficiency not necessarily its need for more data. The three-channel memory controller is probably far more important for servers and will be especially useful in the upcoming 8-core version of Nehalem due out sometime next year. To find out we simply benchmarked Nehalem in a handful of applications with a 4GB/dual channel configuration and a 6GB/triple-channel configuration. Note that none of these tests actually used more than 4GB of memory so the size difference doesn't matter, we kept memory timings the same between all tests.

  Dual Channel DDR3-1066 (9-9-9-20) Triple Channel DDR3-1066 (9-9-9-20)
Memory Tests - Everest v1547    
Read Bandwidth 12859 MB/s 13423 MB/s
Write Bandwidth 12410 MB/s 12401 MB/s
Copy Bandwidth 16474 MB/s 18074 MB/s
Latency 37.2 ns 44.2 ns
Cinebench R10 (Multi-threaded test) 18499 18458
x264 HD Encoding Test (First Pass / Second Pass) 83.8 fps / 30.3 fps 85.3 fps / 30.3 fps
WinRAR 3.80 - 602MB Folder 118 seconds 117 seconds
PCMark Vantage 7438 7490
Vantage - Memories 6753 6712
Vantage - TV and Movies 5601 5637
Vantage - Gaming 10202 9849
Vantage - Music 5378 4593
Vantage - Communications 6671 6422
Vantage - Productivity 7589 7676
WinRAR (Built in Benchmark) 3283 3306
Nero Recode - Office Space - 7.55GB 131 seconds 130 seconds
SuperPI - 32M (mins:seconds) 11:55 11:52
Far Cry 2 - Ranch Medium (1680 x 1050) 62.1 fps 62.4 fps
Age of Conan - 1680 x 1050 51.5 fps 51.1 fps
Company of Heroes - 1680 x 1050 136.6 fps 133.6 fps

 

At DDR3-1066 speeds we found no real performance difference between the Core i7-965 running in two channel vs. three channel mode, the added bandwidth is simply not useful for most desktop applications. For some reason we were able to get better latency scores on the dual-channel configuration, but there's a good chance that may be due to the early nature of BIOSes on these boards. In benchmarks were the latency difference was noticeable we saw the dual-channel configuration pull ahead slightly, then in other tests where the added bandwidth helped we saw the triple-channel configuration do better. Honestly, it's mostly a wash between the two.

Our recommendation would be to stick with three channels, but if you have existing memory and can't populate the third channel yet it's not a huge deal, really, two is fine here for the time being.

Nehalem's Weakness: Cache What about the Impact of DDR3 Speeds?
Comments Locked

73 Comments

View All Comments

  • anand4happy - Sunday, February 8, 2009 - link

    saw many thing but this is the thing something dfferent

    sd4us.blogspot.com/2009/01/intel-viivintel-975x-express-955x.html
  • nidhoggr - Monday, November 10, 2008 - link

    I cant find that information on the test setup page.
  • nidhoggr - Monday, November 10, 2008 - link

    test not text :)
  • puffpio - Wednesday, November 5, 2008 - link

    would you guys consider rebenchmarking?
    from the x264 changelog since the nehalem specific optimizations:
    "Overall speed improvement with Nehalem vs Penryn at the same clock speed is around 40%."
  • anartik - Wednesday, November 5, 2008 - link

    Good review and better than Tom's overall. However Tom's stumbled on something that changed my mind about gaming with Nehalem. While Anand's testing shows minimal performance gains (and came to the not good for games conclusion) Tom's approached it with 1-4 GPU's SLI or Crossfire. All I can say is the performance gains with Nvidia cards in SLI was stunning. Maybe the platform favors SLI or Nvidia had a driver advantage in licensing SLI to Intel. Either way Nehalem and SLI smoked ATI and the current 3.2 extreme quad across the board.
  • dani31 - Wednesday, November 5, 2008 - link

    I know it would't change any conclusion, but since we discuss bleeding edge Intel hardware it would have been nice to see the same in the AMD testbed.

    Using a SB600 mobo (instead of the acclaimed SB750) and an old set of drivers makes it look like the AMD numbers were simply pasted from an old article.
  • Casper42 - Tuesday, November 4, 2008 - link

    Something I think you guys missed in your article/conslusion is the fact that we're now able to pair a great CPU with a pretty damn good North/South Bridge AND SLI.

    I found that the 680/780/790 featureset is plainly lacking and that the Intel ICH9R/10R seems to always perform better and has more features. If any doubt, look at Matrix RAID vs nVidia's RAID. Night and day difference, especially with RAID5.

    The problem with the X38/X48 was you got a great board but were effectively locked into ATI for high end Gaming.

    Now we have the best of both worlds. You get ICH10R, a very well performing CPU (even the 920 beats most of the Intel Quad Core lineup) AND you can run 1/2/3 nVidia GPUs on the machine. In my opinion, this is a winning combination.


    The only downside I see is board designs seem to suck more and more.

    With socket 1366 being so massive and 6 DIMM slots on the Enthusiast/Gamer boards, we're seeing not only 6 expansion slots (down from the standard of 7) but in most boards I have seen pics of, the top slot is an x1 so they can wedge it next to the x58 IOH which means your left with only 5 slots for other cards. Using 3 dual slot cards is out of the question without a massive 10 slot case (of which there are only like 3-5 on the market) and even if you can wedge 2 or 3 dual slot cards into the machine, you have almost zero expansion card slots should you ever need them.

    Then we get to all the cooling crap surrounding the CPU. ALL these designs rely on a top down traditional cooler and if you decide to use a highly effective tower cooling solution, all the little heatsink fins on the Northbridge and pwer regulators around the CPU get very little or no airflow. Now your in there adding puny little 40/60mm fans that produce more noise than airflow, not to mention that the DIMMs are hardly ever cooled in today's board designs.
    Call me a cooling purist if you will, but I much prefer traditional front to back airflow and all this side intake top exhaust stuff just makes me cringe. I personally run a Tyan Thunder K8WE with 2 Hyper6+ coolers and the procs and RAM are all cooled front to back. Intake and exhaust are 120mm and I have a bit of an air channel in which that airflow never goes near the expansion card slots below, which by the way have a 92mm fan up front pushing air in across the drives and another 92mm fan clipped onto the expansion slots in the back pulling it back out.

    I dont know how to resolve these issues, but I think someone surely needs to because IMHO its getting out of control.
  • lemonadesoda - Tuesday, November 4, 2008 - link

    "Looking at POV-Ray we see a 30% increase in performance for a 12% increase in total system power consumption, that more than exceeds Intel's 2:1 rule for performance improvement vs. increase in power consumption."

    You cant use "total system power", but must make the best estimate of CPU power draw. Why? Because imagine if you had a system with 6 sticks of RAM, 4 HDDs, etc. you would have ever increasing power figures that would make the ratio of increased power consumption (a/b) smaller and smaller!

    If you take your figures and subtract (a guestimate of) 100W for non CPU power draw, then you DONT get the Intel 2:1 ratio at all!

    The figures need revisiting.
  • AnnonymousCoward - Thursday, November 6, 2008 - link

    Performance vs power appears to linearly increase with HT. Using the 100W figure for non-CPU draw means a 25% power increase, which is close to the 30% performance.

    Unless we're talking about servers, I think looking at power draw per application is silly. Just do idle power, load power, and maybe some kind of flops/watt benchmark just for fun.
  • silversound - Tuesday, November 4, 2008 - link

    great article, tomsharware reviews always pro intel and nvidia, not sure if they got pay $ to suppot them. anandtech is always neutral, thx!

Log in

Don't have an account? Sign up now