A Quick Path to Memory

Our investigation begins with the most visibly changed part of Nehalem's architecture: the memory subsystem. Nehalem implements a very Phenom-like memory hierarchy consisting of small, fast individual L1 and L2 caches for each of its four cores and then a single, larger shared L3 cache feeding the entire chip.

 

Nehalem's L1 cache, despite being seemingly unchanged from Penryn, does grow in latency; it now takes 4 cycles to access vs. 3. The L2 cache is now only 256KB per core instead of being 24x the size in Penryn and thus can be accessed in only 11 cycles down from 15 (Penryn added an additional clock cycle over Conroe to access L2).

 CPU / CPU-Z Latency L1 Cache L2 Cache L3 Cache
Nehalem (2.66GHz) 4 cycles 11 cycles 39 cycles
Core 2 Quad Q9450 - Penryn - (2.66GHz) 3 cycles 15 cycles N/A

 

The L3 cache is quite possibly the most impressive, requiring only 39 cycles to access at 2.66GHz. The L3 cache is a very large 8MB cache, 4x the size of Phenom's L3, yet it can be accessed much faster. In our testing we found that Phenom's L3 cache takes a similar 43 cycles to access but at much lower clock speeds (2.0GHz). If we put these numbers into relative terms it takes 21.5 ns to get a request back from Phenom's L3 vs. 14.6 ns with Nehalem's - that's nearly 50% longer in Phenom.

While Intel did a lot of tinkering with Nehalem's caches, the inclusion of a multi-channel on-die DDR3 memory controller was the most apparent change. AMD has been using an integrated memory controller (IMC) since 2003 on its K8 based microprocessors and for years Intel has resisted doing the same, citing complexities in choosing what memory to support among other reasons for why it didn't follow in AMD's footsteps.

With clock speeds increasing and up to 8 cores (including GPUs) making their way into Nehalem based CPUs in the coming year, the time to narrow the memory gap is upon us. You can already tell that Nehalem was designed to mask the distance between the individual CPU cores and main memory with its cache design, and the IMC is a further extension of the philosophy.

The motherboard implementation of our 2.66GHz system needed some work so our memory bandwidth/latency numbers on it were way off (slower than Core 2), luckily we had another platform at our disposal running at 2.93GHz which was working perfectly. We turned to Everest Ultimate 4.50 to give us memory bandwidth and latency numbers from Nehalem.

Note that these figures are from a completely untuned motherboard and are using DDR3-1066 (dual-channel on the Core 2 system and triple-channel on the Nehalem system):

 CPU / Everest Ultimate 4.50 Memory Read Memory Write Memory Copy Memory Latency
Nehalem (2.93GHz) 13.1 GB/s 12.7 GB/s 12.0 GB/s 46.9 ns
Core 2 Extreme QX9650 - Penryn - (3.00GHz) 7.6 GB/s 7.1 GB/s 6.9 GB/s 66.7 ns

 

Memory accesses on Conroe/Penryn were quick due to Intel's very aggressive prefetchers, memory accesses on Nehalem are just plain fast. Nehalem takes a little over 2/3 the time to complete a memory request as Penryn, and although we didn't have time to run comparable Phenom numbers I believe Nehalem's DDR3 memory controller is faster than Phenom's DDR2 controller.

Memory bandwidth is obviously greater with three DDR3 channels, Everest measured around a 70% increase in read bandwidth. While we don't have the memory bandwidth figures here, Gary measured a 10% difference in WinRAR performance (a test that's highly influenced by memory bandwidth and latency) between single-channel and triple-channel Nehalem configurations.

While we didn't really expect Intel to somehow do wrong with Nehalem's memory architecture, it's important to point out that it is very well implemented. Intel managed to change the cache structure and introduce an integrated memory controller while making both significantly faster than what AMD managed despite a four-year headstart.

In short: Nehalem can get data out of memory quick like bunnies.

The Return of Hyper Threading Nehalem's Media Encoding Performance
POST A COMMENT

108 Comments

View All Comments

  • weihlmus - Friday, September 05, 2008 - link

    is it just me or does lga1366 bear more than apassing resemblance to amd's logo...

    if only they could have crammed it into 1337 pins - "nehalem - the 1337 chip"
    Reply
  • Proteusza - Friday, July 11, 2008 - link

    I see that on the Daily Tech main page, the headline for this article now reads:
    "June 5, 2008
    Post date on AnandTech's Nehalem preview, before it was ripped and republished on Tom's Hardware"

    Does anyone know what happened? I cant find the same article on Tomshardware, I presume they took it down.
    Reply
  • xsilver - Friday, June 20, 2008 - link

    10 pages of comments and not one about the future of overclocking?

    No more FSB = no more overclocking??????
    Enthusiasts might jump ship if overclocking usually brings 20% extra performance, all amd have to do is come within 10% on performance and below on price?
    Reply
  • Akabeth - Thursday, June 19, 2008 - link

    Jebuzzz

    This pretty much makes it pointless to purchase any high tier mobo and quad core today... It will be eclipsed in 6 months time...

    Some of the numbers here make me wonder, "Are you f*cking kidding me?"
    Reply
  • JumpingJack - Friday, June 13, 2008 - link

    Anand, you could clear up some confusion if you could specify the version of windows you ran. The screen shots of some of your benchmarks show 64-bit Vista, yet your scores are inline with 32-bit Vista.... it makes a difference.
    Reply
  • JumpingJack - Friday, June 13, 2008 - link

    Nevermind, page 2 shows 32-bit Vista... makes sense now. You should becareful when posting stock photos, the Cinbench reflects 64-bit. Reply
  • barnierubble - Sunday, June 08, 2008 - link

    It appears to me that the Tick Tock cycle diagram is wrong.

    Conroe shooked the world in the second half of 2006, Penryn came in the second half of 2007, now 2008 we have Nehalem on the horizon set for the second half of this year.

    Now that is 2 years between new architectures with the intermediate year bringing a shrink derivative.

    That is not what the diagram shows; bracketing shrink derivative and new architecture in a 2 year cycle is clearly not fitting the reality.

    Reply
  • mbf - Saturday, June 07, 2008 - link

    Will the new IMC support ECC RAM? And if so, what are the odds the consumer versions will too? I've had a bit of bad luck with memory errors in the past. Since then I swear by ECC memory, even though it costs me a bit of performance. :) Reply
  • Natima - Saturday, June 07, 2008 - link

    I just thought I'd point out that the Bloomfield chip reviewed (to be released in H2 2008) will infact dominate the gamer/high-end market.
    http://en.wikipedia.org/wiki/Nehalem_(microarchite...">http://en.wikipedia.org/wiki/Nehalem_(microarchite...

    The smaller sockets will be for what I like to call... "office PC's".
    And the larger socket for high-end servers.

    A majority of custom PC builders will be able to buy & use Nehalems by the end of the year. Hoorah!
    Reply
  • Natima - Saturday, June 07, 2008 - link

    The article semi-implied that chips for the PC enthusiast would not be out until mid-2009. Just wanted to clarify this for people. Reply

Log in

Don't have an account? Sign up now