A Quick Path to Memory

Our investigation begins with the most visibly changed part of Nehalem's architecture: the memory subsystem. Nehalem implements a very Phenom-like memory hierarchy consisting of small, fast individual L1 and L2 caches for each of its four cores and then a single, larger shared L3 cache feeding the entire chip.

 

Nehalem's L1 cache, despite being seemingly unchanged from Penryn, does grow in latency; it now takes 4 cycles to access vs. 3. The L2 cache is now only 256KB per core instead of being 24x the size in Penryn and thus can be accessed in only 11 cycles down from 15 (Penryn added an additional clock cycle over Conroe to access L2).

 CPU / CPU-Z Latency L1 Cache L2 Cache L3 Cache
Nehalem (2.66GHz) 4 cycles 11 cycles 39 cycles
Core 2 Quad Q9450 - Penryn - (2.66GHz) 3 cycles 15 cycles N/A

 

The L3 cache is quite possibly the most impressive, requiring only 39 cycles to access at 2.66GHz. The L3 cache is a very large 8MB cache, 4x the size of Phenom's L3, yet it can be accessed much faster. In our testing we found that Phenom's L3 cache takes a similar 43 cycles to access but at much lower clock speeds (2.0GHz). If we put these numbers into relative terms it takes 21.5 ns to get a request back from Phenom's L3 vs. 14.6 ns with Nehalem's - that's nearly 50% longer in Phenom.

While Intel did a lot of tinkering with Nehalem's caches, the inclusion of a multi-channel on-die DDR3 memory controller was the most apparent change. AMD has been using an integrated memory controller (IMC) since 2003 on its K8 based microprocessors and for years Intel has resisted doing the same, citing complexities in choosing what memory to support among other reasons for why it didn't follow in AMD's footsteps.

With clock speeds increasing and up to 8 cores (including GPUs) making their way into Nehalem based CPUs in the coming year, the time to narrow the memory gap is upon us. You can already tell that Nehalem was designed to mask the distance between the individual CPU cores and main memory with its cache design, and the IMC is a further extension of the philosophy.

The motherboard implementation of our 2.66GHz system needed some work so our memory bandwidth/latency numbers on it were way off (slower than Core 2), luckily we had another platform at our disposal running at 2.93GHz which was working perfectly. We turned to Everest Ultimate 4.50 to give us memory bandwidth and latency numbers from Nehalem.

Note that these figures are from a completely untuned motherboard and are using DDR3-1066 (dual-channel on the Core 2 system and triple-channel on the Nehalem system):

 CPU / Everest Ultimate 4.50 Memory Read Memory Write Memory Copy Memory Latency
Nehalem (2.93GHz) 13.1 GB/s 12.7 GB/s 12.0 GB/s 46.9 ns
Core 2 Extreme QX9650 - Penryn - (3.00GHz) 7.6 GB/s 7.1 GB/s 6.9 GB/s 66.7 ns

 

Memory accesses on Conroe/Penryn were quick due to Intel's very aggressive prefetchers, memory accesses on Nehalem are just plain fast. Nehalem takes a little over 2/3 the time to complete a memory request as Penryn, and although we didn't have time to run comparable Phenom numbers I believe Nehalem's DDR3 memory controller is faster than Phenom's DDR2 controller.

Memory bandwidth is obviously greater with three DDR3 channels, Everest measured around a 70% increase in read bandwidth. While we don't have the memory bandwidth figures here, Gary measured a 10% difference in WinRAR performance (a test that's highly influenced by memory bandwidth and latency) between single-channel and triple-channel Nehalem configurations.

While we didn't really expect Intel to somehow do wrong with Nehalem's memory architecture, it's important to point out that it is very well implemented. Intel managed to change the cache structure and introduce an integrated memory controller while making both significantly faster than what AMD managed despite a four-year headstart.

In short: Nehalem can get data out of memory quick like bunnies.

The Return of Hyper Threading Nehalem's Media Encoding Performance
Comments Locked

108 Comments

View All Comments

  • mkruer - Thursday, June 5, 2008 - link

    Not a problem.

    I tend not to take most things at face value. Looking at the Nehalem, its focus was to increase the multi threaded performance, not the single thread app per say. This would put it more inline with what AMD is offering on per core scalability. The Nehalem will get Intel back into the big iron scalability that it lost to AMD.

    My guess is that the Nehalem will not give users any real advantage playing games or other single threaded apps, unless the game or app supports more then one thread.

    The final question is poised back to AMD. If AMD gets their single threaded IPC and clock speed up, then both platforms should be near identical from a performance standpoint. Then it is just down to price, manufacturing and distribution. I just hope that AMD claims of 15-20% improvement in per core IPC are true. This should make this holiday season much more interesting.
  • Anand Lal Shimpi - Thursday, June 5, 2008 - link

    Nehalem most definitely had a server focus coming up, but I wouldn't underestimate what the IMC will do for CPU-bound gaming performance. Don't forget what the IMC did for the K8 vs. Athlon XP way back when...

    As far as AMD goes, clock speed issues should get resolved with the move to 45nm. The IPC stuff should get taken care of with Bulldozer, the question is when can we expect Bulldozer?
  • JumpingJack - Saturday, June 7, 2008 - link

    Don't count on 45 nm clocking up much higher than 65 nm, maybe another bin or so.... gate leakage and SCE are limiting and the reason for the sideways move from 90 to 65 nm to begin with (traditional gate ox, SiO2, did not scale 90 to 65 nm) ... the next chance for a decent clock bump will come with their inclusion of HKMG. Which from the rumor mill isn't until 1H09.
  • fitten - Friday, June 6, 2008 - link

    AMD hasn't really resolved any clock speed issues from the move from 130nm -> 90nm -> 65nm (look at the top speed 130nm parts compared to the top speed 65nm parts). During some of those transitions, the introductory parts actually were slower clocked than the higher clocked of the previous process and didn't even catch up for some time.
  • bcronce - Thursday, June 5, 2008 - link

    Does anyone know why Intel is claiming NUMA on these? I'm assuming you need a multi-cpu system for such uses, but how is the memory segmented that it's NUMA?
  • bcronce - Thursday, June 5, 2008 - link

    Seems Arstechnica(http://arstechnica.com/articles/paedia/cpu/what-yo...">http://arstechnica.com/articles/paedia/...-you-nee... has info on NUMA.

    Assuming more than 1 node being used, each node connects to the Memmory hub and gets assigned it's own *default* memory bank. A one node computer won't see any diff, but a 2-4 node will get a default memory bank and reduced latencies. A node can interleave the data amoung the 2-4 memory banks, but DDR3 is freak'n fast and probably best just streaming from your own bank to reduce contention amoung the nodes.
  • RobberBaron - Thursday, June 5, 2008 - link

    I think there are going to be other issues revolving around this chip. For example:

    http://www.fudzilla.com/index.php?option=com_conte...">http://www.fudzilla.com/index.php?optio...amp;task...


    Nvidia's Director or PR, Derek Perez, has told Fudzilla that Intel actually won't let Nvidia make its Nforce chipset that will work with Intel's Nehalem generation of processors.

    We confirmed this from Intel’s side, as well as other sources. Intel told us that there won't be an Nvidia's chipset for Nehalem. Nvidia will call this a "dispute between companies that they are trying to solve privately," but we believe it's much more than that.
  • AmberClad - Thursday, June 5, 2008 - link

    That still leaves you with CrossFire and cards with multiple GPUs like the 9800 X2. It's a tiny fraction of the market that actually uses SLI anyway.

    Eh, who knows, maybe Nvidia will finally cave and grant that SLI license, and we'll finally have decent chipsets with SLI.
  • chizow - Thursday, June 5, 2008 - link

    Agreed, as much as I love NV GPUs, I'm tired of having SLI tied to NV's buggy chipsets. Realistically I'd probably just get an Intel chipset with Nehalem even if there was an Nforce SLI variant and just go with the fastest single-GPU processor.
  • Baked - Thursday, June 5, 2008 - link

    Maybe I can finally grab that E8400 when it drops to $50.

Log in

Don't have an account? Sign up now