A Quick Path to Memory

Our investigation begins with the most visibly changed part of Nehalem's architecture: the memory subsystem. Nehalem implements a very Phenom-like memory hierarchy consisting of small, fast individual L1 and L2 caches for each of its four cores and then a single, larger shared L3 cache feeding the entire chip.

 

Nehalem's L1 cache, despite being seemingly unchanged from Penryn, does grow in latency; it now takes 4 cycles to access vs. 3. The L2 cache is now only 256KB per core instead of being 24x the size in Penryn and thus can be accessed in only 11 cycles down from 15 (Penryn added an additional clock cycle over Conroe to access L2).

 CPU / CPU-Z Latency L1 Cache L2 Cache L3 Cache
Nehalem (2.66GHz) 4 cycles 11 cycles 39 cycles
Core 2 Quad Q9450 - Penryn - (2.66GHz) 3 cycles 15 cycles N/A

 

The L3 cache is quite possibly the most impressive, requiring only 39 cycles to access at 2.66GHz. The L3 cache is a very large 8MB cache, 4x the size of Phenom's L3, yet it can be accessed much faster. In our testing we found that Phenom's L3 cache takes a similar 43 cycles to access but at much lower clock speeds (2.0GHz). If we put these numbers into relative terms it takes 21.5 ns to get a request back from Phenom's L3 vs. 14.6 ns with Nehalem's - that's nearly 50% longer in Phenom.

While Intel did a lot of tinkering with Nehalem's caches, the inclusion of a multi-channel on-die DDR3 memory controller was the most apparent change. AMD has been using an integrated memory controller (IMC) since 2003 on its K8 based microprocessors and for years Intel has resisted doing the same, citing complexities in choosing what memory to support among other reasons for why it didn't follow in AMD's footsteps.

With clock speeds increasing and up to 8 cores (including GPUs) making their way into Nehalem based CPUs in the coming year, the time to narrow the memory gap is upon us. You can already tell that Nehalem was designed to mask the distance between the individual CPU cores and main memory with its cache design, and the IMC is a further extension of the philosophy.

The motherboard implementation of our 2.66GHz system needed some work so our memory bandwidth/latency numbers on it were way off (slower than Core 2), luckily we had another platform at our disposal running at 2.93GHz which was working perfectly. We turned to Everest Ultimate 4.50 to give us memory bandwidth and latency numbers from Nehalem.

Note that these figures are from a completely untuned motherboard and are using DDR3-1066 (dual-channel on the Core 2 system and triple-channel on the Nehalem system):

 CPU / Everest Ultimate 4.50 Memory Read Memory Write Memory Copy Memory Latency
Nehalem (2.93GHz) 13.1 GB/s 12.7 GB/s 12.0 GB/s 46.9 ns
Core 2 Extreme QX9650 - Penryn - (3.00GHz) 7.6 GB/s 7.1 GB/s 6.9 GB/s 66.7 ns

 

Memory accesses on Conroe/Penryn were quick due to Intel's very aggressive prefetchers, memory accesses on Nehalem are just plain fast. Nehalem takes a little over 2/3 the time to complete a memory request as Penryn, and although we didn't have time to run comparable Phenom numbers I believe Nehalem's DDR3 memory controller is faster than Phenom's DDR2 controller.

Memory bandwidth is obviously greater with three DDR3 channels, Everest measured around a 70% increase in read bandwidth. While we don't have the memory bandwidth figures here, Gary measured a 10% difference in WinRAR performance (a test that's highly influenced by memory bandwidth and latency) between single-channel and triple-channel Nehalem configurations.

While we didn't really expect Intel to somehow do wrong with Nehalem's memory architecture, it's important to point out that it is very well implemented. Intel managed to change the cache structure and introduce an integrated memory controller while making both significantly faster than what AMD managed despite a four-year headstart.

In short: Nehalem can get data out of memory quick like bunnies.

The Return of Hyper Threading Nehalem's Media Encoding Performance
Comments Locked

108 Comments

View All Comments

  • ForumMaster - Thursday, June 5, 2008 - link

    if you'd bother to read the article, it states quite clearly there are PCI-E issues which prevent any GPU testing as of now. it says the motherboard makers need another month to iron out the issues.

    what amazed me is how much better the performance is at this point. when nehalem is optimized, wow.
  • JPForums - Thursday, June 5, 2008 - link

    I wouldn't expect much further optimization of the CPU. The only optimizations in code that couldn't have been made for Penryn would be further threading. Motherboard and chipset optimizations could make a difference, but only up to the point where they are mature. After that there will be little to differentiate CPU performance.

    Like with AMD, implementing the on-die memory controller gives Intel a free performance boost. There are no new instructions to implement and the improvement doesn't apply only to rare or obscure scenarios. Getting data into the core faster with less latency simply makes everything faster. It also serves to further minimize performance differences between supporting platforms.

    What impresses me is that Intel got it right on the first try. It doesn't really surprise me as they have far more resources to work with, but it is nonetheless impressive.

    The article mentions that Pat that said you can only add a memory controller once. Is this somehow different from any other architecture improvement? You can only add SSE or hyperthreading, or a new divider architecture once. Improvements to SSE2 and beyond or adding more thread support in hyperthreading are no different than putting in a DDR4 (or newer) controller with 4 (or more) channel support. Note: I don't advocate trying to add further thread support in hyperthreading. In fact, one of the few architectural change that I can think of that can be used more than once is increasing the cache size. Since AMD can't keep up with Intel's cache size due to process inferiority, it seems like an obvious viewpoint for Intel to take.

    I suspect that the real reasons Intel didn't move over sooner were:
    1) They didn't want to be seen as trailing AMD.
    2) More importantly, an on-die memory controller reduces the advantage of larger caches. Alternately, from Intels perspective, a P4 processor would not have seen nearly as much benefit from an on-die memory controller due to its heavy reliance on large cache sizes. Benchmarks of the P4's showed that raw memory bandwidth was great for the P4s, but they couldn't care less about memory latencies (the largest advantage of an on-die memory controller) as they were hidden by the large cache size. Fast forward to the Core2's of today and you'll see major performance increases from lowering memory latencies on the X38/X48 chipsets. I believe this is true even to the point that the best performance isn't necessarily in line with the highest frequency overclock anymore. Even though Core2 has an even larger cache, it doesn't rely on it as much. Consider how much closer the performances of Intel's budget line (read: low cache) processors are to the mainstream than they were in the P4 era.

    Intel was not going to introduce an on-die memory controller on an architecture that it made little sense to add it to. While it made sense with Core2, it would've taken much longer to get the chips out with one and Intel didn't have the luxury of time at Conroe's launch. Further, Intel would need to give something up to put it in. It is debatable whether they would see the same performance improvements depending on what got left out or changed. In conclusion, Intel added the on-die memory controller when it made the most sense.

    Hyper-transport, on the other hand, was something Intel could've used a long time ago. Though they probably left it out because they weren't getting rid of the front side bus and they didn't want even more communications paths. Quickpath is a welcome improvement, though I'd really like to delve into the details before comparing it to hyper-transport. It'll be a real shame if they use time-division-multiplexing for the switching structure. (Assuming it supports multiple paths of course.)

    The article mentions that the cache latencies and memory latencies are superior to Phenom's. While this is true, I don't really think AMD screwed anything up. Rather, Intel is simply enjoying the benefits of its smaller process technology and newer memory standard. You need look no further than Anandtech to find articles explaining the absolute latencies differences between DDR2 and DDR3. Intel memory latencies may still be a bit lower than AMD when they move over to the smaller process with a DDR3 controller, but I doubt it'll be earth shattering.

    The good news for AMD is that Intel has essentially told them that they are on the right path with their architecture design. The bad news is that Intel also just told them that it doesn't matter if they are right, Intel is so fast that they can take a major detour and still find their way to the destination before AMD arrives. Hopefully, AMD will pick up speed once they're done paying for the ATI merger.
  • Gary Key - Thursday, June 5, 2008 - link

    I was able to view but not personally benchmark a recently optimized Bloomfield/X58 system this week and it was blindingly fast in several video benchmarks compared to a QX9650. These numbers were before the GPUs became a bottleneck. ;)
  • BansheeX - Thursday, June 5, 2008 - link

    The performance conclusion might be a good example of why a monopoly is neither self-perpetuating or an inherently bad thing for the consumer. It IS possible for a virtual monopoly like Intel to be making the best product for the consumer. Perhaps the fear itself of losing that position is enough for such companies to not be complacent or attempt to overprice products, as it would open a window for smaller capital to come in and take marketshare. Just keep them away from subsidies and other special privileges, and the market will always work out for the best. You listening, Europe?
  • Chesterh - Saturday, August 9, 2008 - link

    Go back and to school and take Economics 101. Monopolies and corporate consolidation are part of the reason our economy is in the crapper right now. If the US government had been actually enforcing the antitrust regulations on the books, we might have done the same thing as Europe and slapped Microsoft on the wrist as well.

    Besides, Intel does not have a 'virtual monopoly', or any kind of monopoly. AMD is not out of the game; they are down in the high end CPU segment, but they are definitely not out. The only reason Intel is still releasing aggressively competitive new products is because it doesn't want to lose its lead over AMD. If there was no AMD, we might not even have multicore procs at this point.
  • SiliconDoc - Monday, July 28, 2008 - link

    Uhhh, are you as enamored with Intel as Anand has been for as many years ?
    Did you say "monopoly not overprice it's products" ?
    Ummm... so like after they went insane locking their multipliers because they hate we overclockers ( they can whine about chip remarkers - whatever) - they suddenly... when they could make an INSANE markup...ooooh..... they UNLOCKED their processor they "locked up on us".... made a "cool tard" marketing name and skyerocketed the price...
    Well - you know what you said... like I wish some hot 21yr. old virgin would kiss me that hard.
    With friends like you monopolies can beat all enemies... ( not that there's anything wrong with that).
    \Grrrrr -
    I know, you're just being positive something I'm not very good at.
    I lost my cool when I read " the 1633 pin socket will be a knocked down 1066 version "for the public" or the like....
    You never get the feeling your 3rd phalanx is permanently locked to your talus, both sides ?
    Hmmm.... man we're in trouble.
  • Grantman - Sunday, July 6, 2008 - link

    A monopoly is the worst thing that could happen for consumers in every regard, and the ease of entry for the smaller businesses to enter the processor market and gobble up marketshare is laughable. Firstly once a monopoly is established the can set the prices at any height they want and you think smaller players can take advantage of the opportunity and enter the sophisticated cpu market without being crushed by aggressive price cutting? Not only that, but as the monopoly makes it's close to 100% market share revenue it will reach a point were it's research and development is simply far ahead of any hopeful to enter the market thus perpetuating it's ongoing domination.
  • Justin Case - Sunday, June 8, 2008 - link

    The only reason why Intel came out with these CPUs at all is that there _is_ competition. Without that, Intel would still be run by its marketing department and we'd be paying $1000 for yet another 100 MHz increase on some Pentium 4 derivative.

    The words "a mononopoly isn't a bad thing for consumers" sound straight out of the good ol' USSR. You need to study some Economics 101; you clearly don't understand what "barriers to entry" or "antitrust" mean.

  • mikkel - Sunday, August 10, 2008 - link

    Are you honestly suggesting that desktop performance requirements is the only thing driving processing innovation? I'm fairly sure that you'd find a very good number of server vendors whose customers wouldn't be satisfied with P4 derivatives today.

    There are market forces that Intel couldn't hold back even if they wanted to.
  • adiposity - Friday, June 6, 2008 - link

    AMD is not dead yet and is still undercutting Intel at every price point they are able to. Intel will not rest until AMD is dead or completely non-competetive. At that point we may see a return to the arrogant, bloated Intel of old.

    All that said, their engineers are awesome and deserve credit for delivering again and again since Intel decided to compete seriously. They have done a great job and provided superior performance.

    The only question is: will Intel corporate stop funding R&D and just rake in profits once AMD is dead and gone? I unless they get lucky in court in 2010, I think AMD's death is now a foregone conclusion.

    Dan

Log in

Don't have an account? Sign up now