Challenging. That is the least you can say about the economic climate for the launch of Intel's newest "Nehalem EP Xeon" platform. However, challenges must be met and they certainly make things more interesting. The server vendors won't convince a lot of people to buy a new Intel Nehalem (or AMD Shanghai) based server just because "performance is higher". That will only work in the processing hungry HPC and render worlds, where less time per task results in time and cost savings. Hence, the challenge for AMD and Intel is to convince the rest of the market - that is 95% or so - that the new platforms provide a compelling ROI (Return On Investment).
 
The most productive or intensively used servers in general get replaced every 3 to 5 years. Based on Intel's own inquiries, Intel estimates that the current installed base consists of 40% dual-core CPU servers and 40% servers with single-core CPUs.
 

That means that Intel's Nehalem platform (and AMD's Shanghai/Opteron 23xx platform) has to convince people to replace their dual-core Opteron, dual-core Xeon 50xx ("Dempsey"), and Xeon "Irwindale" servers. There are two great ways to turn a much more powerful server into a moneymaking and cost saving machine. One is to use fewer servers in a cluster, which is not applicable to all companies. The other more popular approach is to consolidate more servers on the same physical machine by using virtualization. The most important arguments for upgrading your servers are performance/watt and support for virtualization.

Intel's newest platform holds the promise that it supports virtualization better by adding EPT and lower world switch times. However, probably the largest bottleneck in the past was the amount of available bandwidth. Bandwidth is frequently an overrated performance factor, as few applications - excluding the HPC world - get a boost from for example using three instead of two memory channels. That changes dramatically when you are running tens of virtual machines on top of a physical machine: many applications with medium bandwidth demands morph into one big bandwidth-hogging monster. The challenge is thus to provide access to the memory as fast as possible, lower energy consumption, and better support for virtualization. On paper, the Nehalem architecture definitely can play all those trump cards. Anand has provided a detailed description of the Nehalem architecture. The most important improvements for business applications are:

  • The integrated memory controller talks to its own local memory or remote memory (NUMA). Memory access takes between 27 and 54 ns (80 to 161 cycles). Compare this to the Xeon 5450 at the same clock speed where memory access via the MC in the chipset can take up to 123 ns! The closest competitor (Opteron "Shanghai") needs between 32 and 71 ns.
  • A native quad-core design with fast 33 cycle L3 cache make it easy for the L2 caches to exchange cache coherency information
  • Fast CPU interconnects make sure that the rest of the snoops happen very fast and do not interfere with other traffic.
  • The memory controller has up to three channels. A dual CPU configuration has access to 35GB/s of memory bandwidth (measured with stream) if you use DDR3-1333. The latest dual Opteron achieves 19.4GB/s with DDR2-800

Basically, Nehalem is Intel's version of the improvements found in the AMD Barcelona platform, only better (or at least that's the goal). Let's see what it can do in reality.

What Intel is Offering
POST A COMMENT

44 Comments

View All Comments

  • snakeoil - Monday, March 30, 2009 - link

    oops it seems that hypertreading is not scaling very well too bad for intel Reply
  • eva2000 - Tuesday, March 31, 2009 - link

    Bloody awesome results for the new 55xx series. Can't wait to see some of the larger vBulletin forums online benefiting from these monsters :) Reply
  • ssj4Gogeta - Monday, March 30, 2009 - link

    huh? Reply
  • ltcommanderdata - Monday, March 30, 2009 - link

    I was wondering if you got any feeling whether Hyperthreading scaled better on Nehalem than Netburst? And if so, do you think this is due to improvements made to HT itself in Nehalem, just do to Nehalem 4+1 instruction decoders and more execution units or because software is better optimized for multithreading/hyperthreading now? Maybe I'm thinking mostly desktop, but HT had kind of a hit or miss reputation in Netburst, and it'd be interesting to see if it just came before it's time. Reply
  • TA152H - Monday, March 30, 2009 - link

    Well, for one, the Nehalem is wider than the Pentium 4, so that's a big issue there. On the negative side (with respect to HT increase, but really a positive) you have better scheduling with Nehalem, in particular, memory disambiguation. The weaker the scheduler, the better the performance increase from HT, in general.

    I'd say it's both. Clearly, the width of Nehalem would help a lot more than the minor tweaks. Also, you have better memory bandwidth, and in particular, a large L1 cache. I have to believe it was fairly difficult for the Pentium 4 to keep feeding two threads with such a small L1 cache, and then you have the additional L2 latency vis-a-vis the Nehalem.

    So, clearly the Nehalem is much better designed for it, and I think it's equally clear software has adjusted to the reality of more computers having multiple processors.

    On top of this, these are server applications they are running, not mainstream desktop apps, which might show a different profile with regards to Hyper-threading improvements.

    It would have to be a combination.
    Reply
  • JohanAnandtech - Monday, March 30, 2009 - link

    The L1-cache and the way that the Pentium 4 decoded was an important (maybe even the most important) factor in the mediocre SMT performance. Whenever the trace cache missed (and it was quite small, something of the equivalent of 16 KB), the Pentium 4 had only one real decoder. This means that you have to feed two threads with one decoder. In other words, whenever you get a miss in the trace cache, HT did more bad than good in the Pentium 4. That is clearly is not the case in Nehalem with excellent decoding capabilities and larger L1.

    And I fully agree with your comments, although I don't think mem disambiguation has a huge impact on the "usefullness" of SMT. After all, there are lots of reasons why the ample execution resources are not fully used: branches, L2-cache misses etc.
    Reply
  • IntelUser2000 - Tuesday, March 31, 2009 - link

    Not only that, Pentium 4 had the Replay feature to try to make up for having such a long pipeline stage architecture. When Replay went wrong, it would use resources that would be hindering the 2nd thread.

    Core uarch has no such weaknesses.
    Reply
  • SilentSin - Monday, March 30, 2009 - link

    Wow...that's just ridiculous how much improvement was made, gg Intel. Can't wait to see how the 8-core EX's do, if this launch is any indication that will change the server landscape overnight.

    However, one thing I would like to see compared, or slightly modified, is the power consumption figures. Instead of an average amount of power used at idle or load, how about a total consumption figure over the length of a fixed benchmark (ie- how much power was used while running SPECint). I think that would be a good metric to illustrate very plainly how much power is saved from the greater performance with a given load. I saw the chart in the power/performance improvement on the Bottom Line page but it's not quite as digestible as or as easy to compare as a straight kW per benchmark figure would be. Perhaps give it the same time range as the slowest competing part completes the benchmark in. This would give you the ability to make a conclusion like "In the same amount of time the Opteron 8384 used to complete this benchmark, the 5570 used x watts less, and spent x seconds in idle". Since servers are rarely at 100% load at all times it would be nice to see how much faster it is and how much power it is using once it does get something to chew on.

    Anyway, as usual that was an extremely well done write up, covered mostly everything I wanted to see.
    Reply
  • 7Enigma - Wednesday, April 01, 2009 - link

    I think that is a very good method for determining total power consumption. Obviously this doesn't show cpu power consumption, but more importantly the overall consumption for a given unit of work.

    Nice thinking.
    Reply
  • JohanAnandtech - Wednesday, April 01, 2009 - link

    I am trying to hard, but I do not see the difference with our power numbers. This is the average power consumption of one CPU during 10 minutes of DVD-store OLTP activity. As readers have the performance numbers, you can perfectly calculate performance/watt or per KWh. Per server would be even better (instead of per CPU) but our servers were too different.

    Or am I missing something?
    Reply

Log in

Don't have an account? Sign up now