Nehalem EX Confusion

One of the reasons that the Xeon X7560 did not show its full potential at launch was a small error in the firmware of the Dell R810 testing platform. This caused the memory subsystem to underperform. As a result some of the bandwidth sensitive benchmarks, including many HPC applications, were not performing optimally. Intel claimed that a dual CPU config should be able to reach 39GB/s, and a quad CPU configuration should reach up to 70GB/s. We could not reach those stream numbers as we test with our somewhat older stream binary as described here. Using the same stream binary as before allows us to compare our findings with all our previous measurements.

We reran our stream benchmarks on the new QSSC-S4R server system.

Stream TRIAD on 64 bit Linux—maximum threads
* New measurements.

The new results tell us that available memory bandwidth is about 21% higher (29GB/s) than what we previously measured on the DELL R810 (24GB/s). That means that many benchmarks published at the launch of the Xeon 7500 and using the Dell R810 were too low, especially the HPC ones. The Xeon X7560 will not be able to beat the quad Opteron 6174 when it comes to raw bandwidth, but it is far from a bandwidth starved platform.

The 32-Core, 64-Thread Beast Stress Testing the High End
Comments Locked

51 Comments

View All Comments

  • fynamo - Wednesday, August 11, 2010 - link

    WHERE ARE THE POWER CONSUMPTION CHARTS??????

    Awesome article, but complete FAIL because of lack of power consumption charts. This is only half the picture -- and I dare to say it's the less important half.
  • davegraham - Wednesday, August 11, 2010 - link

    +1 on this.
  • JohanAnandtech - Thursday, August 12, 2010 - link

    Agreed. But it wasn't until a few days before I was going to post this article that we got a system that is comparable. So I kept the power consumption numbers for the next article.
  • watersb - Wednesday, August 11, 2010 - link

    Wow, you IT Guys are a cranky bunch! :-)

    I am impressed with the vApus client-simulation testing, and I'm humbled by the complexity of enterprise-server testing complexity.

    A former sysadmin, I've been an ignorant programmer for lo these past 10 years. Reading all these comments makes me feel like I'm hanging out on the bench in front of the general store.

    Yeah, I'm getting off your lawn now...
  • Scy7ale - Wednesday, August 11, 2010 - link

    Does this also apply to consumer HDDs? If so is it a bad idea to have an intake fan in front of the drives to cool them as many consumer/gaming cases have now?
  • JohanAnandtech - Thursday, August 12, 2010 - link

    Cold air comes from the bottom of the server aisle, sometimes as low as 20°C (68F) and gets blown at high speed over the disks. Several studies now show that this is not optimal for a HDD. In your desktop, the temperature of the air that is blown over the hdd should be higher, as the fans are normally slower. But yes, it is not good to keep your harddisk at temperatures lower than 30 °C . use hddsentinel or speedfan to check on this. 30-45°C is acceptable.
  • Scy7ale - Monday, August 16, 2010 - link

    Good to know, thanks! I don't think this is widely understood.
  • brenozan - Thursday, August 12, 2010 - link

    http://en.wikipedia.org/wiki/UltraSPARC_T2
    2 sockets =~ 153GHz
    4 sockets =~ 306GHz
    Like the T1, the T2 supports the Hyper-Privileged execution mode. The SPARC Hypervisor runs in this mode and can partition a T2 system into 64 Logical Domains, and a two-way SMP T2 Plus system into 128 Logical Domains, each of which can run an independent operating system instance.

    why SUN did not dominate the world in 2007 when it launched the T2? Besides the two 10G Ethernet builtin processor they had the most advanced architecture that I know, see in
    http://www.opensparc.net/opensparc-t2/download.htm...
  • don_k - Thursday, August 12, 2010 - link

    "why SUN did not dominate the world in 2007 when it launched the T2?"

    Because it's not actually that good :) My company bought a few T2s and after about a week of benchmarking and testing it was obvious that they are very very slow. Sure you get lots and lots of threads but each of those threads is oh so very slow. You would not _want_ to run 128 instances of solaris, one on each thread, because each of those instances would be virtually unusable.

    We used them as webservers.. good for that. Or file servers that you don't need to do any cpu intensive work.

    The theory is fine and all but you obviously have never used a T2 or you would not be wondering why it failed.
  • JohanAnandtech - Thursday, August 12, 2010 - link

    "http://en.wikipedia.org/wiki/UltraSPARC_T2
    2 sockets =~ 153GHz
    4 sockets =~ 306GHz"

    You are multiplying threads times clockspeed. IIRC, the T2 is a finegrained multithread CPU where 8 (!!) threads share two pipelines of *one* core.

    Compare that with the Nehalem core where 2 threads share 4 "pipelines" (sustained decode/issue/execution/retire) per cycle. So basically, a dual socket T2 is nothing more than 16 relatively weak cores which can execute 2 instructions per clockcycle at the most, or 32 instructions per cycle. The only advantage of having 8 threads per core is that (with enough indepedent software threads) the T2 is able to come relatively close to that kind of throughput.

    A dual six-core Xeon has a maximum throughput of 12 cores x 4 instructions or 48 instructions per cycle. As the Xeon has only 2 threads per core, it is less likely that the CPU will ever come close to that kind of output (in business apps). On the other hand, it performs excellent when you have some amount of dependent threads, or simply not enough threads in parallel. The T2 will only perform well if you have enough independent threads.

Log in

Don't have an account? Sign up now