A "beta BIOS update" broke compatibility with ESX, so we had to postpone our virtualization testing on our quad CPU AMD 8384 System.
 
So we started an in depth comparison of the 45 nm Opterons, Xeons and Core i7 CPUs. One of our benchmarks, the famous LINPACK (you can read all about it here) painted a pretty interesting performance picture. We had to test with a matrix size of 18000 (2.5 GB of RAM necessary), as we only had 3 GB of DDR-3 on the Core i7 platform. That should not be a huge problem as we tested with only one CPU. We normally need about 4 GB for each quadcore CPU to reach the best performance.
 
We also used the 9.1 version of Intel's LINPACK, as we wanted the same binary on both platforms. As we have show before, this version of LINPACK performs best on both AMD and Intel platforms when the matrix size is low. The current 10.1 version does not work on AMD CPUs unfortunately.
 
We don't pretend that the comparison is completely fair: the Nehalem platform uses unbuffered RAM which has slightly lower latency and higher bandwidth than the Xeon "Nehalem" will get. But we had to satisfy our curiousity: how does the new "Shanghai" core  compare to "Nehalem"?
 

 
 LINPACK

 
Quite interesting, don't you think? Hyperthreading (SMT) gives the Nehalem core a significant advantage in most multi-threaded applications, but not in Linpack: it slows the CPU down by 10%. May we have found the first multi-threaded application that is slowed down by Hyperthreading on Nehalem? That should not spoil the fun for Intel though, as many other HPC benchmarks show a larger gap. AMD has the advantage of being first to the market, Nehalem based Xeons are still a few months away.
 
Also, the impact of the memory subsystem is limited, as a 50% increase in memory speed results in a meager 6% performance increase. The Math Kernel Libraries are so well optimized that the effect of memory speed is minimized. This in great contrast to other HPC applications where the tripple channel DDR-3 memory system of Nehalem really pays off. More later...
 
 
Comments Locked

60 Comments

View All Comments

  • Darkness Flame - Friday, November 28, 2008 - link

    Wait a sec, I though the only Xeon Nehalem cores that are supposed to use Fully Buffered RAM are the 8 core Beckton processors. Weren't the Gainestown processors supposed to use triple channel DDR3, and support 2 socket systems? (Hence the 2 QPI links). I would figure only the Beckton cores would scale to 4 socket systems, as they have 4 QPI links.

    Regardless, though; I would definitely like to see more comparisons between Nehalem and Shanghai; especially in the database benchmarks.

    Also, like what duploxxx said, I don't think we'll see a really comparison between the two, in bandwidth at least, until AMD moves to HT3 and DDR3.
  • BlueBlazer - Friday, November 28, 2008 - link

    Can tell us how many processors or cores are used on the Opteron system?

    Is that "8384" a typo? Or should it be "2384"?
  • JohanAnandtech - Friday, November 28, 2008 - link

    No, that is 8384 CPU. But we use only one.
  • BlueBlazer - Saturday, November 29, 2008 - link

    Thanks. What speeds are those DDR3 on Core i7 machine?
  • JohanAnandtech - Saturday, November 29, 2008 - link

    1066 MHz DDR-3 7-7-7
  • BlueBlazer - Saturday, November 29, 2008 - link

    Would you retry those tests on DDR3-1333 and DDR3-1600? Like to see how memory bandwidth affects these tests.

    Thanks.
  • swhibble - Friday, November 28, 2008 - link

    Shanghai has been out for the best part of... what... 2 weeks now? And all you've managed to come up with is some database testing and a one page comparison to Nehalem.

    COME ON ANAND!! Nehalem got a full review as soon as it came out, why is it taking so long to do a full review of Shanghai?
  • Vinvin - Saturday, November 29, 2008 - link

    I'd like to see compairisons with Dunnington too (6, 12 and 24 cores ...)
  • joshuamora - Saturday, November 29, 2008 - link

    Hi.

    Here you can see some 4 core runs on 2384 with DDR2-800 using only 4 cores within 1 socket.

    For N=18000 I get 35.49GFLOPs which has efficiency of 82.1% a bit low but much better than the 32GFLOPs reported at efficiency of 75% by Anandtech.
    For larger N and multiple of the NB you can achieve better efficiencies:
    For N=28224 I get 36.47GFLOPs which has efficiency of 84.4%.

    8core runs on 2 socket are within same levels of efficiency (~84.5%)
    I have used for all these runs ACML 4.2 (single threaded), PGI 7.2-4 compiler and hpmpi2.2.7,binding of MPI processes only on cores of first socket.

    I don't see the reason for comparing a 1 socket system against 2 socket system.
    I don't see the reason for using DDR2-533 on Shanghai.

    Bottom line, the AMD runs reported by Anandtech are low in terms of efficiency due to not using the appropriate library and blocking factor. I do not understand the comparison of these two very different systems offering very different features at very different prices. It would not make also sense to use 2 of the Intel systems to compete against 1 AMD system because of the big difference in pricing provided they had similar performance.

    Below I provide the logs of the runs.

    Best regards,
    Joshua Mora.

    /opt/Benchmarks/hpl-2.0/bin/AMD_ACML_HPMPI # more 4core.log
    ================================================================================
    HPLinpack 2.0 -- High-Performance Linpack benchmark -- September 10, 2008
    Written by A. Petitet and R. Clint Whaley, Innovative Computing Laboratory, UTK
    Modified by Piotr Luszczek, Innovative Computing Laboratory, UTK
    Modified by Julien Langou, University of Colorado Denver
    ================================================================================

    An explanation of the input/output parameters follows:
    T/V : Wall time / encoded variant.
    N : The order of the coefficient matrix A.
    NB : The partitioning blocking factor.
    P : The number of process rows.
    Q : The number of process columns.
    Time : Time in seconds to solve the linear system.
    Gflops : Rate of execution for solving the linear system.

    The following parameter values will be used:

    N : 18000 21504 28224
    NB : 168
    PMAP : Row-major process mapping
    P : 2
    Q : 2
    PFACT : Left Crout Right
    NBMIN : 8
    NDIV : 2
    RFACT : Left Crout Right
    BCAST : 1ring
    DEPTH : 1
    SWAP : Mix (threshold = 64)
    L1 : no-transposed form
    U : no-transposed form
    EQUIL : yes
    ALIGN : 8 double precision words

    --------------------------------------------------------------------------------

    - The matrix A is randomly generated for each test.
    - The following scaled residual check will be computed:
    ||Ax-b||_oo / ( eps * ( || x ||_oo * || A ||_oo + || b ||_oo ) * N )
    - The relative machine precision (eps) is taken to be 1.110223e-16
    - Computational tests pass if scaled residuals are less than 16.0

    ================================================================================
    T/V N NB P Q Time Gflops
    --------------------------------------------------------------------------------
    WR10L2L8 18000 168 2 2 109.55 3.549e+01
    --------------------------------------------------------------------------------
    ||Ax-b||_oo/(eps*(||A||_oo*||x||_oo+||b||_oo)*N)= 0.0038256 ...... PASSED
    ================================================================================
    T/V N NB P Q Time Gflops
    --------------------------------------------------------------------------------
    WR10L2C8 18000 168 2 2 110.06 3.533e+01
    --------------------------------------------------------------------------------
    ||Ax-b||_oo/(eps*(||A||_oo*||x||_oo+||b||_oo)*N)= 0.0045786 ...... PASSED
    ================================================================================
    T/V N NB P Q Time Gflops
    --------------------------------------------------------------------------------
    WR10L2R8 18000 168 2 2 110.11 3.531e+01
    --------------------------------------------------------------------------------
    ||Ax-b||_oo/(eps*(||A||_oo*||x||_oo+||b||_oo)*N)= 0.0045956 ...... PASSED
    ================================================================================
    T/V N NB P Q Time Gflops
    --------------------------------------------------------------------------------
    WR10C2L8 18000 168 2 2 109.67 3.546e+01
    --------------------------------------------------------------------------------
    ||Ax-b||_oo/(eps*(||A||_oo*||x||_oo+||b||_oo)*N)= 0.0049196 ...... PASSED
    ================================================================================
    T/V N NB P Q Time Gflops
    --------------------------------------------------------------------------------
    WR10C2C8 18000 168 2 2 110.03 3.534e+01
    --------------------------------------------------------------------------------
    ||Ax-b||_oo/(eps*(||A||_oo*||x||_oo+||b||_oo)*N)= 0.0044894 ...... PASSED
    ================================================================================
    T/V N NB P Q Time Gflops
    --------------------------------------------------------------------------------
    WR10C2R8 18000 168 2 2 110.09 3.532e+01
    --------------------------------------------------------------------------------
    ||Ax-b||_oo/(eps*(||A||_oo*||x||_oo+||b||_oo)*N)= 0.0043481 ...... PASSED
    ================================================================================
    T/V N NB P Q Time Gflops
    --------------------------------------------------------------------------------
    WR10R2L8 18000 168 2 2 110.08 3.532e+01
    --------------------------------------------------------------------------------
    ||Ax-b||_oo/(eps*(||A||_oo*||x||_oo+||b||_oo)*N)= 0.0042594 ...... PASSED
    ================================================================================
    T/V N NB P Q Time Gflops
    --------------------------------------------------------------------------------
    WR10R2C8 18000 168 2 2 110.12 3.531e+01
    --------------------------------------------------------------------------------
    ||Ax-b||_oo/(eps*(||A||_oo*||x||_oo+||b||_oo)*N)= 0.0043521 ...... PASSED
    ================================================================================
    T/V N NB P Q Time Gflops
    --------------------------------------------------------------------------------
    WR10R2R8 18000 168 2 2 109.54 3.550e+01
    --------------------------------------------------------------------------------
    ||Ax-b||_oo/(eps*(||A||_oo*||x||_oo+||b||_oo)*N)= 0.0045002 ...... PASSED
    ================================================================================
    T/V N NB P Q Time Gflops
    --------------------------------------------------------------------------------
    WR10L2L8 21504 168 2 2 186.98 3.546e+01
    --------------------------------------------------------------------------------
    ||Ax-b||_oo/(eps*(||A||_oo*||x||_oo+||b||_oo)*N)= 0.0038828 ...... PASSED
    ================================================================================
    T/V N NB P Q Time Gflops
    --------------------------------------------------------------------------------
    WR10L2C8 21504 168 2 2 187.03 3.545e+01
    --------------------------------------------------------------------------------
    ||Ax-b||_oo/(eps*(||A||_oo*||x||_oo+||b||_oo)*N)= 0.0047606 ...... PASSED
    ================================================================================
    T/V N NB P Q Time Gflops
    --------------------------------------------------------------------------------
    WR10L2R8 21504 168 2 2 187.09 3.544e+01
    --------------------------------------------------------------------------------
    ||Ax-b||_oo/(eps*(||A||_oo*||x||_oo+||b||_oo)*N)= 0.0037397 ...... PASSED
    ================================================================================
    T/V N NB P Q Time Gflops
    --------------------------------------------------------------------------------
    WR10C2L8 21504 168 2 2 187.03 3.545e+01
    --------------------------------------------------------------------------------
    ||Ax-b||_oo/(eps*(||A||_oo*||x||_oo+||b||_oo)*N)= 0.0038828 ...... PASSED
    ================================================================================
    T/V N NB P Q Time Gflops
    --------------------------------------------------------------------------------
    WR10C2C8 21504 168 2 2 186.95 3.546e+01
    --------------------------------------------------------------------------------
    ||Ax-b||_oo/(eps*(||A||_oo*||x||_oo+||b||_oo)*N)= 0.0047606 ...... PASSED
    ================================================================================
    T/V N NB P Q Time Gflops
    --------------------------------------------------------------------------------
    WR10C2R8 21504 168 2 2 187.07 3.544e+01
    --------------------------------------------------------------------------------
    ||Ax-b||_oo/(eps*(||A||_oo*||x||_oo+||b||_oo)*N)= 0.0036661 ...... PASSED
    ================================================================================
    T/V N NB P Q Time Gflops
    --------------------------------------------------------------------------------
    WR10R2L8 21504 168 2 2 187.05 3.545e+01
    --------------------------------------------------------------------------------
    ||Ax-b||_oo/(eps*(||A||_oo*||x||_oo+||b||_oo)*N)= 0.0038828 ...... PASSED
    ================================================================================
    T/V N NB P Q Time Gflops
    --------------------------------------------------------------------------------
    WR10R2C8 21504 168 2 2 187.07 3.544e+01
    --------------------------------------------------------------------------------
    ||Ax-b||_oo/(eps*(||A||_oo*||x||_oo+||b||_oo)*N)= 0.0037164 ...... PASSED
    ================================================================================
    T/V N NB P Q Time Gflops
    --------------------------------------------------------------------------------
    WR10R2R8 21504 168 2 2 186.93 3.547e+01
    --------------------------------------------------------------------------------
    ||Ax-b||_oo/(eps*(||A||_oo*||x||_oo+||b||_oo)*N)= 0.0036661 ...... PASSED
    ================================================================================
    T/V N NB P Q Time Gflops
    --------------------------------------------------------------------------------
    WR10L2L8 28224 168 2 2 411.27 3.645e+01
    --------------------------------------------------------------------------------
    ||Ax-b||_oo/(eps*(||A||_oo*||x||_oo+||b||_oo)*N)= 0.0032718 ...... PASSED
    ================================================================================
    T/V N NB P Q Time Gflops
    --------------------------------------------------------------------------------
    WR10L2C8 28224 168 2 2 411.02 3.647e+01
    --------------------------------------------------------------------------------
    ||Ax-b||_oo/(eps*(||A||_oo*||x||_oo+||b||_oo)*N)= 0.0032735 ...... PASSED
    ================================================================================
    T/V N NB P Q Time Gflops
    --------------------------------------------------------------------------------
    WR10L2R8 28224 168 2 2 411.16 3.646e+01
    --------------------------------------------------------------------------------
    ||Ax-b||_oo/(eps*(||A||_oo*||x||_oo+||b||_oo)*N)= 0.0031464 ...... PASSED
    ================================================================================
    T/V N NB P Q Time Gflops
    --------------------------------------------------------------------------------
    WR10C2L8 28224 168 2 2 411.05 3.647e+01
    --------------------------------------------------------------------------------
    ||Ax-b||_oo/(eps*(||A||_oo*||x||_oo+||b||_oo)*N)= 0.0032718 ...... PASSED
    ================================================================================
    T/V N NB P Q Time Gflops
    --------------------------------------------------------------------------------
    WR10C2C8 28224 168 2 2 411.09 3.646e+01
    --------------------------------------------------------------------------------
    ||Ax-b||_oo/(eps*(||A||_oo*||x||_oo+||b||_oo)*N)= 0.0034905 ...... PASSED
    ================================================================================
    T/V N NB P Q Time Gflops
    --------------------------------------------------------------------------------
    WR10C2R8 28224 168 2 2 411.06 3.647e+01
    --------------------------------------------------------------------------------
    ||Ax-b||_oo/(eps*(||A||_oo*||x||_oo+||b||_oo)*N)= 0.0031464 ...... PASSED
    ================================================================================
    T/V N NB P Q Time Gflops
    --------------------------------------------------------------------------------
    WR10R2L8 28224 168 2 2 411.06 3.647e+01
    --------------------------------------------------------------------------------
    ||Ax-b||_oo/(eps*(||A||_oo*||x||_oo+||b||_oo)*N)= 0.0032718 ...... PASSED
    ================================================================================
    T/V N NB P Q Time Gflops
    --------------------------------------------------------------------------------
    WR10R2C8 28224 168 2 2 411.16 3.646e+01
    --------------------------------------------------------------------------------
    ||Ax-b||_oo/(eps*(||A||_oo*||x||_oo+||b||_oo)*N)= 0.0034905 ...... PASSED
    ================================================================================
    T/V N NB P Q Time Gflops
    --------------------------------------------------------------------------------
    WR10R2R8 28224 168 2 2 410.97 3.647e+01
    --------------------------------------------------------------------------------
    ||Ax-b||_oo/(eps*(||A||_oo*||x||_oo+||b||_oo)*N)= 0.0031464 ...... PASSED
    ================================================================================

    Finished 27 tests with the following results:
    27 tests completed and passed residual checks,
    0 tests completed and failed residual checks,
    0 tests skipped because of illegal input values.
    --------------------------------------------------------------------------------

    End of Tests.
    ================================================================================

    ================================================================================
    HPLinpack 2.0 -- High-Performance Linpack benchmark -- September 10, 2008
    Written by A. Petitet and R. Clint Whaley, Innovative Computing Laboratory, UTK
    Modified by Piotr Luszczek, Innovative Computing Laboratory, UTK
    Modified by Julien Langou, University of Colorado Denver
    ================================================================================

    An explanation of the input/output parameters follows:
    T/V : Wall time / encoded variant.
    N : The order of the coefficient matrix A.
    NB : The partitioning blocking factor.
    P : The number of process rows.
    Q : The number of process columns.
    Time : Time in seconds to solve the linear system.
    Gflops : Rate of execution for solving the linear system.

    The following parameter values will be used:

    N : 43008
    NB : 168
    PMAP : Row-major process mapping
    P : 2
    Q : 4
    PFACT : Left Crout Right
    NBMIN : 8
    NDIV : 2
    RFACT : Left Crout Right
    BCAST : 1ring
    DEPTH : 1
    SWAP : Mix (threshold = 64)
    L1 : no-transposed form
    U : no-transposed form
    EQUIL : yes
    ALIGN : 8 double precision words

    --------------------------------------------------------------------------------

    - The matrix A is randomly generated for each test.
    - The following scaled residual check will be computed:
    ||Ax-b||_oo / ( eps * ( || x ||_oo * || A ||_oo + || b ||_oo ) * N )
    - The relative machine precision (eps) is taken to be 1.110223e-16
    - Computational tests pass if scaled residuals are less than 16.0

    ================================================================================
    T/V N NB P Q Time Gflops
    --------------------------------------------------------------------------------
    WR10L2L8 43008 168 2 4 727.71 7.288e+01
    --------------------------------------------------------------------------------
    ||Ax-b||_oo/(eps*(||A||_oo*||x||_oo+||b||_oo)*N)= 0.0029853 ...... PASSED
    ================================================================================
    T/V N NB P Q Time Gflops
    --------------------------------------------------------------------------------
    WR10L2C8 43008 168 2 4 727.58 7.290e+01
    --------------------------------------------------------------------------------
    ||Ax-b||_oo/(eps*(||A||_oo*||x||_oo+||b||_oo)*N)= 0.0029481 ...... PASSED
    ================================================================================
    T/V N NB P Q Time Gflops
    --------------------------------------------------------------------------------
    WR10L2R8 43008 168 2 4 727.62 7.289e+01
    --------------------------------------------------------------------------------
    ||Ax-b||_oo/(eps*(||A||_oo*||x||_oo+||b||_oo)*N)= 0.0026779 ...... PASSED
    ================================================================================
    T/V N NB P Q Time Gflops
    --------------------------------------------------------------------------------
    WR10C2L8 43008 168 2 4 727.19 7.293e+01
    --------------------------------------------------------------------------------
    ||Ax-b||_oo/(eps*(||A||_oo*||x||_oo+||b||_oo)*N)= 0.0033299 ...... PASSED
    ================================================================================
    T/V N NB P Q Time Gflops
    --------------------------------------------------------------------------------
    WR10C2C8 43008 168 2 4 727.92 7.286e+01
    --------------------------------------------------------------------------------
    ||Ax-b||_oo/(eps*(||A||_oo*||x||_oo+||b||_oo)*N)= 0.0030322 ...... PASSED
    ================================================================================
    T/V N NB P Q Time Gflops
    --------------------------------------------------------------------------------
    WR10C2R8 43008 168 2 4 727.63 7.289e+01
    --------------------------------------------------------------------------------
    ||Ax-b||_oo/(eps*(||A||_oo*||x||_oo+||b||_oo)*N)= 0.0030579 ...... PASSED
    ================================================================================
    T/V N NB P Q Time Gflops
    --------------------------------------------------------------------------------
    WR10R2L8 43008 168 2 4 727.90 7.286e+01
    --------------------------------------------------------------------------------
    ||Ax-b||_oo/(eps*(||A||_oo*||x||_oo+||b||_oo)*N)= 0.0030296 ...... PASSED
    ================================================================================
    T/V N NB P Q Time Gflops
    --------------------------------------------------------------------------------
    WR10R2C8 43008 168 2 4 727.53 7.290e+01
    --------------------------------------------------------------------------------
    ||Ax-b||_oo/(eps*(||A||_oo*||x||_oo+||b||_oo)*N)= 0.0031404 ...... PASSED
    ================================================================================
    T/V N NB P Q Time Gflops
    --------------------------------------------------------------------------------
    WR10R2R8 43008 168 2 4 727.24 7.293e+01
    --------------------------------------------------------------------------------
    ||Ax-b||_oo/(eps*(||A||_oo*||x||_oo+||b||_oo)*N)= 0.0031886 ...... PASSED
    ================================================================================

    Finished 9 tests with the following results:
    9 tests completed and passed residual checks,
    0 tests completed and failed residual checks,
    0 tests skipped because of illegal input values.
    --------------------------------------------------------------------------------

    End of Tests.
    =============================================================================
    ==
  • BlueBlazer - Saturday, November 29, 2008 - link

    How much RAM did you use on those systems?

    The reason for the size is due to "We had to test with a matrix size of 18000 (2.5 GB of RAM necessary), as we only had 3 GB of DDR-3 on the Core i7 platform."

Log in

Don't have an account? Sign up now