64-bit Linux HPC Performance: LINPACK

There is one kind of code where Core really ate the AMD CPUs for breakfast. It was close to embarrassing: floating point intensive code that makes heavy use of vector SIMD, also called packed SSE (and SSE2/SSE3) runs up to two times as fast on a Xeon 5160 (3GHz) than on Opteron 2222 (3GHz) . This is also one of the (but probably not the main) reason why AMD was also falling a bit behind in the gaming area.

AMD has really gone a long way to improve the performance of 128-bit packed SSE instructions:
  • Instruction fetch has been doubled to 32 bytes
  • 128-bit SSE computations now decode into a single micro-op (two in K8)
  • The load unit can load two 128-bit numbers from the L1 cache each cycle
  • FP Reservation stations are still 36 entry, but they're now 128-bits wide instead of 64-bits
  • All three FPU executions units were widened to 128-bit (64-bit before)
  • The L2 cache has double the bandwidth to cope with this
Together with the excellent memory subsystem, Barcelona should be ready to take on the Intel Core architecture when it comes to pure SIMD/SSE power.

Meet LINPACK, a benchmark application based on the LINPACK TPP code, which has become the industry standard benchmark for HPC. It solves large systems of linear equations by using a high performance matrix kernel. We used Intel's version of LINPACK, which uses the highly optimized Intel Math Kernel Library. The Intel MKL is quite popular and in an Intel dominated world, AMD's CPUs have to be able to run Intel optimized code well.

We used a workload of square matrices of sizes 5000 to 30000 by steps of 5000, and we ran four (dual dual-core) or eight threads (dual quad-core). As the system was equipped with 8GB of RAM, the large matrixes all ran in memory. LINPAC is expressed in GFLOPs (Giga/Billions of Floating Operations Per Second). We'll start with the quad-core scores (one quad or two duals).


Yes, this code is very Intel friendly but it does exist in the real world, and it is remarkably interesting. Look at what Barcelona is doing: it is outperforming a 60% higher clocked Opteron 2224 SE. That means that clock for clock, the third generation Opteron is no less than 142% faster. That is a massive improvement!

Thanks to meticulous tuning for the Intel's cores, the Xeon is still winning the benchmark. A 17% higher clocked Xeon 5345 is about 25-26% faster than Barcelona, but the days where this kind of code resulted in embarrassing defeats for AMD are over. We are very curious how a LINPACK compiled with AMD's math kernel libraries and other compilers would do, but the late arrival didn't allow us to do much recompiling.

Now let's take a look at the eight thread results. We kept the Xeon 5160 (four threads) in this graph, so you can easily compare the results with the previous graph.


Normally you would expect that this kind of code with huge matrices has to access the memory a lot, but masterly optimization together with hardware prefetching ensures most of the data is already in the cache. The quad-core Xeon wins again, but the victory is a bit smaller: the advantage is 20%-23%. Let us see if Intel can still keep the lead when we look at a benchmark which is very SSE intensive and which is optimized for Intel CPUs, but this time it's developed by a third party.

64-bit Linux Java Performance: SPECjbb2005 Software Rendering: zVisuel (32-bit Windows)
Comments Locked

46 Comments

View All Comments

  • kalyanakrishna - Tuesday, September 11, 2007 - link

    I don't deny people use MKL ... I dont agree that anyone targeting performance on AMD Opteron will use MKL. No one running HPL/Linpack for Top 500 submission would use MKL on Opteron. No one who wishes to test his Opteron for performance would use MKL to do so. No one wishing to have the fastest possible results from his Opteron will do so.

    Even ISV's now provide code that is optimized for Xeon and Opteron separately.
  • JohanAnandtech - Tuesday, September 11, 2007 - link

    Ok, point taken. Give us some time, and we'll follow up with new compilations of Linpack.
  • kalyanakrishna - Wednesday, September 12, 2007 - link

    Thank you. Appreciate the effort.
  • leexgx - Monday, September 10, 2007 - link

    and how offen do you read anandtechs Previews and reviews

    unlike when intels core 2 came out all the hipe was real, to bad for AMD this time

    this cpu is going to be good, problem is will it be able to compleat with Intels new cpu when it comes out

    i still useing an amd system if your wundering and so all the rest of my pcs apart from my server as i just thow in an old P4 mobo to just file sharein house (all second hand parts apart from the hdds)
  • phaxmohdem - Monday, September 10, 2007 - link

    I wonder if it would be feasible for AMD to take the Intel approach, and slap two of there new native quad cores together and release an octal core CPU in the near future. Or would they remain the multi-core purists they have become... Similarly I wonder if 2 65nm Barecelona cores could even fit under that heat spreader... or come in under an acceptable thermal envelope.
  • Accord99 - Monday, September 10, 2007 - link

    It won't fit on Socket F:

    http://www.madboxpc.com/news/am2/AMD_barcelona.jpg">http://www.madboxpc.com/news/am2/AMD_barcelona.jpg
  • fic2 - Monday, September 10, 2007 - link

    Page 8, 3DS Max 9 last paragraph:
    "Dual 3GHz Opteron 2222 is capable of generating about 29 frames per hour", but then
    "potential 3GHz Barcelona will be able to spit out ~35 frames per second". I think that is supposed to be ~35 frames per hour. Otherwise that is an extremely impressive speedup!
  • JohanAnandtech - Monday, September 10, 2007 - link

    No, it is "per second". We used a Octalcore 2THz Barcelona there.


    ... Thanks, fixed that one :-)
  • phaxmohdem - Monday, September 10, 2007 - link

    Got SuperPi times for that beast? ;)
  • Roy2001 - Monday, September 10, 2007 - link

    Kentsfield has 2*143mm^2 dies. Barcelona is 280+ mm^2. Penry would be even smaller, 2*100 mm^2. So unless AMD can increase the frequency to 3.0+Ghz soon and price their new quad-core processors higher than Intel's, AMD would be still in red unless it oursouces Athlon 64 to TSMC.

Log in

Don't have an account? Sign up now