64-bit Linux HPC Performance: LINPACK

There is one kind of code where Core really ate the AMD CPUs for breakfast. It was close to embarrassing: floating point intensive code that makes heavy use of vector SIMD, also called packed SSE (and SSE2/SSE3) runs up to two times as fast on a Xeon 5160 (3GHz) than on Opteron 2222 (3GHz) . This is also one of the (but probably not the main) reason why AMD was also falling a bit behind in the gaming area.

AMD has really gone a long way to improve the performance of 128-bit packed SSE instructions:
  • Instruction fetch has been doubled to 32 bytes
  • 128-bit SSE computations now decode into a single micro-op (two in K8)
  • The load unit can load two 128-bit numbers from the L1 cache each cycle
  • FP Reservation stations are still 36 entry, but they're now 128-bits wide instead of 64-bits
  • All three FPU executions units were widened to 128-bit (64-bit before)
  • The L2 cache has double the bandwidth to cope with this
Together with the excellent memory subsystem, Barcelona should be ready to take on the Intel Core architecture when it comes to pure SIMD/SSE power.

Meet LINPACK, a benchmark application based on the LINPACK TPP code, which has become the industry standard benchmark for HPC. It solves large systems of linear equations by using a high performance matrix kernel. We used Intel's version of LINPACK, which uses the highly optimized Intel Math Kernel Library. The Intel MKL is quite popular and in an Intel dominated world, AMD's CPUs have to be able to run Intel optimized code well.

We used a workload of square matrices of sizes 5000 to 30000 by steps of 5000, and we ran four (dual dual-core) or eight threads (dual quad-core). As the system was equipped with 8GB of RAM, the large matrixes all ran in memory. LINPAC is expressed in GFLOPs (Giga/Billions of Floating Operations Per Second). We'll start with the quad-core scores (one quad or two duals).


Yes, this code is very Intel friendly but it does exist in the real world, and it is remarkably interesting. Look at what Barcelona is doing: it is outperforming a 60% higher clocked Opteron 2224 SE. That means that clock for clock, the third generation Opteron is no less than 142% faster. That is a massive improvement!

Thanks to meticulous tuning for the Intel's cores, the Xeon is still winning the benchmark. A 17% higher clocked Xeon 5345 is about 25-26% faster than Barcelona, but the days where this kind of code resulted in embarrassing defeats for AMD are over. We are very curious how a LINPACK compiled with AMD's math kernel libraries and other compilers would do, but the late arrival didn't allow us to do much recompiling.

Now let's take a look at the eight thread results. We kept the Xeon 5160 (four threads) in this graph, so you can easily compare the results with the previous graph.


Normally you would expect that this kind of code with huge matrices has to access the memory a lot, but masterly optimization together with hardware prefetching ensures most of the data is already in the cache. The quad-core Xeon wins again, but the victory is a bit smaller: the advantage is 20%-23%. Let us see if Intel can still keep the lead when we look at a benchmark which is very SSE intensive and which is optimized for Intel CPUs, but this time it's developed by a third party.

64-bit Linux Java Performance: SPECjbb2005 Software Rendering: zVisuel (32-bit Windows)
Comments Locked

46 Comments

View All Comments

  • Phynaz - Monday, September 10, 2007 - link

    Isn't this intentionally crippling the system?
  • JohanAnandtech - Monday, September 10, 2007 - link

    No. Just check what Intel and other companies do when they submit Specjbb scores for example. With HW prefetch on, you get about 10% lower scores.
  • nj2112 - Tuesday, September 11, 2007 - link

    Was HW prefetching off for all tests ?
  • lplatypus - Monday, September 10, 2007 - link

    I thought that 2x00 series CPUs only supported one coherent hypertransport link, so would this mean that the "Dual Link" feature involving two HT links would require 8300 series CPUs?
  • mino - Tuesday, September 11, 2007 - link

    Well, maybe the changed that and all links are active (to enable setups like this) and the CPU just refuses to comunicate more than one coherent hopa away..
  • mino - Tuesday, September 11, 2007 - link

    Well, maybe the changed that and all links are active (to enable setups like this) and the CPU just refuses to comunicate more than one coherent hopa away..
  • MDme - Monday, September 10, 2007 - link

    Let the games begin!
  • Viditor - Thursday, September 13, 2007 - link

    Are you going to be re-doing the review with the shipping version (stepping BA) anytime soon?
    I'm most curious to see if the improvement of 5%+ claims are true...
  • MDme - Monday, September 10, 2007 - link

    I think Barcelona will be a success in the server world. It's performance is around 20% faster than equivalently clocked xeons with the exception of certain programs like fritz and the linpack intel library where it is around 5-10% slower. But since it scales better than the xeon chips it should negate that and increase it's lead on others as core/sockets increase. add to that it's power efficiency tweaks and aggressive pricing, AMD will be able to hold off intel in the server world.....maybe.

    With 2.5Ghz Barceys coming up that would be equivalent to around 3-3+ Ghz xeons. So AMD was right that they need to get to 2.6 Ghz....AMD needs to ramp up clock to get the highest-end performance crown, but for now, their offering offers a nice balance of performance and power efficiency for the price.

    Now time for the Phenom to get it's act together.
  • TA152H - Monday, September 10, 2007 - link

    The article should have mentioned the performance penalty Intel chips are suffering from with regards to FB-DIMMS. While it's true they should be benchmarked in servers with with memory, it's also widely rumored that they are going to be offering choices in the near future. This memory has a really big impact on a lot of benchmarks, so when looking towards the future, or desktop, it's important to keep in mind the importance of Intel using different memory. I don't think even Intel is stubborn enough to stick with this seriously slow, and power hungry memory. Maybe as a choice it's fine, but it must be clear to them that offering something else as well as FB-DIMMs is very desirable in the server space. Then again, look at how long they stuck with Rambus.

Log in

Don't have an account? Sign up now