AMD

Despite the enthusiastic and hard working marketing people, AMD's presence at CeBIT was somewhat a disappointment. Other than some vague benchmarks, there was no demo system of AMD's "Native quad core", and no real hard benchmarks or clock speeds. With a bit of help from a few enlightened people, however, we were able to dig up some new info.


It was quite interesting to hear AMD's representatives use the term "K10" again. The most famous benchmark so far is AMD's claim that the K10 is about 42% faster in floating point than the current top chip, the Xeon x5355.


So we decided to delve a little deeper. Let us first look at the dual socket systems, as we try to find more precise SPEC (base) numbers:

AMD vs. Intel Quad Core Performance Overview
Base SPEC Xeon 5160 (3 GHz) Opteron 2220 (2.8 GHz) Xeon vs. Opteron
CPU Int2006 17.5 12.2 43%
CPU Fp2006 17.1 13.1 31%
CPU Int2006 (rate) 53.2 46.1 15%
CPU Fp2006 (rates) 44.1 45.6 -3%
CPU Fp2000 (rates) 81.6 85.4 -4%

First of all, it should be noted that Spec FP2000 rate and Spec FP2006 rate are already running better on the dual core Opteron than on the dual core Xeon "Woodcrest". SpecFP rate is nothing more than several SpecFP benchmarks running completely separate from each other, and it is well known that SpecFP is a bandwidth intensive benchmark. So running several of those benchmarks will only increase the bandwidth needed. In a two socket machine, the Opterons have roughly twice as much bandwidth as with one socket, so SpecFP rate is basically the ideal benchmark to show the benefits of AMD's NUMA platform.

If we compare a dual quad core Xeon (x5355) with a quad socket dual core Opteron, the bandwidth of the AMD platform doubles while the bandwidth of the Intel system stays the same. As we use eight cores in total instead of four, the bandwidth demands of the SpecFP benchmark also double.

AMD vs. Intel Octal Core Performance Overview
Xeon x5355 (2.66 GHz) Opteron 8220 (2.8 GHz) Xeon vs. Opteron K10
CPU Int2006 16 10.5 52% N/a
CPU Fp2006 16.1 12.1 33% N/a
CPU Int2006 (rate) 79.6 86.2 -8% N/a
CPU Fp2006 (rates) 58.9 82.5 -29% N/a
CPU Fp2000 (rates) 103 157 -34% +/- 146

The result is that the dual Xeon x5355 (eight cores) is heavily bottlenecked by a lack of bandwidth and hardly faster than the dual Opteron 2220SE (four cores) in CPU FP2000 rates. If we take a look at the best dual core Xeon 5150 (2.66 GHz) score, it gets a score of 78.2. That means that the quad Xeon 2.66 GHz is only about 32% faster than its dual core brother at the same clock speed, another clear indication that the dual Xeon x5355 scores are seriously limited by memory bandwidth. It is no surprise that the quad socket Opteron 8220 is about 34% faster than the Xeon x5355 (and we are ignoring the probably inflated result of 184 you can get with the Sun Studio Compiler).

This puts AMD's claim that the best "K10" (most likely at 2.3 GHz) will be 42% faster than the Xeon x5355 in Spec FP rate in the right perspective. We reported in our Barcelona architecture article that the AMD K10's Northbridge is set up to handle higher bandwidth than the current AMD chips. As has been shown numerous times, the current Athlon 64 X2/Opteron architecture is not able to use the extra bandwidth that DDR2 gives.

So most of the 42% advantage is probably due to K10's better Northbridge and better use of DDR2. Ron Myers of AMD claimed that the difference is now already greater than 42%. Combine this with the fact that the K10 is running at only 2.3 GHz, and we can conclude that the memory subsystem (Load/store unit, L1, L2, Northbridge) of the K10 is simply (vastly) superior compared to the Athlon 64s and to the quad Xeon. This confirms our and Intel's assumption that the K10 will probably make the largest impact as a very potent HPC chip. The hardware virtualization features in AMD's K10 are quite impressive, but we'll discuss them later.

There is more: AMD emphasizes the memory subsystem and SSE on the slide below, presenting the Intel Clovertown as a severely bottlenecked CPU.


So how much of this marketing slide is true? It is very likely that the instruction fetch bandwidth of the AMD K10 is probably twice as high as Intel's Core architecture. Pre-decoding bandwidth (and thus the complete chain of fetching and decoding) on Core is still limited to 16 bytes of code per clock cycle, while it is claimed to be 32 bytes for the whole pre-decoding, fetching and decoding pipeline for AMD's K10. It should be remarked however that only in very CPU intensive code will the 16 byte per cycle bandwidth really be a bottleneck.

However, stating that the data cache bandwidth is twice as high as Intel's Core is ignoring a few things. Eric Bron, probably one of the most knowledgeable developers when it comes to SSE, stated: "Intel Core can sustain one 128-bit load and one 128-bit store per cycle (I've measured actual timings very near this theoretical peak), so Core can copy 128 bits per cycle. Barcelona (K10) can only copy 64 bits per cycle from the above store bandwidth limitation." So the twice as much "load bandwidth" is only a small part of the story:
  • Intel Core can do a 128-bit Load and 128-bit Store in one cycle if possible
  • AMD's K10 can either do two 128-bit loads, or two 64-bit stores, or one 128-bit Load and one 64-bit Store
Depending on the situation, AMD's K10 can do twice as much, about equal or about 33% less work in one cycle. So you cannot conclude that the AMD K10 has twice as much SSE bandwidth as the Intel quad core Xeon. It will only be faster if loads happen twice as often (or more) as Stores. In most "harder to vectorize" FP code, this is the case, so here the K10 chip will probably win by a small margin (as the percentage of SSE code is low). An example of this is the SpecFP benchmark. In some "easy to vectorize" SSE code this is not the case, and in that case the K10 will probably not be beaten per clock cycle, but the clock speed disadvantage might give the Xeon the edge.

A very enthusiastic Ron Myers emphasized that the AMD "Barcelona" launch is just as important as the Opteron launch back in 2003. And frankly, we believe him. The fast interconnects on die and "native quad core" might not make a big difference in typical desktop applications, but it does make a difference in server applications. The improved fetching and decoding pipeline and the much improved OoO (Out of Order) execution should offer better integer performance. At the same time, it's starting to be clear that it is more likely that AMD's K10 will outperform by a tangible margin the current Xeon "Clovertown" in HPC applications than server applications, thanks to vastly improved SSE loading and execution, together with a much better memory system. Again, that is probably the reason why Intel is introducing a 1600 MHz DIB more quickly than planned.

Intel Storage
Comments Locked

13 Comments

View All Comments

  • BikeDude - Tuesday, April 10, 2007 - link

    I've tried to use SuperMicro's management software in the past and quite frankly it is pure and utter c---.

    Last week I tried HP's management software, and could install the entire OS from a remote location. I could map a .iso image to the blade in question and it booted right up.

    OTOH, as I recall, SuperMicro's remote desktop solution is based on VNC. Where HP lets you remotely access the console from before POST is even run, SuperMicro forces you to first install the OS.

    (We have lots of Tyan and SuperMicro servers, but of course we might've missed something fundamental along the way -- but... HP has a very nice package once the hardware has been hooked up to the power outlet and your network switch)
  • Xenoterranos - Friday, March 30, 2007 - link

    "Enermax showed how much hardware its Galaxy 1000W can power. According to Enermax, the PSU delivered 933 W to 24 80GB hard disks, four Opteron 8212 CPUs, four 3Ware 9650 drive controllers, a GeForce 7600GX and 8GB of RAM (16 x 512MB)."

    ...or 2 GeForce 8800's.
  • yyrkoon - Friday, March 30, 2007 - link

    quote:

    The only thing missing is an internal SAS controller; the internal disk bays only support SATA. A positive is the fact that two USB ports are available on the front of the server.


    So now we 'need' usb ports on the front of a rack mount server ?! I'd rather have onboard SAS to be honest ;)

    Interresting toys, no doubt.
  • JohanAnandtech - Friday, March 30, 2007 - link

    Well, I find sometimes very handy for installing quickly a driver or a small testprogram etc. Or in some cases to add a USB CDROM, or to make a quick backup on a USB harddisk.

    Do you feel that USB has no use on the front of a server?
  • neogodless - Thursday, March 29, 2007 - link

    I've heard of Home Theatre PC (HTPC) but HPC... is...?
  • laok - Thursday, March 29, 2007 - link

    High Performance Computing
  • Desslok - Thursday, March 29, 2007 - link

    Tyan should have taken the silica bag off that system before showing it off.
  • AnandThenMan - Thursday, March 29, 2007 - link

    Hellz no! The silica bag and the inanimate carbon rod are the main reasons people flock to these shows.
  • ravedave - Thursday, March 29, 2007 - link

    So the Barcelona is no longer known as the K8L and is now again being called the K10? When is the launch for this part? Last I saw it was Q1, which is almost over...

  • JohanAnandtech - Thursday, March 29, 2007 - link

    AFAIK, the K8L name was never used by AMD: it was invented on the Internet. THe K10 will be launched mid 2007 (that is all AMD says), probably the Summer of 2007

Log in

Don't have an account? Sign up now