Introduction

With this week's introduction of the x52 line of Opteron processors, AMD is giving us a little look into the future of their Athlon 64 line. As mentioned in our article on Monday, the new 2.6GHz speed grade is also introducing the new E4 stepping, which adds SSE3 support. The new Opteron also received a face lift in that it is fabbed on a 90nm process, runs coherent HT links at 1GHz, and comes in a shiny new organic package rather than the older ceramic.

The goal of this article is to bring out a quick look at what SSE3 brings to the table for Opteron and the future revision E Athlon 64 cores. As desktop parts do not enable coherent HT links at all, the 1GHz support won't matter. Also, the newer A64 parts are already 90nm on organic packages. Other than the usual small tweaks we see between steppings, the only thing that will be new across the board for K8 processors is SSE3.

What exactly is SSE3? Intel introduced SSE3 as Prescott New Instructions last year. These instructions are generally additions to the SIMD (single instruction multiple data) capabilities of the processor. SIMD processing is based on the idea that sometimes processors must take large amounts of data and perform similar operations across the entire set. This lends itself well to things like audio and video processing. In these areas of computing, large amounts of data flow through the processor, undergoing roughly the same operations, in preparation for display. The philosophy behind SIMD lends itself well to graphics as well. Modern graphics cores incorporate many SIMD processing units in order to churn through vector and pixel data as fast as possible. SIMD processing has also largely overshadowed the use of the x87 floating point unit on x86 processors. Because of this, it is advantageous for AMD to support the extensions to SIMD Intel makes as quickly as possible.

With SSE3, Intel added 10 new instructions targeted at SIMD as well as 3 other instructions that don't touch the SSE registers (fisttp, monitor, mwait). Here's a brief list of SSE3 instructions and what they are for:
x87 floating point to integer conversion (fisttp)
Complex arithmetic (addsubps, addsubpd, movsldup, movshdup, movddup)
Video encoding (lddqu)
Graphics (haddps, hsubps, haddpd, hsubpd)
Thread synchronization (monitor, mwait)
The float to integer conversion is rather obvious in function, but some of the other instructions are a little mysterious. The complex math instructions extend functionality for imaginary numbers. The hadd and hsub instructions are horizontal additions and horizontal subtractions. These allow faster processing of data stored "horizontally" in (for example) vertex arrays. Here is a 4-element array of vertex structures.
x1 y1 z1 w1 | x2 y2 z2 w2 | x3 y3 z3 w3 | x4 y4 z4 w4
SSE and SSE2 are organized such that performance is better when processing vertical data, or structures that contain arrays; for example, a vertex structure with 4-element arrays for each component:
x1 x2 x3 x4
y1 y2 y3 y4
z1 z2 z3 z4
w1 w2 w3 w4
Generally, the preferred organizational method for vertecies is the former. Under SSE2, the compiler (or very unfortunate programmer) would have to reorganize the data during processing.

The lddqu instruction is designed to reduce the impact of 128bit unaligned memory accesses. As unaligned loads happen quite often in video processing, the lddqu instruction is designed to load 256bits of data aligned on a 16byte boundary. The instruction also takes care of extracting the correct 16bytes (as requested) from the 32byte block. Under SSE2, 64bit loads are executed and then the data is recombined.

In order to test these features as implemented by AMD, we tested an Opteron 250 against an Opteron 252. We were able to use crystalcpuid to set the multiplier of the Opteron 252 (though powernow!) to 12 in order to match the 2.4GHz of the Opteron 250. This way, we'll have a direct comparison of the two architectures.

We ran both processors in HP's wx9300 workstation. We used a single CPU configuration and 4x 512MB of RAM at 3:3:3:8. Windows XP SP2 was used in our tests. In an MP environment (with more memory bandwidth), the Opteron has a greater potential for improvement with SSE3. Unfortunately, we were unable to perform a direct comparison of the older and newer cores under a DP configuration. Attempting to use powernow! to adjust the multiplier with more than 1 processor installed resulted in a BSOD (machine check exception).

SSE3 Performance Analysis
Comments Locked

48 Comments

View All Comments

  • DerekWilson - Monday, February 21, 2005 - link

    #46 and #47

    arent monitor and mwait like hardware semaphores/mutexes?
  • PrinceGaz - Saturday, February 19, 2005 - link

    #46- MONITOR tells the processor to detect changes in memory locations (typically in cache), and MWAIT puts a thread into low-power "sleep" until those memory changes are detected.

    MONITOR and MWAIT are meaningless to processors that are only running a single thread, therefore the AMD SSE3 capable processors will just treat them as no operation (it has to be aware of there existence so it can skip past them correctly, and not potentially crash like #44 asked).

    Instructions designed to put one thread to sleep so that the processor can use its full resources on the other thread are only relevant when a single processor is running two or more threads simultaneously. Single-threaded processors will always dedicate full resources to the thread they are running and not put them to sleep as that would be pointless. The O/S still handles thread-switching normally, regardless of how many threads the processor is running.
  • pxc - Friday, February 18, 2005 - link

    #45, that would be disasterous (NOP for a wait condition would mean a thread wouldn't wake up). :P MONITOR and MWAIT are useful for HyperThreading, but not exclusive to HT. Any application with multiple threads can benefit from its use. AMD's internal implementations will of course be different, but the instructions will behave the same way.
  • PrinceGaz - Friday, February 18, 2005 - link

    #44- they'll almost certainly just treat monitor and mwait as a NOP (no operation instruction)
  • quanta - Friday, February 18, 2005 - link

    Since SSE3-based AMD64 CPUs don't have hyperthreading, will application using monitor and mwait crashes the Opterons?
  • Viditor - Friday, February 18, 2005 - link

    Derek - By running at 75% quality, aren't you minimizing the effects of lddqu, as this is mainly of use for motion estimation (which is greatly reduced at lower quality settings...)?

    (Thanks to Mike S for pointing this out to me...)
  • Viditor - Friday, February 18, 2005 - link

    PrinceGaz - "It definitely looks like these new E stepping chips run hotter"

    Unfortunately, you can't really tell with AMD chips...
    All we really know is that under absolutely NO circumstances will it run higher than 92.6w...
    I do wish TDP was a standard across all companies, but I guess that would be impractical.
    Sadly, this is a measurement that none of the review sites ever make...
  • Viditor - Friday, February 18, 2005 - link

    Icehawk - "I've worked for several large corporations (Fortune 500) and none of them have AMD servers anywhere..."

    40% of the Fortune 500 companies are now using Opteron servers. A large percentage of the most powerful new supercomputers are Opterons.
    I am sure that while you worked for those companies the did not have Opterons, as this is only a recent development (over the last year).

    "most vendors only offer Intel boxes"

    This too has changed over the last year. As of now, the only major vendor to be Intel only is Dell. In fact, Sun has cancelled their Xeon line in favour of Opterons...

    http://www.theregister.co.uk/2005/02/10/sun_kills_...

  • PrinceGaz - Friday, February 18, 2005 - link

    #39- although the 2.4GHz x50 D4 Opteron also has the same 85.3W TDP as the 2.2GHz part, the 2.6GHz x52 has a TDP of 92.6W which is higher than any other Opteron including the 130nm parts.

    It definitely looks like these new E stepping chips run hotter, but we need power consumption and temperature tests to say for sure.
  • Brunnis - Friday, February 18, 2005 - link

    #38

    Well, AMD said that power requirements would drop for chips at the same frequency, if I remember correctly. The TDP doesn't say anything about processors currently available. For all we know the 85W figure could be for a future 4GHz Opteron. I'm exaggerating, but you get my point. :)

Log in

Don't have an account? Sign up now