Introduction

With this week's introduction of the x52 line of Opteron processors, AMD is giving us a little look into the future of their Athlon 64 line. As mentioned in our article on Monday, the new 2.6GHz speed grade is also introducing the new E4 stepping, which adds SSE3 support. The new Opteron also received a face lift in that it is fabbed on a 90nm process, runs coherent HT links at 1GHz, and comes in a shiny new organic package rather than the older ceramic.

The goal of this article is to bring out a quick look at what SSE3 brings to the table for Opteron and the future revision E Athlon 64 cores. As desktop parts do not enable coherent HT links at all, the 1GHz support won't matter. Also, the newer A64 parts are already 90nm on organic packages. Other than the usual small tweaks we see between steppings, the only thing that will be new across the board for K8 processors is SSE3.

What exactly is SSE3? Intel introduced SSE3 as Prescott New Instructions last year. These instructions are generally additions to the SIMD (single instruction multiple data) capabilities of the processor. SIMD processing is based on the idea that sometimes processors must take large amounts of data and perform similar operations across the entire set. This lends itself well to things like audio and video processing. In these areas of computing, large amounts of data flow through the processor, undergoing roughly the same operations, in preparation for display. The philosophy behind SIMD lends itself well to graphics as well. Modern graphics cores incorporate many SIMD processing units in order to churn through vector and pixel data as fast as possible. SIMD processing has also largely overshadowed the use of the x87 floating point unit on x86 processors. Because of this, it is advantageous for AMD to support the extensions to SIMD Intel makes as quickly as possible.

With SSE3, Intel added 10 new instructions targeted at SIMD as well as 3 other instructions that don't touch the SSE registers (fisttp, monitor, mwait). Here's a brief list of SSE3 instructions and what they are for:
x87 floating point to integer conversion (fisttp)
Complex arithmetic (addsubps, addsubpd, movsldup, movshdup, movddup)
Video encoding (lddqu)
Graphics (haddps, hsubps, haddpd, hsubpd)
Thread synchronization (monitor, mwait)
The float to integer conversion is rather obvious in function, but some of the other instructions are a little mysterious. The complex math instructions extend functionality for imaginary numbers. The hadd and hsub instructions are horizontal additions and horizontal subtractions. These allow faster processing of data stored "horizontally" in (for example) vertex arrays. Here is a 4-element array of vertex structures.
x1 y1 z1 w1 | x2 y2 z2 w2 | x3 y3 z3 w3 | x4 y4 z4 w4
SSE and SSE2 are organized such that performance is better when processing vertical data, or structures that contain arrays; for example, a vertex structure with 4-element arrays for each component:
x1 x2 x3 x4
y1 y2 y3 y4
z1 z2 z3 z4
w1 w2 w3 w4
Generally, the preferred organizational method for vertecies is the former. Under SSE2, the compiler (or very unfortunate programmer) would have to reorganize the data during processing.

The lddqu instruction is designed to reduce the impact of 128bit unaligned memory accesses. As unaligned loads happen quite often in video processing, the lddqu instruction is designed to load 256bits of data aligned on a 16byte boundary. The instruction also takes care of extracting the correct 16bytes (as requested) from the 32byte block. Under SSE2, 64bit loads are executed and then the data is recombined.

In order to test these features as implemented by AMD, we tested an Opteron 250 against an Opteron 252. We were able to use crystalcpuid to set the multiplier of the Opteron 252 (though powernow!) to 12 in order to match the 2.4GHz of the Opteron 250. This way, we'll have a direct comparison of the two architectures.

We ran both processors in HP's wx9300 workstation. We used a single CPU configuration and 4x 512MB of RAM at 3:3:3:8. Windows XP SP2 was used in our tests. In an MP environment (with more memory bandwidth), the Opteron has a greater potential for improvement with SSE3. Unfortunately, we were unable to perform a direct comparison of the older and newer cores under a DP configuration. Attempting to use powernow! to adjust the multiplier with more than 1 processor installed resulted in a BSOD (machine check exception).

SSE3 Performance Analysis
POST A COMMENT

48 Comments

View All Comments

  • Icehawk - Thursday, February 17, 2005 - link

    I've worked for several large corporations (Fortune 500) and none of them have AMD servers anywhere... it is unfortunate but it is like Macs in a DTP house - the old guard swears by it so nothing is going to change. AMD still is seen as inferior compared to Intel even years after the successes of Athlon by many.

    Plus most vendors only offer Intel boxes and large corporates like as small a vendor pool as possible (leverage) and as uniform an IT infrastructure as possible (ie, Intel shop).

    At least that is my perspective on it.

    I would have liked to see a wider array of benchmarks, these were slim pickins - but thanks for the quick review!
    Reply
  • pxc - Thursday, February 17, 2005 - link

    #26, that would't change anything. Look at the XviD S939/S940 FX-53 (2.4GHz) benchmarks here: http://www.hexus.net/content/reviews/review.php?dX...

    I don't believe XviD has any SSE3 enhancements. XviD is just an unlicensed MPEG-4 hack anyways, so it doesn't matter.
    Reply
  • Umbra55 - Thursday, February 17, 2005 - link

    Derek,

    Why did you use DivX and not Xvid?
    It is well known that DivX has been “enhanced” by Intel (read: crippled for AMD).
    I would like to see the latest Opterons compared to the latest Xeons under Linux.
    Two reasons: Linux applications have not been “enhanced” by Intel and nowadays more server use Linux than Windows.
    Umbra.
    Reply
  • pxc - Thursday, February 17, 2005 - link

    #22, from the DivX 5.2 release notes:

    DivX 5.2 now includes:
    ...
    Encoder: Intel SSE3 (Prescott) Optimizations
    The DivX 5.2 encoder features optimizations for Intel Prescott CPU's, improving performance by up to 15%.
    ...
    Reply
  • mlittl3 - Thursday, February 17, 2005 - link

    #7, bigpow

    In addition to #14, Derek Wilson (the author of the article in case you didn't notice), stating that Anandtech uses Opterons in their servers, maybe you should pop over to www.top500.org and read through the top 500 supercomputer list. Some 30% of the computers use Opterons. I know you said you are from "one of the largest tech companies" but sounds like you guys aren't doing your homework. Who do you work for? Intel?

    Also, for all of you guys who are asking about better gaming performance and overclocking, OPTERONS ARE SERVER AND WORKSTATION PROCESSORS!!!!! You guys have got to get some perspective. The PC world does not revolve around the number of frames per second you can get out of HL2 or Doom3. Servers are built for stability and usually come with 2d only built on 8MB video cards in 1U designs, etc. etc. Workstations usually use Quadros and FireGLs which are for designing 3d apps, running CAD software, etc.

    Besides Opterons are meant to work with registered memory (some are getting around this). This is not the stuff for gamers and overclockers and regular desktop use. Let's get real. Anandtech will overclock and benchmark games until the cows come home when the Rev E. Athlon 64's and Athlon 64 FX's come out.

    Everyone agreed.
    Reply
  • pxc - Thursday, February 17, 2005 - link

    Intel 3.4F results:
    SSE2
    Math Solving fps: 591.7
    Prerendering fps: 3554.9
    Overall fps: 21.26

    SSE3
    Math Solving fps: 601.5
    Prerendering fps: 3558.0
    Overall fps: 21.35


    I used the same default settings as Derek used. The Renderer set up does not have a SSE2 setting (only FPU+MMX, 3DNow+MMX, SSE+MMX and SSE3+MMX), but the model set up does have SSE2 and SSE3 options. I also tested 2 render threads, but the math solving and prerendering results seem to report only the first thread (overall fps are correct):

    SSE2, 2 rendering threads
    Math Solving fps: 509.8
    Prerendering fps: 3428.8
    Overall fps: 35.57

    SSE3, 2 rendering threads
    Math Solving fps: 516.2
    Prerendering fps: 3424.6
    Overall fps: 35.68
    Reply
  • PetNorth - Thursday, February 17, 2005 - link

    DivX 5.2.1 hasn't SSE3 support at all. 2-3% gain will be for some memory system improvement or for another reason. Reply
  • ceefka - Thursday, February 17, 2005 - link

    Hey Intel, can we have SSE4 now?

    Ok, it will improve some benchies. I hope you can find gains for Opteron on SSE3 in your next articles on this one. Otherwise I agree with #9

    #7 That's not funny. That's ignorant.
    Reply
  • mickyb - Thursday, February 17, 2005 - link

    I like the direct comparison by adjusting the clock, but I would have also included the 2.6 GHz benchmarks as well. I guess you are saving that for a bigger article.

    I thought there were a couple of games that took advantage of SSE3. Do HL-2 or D3 do anything?

    Also, I would like to have seen the temperature when you underclocked it to see if there was any improvement or loss. I thought the E stepping had a better proccess to reduce leakage. I am also curious if SSE3 added anything significant in the way of load or temp. I would think that SSE3 would be negligable.
    Reply
  • LoneWolf15 - Thursday, February 17, 2005 - link

    #16 's comments are the ones I would have made if they weren't posted already. I'd like to know if the Opteron has the new memory controller that the Venice-core Athlon 64 is supposed to have, and what effects that has on performance. Reply

Log in

Don't have an account? Sign up now