AMD K8 E4 Stepping: SSE3 Performance

by Derek Wilson on 2/17/2005 12:05 AM EST
POST A COMMENT

48 Comments

Back to Article

  • DerekWilson - Monday, February 21, 2005 - link

    #46 and #47

    arent monitor and mwait like hardware semaphores/mutexes?
    Reply
  • PrinceGaz - Saturday, February 19, 2005 - link

    #46- MONITOR tells the processor to detect changes in memory locations (typically in cache), and MWAIT puts a thread into low-power "sleep" until those memory changes are detected.

    MONITOR and MWAIT are meaningless to processors that are only running a single thread, therefore the AMD SSE3 capable processors will just treat them as no operation (it has to be aware of there existence so it can skip past them correctly, and not potentially crash like #44 asked).

    Instructions designed to put one thread to sleep so that the processor can use its full resources on the other thread are only relevant when a single processor is running two or more threads simultaneously. Single-threaded processors will always dedicate full resources to the thread they are running and not put them to sleep as that would be pointless. The O/S still handles thread-switching normally, regardless of how many threads the processor is running.
    Reply
  • pxc - Friday, February 18, 2005 - link

    #45, that would be disasterous (NOP for a wait condition would mean a thread wouldn't wake up). :P MONITOR and MWAIT are useful for HyperThreading, but not exclusive to HT. Any application with multiple threads can benefit from its use. AMD's internal implementations will of course be different, but the instructions will behave the same way. Reply
  • PrinceGaz - Friday, February 18, 2005 - link

    #44- they'll almost certainly just treat monitor and mwait as a NOP (no operation instruction) Reply
  • quanta - Friday, February 18, 2005 - link

    Since SSE3-based AMD64 CPUs don't have hyperthreading, will application using monitor and mwait crashes the Opterons? Reply
  • Viditor - Friday, February 18, 2005 - link

    Derek - By running at 75% quality, aren't you minimizing the effects of lddqu, as this is mainly of use for motion estimation (which is greatly reduced at lower quality settings...)?

    (Thanks to Mike S for pointing this out to me...)
    Reply
  • Viditor - Friday, February 18, 2005 - link

    PrinceGaz - "It definitely looks like these new E stepping chips run hotter"

    Unfortunately, you can't really tell with AMD chips...
    All we really know is that under absolutely NO circumstances will it run higher than 92.6w...
    I do wish TDP was a standard across all companies, but I guess that would be impractical.
    Sadly, this is a measurement that none of the review sites ever make...
    Reply
  • Viditor - Friday, February 18, 2005 - link

    Icehawk - "I've worked for several large corporations (Fortune 500) and none of them have AMD servers anywhere..."

    40% of the Fortune 500 companies are now using Opteron servers. A large percentage of the most powerful new supercomputers are Opterons.
    I am sure that while you worked for those companies the did not have Opterons, as this is only a recent development (over the last year).

    "most vendors only offer Intel boxes"

    This too has changed over the last year. As of now, the only major vendor to be Intel only is Dell. In fact, Sun has cancelled their Xeon line in favour of Opterons...

    http://www.theregister.co.uk/2005/02/10/sun_kills_...

    Reply
  • PrinceGaz - Friday, February 18, 2005 - link

    #39- although the 2.4GHz x50 D4 Opteron also has the same 85.3W TDP as the 2.2GHz part, the 2.6GHz x52 has a TDP of 92.6W which is higher than any other Opteron including the 130nm parts.

    It definitely looks like these new E stepping chips run hotter, but we need power consumption and temperature tests to say for sure.
    Reply
  • Brunnis - Friday, February 18, 2005 - link

    #38

    Well, AMD said that power requirements would drop for chips at the same frequency, if I remember correctly. The TDP doesn't say anything about processors currently available. For all we know the 85W figure could be for a future 4GHz Opteron. I'm exaggerating, but you get my point. :)
    Reply
  • aznskickass - Friday, February 18, 2005 - link

    I am actually not very surprised, as you would expect SSE3 to have a much bigger impact on the P4 due to it's much longer pipeline and weaker FPU.

    #30, wow, wattage has jumped up a lot with strained silicon (no wonder Prescotts are having trouble, esp. since they don't have SOI)...

    While not yet in Prescott territory, AMD has to keep the wattages in check, I think a ~3GHz chip might be tipping the scales at almost 100W, which *is* Prescott territory!
    Reply
  • saratoga - Thursday, February 17, 2005 - link

    "#29, XviD is an *UNLICENSED MPEG-4 HACK*. That's just a fact. DivX is a MPEG-4 licensee, XviD is not. "

    This is pretty silly. How could a piece of code get an MPEG4 license? Obviously it can't, which is why neither Xvid nor Divx code is licensed. Only a (compiled) product can be licensed to use MPEG4.

    Anyone selling an MPEG4 product is welcome to use Xvid and its perfectly legal, but they must pay a license fee for each product sold, as Divx does when you buy their product. Its the same situation as LAME, when you use it without having paid for a license, you're violating some patents. But you're free to license it and then you're in the clear legally.

    Also the irony of calling something a hack and mentioning Divx is simply breath taking.
    Reply
  • tygrus - Thursday, February 17, 2005 - link

    "This seems to indicate that the K8 architecture is simply resilient when it comes to unaligned 128bit loads. In the case of Intel's NetBurst, the lddqu instruction may have more impact."
    If you have SSE3 enabled Intel CPU's, then test your hypothesis instead of guessing. It would be interesting to see the absolute and percentage increases in performance for the same tests using equivalent Intel chips. From what I can remember is that SSE3 gave Intel little performance increase for previously SSE2 optimised code. There may have been a few artificial test cases that showed large benefits ie. deliberate unoptimised SSE2 code versus optimised SSE3 code.

    "As the Intel compiler is designed to optimize for Intel processors, we haven't had a viable source for high quality SSE3 compilation." You maybe surprised by the performance of so-called 'Intel optimised' code on AMD systems. I say this particularly because of the old case of PIII and early P4 optimised showing better AMD Athlon scores at the time.

    It would also be interesting to see the difference in performance with the Opteron 252 with the SSE3 turned off in those benchmarks.

    Like always we will have to wait for further optimisations and validations before we can make a better comparison. To investigate the features and implementation is to use hand coded SSE2/3 code for an inner loop and compare performance and behaviour under different conditions. It's like, at the moment we only have one side of a six-sided dice.

    The other thing would be to compare the power consumption of the two steppings of Opterons (either at the power point and extrapolate or measure power to mainboard/CPU).

    I see that "23 - Posted on Feb 17, 2005 at 10:37 AM by pxc" has added some useful information using a Intel 3.4GHz P4 F. A 2.4GHz Opteron could be considered to compete with an Intel P4 based @ 3.6GHz. Others have already mentioned similar comments to me or provided a different view of the benchmarks given.
    Reply
  • ChronoReverse - Thursday, February 17, 2005 - link

    #31

    Unlicensed MPEG4 implementation, yes. Hack? Hardly.

    XviD at least implements features as per specifications. DivX tends to add in their own "features" that aren't exactly in spec (although some are understandable given the limitations of the AVI container)

    Choose your words carefully.

    #32

    I'm also inclined to believe that it's simply because DivX's implementation of SSE3 simply doesn't do anything much yet.
    Reply
  • Jigga - Thursday, February 17, 2005 - link

    Sorry I'm a bit of a n00b when it comes to Divx encoding tests but are you sure the SSE3 codepath was enabled on the Opteron? I'm curious if some apps simply test for core/stepping rather than actual SSE3 ability--maybe DivX wasn't even using the right code?? Reply
  • Brunnis - Thursday, February 17, 2005 - link

    #25, pxc, wrote:

    "DivX 5.2 now includes:
    ...
    Encoder: Intel SSE3 (Prescott) Optimizations
    The DivX 5.2 encoder features optimizations for Intel Prescott CPU's, improving performance by up to 15%."

    Is it even remotely possible that DivX skips using SSE3 on the Opteron because it's currently only "meant" to run on the Prescott? I realise that SSE3 should work if the program is correctly written, but one never knows...
    Reply
  • PetNorth - Thursday, February 17, 2005 - link

    #25

    OK, I looked here http://www.divx.com/divx/divxpro/versions/ and it doesn't mention it, so I thought it hasn't.

    Anyway it seems DivX SSE3 implementation isn't very good or simply, SSE3 is useless (I think it is the first possibility or really, it hasn't SSE3, because for example, with TmpegENC Xpress there is a good improvement).

    I say this because we can see here http://www.tomshardware.com/cpu/20041115/pentium4_... with AG Knot and DivX 5.2 between P4C (SSE2) and P4E (SSE3) at same clock speed there aren't perfomance difference at all.
    Reply
  • pxc - Thursday, February 17, 2005 - link

    #29, XviD is an *UNLICENSED MPEG-4 HACK*. That's just a fact. DivX is a MPEG-4 licensee, XviD is not. Reply
  • PrinceGaz - Thursday, February 17, 2005 - link

    #20- the E4 Opterons have a higher TDP, almost as high as the 130nm CG revision.

    2.2 GHz (x48) Opteron TDP:

    CG - 89W (130nm, SOI)
    D4 - 67W (90nm, SOI)
    E4 - 85.3W (90nm, SOI, strained-silicon)

    It's to be expected that the E revision chips will run hotter than the D revision because strained-silicon increases power consumption (but allows for higher speeds). So long as you have good cooling, the E revision chips should be great overclockers.

    It would be nice for comparisons of temperature and system power consumption to be taken of a D4 x48, and E4 clocked at 2.2GHz (there are no D4 x50 parts, they were all CG revision).
    Reply
  • ChronoReverse - Thursday, February 17, 2005 - link

    @27

    XviD certainly doesn't have any SSE3 enhancements, but I do believe that calling them a "mpeg4 hack", when DivX has far more hacks and implements less features of MPEG4 ASP, is hardly fair at all.


    And I'd also like to see the difference between the new and old Opterons using only SSE2 so that we can see the difference not due to SSE3.
    Reply
  • Icehawk - Thursday, February 17, 2005 - link

    I've worked for several large corporations (Fortune 500) and none of them have AMD servers anywhere... it is unfortunate but it is like Macs in a DTP house - the old guard swears by it so nothing is going to change. AMD still is seen as inferior compared to Intel even years after the successes of Athlon by many.

    Plus most vendors only offer Intel boxes and large corporates like as small a vendor pool as possible (leverage) and as uniform an IT infrastructure as possible (ie, Intel shop).

    At least that is my perspective on it.

    I would have liked to see a wider array of benchmarks, these were slim pickins - but thanks for the quick review!
    Reply
  • pxc - Thursday, February 17, 2005 - link

    #26, that would't change anything. Look at the XviD S939/S940 FX-53 (2.4GHz) benchmarks here: http://www.hexus.net/content/reviews/review.php?dX...

    I don't believe XviD has any SSE3 enhancements. XviD is just an unlicensed MPEG-4 hack anyways, so it doesn't matter.
    Reply
  • Umbra55 - Thursday, February 17, 2005 - link

    Derek,

    Why did you use DivX and not Xvid?
    It is well known that DivX has been “enhanced” by Intel (read: crippled for AMD).
    I would like to see the latest Opterons compared to the latest Xeons under Linux.
    Two reasons: Linux applications have not been “enhanced” by Intel and nowadays more server use Linux than Windows.
    Umbra.
    Reply
  • pxc - Thursday, February 17, 2005 - link

    #22, from the DivX 5.2 release notes:

    DivX 5.2 now includes:
    ...
    Encoder: Intel SSE3 (Prescott) Optimizations
    The DivX 5.2 encoder features optimizations for Intel Prescott CPU's, improving performance by up to 15%.
    ...
    Reply
  • mlittl3 - Thursday, February 17, 2005 - link

    #7, bigpow

    In addition to #14, Derek Wilson (the author of the article in case you didn't notice), stating that Anandtech uses Opterons in their servers, maybe you should pop over to www.top500.org and read through the top 500 supercomputer list. Some 30% of the computers use Opterons. I know you said you are from "one of the largest tech companies" but sounds like you guys aren't doing your homework. Who do you work for? Intel?

    Also, for all of you guys who are asking about better gaming performance and overclocking, OPTERONS ARE SERVER AND WORKSTATION PROCESSORS!!!!! You guys have got to get some perspective. The PC world does not revolve around the number of frames per second you can get out of HL2 or Doom3. Servers are built for stability and usually come with 2d only built on 8MB video cards in 1U designs, etc. etc. Workstations usually use Quadros and FireGLs which are for designing 3d apps, running CAD software, etc.

    Besides Opterons are meant to work with registered memory (some are getting around this). This is not the stuff for gamers and overclockers and regular desktop use. Let's get real. Anandtech will overclock and benchmark games until the cows come home when the Rev E. Athlon 64's and Athlon 64 FX's come out.

    Everyone agreed.
    Reply
  • pxc - Thursday, February 17, 2005 - link

    Intel 3.4F results:
    SSE2
    Math Solving fps: 591.7
    Prerendering fps: 3554.9
    Overall fps: 21.26

    SSE3
    Math Solving fps: 601.5
    Prerendering fps: 3558.0
    Overall fps: 21.35


    I used the same default settings as Derek used. The Renderer set up does not have a SSE2 setting (only FPU+MMX, 3DNow+MMX, SSE+MMX and SSE3+MMX), but the model set up does have SSE2 and SSE3 options. I also tested 2 render threads, but the math solving and prerendering results seem to report only the first thread (overall fps are correct):

    SSE2, 2 rendering threads
    Math Solving fps: 509.8
    Prerendering fps: 3428.8
    Overall fps: 35.57

    SSE3, 2 rendering threads
    Math Solving fps: 516.2
    Prerendering fps: 3424.6
    Overall fps: 35.68
    Reply
  • PetNorth - Thursday, February 17, 2005 - link

    DivX 5.2.1 hasn't SSE3 support at all. 2-3% gain will be for some memory system improvement or for another reason. Reply
  • ceefka - Thursday, February 17, 2005 - link

    Hey Intel, can we have SSE4 now?

    Ok, it will improve some benchies. I hope you can find gains for Opteron on SSE3 in your next articles on this one. Otherwise I agree with #9

    #7 That's not funny. That's ignorant.
    Reply
  • mickyb - Thursday, February 17, 2005 - link

    I like the direct comparison by adjusting the clock, but I would have also included the 2.6 GHz benchmarks as well. I guess you are saving that for a bigger article.

    I thought there were a couple of games that took advantage of SSE3. Do HL-2 or D3 do anything?

    Also, I would like to have seen the temperature when you underclocked it to see if there was any improvement or loss. I thought the E stepping had a better proccess to reduce leakage. I am also curious if SSE3 added anything significant in the way of load or temp. I would think that SSE3 would be negligable.
    Reply
  • LoneWolf15 - Thursday, February 17, 2005 - link

    #16 's comments are the ones I would have made if they weren't posted already. I'd like to know if the Opteron has the new memory controller that the Venice-core Athlon 64 is supposed to have, and what effects that has on performance. Reply
  • Beenthere - Thursday, February 17, 2005 - link

    The only reason Intel created SSE3 was to have bogus benchmarks to fool naive consumers. There is no significant performance advanatage in any application. When you're Intel and you can provide incentives for benchmarks to be written to your liking to show a fantasy performance advantage, and your product line is obsolete and your market share is dropping, you do whatever you can to deceive consumers and hacks. AMD included SSE3 so Intel couldn't use the bogus benchmarks for misleading marketing purposes.

    This is no different than when MICROSUCKS paid to have benchmarks run that showed Win2000 to be faster than NT4 when in fact it is NOT in actual practice.

    SOD, DD

    Time for PC users to become a little more knowledgeable on the scams being used by dishonest companies to hawk inferior products.
    Reply
  • Carfax - Thursday, February 17, 2005 - link

    Hey Derek. Could you test SSE2 performance aswell?

    As it has been mentioned, the E stepping was rumored to possess a better SSE2 implementation.
    Reply
  • iwodo - Thursday, February 17, 2005 - link

    I always thought E core stepping is going to bring many things new on the table.

    Improved memory contoller, that is suppose to be faster and have better compbality.

    Improved SSE2 core - More performance.

    Better Cache Latency

    SIO - Lower TDP...........

    Where is all these in the review? Or are they just total rumors or They are not avalible on Opertron?
    Reply
  • DerekWilson - Thursday, February 17, 2005 - link

    Oh, but back on topic... I've had a lot of emails about AMD simply mapping SSE3 functionality to SSE2 (or even x87) hardware. This would be a very bad idea for AMD and doesn't look like what they are doing.

    If we had seen AMD impliment the entire SSE3 instruction set as essentially macros for SSE2 we would likely have seen a performance drop. There's not an easy way to just map some of the instructions, as optimal performance would require a recompile. We actually saw a performance gain in our synthetic benchmark that used some of the floating point instructions.

    It is possible some instructions could be treated this way. For example, there's no reason the code that uses a standard method to load 16 bytes (that may or may not be unaligned) and lddqu should look different.
    Reply
  • DerekWilson - Thursday, February 17, 2005 - link

    No one uese Opteron?

    http://www.anandtech.com/IT/showdoc.aspx?i=2173

    Also, if you need 4P or more, there's no reason to limit yourself by going with Intel's FSB implimentation -- It really hurts the performance of the system:

    http://www.anandtech.com/IT/showdoc.aspx?i=1982
    Reply
  • xsilver - Thursday, February 17, 2005 - link

    old habit ?
    Its called perception lag -- when perception (of intel being good) needs to catch up to reality .... oh and also blame it on companies like dell etc.
    Reply
  • Brunnis - Thursday, February 17, 2005 - link

    bigpow: But then again, I wouldn't go with Opteron too.

    Why not? Opteron is better than Xeon in many areas.

    A large reason why many companies don't use much else than Intel products are probably because of old habit. That's just stupid, in my opinion, but everyone's different...
    Reply
  • sandorski - Thursday, February 17, 2005 - link

    Bigpow: Opteron has gone from 0-10% marketshare in he server space. So it's not surprising that you nor anyone you know has them, but they are being used and last I heard they were still gaining Marketshare. Reply
  • Samadhi - Thursday, February 17, 2005 - link

    It has been written in a number of places that as well as adding SSE3 units the SSE2 units were to be improved in the latest chip revision.

    Any chance we could get some SSE2 vs SSE2 results for the two processors tested in this article?
    Reply
  • SkAiN - Thursday, February 17, 2005 - link

    Sorry for the blank post.

    When I first began reading this article, I became excited, looking forward to seeing the benchmarks this "upgrade" was supposed to bring, especially in the area of encoding.

    Then I saw the benchmarks.

    Seriously, it looks as if AMD is getting the short end of the stick when it comes to the cross-licensing deal with Intel. Intel gets awesome new architechture, A64's get Intel's bogus hype...
    Reply
  • SkAiN - Thursday, February 17, 2005 - link

    Reply
  • bigpow - Thursday, February 17, 2005 - link

    Funny.
    I work at one of the largest high tech company today and I can't find any of these Opteron servers. My friends also notice the same trend.
    Large corporations are sticking with Intel, enough said.

    Nice step forward for AMD, still far away to catch Intel.

    For my PC, I use AMD AthlonXP (soon-to-be A64). I wouldn't go with Intel for my use. But then again, I wouldn't go with Opteron too.

    Who's buying this Opteron again?
    Reply
  • DerekWilson - Thursday, February 17, 2005 - link

    Unfortunately, the platforms I have available to test the Opteron on (nforce 3 pro and nforce pro 2200) only offer overclocking in the form of nTune. And these platforms do not like being pushed out of spec.

    We also have many more tests to run on these processors and platforms and don't wish to see an unfortunate lab accident consume our samples before we squezee all the data out of them we are looking for.

    If we finish all our planned tests with Opteron 252, we may look into overclocking. But that will sit on the back burner for some time either way.
    Reply
  • dannybin1742 - Thursday, February 17, 2005 - link

    isn't this rev supposed to use strained silicon too? Reply
  • ozzimark - Thursday, February 17, 2005 - link

    i know these are opterons, but are we going to get an overclocking article on the new core soon? Reply
  • skiboysteve - Thursday, February 17, 2005 - link

    its funny how intel comes out with SSE, SSE2, SSE3... to compensate for weak x87 FP and a long pipe, but because of marketing AMD has to adopt these instructions as well on a very resiliant cpu that doesnt have such pickyness about code... so slap on SSE2 sticker and the performance is no better.

    you could almost blame the kick ass FP performance?

    im not trying to be biased, but i mean, look at the numbers, its the truth. it takes allot of work to make a long pipe work great in all areas.
    Reply
  • Fricardo - Thursday, February 17, 2005 - link

    Do you guys have any word on when the revision E stepping comes out for the Athlon 64's? I wonder how long of a gap AMD wants to leave before releasing their desktop parts. Reply
  • jimmy43 - Thursday, February 17, 2005 - link

    In any case, AMD is slowly catching up to Intel in the media encoding segment..Hey more features, im not complaining! Reply

Log in

Don't have an account? Sign up now