Final Words

Finding good SSE3 benchmarks wasn't as easy as we would have liked. Other encoding suites react the same way that DivX and AutoGK do. This seems to indicate that the K8 architecture is simply resilient when it comes to unaligned 128bit loads. In the case of Intel's NetBurst, the lddqu instruction may have more impact.

As far as physics and graphics go, the added instructions show potential in our synthetic test. For DCC, CAD, scientific, and other workstation software, the E4 stepping could offer a bit of a performance boost.

In the consumer space, Athlon 64 may not see as much benefit from SSE3, especially since our encoding tests turned up so little performance impact. SSE3 can be used in games, but the impact of this will likely be minimal. As most games will likely remain graphics limited, improvements will have a hard time shining through. Of course, for those who like to use lower cost Athlon 64 processors in cheaper workstations, there could be some advantage.

When we take a look at the Opteron 252 in a workstation environment, we will be able to get a better view of what the total package has to offer. As our workstation tests will be in a DP environment, we'll be able to see how the higher bandwidth helps the Opteron shine.

We would like to have tested more applications in this report on SSE3 performance under the new AMD core. Of interest to us are LINPACK, FLOPS, STREAM, and various other tests that would require us to recompile them with proper SSE3 support. As the Intel compiler is designed to optimize for Intel processors, we haven't had a viable source for high quality SSE3 compilation. Hand optimizing these benchmarks for SSE3 on Opteron would take a little more time than this short investigation will allow. We may look into using GCC for this purpose in future tests. As for real world tests using SSE3, we haven't been able to find many suitable candidates beyond video encoders.

It will likely be the case that current SSE3 optimized code paths will also not show their strengths on Opteron/Athlon until the processors are in developers' hands for a while. The Intel compiler is also hands and feet above any resource AMD have up their sleeve. But since SSE3 offers more choices for optimization and code simplification, compilers may have an easier time generating efficient code. Hand optimized code is still important for tight loops in critical sections of performance oriented code. In this case, more powerful and simple options implemented in hardware will help programmers better optimize their own code.

SSE3 Performance Analysis
Comments Locked

48 Comments

View All Comments

  • aznskickass - Friday, February 18, 2005 - link

    I am actually not very surprised, as you would expect SSE3 to have a much bigger impact on the P4 due to it's much longer pipeline and weaker FPU.

    #30, wow, wattage has jumped up a lot with strained silicon (no wonder Prescotts are having trouble, esp. since they don't have SOI)...

    While not yet in Prescott territory, AMD has to keep the wattages in check, I think a ~3GHz chip might be tipping the scales at almost 100W, which *is* Prescott territory!
  • saratoga - Thursday, February 17, 2005 - link

    "#29, XviD is an *UNLICENSED MPEG-4 HACK*. That's just a fact. DivX is a MPEG-4 licensee, XviD is not. "

    This is pretty silly. How could a piece of code get an MPEG4 license? Obviously it can't, which is why neither Xvid nor Divx code is licensed. Only a (compiled) product can be licensed to use MPEG4.

    Anyone selling an MPEG4 product is welcome to use Xvid and its perfectly legal, but they must pay a license fee for each product sold, as Divx does when you buy their product. Its the same situation as LAME, when you use it without having paid for a license, you're violating some patents. But you're free to license it and then you're in the clear legally.

    Also the irony of calling something a hack and mentioning Divx is simply breath taking.
  • tygrus - Thursday, February 17, 2005 - link

    "This seems to indicate that the K8 architecture is simply resilient when it comes to unaligned 128bit loads. In the case of Intel's NetBurst, the lddqu instruction may have more impact."
    If you have SSE3 enabled Intel CPU's, then test your hypothesis instead of guessing. It would be interesting to see the absolute and percentage increases in performance for the same tests using equivalent Intel chips. From what I can remember is that SSE3 gave Intel little performance increase for previously SSE2 optimised code. There may have been a few artificial test cases that showed large benefits ie. deliberate unoptimised SSE2 code versus optimised SSE3 code.

    "As the Intel compiler is designed to optimize for Intel processors, we haven't had a viable source for high quality SSE3 compilation." You maybe surprised by the performance of so-called 'Intel optimised' code on AMD systems. I say this particularly because of the old case of PIII and early P4 optimised showing better AMD Athlon scores at the time.

    It would also be interesting to see the difference in performance with the Opteron 252 with the SSE3 turned off in those benchmarks.

    Like always we will have to wait for further optimisations and validations before we can make a better comparison. To investigate the features and implementation is to use hand coded SSE2/3 code for an inner loop and compare performance and behaviour under different conditions. It's like, at the moment we only have one side of a six-sided dice.

    The other thing would be to compare the power consumption of the two steppings of Opterons (either at the power point and extrapolate or measure power to mainboard/CPU).

    I see that "23 - Posted on Feb 17, 2005 at 10:37 AM by pxc" has added some useful information using a Intel 3.4GHz P4 F. A 2.4GHz Opteron could be considered to compete with an Intel P4 based @ 3.6GHz. Others have already mentioned similar comments to me or provided a different view of the benchmarks given.
  • ChronoReverse - Thursday, February 17, 2005 - link

    #31

    Unlicensed MPEG4 implementation, yes. Hack? Hardly.

    XviD at least implements features as per specifications. DivX tends to add in their own "features" that aren't exactly in spec (although some are understandable given the limitations of the AVI container)

    Choose your words carefully.

    #32

    I'm also inclined to believe that it's simply because DivX's implementation of SSE3 simply doesn't do anything much yet.
  • Jigga - Thursday, February 17, 2005 - link

    Sorry I'm a bit of a n00b when it comes to Divx encoding tests but are you sure the SSE3 codepath was enabled on the Opteron? I'm curious if some apps simply test for core/stepping rather than actual SSE3 ability--maybe DivX wasn't even using the right code??
  • Brunnis - Thursday, February 17, 2005 - link

    #25, pxc, wrote:

    "DivX 5.2 now includes:
    ...
    Encoder: Intel SSE3 (Prescott) Optimizations
    The DivX 5.2 encoder features optimizations for Intel Prescott CPU's, improving performance by up to 15%."

    Is it even remotely possible that DivX skips using SSE3 on the Opteron because it's currently only "meant" to run on the Prescott? I realise that SSE3 should work if the program is correctly written, but one never knows...
  • PetNorth - Thursday, February 17, 2005 - link

    #25

    OK, I looked here http://www.divx.com/divx/divxpro/versions/ and it doesn't mention it, so I thought it hasn't.

    Anyway it seems DivX SSE3 implementation isn't very good or simply, SSE3 is useless (I think it is the first possibility or really, it hasn't SSE3, because for example, with TmpegENC Xpress there is a good improvement).

    I say this because we can see here http://www.tomshardware.com/cpu/20041115/pentium4_... with AG Knot and DivX 5.2 between P4C (SSE2) and P4E (SSE3) at same clock speed there aren't perfomance difference at all.
  • pxc - Thursday, February 17, 2005 - link

    #29, XviD is an *UNLICENSED MPEG-4 HACK*. That's just a fact. DivX is a MPEG-4 licensee, XviD is not.
  • PrinceGaz - Thursday, February 17, 2005 - link

    #20- the E4 Opterons have a higher TDP, almost as high as the 130nm CG revision.

    2.2 GHz (x48) Opteron TDP:

    CG - 89W (130nm, SOI)
    D4 - 67W (90nm, SOI)
    E4 - 85.3W (90nm, SOI, strained-silicon)

    It's to be expected that the E revision chips will run hotter than the D revision because strained-silicon increases power consumption (but allows for higher speeds). So long as you have good cooling, the E revision chips should be great overclockers.

    It would be nice for comparisons of temperature and system power consumption to be taken of a D4 x48, and E4 clocked at 2.2GHz (there are no D4 x50 parts, they were all CG revision).
  • ChronoReverse - Thursday, February 17, 2005 - link

    @27

    XviD certainly doesn't have any SSE3 enhancements, but I do believe that calling them a "mpeg4 hack", when DivX has far more hacks and implements less features of MPEG4 ASP, is hardly fair at all.


    And I'd also like to see the difference between the new and old Opterons using only SSE2 so that we can see the difference not due to SSE3.

Log in

Don't have an account? Sign up now