Thirteen New Instructions - SSE3

Back at IDF we learned about the thirteen new instructions that Prescott would bring to the world; although they were only referred to as the Prescott New Instructions (PNI) back then, it wasn't tough to guess that their marketing name would be SSE3.

The new instructions are as follows:

FISTTP, ADDSUBPS, ADDSUBPD, MOVSLDUP, MOVSHDUP, MOVDDUP, LDDQU, HADDPS, HSUBPS, HADDPD, HSUBPD, MONITOR, MWAIT

The instructions can be grouped into the following categories:

x87 to integer conversion
Complex arithmetic
Video Encoding
Graphics
Thread synchronization

You have to keep in mind that unlike the other Prescott enhancements we've mentioned today, these instructions do require updated software to take advantage of. Applications will either have to be recompiled or patched with these instructions in mind. With that said, let's get to highlighting what some of these instructions do.

The FISTTP instruction is useful in x87 floating point to integer conversion, which is an instruction that will be used by applications that are not using SSE for their floating point math.

The ADDSUBPS, ADDSUBPD, MOVSLDUP, MOVSHDUP and MOVDDUP instructions are all grouped into the realm of "complex arithmetic" instructions. These instructions are mostly designed to reduce latencies in carrying out some of these complex arithmetic instructions. One example are the move instructions, which are useful in loading a value into a register and adding it to other registers. The remaining complex arithmetic instructions are particularly useful in Fourier Transforms and convolution operations - particularly common in any sort of signal processing (e.g. audio editing) or heavy frequency calculations (e.g. voice recognition).

The LDDQU instruction is one Intel is particularly proud of as it helps accelerate video encoding and it is implemented in the DivX 5.1.1 codec. More information on how it is used can be found in Intel's developer documentation here.

In response to developer requests Intel has included the following instructions for 3D programs (e.g. games): haddps, hsubps, haddpd, hsubpd. Intel told us that developers are more than happy with these instructions, but just to make sure we asked our good friend Tim Sweeney - Founder and Lead Developer of Epic Games Inc (the creators of Unreal, Unreal Tournament, Unreal Tournament 2003 and 2004). Here's what he had to say:

Most 3D programmers been requesting a dot product instruction (similar to the shader assembly language dp4 instruction) ever since the first SSE spec was sent around, and the HADDP is piece of a dot product operation: a pmul followed by two haddp's is a dot product.

This isn't exactly the instruction developers have been asking for, but it allows for performing a dot product in fewer instructions than was possible in the previous SSE versions. Intel's approach with HADDP and most of SSE in general is more rigorous than the shader assembly language instructions. For example, HADDP is precisely defined relative to the IEEE 754 floating-point spec, whereas dp4 leaves undefined the order of addition and the rounding points of the components additions, so different hardware implementing dp4 might return different results for the same operation, whereas that can't happen with HADDP.

As far as where these instructions are used, Tim had the following to say:

Dot products are a fundamental operation in any sort of 3D programming scenario, such as BSP traversal, view frustum tests, etc. So it's going to be a measurable performance component of any CPU algorithm doing scene traversal, collision detection, etc.

The HSUBP ops are just HADDP ops with the second argument's sign reversed (sign-reversal is a free operation on floating-point values). It's natural to support a subtract operation wherever one supports an add.

So the instructions are useful and will lead to performance improvements in games that do take advantage of them down the road. The instructions aren't everything developers have wanted, but it's good to see that Intel is paying attention to the game development community, which is something they have done a poor job of doing in the past.

Finally we have the two thread synchronization instructions - monitor and mwait. These two instructions work hand in hand to improve Hyper Threading performance. The instructions work by determining whether a thread being sent to the core is the OS' idle thread or other non-productive threads generated by device drivers and then instructing the core to worry about those threads after working on whatever more useful thread it is working on at the time. Unfortunately monitor and mwait will both require OS support to be used, meaning that we will either be waiting for Longhorn or the next Service Pack of Windows for these two instructions.

Intel would not confirm whether the instructions can be used in a simple service pack update; they simply indicated that they were working with Microsoft of including support for them. We'd assume that they would be a bit more excited about the ability to bring the instructions to Prescott users via a simple service pack update, maybe indicating that we will have to wait for the next version of Windows before seeing these two in use.

Larger, Slower Cache Half-Time Summary
Comments Locked

104 Comments

View All Comments

  • Jeff7181 - Sunday, February 1, 2004 - link

    I'm going to go out on a limb here and say 2004 is the year of the Athlon-64 and Intel will take a back seat this year unless their new socket will help increase clock speeds. When AMD makes the transition to 90nm I think you'll see a jump in clock speed from them too... and I'm willing to bet their current 130nm processors will scale to 2.6 or 2.8 Ghz if they want to put the effort into it before switching to 90nm.

    Intel better hope people adopt SSE3 in favor of AMD-64 otherwise they're going to lose the majority of the benchmark tests.

    On second thought... the real question is how high will Prescott scale... will we really see 4.0 Ghz by the end of the year? Will performance scale as well as it does with the Athlon-64?

    Right now, looking at the Prescott, the best I can say for it is "huh, 31 stages in the pipeline and they didn't lose too much performance, neat."
  • Barkuti - Sunday, February 1, 2004 - link

    Check out the article at xbitlabs:

    http://www.xbitlabs.com/articles/cpu/display/presc...

    Less technical but with a wider set of tests.
  • Stlr22 - Sunday, February 1, 2004 - link

    ;-)
  • Stlr22 - Sunday, February 1, 2004 - link

    ((((((((((((((CRAMITPAL))))))))))))))))

    Listen,I just want you to know that everything will be alright. Really, life isn't all that bad buddy. It's not good to keep so much hate inside. It's very unhealthy. We are all family here at the Anandtech forums and we care about you. If you ever need to sit down and talk, I'm ll ears pal. So that your brother doesn't feel left out, here's a hug for him aswell.......


    (((((((((((((AMDjihad)))))))))))))
  • KF - Sunday, February 1, 2004 - link

    Yeah, the Inquirer was right about 30 stages. Maybe I should start reading it! However I did read the one where the news linked to an article purporting that an Inquirer reporter had bumped into a person who had overheard an Intel executive say Prescott was 64 bit. Maybe Derek and Anand didn't have the space to squeeze that tiny detail into the review.

    I saw a paper on the Intel site a while ago, seemingly intended for some professional jounal, the premise of which was that it is ALWAYS preferable to make the pipeline longer, no matter how long, while using techniques to reduce the penalties. Like, 100 stages would be a good thing. Right then I knew what one team at Intel was up to. The fact that they didn't explain any new penalty reduction techniques only made it all the more sure what Intel had in the works (otherwise why write the paper?), and that they had the techniques worked out, but still under wraps.
  • ianwhthse - Sunday, February 1, 2004 - link

    Err.. *Cramitpal

    Sorry about that. My mind is wandering.
  • ianwhthse - Sunday, February 1, 2004 - link

    Did we actually just get 26 good posts in before crumpet showed up?
  • FiberOptik - Sunday, February 1, 2004 - link

    I like the part about the new shift/rotate unit on the CPU. Does this mean that prescott will be noticeably faster for the RC5 project? Athlon's usually mop the floor with whatever the Northwood can pump out.
  • eBauer - Sunday, February 1, 2004 - link

    "Botmatch has bots (AI) playing, shooting, running, etc. (deathmatch) while Flyby does not. The number that you should be most interested in is the Botmatch scores."

    No, I am talking about the botmatch scores from previous articles. Well aware of the difference between flyby and botmatch. http://www.anandtech.com/cpu/showdoc.html?i=1946&a... In that article, all CPU's had about 10 more fps than the CPU's in the prescott article.




  • AnonymouseUser - Sunday, February 1, 2004 - link

    "I am curious as to why the UT2k3 botmatch scores dropped on all CPU's... Different map?"

    Botmatch has bots (AI) playing, shooting, running, etc. (deathmatch) while Flyby does not. The number that you should be most interested in is the Botmatch scores.

Log in

Don't have an account? Sign up now