Thirteen New Instructions - SSE3

Back at IDF we learned about the thirteen new instructions that Prescott would bring to the world; although they were only referred to as the Prescott New Instructions (PNI) back then, it wasn't tough to guess that their marketing name would be SSE3.

The new instructions are as follows:

FISTTP, ADDSUBPS, ADDSUBPD, MOVSLDUP, MOVSHDUP, MOVDDUP, LDDQU, HADDPS, HSUBPS, HADDPD, HSUBPD, MONITOR, MWAIT

The instructions can be grouped into the following categories:

x87 to integer conversion
Complex arithmetic
Video Encoding
Graphics
Thread synchronization

You have to keep in mind that unlike the other Prescott enhancements we've mentioned today, these instructions do require updated software to take advantage of. Applications will either have to be recompiled or patched with these instructions in mind. With that said, let's get to highlighting what some of these instructions do.

The FISTTP instruction is useful in x87 floating point to integer conversion, which is an instruction that will be used by applications that are not using SSE for their floating point math.

The ADDSUBPS, ADDSUBPD, MOVSLDUP, MOVSHDUP and MOVDDUP instructions are all grouped into the realm of "complex arithmetic" instructions. These instructions are mostly designed to reduce latencies in carrying out some of these complex arithmetic instructions. One example are the move instructions, which are useful in loading a value into a register and adding it to other registers. The remaining complex arithmetic instructions are particularly useful in Fourier Transforms and convolution operations - particularly common in any sort of signal processing (e.g. audio editing) or heavy frequency calculations (e.g. voice recognition).

The LDDQU instruction is one Intel is particularly proud of as it helps accelerate video encoding and it is implemented in the DivX 5.1.1 codec. More information on how it is used can be found in Intel's developer documentation here.

In response to developer requests Intel has included the following instructions for 3D programs (e.g. games): haddps, hsubps, haddpd, hsubpd. Intel told us that developers are more than happy with these instructions, but just to make sure we asked our good friend Tim Sweeney - Founder and Lead Developer of Epic Games Inc (the creators of Unreal, Unreal Tournament, Unreal Tournament 2003 and 2004). Here's what he had to say:

Most 3D programmers been requesting a dot product instruction (similar to the shader assembly language dp4 instruction) ever since the first SSE spec was sent around, and the HADDP is piece of a dot product operation: a pmul followed by two haddp's is a dot product.

This isn't exactly the instruction developers have been asking for, but it allows for performing a dot product in fewer instructions than was possible in the previous SSE versions. Intel's approach with HADDP and most of SSE in general is more rigorous than the shader assembly language instructions. For example, HADDP is precisely defined relative to the IEEE 754 floating-point spec, whereas dp4 leaves undefined the order of addition and the rounding points of the components additions, so different hardware implementing dp4 might return different results for the same operation, whereas that can't happen with HADDP.

As far as where these instructions are used, Tim had the following to say:

Dot products are a fundamental operation in any sort of 3D programming scenario, such as BSP traversal, view frustum tests, etc. So it's going to be a measurable performance component of any CPU algorithm doing scene traversal, collision detection, etc.

The HSUBP ops are just HADDP ops with the second argument's sign reversed (sign-reversal is a free operation on floating-point values). It's natural to support a subtract operation wherever one supports an add.

So the instructions are useful and will lead to performance improvements in games that do take advantage of them down the road. The instructions aren't everything developers have wanted, but it's good to see that Intel is paying attention to the game development community, which is something they have done a poor job of doing in the past.

Finally we have the two thread synchronization instructions - monitor and mwait. These two instructions work hand in hand to improve Hyper Threading performance. The instructions work by determining whether a thread being sent to the core is the OS' idle thread or other non-productive threads generated by device drivers and then instructing the core to worry about those threads after working on whatever more useful thread it is working on at the time. Unfortunately monitor and mwait will both require OS support to be used, meaning that we will either be waiting for Longhorn or the next Service Pack of Windows for these two instructions.

Intel would not confirm whether the instructions can be used in a simple service pack update; they simply indicated that they were working with Microsoft of including support for them. We'd assume that they would be a bit more excited about the ability to bring the instructions to Prescott users via a simple service pack update, maybe indicating that we will have to wait for the next version of Windows before seeing these two in use.

Larger, Slower Cache Half-Time Summary
Comments Locked

104 Comments

View All Comments

  • mattsaccount - Sunday, February 1, 2004 - link

    From the HardOCP review: "Certainly moving to watercooling helped us out a great deal. In fact it is hard for us to recommend buying a Prescott and cooling it any other way."
  • eBauer - Sunday, February 1, 2004 - link

    I am curious as to why the UT2k3 botmatch scores dropped on all CPU's... Different map?
  • Pumpkinierre - Sunday, February 1, 2004 - link

    Sorry errata on #20 that was 3.0 Northood result is out of kilter with other cpus in dtata analysis sysmark 2004.
  • Pumpkinierre - Sunday, February 1, 2004 - link

    JFK,Vietnam,Nixon,Monica,Bush/Gore,Iraq and now this! - what is going on with the leader of the free world.I hope it overclocks well- that's all that's going for it. Maybe Intel should rethink their multiplier locked policy. AMD must get in there and profit. I still dont understand why the caches are running at half the latency as Northood if they are the same speed and structure? Is it as a result of a doubling in size for the same associativity?

    Good article- needs re-rereading after digestion. Last chart in Sysmark2004 (data analysis) has 3.0 Prescott totally outperformed by 2.8 Prescott and all other cpus. Look like a benchmark/typing glitch.
  • yak8998 - Sunday, February 1, 2004 - link

    first the error:
    pg 9 -
    The LDDQU instruction is one Intel is particularly proud of as it helps accelerate video encoding and it is implemented in the DivX 5.1.1 codec. More information on how it is used can be found in Intel’s developer documentation here.

    No link?

    ===
    "What's the power consumption like on these new bad boys?

    Is anything less than a quality 450watt PSU gonna be generally *NOT* recommended?? "

    I'm going to guess a clean running ~350W or so should suffice for a regular system, but I'm not positive with these monster gfx cards out rite now...

    "Any of you know what the cache size on the EE's will be?"

    If your talking about the Northwood (the p4c's are still considered northwoods, no?), its 1mb I believe.
    (still finishing the article. man i love these in-depth technical articles)
  • Tiorapatea - Sunday, February 1, 2004 - link

    I agree, some info on power consumption please.

    Thanks for the article, by the way.

    I guess we'll have to wait and see how Prescott ramps in speed versus 90nm A64.
  • AgaBooga - Sunday, February 1, 2004 - link

    Much better than the P4's origional launch...

    All I want to know now is what AMD is going to do soon... They'll probably counteract Prescott with high clock speeds but when and by how much is what matters.

    Any of you know what the cache size on the EE's will be?

    Also, the final CPU's based on Northwood are kind of like a car with the ratio curves or whatever they're called, but basically after a point of revving, going any higher doesn't give you as much of an increase in speed as it would at a lower rpm increasing the same amount.
  • Cygni - Sunday, February 1, 2004 - link

    AMD's roadmap shows a 4000+ Athlon64 by the end of the year... which is the same as Intel's. They are aware, im sure.
  • Stlr22 - Sunday, February 1, 2004 - link

    What's the power consumption like on these new bad boys?

    Is anything less than a quality 450watt PSU gonna be generally *NOT* recommended??
  • HammerFan - Sunday, February 1, 2004 - link

    Things are gonna get hairy in '04 and '05!!! My take is that AMD nees to get their marketing up-to-spec or the high-clocked prescotts are gonna run the show.

    I have a question for Derek and Anand: What kind of temps does the prescott run at? what type of cooler does it have? (there's nothing there to support or refute claims that the prescott is one hot potato)

Log in

Don't have an account? Sign up now