Thirteen New Instructions - SSE3

Back at IDF we learned about the thirteen new instructions that Prescott would bring to the world; although they were only referred to as the Prescott New Instructions (PNI) back then, it wasn't tough to guess that their marketing name would be SSE3.

The new instructions are as follows:

FISTTP, ADDSUBPS, ADDSUBPD, MOVSLDUP, MOVSHDUP, MOVDDUP, LDDQU, HADDPS, HSUBPS, HADDPD, HSUBPD, MONITOR, MWAIT

The instructions can be grouped into the following categories:

x87 to integer conversion
Complex arithmetic
Video Encoding
Graphics
Thread synchronization

You have to keep in mind that unlike the other Prescott enhancements we've mentioned today, these instructions do require updated software to take advantage of. Applications will either have to be recompiled or patched with these instructions in mind. With that said, let's get to highlighting what some of these instructions do.

The FISTTP instruction is useful in x87 floating point to integer conversion, which is an instruction that will be used by applications that are not using SSE for their floating point math.

The ADDSUBPS, ADDSUBPD, MOVSLDUP, MOVSHDUP and MOVDDUP instructions are all grouped into the realm of "complex arithmetic" instructions. These instructions are mostly designed to reduce latencies in carrying out some of these complex arithmetic instructions. One example are the move instructions, which are useful in loading a value into a register and adding it to other registers. The remaining complex arithmetic instructions are particularly useful in Fourier Transforms and convolution operations - particularly common in any sort of signal processing (e.g. audio editing) or heavy frequency calculations (e.g. voice recognition).

The LDDQU instruction is one Intel is particularly proud of as it helps accelerate video encoding and it is implemented in the DivX 5.1.1 codec. More information on how it is used can be found in Intel's developer documentation here.

In response to developer requests Intel has included the following instructions for 3D programs (e.g. games): haddps, hsubps, haddpd, hsubpd. Intel told us that developers are more than happy with these instructions, but just to make sure we asked our good friend Tim Sweeney - Founder and Lead Developer of Epic Games Inc (the creators of Unreal, Unreal Tournament, Unreal Tournament 2003 and 2004). Here's what he had to say:

Most 3D programmers been requesting a dot product instruction (similar to the shader assembly language dp4 instruction) ever since the first SSE spec was sent around, and the HADDP is piece of a dot product operation: a pmul followed by two haddp's is a dot product.

This isn't exactly the instruction developers have been asking for, but it allows for performing a dot product in fewer instructions than was possible in the previous SSE versions. Intel's approach with HADDP and most of SSE in general is more rigorous than the shader assembly language instructions. For example, HADDP is precisely defined relative to the IEEE 754 floating-point spec, whereas dp4 leaves undefined the order of addition and the rounding points of the components additions, so different hardware implementing dp4 might return different results for the same operation, whereas that can't happen with HADDP.

As far as where these instructions are used, Tim had the following to say:

Dot products are a fundamental operation in any sort of 3D programming scenario, such as BSP traversal, view frustum tests, etc. So it's going to be a measurable performance component of any CPU algorithm doing scene traversal, collision detection, etc.

The HSUBP ops are just HADDP ops with the second argument's sign reversed (sign-reversal is a free operation on floating-point values). It's natural to support a subtract operation wherever one supports an add.

So the instructions are useful and will lead to performance improvements in games that do take advantage of them down the road. The instructions aren't everything developers have wanted, but it's good to see that Intel is paying attention to the game development community, which is something they have done a poor job of doing in the past.

Finally we have the two thread synchronization instructions - monitor and mwait. These two instructions work hand in hand to improve Hyper Threading performance. The instructions work by determining whether a thread being sent to the core is the OS' idle thread or other non-productive threads generated by device drivers and then instructing the core to worry about those threads after working on whatever more useful thread it is working on at the time. Unfortunately monitor and mwait will both require OS support to be used, meaning that we will either be waiting for Longhorn or the next Service Pack of Windows for these two instructions.

Intel would not confirm whether the instructions can be used in a simple service pack update; they simply indicated that they were working with Microsoft of including support for them. We'd assume that they would be a bit more excited about the ability to bring the instructions to Prescott users via a simple service pack update, maybe indicating that we will have to wait for the next version of Windows before seeing these two in use.

Larger, Slower Cache Half-Time Summary
POST A COMMENT

104 Comments

View All Comments

  • terrywongintra - Monday, February 2, 2004 - link

    anybody benchmark prescott over northwood in entry-server environment? i'm installing 3 servers later by using intel 875p (s875wp1-e) entry server board n p4 2.8, need to decide prescott or northwood to use. Reply
  • sipc660 - Monday, February 2, 2004 - link

    i don't understand why some people are bashing such a good inovation that was long overdue from intel.

    a pc that doubles as a heater and at only 100-200W power consumption.

    Let me remind you that a conventional fan heater eats up a kilowatt/hour of power.

    Think positive

    * space reduction
    * enormous power savings (pc + fan heater)
    * extremly sophisticated looking fan haeter
    * extremly safe casing. reduces burn injuries
    to pets and children.
    * finely tunable temperature settings (only need
    to overclock by small increments)
    * coupled with an lcd it features the best
    looking temperature adjustment one has ever
    witnessed on a heater
    * child proof as it features thermal shutdown
    * anyone having a laugh thus far
    * will soon feature on american idol
    the worst singers will receive one p4 E based
    unit each. That should make people
    think twice about auditioning thus making
    sure only true talent shows up.
    * gives dell new marketing potential and a crack
    at a long desired consumer heating electronic
    * amd is nowhere near this advancement in thermal
    thechnology leaving intel way ahead


    hope you enjoyed some of my thoughts

    Other than that good article and some good comments.

    on another note i don't understand why people run and fill intels pockets so intel can hide their engineering mistakes with unseen propaganda, while there is an obvious choice.

    choice is Advanced Micro Devices all until intel gets their act together.

    go amd...
    Reply
  • Stlr22 - Monday, February 2, 2004 - link

    INTC - "Intel roadmap says Prescott will hit 4.2 GHz by Q1 '05. My guess is that it is already running at 4 GHz but just needs to be fine tuned to reduce the heat."


    Maybe they are trying to keep it under the 200watt mark? ;-)
    Reply
  • INTC - Monday, February 2, 2004 - link

    I think CRAMITPAL must have sat on a hot Prescott and got it stuck where the sun doesn't shine - that would explain all of the yelling and screaming and friggin this and friggin that going on. "Approved mobo, approved PC case cooling system, approved heatsink & fan - and you better not use Artic Silver or else it will void your warranty..." gee - didn't we just hear that when Athlon XPs came out? It brings to mind when TechTV put their dual Athlon MP rig together and it started smoking and catching on fire when they fired it up the first time on live television during their show.

    Intel roadmap says Prescott will hit 4.2 GHz by Q1 '05. My guess is that it is already running at 4 GHz but just needs to be fine tuned to reduce the heat. I bet the experts (or self proclaimed experts such as CRAM) were betting that Northwood could not hit 3 GHz and look where it is at today. Video card GPUs today are hitting 70 degrees C plus at full load but they do fine with cooling in the same PC cases.
    Reply
  • CRAMITPAL - Monday, February 2, 2004 - link

    Dealing with the FLAME THROWER's heat issues is only one aspect of Prescott's problems. The chip is a DOG and it requires an "approved Mobo" and an "approved PC case cooling system", a premo PSU cause the friggin thing draws 100+ Watts and this crap all costs money you don't need to spend on an A64 system that is faster, runs cooler, and does both 32/64 bit processing faster. How difficult is THIS to comprehend???

    Ain't no way Intel is gonna be able to Spin this one despite the obvious "press material" they supplied to all the reviewers to PIMP that Prescott was designed to reach 5 Gigs. Pigs will fly lightyears before Prescott runs at 5 Gigs.

    Time to GET REAL folks. Prescott sucks and every hardware review site politely stated so in "political speak".
    Reply
  • Stlr22 - Monday, February 2, 2004 - link

    ((((((((((((CRAMITPAL)))))))))))))))




    It's ok man. It's ok. Everything will be alright.


    ;-)
    Reply
  • scosta - Monday, February 2, 2004 - link

    #38 - About your "Did anyone catch the error in Pipelining: 101?".

    There is no error. The time it takes to travel the pipelane is just a kind of process delay. What matters is the rate at witch finished/processed results come out of the pipeline. In the case of the 0.5ns/10 stage pipelane you will get one finished result every 0.5ns, twice as many as in the case of the 1ns/5 stage pipeline.

    If the pipelines were building motorcycles, you woud get, respectively, 1 and 2 motorcycles every ns. And that is the point.
    Reply
  • LordSnailz - Monday, February 2, 2004 - link

    I'm sure the prescotts will get hotter as the speed increases but you can't forget there are companies out there that specializes in this area. There are 3 companies that I know of that are doing research on ways to reduce the heat, for instance, they're planning on placing a piece of silicon with etch lines on top of the CPU and run some type of coolant through it. Much like the radiator concept.

    My point is, Intel doesn't have to worry about the heat too much since there are companies out there fighting that battle. Intel will just concentrate on achieving those higher speeds and the temp control solution will come.
    Reply
  • scosta - Monday, February 2, 2004 - link

    You can find thermal power information in the also excelent "Aces Hardware" Prescot review here:
    [L=myurl]http://www.aceshardware.com/read.jsp?id=60000317[/l]

    In resume, we have the following Typical Thermal Power :
    P4 3.2 GHz (Northwood) - 82W
    P4E 3.2 GHz (Prescot) - 103W

    Note that, at the same clock speed and with the same or lesser performance, the Prescot dissipates 25% more power than Northwood. This means that with a similar cooling system, the Prescot has to run substancially hoter.

    As AcesHardware says,
    [Q]After running a 3DSMax rendering and restarting the PC, the BIOS reported that the 3.2 GHz Northwood was at about 45-47°C, while Prescott was flirting with 64-66°C. Mind you, this is measured on a motherboard completely exposed to the cool air (18°C) of our lab.[/Q]

    So, what will the ~5GHx Prescot dissipate? 200W ?
    Will we all be forced to run PCs with bulky, expensive, etc, criogenic cooling systems?. I for one wont. This power consumption escalation has to stop. Intel and AMD have to improve the performace of their CPUs by improving the CPU archytecture and manufacturing processes, not by trowing more and more electrical power at the problem.

    And those are my 2 cents.
    Reply
  • CRAMITPAL - Monday, February 2, 2004 - link

    Prescott will never go above 3.8 Gig. even with the 3rd revision of the 90 nano process. Tejas will make it to just over 4.0 Gig. with a little luck but it won't be anything to write home about either based on current knowledge.

    Intel has fallen and can't get it up!
    Reply

Log in

Don't have an account? Sign up now