First Thoughts

Unlike AMD's Lightweight Profiling Proposal, today we're not looking at an idea for a possible future, but instead a specification for something that will happen. While AMD is still soliciting feedback for their earlier proposal, as of today they are moving at full speed ahead with SSE5. A SSE5 software simulator should be released today that will allow developers to experiment with SSE5 optimized code and see how well it will perform, well over a year ahead of when they will be able to see real engineering sample processors to test their code on. 2008 and beyond will see SSE5 support coming to the GCC compiler, along with AMD's optimized software libraries. AMD is going through great efforts to spur the adoption of SSE5.

It bears mentioning that there exists some potential issues between now and then however. First and foremost is the 800lb gorilla: Intel. Under normal conditions Intel's support is critical for new extensions to take off, the only exception to that has been AMD64/x86-64, which came at a time when that specification was far better suited for the computing industry than Intel's own IA64 specification. Simply put, we believe AMD will be unsuccessful with SSE5 if Intel is unwilling to add it to their own processors; the performance improvements are important, but it's not an AMD64 situation where AMD has the influence and technical advantages to get what it wants. Without Intel's support developers will code for the least-common-denominator among the extensions, or worse will focus on any Intel-only extensions should the two CPU families further diverge, something that is more likely to result now due to AMD's failure to include full SSE4 support on Bulldozer.

There's also the issue of the all-important compiler. Although AMD will have SSE5 support in GCC, and likely in the Visual Studio Compiler too, Intel failing to support SSE5 at the hardware level will mean that they will also not include it with their compiler. Intel's compilers are near-legendary in their ability to optimize code in the right situations (and in snubbing AMD chips at times), with developers working on computationally intensive applications likely unwilling to move away from using Intel's compiler.

Finally there's AMDs own hardware issues. It's impossible to predict what their situation will be like in 2 years, but we are always more concerned about them than Intel due to AMD's operating position. SSE5 is reliant on Bulldozer, AMD will need to get Bulldozer out on time if they want SSE5 to launch without a hitch. A late CPU can lead to SSE5 missing a chance to get in to new software development cycles.

But with those warnings out of the way, don't take this as a disapproval of SSE5. We believe that this will be the most important extension to the x86 instruction set since SSE2, and the new instructions like MADD in particular can offer the kind of performance improvements AMD wants to hit while avoiding the need to increase clock speeds significantly and the problems that result from such. What such a performance increase will be on actual applications however is something we're going to have to wait on the delivery of Bulldozer silicon to find out.

Finally, this isn't the last announcement from AMD we will see on the subject of AMD's instruction set performance initiative. We're still waiting on the rest of the proposals for the Hardware Extensions for Software Parallelism to be released, at which point we'll have a better idea of how AMD is going to tackle thread-level parallelism in the next few years. AMD hasn't put a date on that, but we'd expect something before the end of the year.

It’s a MADD, MADD World
Comments Locked

17 Comments

View All Comments

  • skiboysteve - Friday, August 31, 2007 - link

    this is stupid. they are adding SSE5 before SSE4. wow.
  • tygrus - Friday, August 31, 2007 - link

    SSE numbers/description becoming like model numbers. Confusing and virtually meaningless. Need CPU core with microcode to convert non-native SSE? instructions into sequence of native instuctions (micro-ops/macro-ops).
    If that doesn't happen then the compilers may need to re-write the code sequences for target(s) at compile time or execution.
  • yyrkoon - Thursday, August 30, 2007 - link

    Just from what I have seen in the past, whenever AMD does something like this, Intel tries to seperate themselves by going a different direction, this is why I think Intel will rename their future instruction sets to something else.

    If AMD and Intel were to actually work together on this, then maybe Intel would opt in on some of the better portions of the instruction set that enhanced their CPUs, but somehow I do not think this is the case.

    I watch the Intel/AMD 'rivalry' from the outside looking in, and I see the Coke/Pepsi 'war' all over again. Little kids going so far as to pull an engine out of a new delivery truck, paint it another color other than blue, because that *is* their rivals colors . . . At what cost for your share holders ? Nonesense !
  • jeromekwok - Thursday, August 30, 2007 - link

    I don't think of a good reason we should care this SSE5A, or should we call it 3dnow technically. AMD may gain back a few benchmark scores, but it is hard to get developers move from Intel compiler suites.

    Do you guys feel the same. When the MADD goes thru the OOO, it should be decoded as MUL and ADD micro-ops. There should not be a big difference if we use two instructions MUL and ADD, which should get similar micro-ops. May be there is something AMD is weak at.
  • saratoga - Thursday, August 30, 2007 - link

    quote:

    Do you guys feel the same. When the MADD goes thru the OOO, it should be decoded as MUL and ADD micro-ops.



    Since muls have a very high latency compared to adds, and a dependency would exist between the ops, this would not be a good way to do things.

    quote:


    There should not be a big difference if we use two instructions MUL and ADD, which should get similar micro-ops.


    The result is the same (obviously), but its slower and complicates scheduling for no logical reason. Compared to a multiplier, adders are very cheap.
  • redpriest_ - Thursday, August 30, 2007 - link

    Look at Itanium's fused multiply add.
  • jiulemoigt - Thursday, August 30, 2007 - link

    Well I wrote several versions but what it comes down to is I'm scratching my at the example as it looks like it was written by marketing without asking an engineer how to code it. The first can be written in half that many lines of code and more efficiently, it looks like vb code that was automatically translated by a very bad compiler. I've written code for both chips and generally hand coding will give code than four about four times faster but is not practical considering time constraints and the number of people that can write assembler code. Yet using instructions is supposed to speed up the rate code goes because the computer performs a series of instructions that have predefined procedures ie store data in A, store data in B, ADD A to B, repeat C times, return B, where as this looks like store data in A, store data in B, Add A,B store in C, return C, repeat with new numbers multiple times, go back and get data returned from C and store in A compare to data from pass two stored in B store result in C return C, get data just returned compare to data from pass three, store in C return C, get data just returned etc... with the second one using the location but still using a third location! Instead of ADD a,b with the result in B, return B to location 1, return B to loc2, then store loc 1 in A, loc 2 in B ADD A,B return B.

    The interesting thing about the number of instructions in the example is that the time it takes to one instruction to complete is far different, as store statements are not equal to compare statements are not equal to ADD/MUL statements, the computer can do an ADD statement faster than it can find data on local cache let alone system memory. One of the reason graphic cards are so much faster at MADD tends to do with the data being right there, which is why graphic DRR is so much more expensive than system memory. and now AMD wants to join the slowest instruction with the fast ones? This is something people should be really wondering about since it kills prefetch as it is going to make the system wait for data with every pass including the ones that should be really fast. That suggests they are going to try and force the scheduler to get longer blocks of data like Intel did with its P4 which was a very bad design since branching logic is only so good, and every miss will cause the CPU to sit ideal, covering up misses with longer cycles.

    Any way for the non-coders SSE takes low level code and packages chunks of code that can be pasted to the CPU as one chunk it knows what to do with. Usually this makes the chunks get processed faster as scheduler on the CPU takes the chunks as one piece and it all gets pushed through no waiting, only in this case it is forcing the CPU to be an in-order CPU for every instruction so coded, which is bad because normally it can crank through fast instructions ideal through slow ones, this will force it to ideal through many slow ones, as opposed to simply burning through the fast short ones, less ideal time fast the job gets done, but with everything waiting on store statements there will be an increase in ideal time, since it is easier to stack a bunch of small legos in a box than four bowling balls. Just think of store statements as getting the legos or the bowling balls to put in the box you may have to make more trips to get enough legos to fill the box but the trips are faster and if you get two many legos the amount that does not fit will be small where as that last bowling balls is a significant amount compared to what is in the box. Rough analogy but I'm supposed to be relaxing not thinking about work.

    Oh and MMX when it first came out was a PR stunt and it was only about two years after being added that someone found a use for it, as kludge to simply coding for people who were not willing to do it right. 3DNow was just as bad SSE was the first set that was actually useful, when added to compiler to speed up certain repetitive tasks like encoding and rendering. Though this new set defeats the purpose of having all those new registers to use!
  • PeteRoy - Thursday, August 30, 2007 - link

    Return of the Jedi anyone?
  • her34 - Thursday, August 30, 2007 - link

    next for amd:

    the geforce 10800xt
  • peldor - Thursday, August 30, 2007 - link

    This strikes me as a way to distract from the lack of a complete SSE4 implementation.

Log in

Don't have an account? Sign up now