Single Cycle SSE and Macro-Fusion in Core

Not spending any time beating around the bush, Justin Rattner immediately jumped into the five key innovations in Intel's new Core micro-architecture.

The 4-issue core and 14-stage pipeline were both disclosed at the last IDF and we already knew that the Pentium M's micro-ops fusion would make its way into Conroe, but what's new here is the support for macro fusion. While micro-ops fusion will allow decoded instructions to be sent down the pipe together (as "fused" instructions), macro fusion will allow x86 instructions (before the decode stage) to be fused together and sent down as a single instruction. The example of this that Rattner gave was that Compare and Jump instructions now become a single instruction in the pipeline thanks to Macro-fusion.

The next major feature of the Core micro-architecture is that now all 128-bit SSE instructions will execute in a single cycle. The single cycle throughput for all SSE instructions should offer some pretty hefty gains in any applications that make extensive use of SSE. We've confirmed that this applies to all SSEn instructions (SSE1/SSE2/SSE3). Updated: Intel clarified the single-cycle SSE item for us. The throughput (not latency) of all SSE instructions is now 1 cycle, whereas in the past it was generally a 2 cycle throughput. The increase in throughput will result in some pretty hefty performance gains in SSE optimized encoding applications.

Introducing the Core Micro-Architecture General Performance Expectations for Conroe, Merom and Woodcrest
Comments Locked


View All Comments

  • nicolasb - Wednesday, March 8, 2006 - link

    What do I need to do so that will let me see the pictures when I'm reading its articles? :-(
  • stephenbrooks - Sunday, March 12, 2006 - link

    I've found that the pictures load as 1x1 blank GIFs in Opera but appear fine in IE. That really sucks.
  • zephyrprime - Tuesday, March 7, 2006 - link

    Anand says that SSE will execute in a single cycle but I think Intel really meant that SSE will have single cycle throughput, not latency. Notice that in the slide Intel simply writes "single cycle SSE". SSE instructions (except some of the really easy ones) are currently broken down from 128bits -> 2x64bit instructions to actually execute. This has long been the biggest weak point of SSE.

    I expect latency to be 5cycles for SSE FP multiply (it's currently 6). I expect throughput to be 1 cycle for SSE FP multiply (it's currently 2). So instruction throughput will theoretically double.
  • Anand Lal Shimpi - Tuesday, March 7, 2006 - link

    You are quite correct, Intel just clarified this point to us and I've updated the article. Thanks for the pointer :)

    Take care,
  • Hulk - Tuesday, March 7, 2006 - link

    Do all SSE instructions execute in the same number of cycles?

    This crazy projections are always more exciting when they come from Intel because they do have a track record of NOT producing vaporware.

    On the other hand their performance figures are always way optimistic.

    If you look at the middle ground Conroe will probably be a bit faster than X2 per clock cycle. We'll see if they can ramp up the clockspeeds for release...
  • Doormat - Tuesday, March 7, 2006 - link

    "While we'll get a better idea of performance of Conroe, Merom and Woodcrest later today, Rattner did whet our appetites"

    Is that a typo or a reference (inside joke?) about performance numbers....
  • Rock Hydra - Tuesday, March 7, 2006 - link

    That's the proper use for the word.
    I suppose what he's trying to say is they're satisfied with the info disclosed at the time.
  • xtremejack - Tuesday, March 7, 2006 - link

    Whet means sharpen, right. Means becoming eager for more information, I suppose
  • adamfilipo - Tuesday, March 7, 2006 - link

    same here. images arent loading
    hope conroe kicks ass, my next powermac will have it
  • DigitalFreak - Tuesday, March 7, 2006 - link

    show me the benchies!

Log in

Don't have an account? Sign up now