SSE128

AMD Architecture Comparison
K8 Barcelona
SSE Execution Width 64-bit 128-bit
Instruction Fetch Bandwidth 16 bytes/cycle 32 bytes/cycle
Data Cache Bandwidth 2 x 64-bit loads/cycle 2 x 128-bit loads/cycle
L2/Northbridge Bandwidth 64 bits/cycle 128 bits/cycle
FP Scheduler Depth 36 Dedicated x 64-bit ops 36 Dedicated x 128-bit ops

Many of the "major" changes to Barcelona were driven by one significant change: what AMD is calling SSE128. In the K8 architecture AMD can execute two SSE operations in parallel; however the SSE execution units are only 64-bits wide. For 128-bit SSE operations, the K8 had to handle them as two 64-bit operations. This also means that when a 128-bit SSE instruction is fetched, it is first decoded into two micro-ops (one for each 64-bit half of the instruction), thus taking up an extra decode port for a single instruction.

Barcelona widens the execution units that handle SSE operations from 64-bits to 128-bits, so now 128-bit SSE operations don't have to be broken up into two 64-bit operations. This also means that you get more usable decode bandwidth since 128-bit SSE instructions now map to a single micro-op instead of two. The FP scheduler can now handle these 128-bit SSE operations as well.

It's the increase to SSE execution width that drove a number of other changes within the core. Since you effectively have more decode bandwidth when executing 128-bit SSE instructions AMD discovered a new bottleneck: instruction fetch bandwidth. These 128-bit SSE instructions tend to be quite large, and in order to maximize the number decoded in parallel the Barcelona core can now fetch 32-bytes per cycle, up from 16-bytes in K8. The 32B instruction fetch not only benefits SSE code but also seems to benefit integer code as well. Bigger instructions in general will see a performance boost here.

Now that you can fetch and decode more instructions, you need to be able to get more data to the execution core and thus AMD widened the interface between the L1 data cache and Barcelona's SSE registers. Barcelona can now perform two 128-bit SSE loads per cycle from the L1-D cache compared to two 64-bit loads per cycle in K8. AMD then widened the interface between the L2 cache and the memory controller so that now 128-bits can be transferred per cycle, once again to balance out all of the aforementioned changes.

The culmination of the SSE128 improvements is very similar to some of the changes made in the Yonah to Merom transition. Prior to Conroe/Merom, Yonah could not keep up with AMD's K8 when it came to FP/SSE performance. Almost a year and a half ago we did an article where we compared AMD's K8 to Intel's Yonah running at the same clock speed. While Yonah was able to equal the K8's performance in general applications, professional 3D rendering and games, it could not compete when it came to video encoding.

There were a number of SSE performance improvements made to Yonah but it wasn't until Intel's Core 2 processors that Intel was really able to outperform AMD in our video encoding tests. Whether the improvements were due to the single cycle SSE throughput introduced in Core 2 or the wider front end or a combination of both remains to be seen. Although it's difficult to compare specs between two very different architectures, encoding performance is a sore spot for AMD today, and it's something that the SSE128 changes can only help.

The Chip Core Tune-up
POST A COMMENT

83 Comments

View All Comments

  • agaelebe - Friday, March 2, 2007 - link

    Wow! A lot of dicussion in here.
    And, by the way, very interesting article.

    I'm a software engineer from Brazil and I'm planning to change my PC this year.
    I've bem using AMD processors since the K6.
    Today I've a XP Mobile 2500+(@2.2ghz), 1gb ram, 200gb and an AGP 6600GT
    My PC is not very slow, but I'm thinking in going dual core to speed things up(office applications, web development and some games).
    I can run some of the newest games, but not in high graphics.
    I expect that my PC can run C&C 3 (Already run the demo in 1024 medium, but have some craches although it's not running it slow)

    So, today I'm thinking in 3 options:
    1) Stay with this computer and wait until AMD launchs it's new architecture (I pretend to go with an average price Kuma)

    2) Go with Intel Core 2 Duo (e6300 or e6400). They're not expensive and for games I can easily make an overclock and gain more performance.

    3) Buy a good AM2 board and a cheap Atlhon X2 (3600) and wait new AMD processors and then change only the processor.

    Here in Brazil the taxes are to high, so I'm planning in buying a PC with these specs:

    - CORE 2 Duo e6300/6400 or X2 3600/3800
    - mid-tier motherboard (
    - 2 x 1gb DDR 800 4-4-4-12
    - 2 x 250 gb
    - X1950pro 256 or 512
    - 500watts power

    So the prices are below:

    e6300 box US$ 300 (same price for a X2 4200+ box)

    x23800 box US$ 220

    motherboard: US$ 220

    ram: US$ 400

    video: US$ 450

    DVD: US$ 70

    case: US$ 150

    HDs : US$ 250

    Power: us$ 180

    So I plan to spent about 2000 dollars (Sadly, I can buy this same PC in US for the half of the price).

    My new PC should spent not to much power so I can leave it turned onall day long(max 150watts on iddle without monitor), otherwise I'll keep my old computer turned on just for downloding stuff)

    So, If someone has an opinion, I'd like to "hear" it. You can give another options to, or make some comments about the specs I'm choosing now.

    I had Pentium 75 and after that only AMD CPUs... Should know I surrender to the Core 2 Duo or believe that AMD can really beat it until the end of 2008?

    And thanks for the cooperation and patience.
    Reply
  • Zebo - Saturday, March 3, 2007 - link

    Athlon 64 AM2's arnt exactly slow so if you're an AMD fan just get one..like a 3800+ or 3600+ and overclock it. It will be at least 4x faster than what you have now and accept K8L Agena core later. It will be cheaper than C2D by about $50 USD and You'll also pay cheap for a GeForce 6100 Motherboard which is only $50 USD. Overall expect the the AM2 system to be about $100 USD cheaper.

    Keep in mind that C2D is 20% faster clock for clock in most apps so it's not exactly a quantum leap here getting a C2D.. Gap gets a lot larger when overclocking since C2D's overclcok higher like 3.2Ghz is common on air vs. only 2.8Ghz for AM2, so, at the end of the day a C2D setup is able to be about 40% faster over most benchmarks. That is getting significant and why enthusiasts are buying C2D's.
    Reply
  • agaelebe - Friday, March 2, 2007 - link

    And,as always, sorry with the errors and not so good writing... Reply
  • Kiijibari - Thursday, March 1, 2007 - link

    Hi,

    never heard of of that before, does anybody know what it is ?
    So far I see 2 pad areas at the DIE photo, therefore I assume that it would be also 2 interfaces, e.g. x8 PCIe like Sun uses ?

    bb

    Kiijibari
    Reply
  • mino - Friday, March 2, 2007 - link

    It should be some management/coodrination stuff (can-t remember the name of that bus).
    Every northbridge and CPU has that.
    Reply
  • davecason - Thursday, March 1, 2007 - link

    Anand,

    Great article! I know it took a lot of time and I wanted you to know I really appreciate your effort. It is the kind of article that keeps me coming back to your site.

    -Dave
    Reply
  • yyrkoon - Thursday, March 1, 2007 - link

    quote:

    On average, about 1/3 of all instructions in a program end up being loads, thus if you can improve load performance you can generally impact overall application performance pretty significantly.


    Page 5, paragraph 4 'pretty significantly'. Well is it, or is it not it ?

    http://www.wikihow.com/Avoid-Colloquial-%28Informa...">http://www.wikihow.com/Avoid-Colloquial-%28Informa...

    Aside from my gripe concerning writing style, good article :)
    Reply
  • trisweb2 - Friday, March 16, 2007 - link

    Usually we criticize writing style based on a whole experience... obviously Anand is one of the best technical review writers on the Internet; if you bother to read his articles more fully perhaps you'd realize that. The colloquial writing sometimes brings it to a more personal level that a reader can better relate to and understand -- it works especially well in this case, where it's a future design, we really don't know how it's going to perform. That he can guess and say "pretty significantly" tells me he understands the uncertainty of the situation, and the language communicates that fact perfectly well. It would be more confusing if he said it would impact performance "significantly" as you want him to, as that would imply that he was more certain than he might actually have been.

    Masters are allowed to bend the rules, and Anand is one, so lay off.
    Reply
  • yyrkoon - Thursday, March 1, 2007 - link

    *Is it, or is it not*

    /me hangs head in shame
    Reply
  • baronzemo78 - Thursday, March 1, 2007 - link

    Any rough guess as to how Barcelona will compete with Core2 in gaming? Many articles have shown how Core2 gets you a slight FPS boost in games that aren't graphics card limited. I'm curious how Barcelona will fit in with the overall picture of DX10 cards like G80 and R600. Reply

Log in

Don't have an account? Sign up now