Last week Johan posted his thoughts from an server/HPC standpoint on AMD's roadmap. Much of my analysis was limited to desktop/mobile, so if you're making million dollar server decisions then his article is better suited for your needs.

He also unveiled a couple of details about AMD's Bulldozer architecture that I thought I'd call out in greater detail. Johan has been working on a CMP vs. SMT article so I'll try to not step on his toes too much here.

It all started about two weeks ago when I got a request from AMD to have a quick conference call about Bulldozer. I get these sorts of calls for one of two reasons. Either:

1) I did something wrong, or
2) Intel did something wrong.

This time it was the former. I hate when it's the former.

It's called a Module

This is the Bulldozer building block, what AMD is calling a Bulldozer Module:

AMD refers to the module as being two tightly coupled cores, which starts the path of confusing terminology. A few of you wondered how AMD was going to be counting cores in the Bulldozer era; I took your question to AMD via email:

Also, just to confirm, when your roadmap refers to 4 bulldozer cores that is four of these cores:

http://images.anandtech.com/reviews/cpu/amd/FAD2009/2/bulldozer.jpg

Or does each one of those cores count as two? I think it's the former but I just wanted to confirm.

AMD responded:

Anand,

Think of each twin Integer core Bulldozer module as a single unit, so correct.

I took that to mean that my assumption was correct and 4 Bulldozer cores meant 4 Bulldozer modules. It turns out there was a miscommunication and I was wrong. Sorry about that :)

Inside the Bulldozer Module

There are two independent integer cores on a single Bulldozer module. Each one has its own L1 instruction and data cache (thanks Johan), as well as scheduling/reordering logic. AMD is also careful to mention that the integer throughput of one of these integer cores is greater than that of the Phenom II's integer units.

Intel's Core architecture uses a unified scheduler fielding all instructions, whether integer or floating point. AMD's architecture uses independent integer and floating point schedulers. While Bulldozer doubles up on the integer schedulers, there's only a single floating point scheduler in the design.

Behind the FP scheduler are two 128-bit wide FMACs. AMD says that each thread dispatched to the core can take one of the 128-bit FMACs or, if one thread is purely integer, the other can use all of the FP execution resources to itself.

AMD believes that 80%+ of all normal server workloads are purely integer operations. On top of that, the additional integer core on each Bulldozer module doesn't cost much die area. If you took a four module (eight core) Bulldozer CPU and stripped out the additional integer core from each module you would end up with a die that was 95% of the size of the original CPU. The combination of the two made AMD's design decision simple.AMD has come back to us with a clarification: the 5% figure was incorrect. AMD is now stating that the additional core in Bulldozer requires approximately an additional 50% die area. That's less than a complete doubling of die size for two cores, but still much more than something like Hyper Threading.

The New Way to Count Cores
POST A COMMENT

94 Comments

View All Comments

  • vsary6968 - Tuesday, December 01, 2009 - link

    this slide was 2005. This is not the latest slide. you need to do more research. Reply
  • vsary6968 - Tuesday, December 01, 2009 - link

    this slide was 2005. This is not the latest slide. you need to do more research. Reply
  • Anand Lal Shimpi - Monday, November 30, 2009 - link

    You're very right, AMD responded and said that the 5% figure was incorrect. Unfortunately it looks like both Johan and I were given the same incorrect info.

    The real figure is closer to 50%, I've updated the article accordingly.

    Thanks again :)

    Take care,
    Anand
    Reply
  • piesquared - Monday, November 30, 2009 - link

    I think i'd investigate a little further. Judging by the block diagrams each integer core is no where near 50% of the die, so obviously that number can't be correct.... Reply
  • JumpingJack - Tuesday, December 01, 2009 - link

    And as we all know, these power point block diagrams are carefully scaled to ensure that blocks are exactly proportional to the actual units located on the floor plan of the die.

    From this, one may extrapolate the L3 cache is not much more the 512 KB.

    Thanks for the knee slapper.

    Jack
    Reply
  • GaiaHunter - Monday, November 30, 2009 - link

    I'm still not sure. JF arguing is solid.

    That 5% and 50% could be just semantics.

    Because JF said, distinctly and repeatedly, he was talking about total die size, while the 50% is referring to the area of the module, sans L3$/IMC/NB/etc. And more specifically the Int-core area, which clearly doubles when going from 1 Int-core to 2 Int-cores.

    So, while to get up to 180% increase in integer performance you need to double the area (or 50% of the total integer area)dedicated to integer operations, that relatively to the total die size may well take only 5% of the die space.


    Reply
  • GaiaHunter - Monday, November 30, 2009 - link

    I really think this is semantics again.

    Module and Die.

    Module is 50% bigger but die is only 5% bigger.
    Reply
  • psychobriggsy - Tuesday, December 01, 2009 - link

    A single integer core (just the unique per-core parts, not the shared functionality in the module) takes up 5% of a typical quad-core Bulldozer die (including uncore and L3)? Or maybe even an octo-core die.

    Also assume rounding up and down and nearest. Could be 47% and 5.4%, etc.

    It's a way away yet. Let's see what happens.
    Reply
  • smilingcrow - Monday, November 30, 2009 - link

    5% always sounded very unrealistic as that would mean a remarkable increase in IPC for such a small increase in ‘core’ size.
    If it was only 5% we would expect to see a native 8 module version being for the desktop if looked at purely from die size or on a cost basis. But at 50% extra it means that all other things being equal 4 modules = 6 ‘simple’ cores in space terms ignoring the uncore.

    I’m still not 100% clear on the 50% thing. If a die is 50% cores and 50% un-core and measures 100 sq mm. When we add the 50% larger cores to the equation the cores become 75 sq mm and the die becomes 25% larger or 125 sq mm. Or is there another portion of the module/core that is excluded so the total size increase is less than 25 sq mm?
    Reply
  • Zool - Monday, November 30, 2009 - link

    That 50% sounds much more realistic.
    On the k10 die are http://en.wikipedia.org/wiki/File:K10h.jpg">http://en.wikipedia.org/wiki/File:K10h.jpg u can see
    that doubling the integer pipeline, data cache and load store unit is clearly more than 5% :P.
    The thing is that L2 cache and L3 cache are in the buldozer module picture and they are several times bigger die area than the core. And there are also other things in the uncore like memory controler, hypertransport. The whole die vs core is quite diferent than the whole die vs module. They say 50% more core area invested not module or die area.
    Reply

Log in

Don't have an account? Sign up now