The New Way to Count Cores

Henceforth AMD is referring to the number of integer cores on a processor when it counts cores. So a quad-core Zambezi is made up of four integer cores, or two Bulldozer modules. An eight-core would be four Bulldozer modules.


A hypothetical quad-core Bulldozer. Presumably the L3 cache would be shared by both modules.


A hypothetical eight-core Bulldozer. Presumably the L3 cache would be shared by all four modules.

It's a distinct shift from AMD's (and Intel's) current method of counting cores. A quad-core Phenom II X4 is literally four Phenom II cores on a single die, if you disabled three you would be left with a single core Phenom II. The same can't be said about a quad-core Bulldozer. The smallest functional block there is a module, which is two cores according to AMD.

Better than Hyper Threading?

Intel doesn't take, at least today, quite aggressive of a step towards multithreading. Nehalem uses SMT to send two threads to a single core, resulting in as much as a 30% increase in performance:

The added die area to enable HT on Nehalem is very small, far less than 5%.

AMD claims that the performance benefit from the second integer core on a single Bulldozer module is up to 80% on threaded code. That's more than what AMD could get through something like Hyper Threading, but as we've recently found out the impact to die size is not negligible. It really boils down to the sorts of workloads AMD will be running on Bulldozer. If they are indeed mostly integer, then the performance per die area will be quite good and the tradeoff worth it. Part of the integer/FP balance does depend on how quickly the world embraces computing on the GPU however...

According to AMD's roadmaps, Zambezi will use either 4 or 8 Bulldozer cores (that's 2 or 4 modules). The quad-core Zambezi should have roughly 10 - 35% better integer performance than a similarly clocked quad-core Phenom II. An eight-core Zambezi will be a threaded monster.

No GPU, for Now

The first APU from AMD will be Llano, but based on existing Phenom II cores. The move to a new manufacturing process combined with the first monolithic CPU/GPU is enough to do at once, there's no need to toss in a brand new microarchitecture at the same time.

AMD did add that eventually, in a matter of 3 - 5 years, most floating point workloads would be moved off of the CPU and onto the GPU. At that point you could even argue against including any sort of FP logic on the "CPU" at all. It's clear that AMD's design direction with Bulldozer is to prepare for that future.

In recent history AMD's architectural decisions have predicted, earlier than Intel, where the the microprocessor industry was headed. The K8 embraced 64-bit computing, a move that Intel eventually echoed some years later. Phenom was first to migrate to the 3 level cache hierarchy that we have today, with private L2 caches. Nehalem mimicked and improved on that philosophy. Bulldozer appears to be similarly ahead of its time, ready for world where heterogenous CPU/GPU computing is commonplace. I wonder if we'll see a similar architecture from Intel in a few years.

Index
POST A COMMENT

94 Comments

View All Comments

  • vsary6968 - Tuesday, December 01, 2009 - link

    this slide was 2005. This is not the latest slide. you need to do more research. Reply
  • vsary6968 - Tuesday, December 01, 2009 - link

    this slide was 2005. This is not the latest slide. you need to do more research. Reply
  • Anand Lal Shimpi - Monday, November 30, 2009 - link

    You're very right, AMD responded and said that the 5% figure was incorrect. Unfortunately it looks like both Johan and I were given the same incorrect info.

    The real figure is closer to 50%, I've updated the article accordingly.

    Thanks again :)

    Take care,
    Anand
    Reply
  • piesquared - Monday, November 30, 2009 - link

    I think i'd investigate a little further. Judging by the block diagrams each integer core is no where near 50% of the die, so obviously that number can't be correct.... Reply
  • JumpingJack - Tuesday, December 01, 2009 - link

    And as we all know, these power point block diagrams are carefully scaled to ensure that blocks are exactly proportional to the actual units located on the floor plan of the die.

    From this, one may extrapolate the L3 cache is not much more the 512 KB.

    Thanks for the knee slapper.

    Jack
    Reply
  • GaiaHunter - Monday, November 30, 2009 - link

    I'm still not sure. JF arguing is solid.

    That 5% and 50% could be just semantics.

    Because JF said, distinctly and repeatedly, he was talking about total die size, while the 50% is referring to the area of the module, sans L3$/IMC/NB/etc. And more specifically the Int-core area, which clearly doubles when going from 1 Int-core to 2 Int-cores.

    So, while to get up to 180% increase in integer performance you need to double the area (or 50% of the total integer area)dedicated to integer operations, that relatively to the total die size may well take only 5% of the die space.


    Reply
  • GaiaHunter - Monday, November 30, 2009 - link

    I really think this is semantics again.

    Module and Die.

    Module is 50% bigger but die is only 5% bigger.
    Reply
  • psychobriggsy - Tuesday, December 01, 2009 - link

    A single integer core (just the unique per-core parts, not the shared functionality in the module) takes up 5% of a typical quad-core Bulldozer die (including uncore and L3)? Or maybe even an octo-core die.

    Also assume rounding up and down and nearest. Could be 47% and 5.4%, etc.

    It's a way away yet. Let's see what happens.
    Reply
  • smilingcrow - Monday, November 30, 2009 - link

    5% always sounded very unrealistic as that would mean a remarkable increase in IPC for such a small increase in ‘core’ size.
    If it was only 5% we would expect to see a native 8 module version being for the desktop if looked at purely from die size or on a cost basis. But at 50% extra it means that all other things being equal 4 modules = 6 ‘simple’ cores in space terms ignoring the uncore.

    I’m still not 100% clear on the 50% thing. If a die is 50% cores and 50% un-core and measures 100 sq mm. When we add the 50% larger cores to the equation the cores become 75 sq mm and the die becomes 25% larger or 125 sq mm. Or is there another portion of the module/core that is excluded so the total size increase is less than 25 sq mm?
    Reply
  • Zool - Monday, November 30, 2009 - link

    That 50% sounds much more realistic.
    On the k10 die are http://en.wikipedia.org/wiki/File:K10h.jpg">http://en.wikipedia.org/wiki/File:K10h.jpg u can see
    that doubling the integer pipeline, data cache and load store unit is clearly more than 5% :P.
    The thing is that L2 cache and L3 cache are in the buldozer module picture and they are several times bigger die area than the core. And there are also other things in the uncore like memory controler, hypertransport. The whole die vs core is quite diferent than the whole die vs module. They say 50% more core area invested not module or die area.
    Reply

Log in

Don't have an account? Sign up now