The New Way to Count Cores

Henceforth AMD is referring to the number of integer cores on a processor when it counts cores. So a quad-core Zambezi is made up of four integer cores, or two Bulldozer modules. An eight-core would be four Bulldozer modules.


A hypothetical quad-core Bulldozer. Presumably the L3 cache would be shared by both modules.


A hypothetical eight-core Bulldozer. Presumably the L3 cache would be shared by all four modules.

It's a distinct shift from AMD's (and Intel's) current method of counting cores. A quad-core Phenom II X4 is literally four Phenom II cores on a single die, if you disabled three you would be left with a single core Phenom II. The same can't be said about a quad-core Bulldozer. The smallest functional block there is a module, which is two cores according to AMD.

Better than Hyper Threading?

Intel doesn't take, at least today, quite aggressive of a step towards multithreading. Nehalem uses SMT to send two threads to a single core, resulting in as much as a 30% increase in performance:

The added die area to enable HT on Nehalem is very small, far less than 5%.

AMD claims that the performance benefit from the second integer core on a single Bulldozer module is up to 80% on threaded code. That's more than what AMD could get through something like Hyper Threading, but as we've recently found out the impact to die size is not negligible. It really boils down to the sorts of workloads AMD will be running on Bulldozer. If they are indeed mostly integer, then the performance per die area will be quite good and the tradeoff worth it. Part of the integer/FP balance does depend on how quickly the world embraces computing on the GPU however...

According to AMD's roadmaps, Zambezi will use either 4 or 8 Bulldozer cores (that's 2 or 4 modules). The quad-core Zambezi should have roughly 10 - 35% better integer performance than a similarly clocked quad-core Phenom II. An eight-core Zambezi will be a threaded monster.

No GPU, for Now

The first APU from AMD will be Llano, but based on existing Phenom II cores. The move to a new manufacturing process combined with the first monolithic CPU/GPU is enough to do at once, there's no need to toss in a brand new microarchitecture at the same time.

AMD did add that eventually, in a matter of 3 - 5 years, most floating point workloads would be moved off of the CPU and onto the GPU. At that point you could even argue against including any sort of FP logic on the "CPU" at all. It's clear that AMD's design direction with Bulldozer is to prepare for that future.

In recent history AMD's architectural decisions have predicted, earlier than Intel, where the the microprocessor industry was headed. The K8 embraced 64-bit computing, a move that Intel eventually echoed some years later. Phenom was first to migrate to the 3 level cache hierarchy that we have today, with private L2 caches. Nehalem mimicked and improved on that philosophy. Bulldozer appears to be similarly ahead of its time, ready for world where heterogenous CPU/GPU computing is commonplace. I wonder if we'll see a similar architecture from Intel in a few years.

Index
POST A COMMENT

94 Comments

View All Comments

  • Zool - Monday, November 30, 2009 - link

    This is the K10 core from wikipedia with integer pipeline highlited (and other areas too) http://en.wikipedia.org/wiki/File:K10h.jpg">http://en.wikipedia.org/wiki/File:K10h.jpg .
    The 5% are is quite realistic if count in the shared L1 and L2 cache for one module.
    Reply
  • psychobriggsy - Tuesday, December 01, 2009 - link

    The L1 caches are duplicated however. Also the Load/Store units I presume, but maybe there is a way to share some resource there.

    What that diagram does show is that there are two 64-bit SIMDs (one of which can do x87) in K10 (not K10.5).

    In Bulldozer there are two 128-bit SIMDS (that can also do FMA). I presume that they can each do x87 if they deign to lower themselves to the task.

    That's why the FP performance has gone up. FMA counts as two operations when it comes to Linpack. :D FP is doubled compared to K10, even on a per-BDcore basis.

    Will we refer to a Bulldozer module as K11?
    Reply
  • GaiaHunter - Monday, November 30, 2009 - link

    In the guy own words

    http://forums.anandtech.com/showpost.php?p=2893509...">http://forums.anandtech.com/showpost.php?p=2893509...

    [quote]I think the difference between 50% and 5% might be the difference between marketing and engineering. Engineers tend to be very literal.

    If 2 cores get you 180% performance of 1, then in simple terms, that extra core is 50% that gets you the 80%.

    What I asked the engineering team was "what is the actual die space of the dedicated integer portion of the module"? So, for instance, if I took an 8 core processor (with 4 modules) and removed one integer core from each module, how much die space would that save. The answer was ~5%.

    Simply put, in each module, there are plenty of shared components. And there is a large cache in the processor. And a northbridge/memory controller. The dies themselves are small in relative terms.[/quote]
    Reply
  • tatertot - Monday, November 30, 2009 - link

    The guy is wrong, or his engineering team misunderstood.

    Moore (the lead designer) said about 50% increase to double the integer resources, L1D, etc. That sounds about right.

    What I COULD believe is this:

    Q: "If I took an 8 core processor (with 4 modules) and removed 1 integer core from ONE module, how much die space would that save?" A: 5%

    In other words, if you removed them from all 4, you'd save 20%.

    If you figure that the uncore takes up a bit more than half of the die, that would be totally consistent with Moore's 50% larger core figure.

    For example (totally made up numbers):

    Die size 300 mm2

    uncore 160 mm2
    4 BD modules 140mm^2

    1 BD module 35 mm2
    1 BD module without extra integer units: 23 mm2
    (Savings from lopping 1 BD module: 12mm^2)

    4 BD modules without extra integer units: 93 mm2
    (Savings from lopping 4 BD modules: 47mm^2)

    12/300 is 4%, which is what his engineers thought he was asking.

    But really he was asking about 47/300 or ~16%.

    So as stated, the 5% is wrong. It's the area cost of 1 of the module's extra int resources on a 4 module die. All 4 of them cost more.

    And this would be consistent with Moore's estimate that relative to JUST the module, it is a 50% area increase.

    Reply
  • psychobriggsy - Tuesday, December 01, 2009 - link

    Thanks for doing the example maths.

    Yes, it looks like adding a single core to each module adds around 15%-20% to the die size of a dual-module/quad-core Bulldozer.

    So 20% die space for 80% performance increase. Well, until you decide to make the L2 larger because there will be more contention for it.

    Of course the 5% die area for SMT in Nehalem is negligable when you start factoring in the uncore portions as above...
    Reply
  • GaiaHunter - Monday, November 30, 2009 - link

    Meh, I miss-clicked and reported your post by mistake, sorry about that. :(

    Anyway, imagine a int core is 5% area of a total module and that a module size is 100 (size units not mm^2), so the int core size is 5. 4 modules will then be 400 and 4 int cores will be 20. 20 is 5% of 400, not 20%. Same for 8 modules.

    You have to see, that they do need 50% more area to get 80% Int boost performance as they are using a second Int core to accomplish that. So the dedicated area of module to do Int operations is 2x the size of a regular Int core.
    Reply
  • GaiaHunter - Monday, November 30, 2009 - link

    Meh, I miss-clicked and reported your post by mistake, sorry about that. :(

    Anyway, imagine a int core is 5% area of a total module and that a module size is 100 (size units not mm^2), so the int core size is 5. 4 modules will then be 400 and 4 int cores will be 20. 20 is 5% of 400, not 20%. Same for 8 modules.

    You have to see, that they do need 50% more area to get 80% Int boost performance as they are using a second Int core to accomplish that. So the dedicated area of module to do Int operations is 2x the size of a regular Int core.
    Reply
  • HolKann - Monday, November 30, 2009 - link

    Nah, you don't understand him. His assumptions are:
    1. one int core is about 5% of the whole die (including uncore).
    2. one int core is about 50% of a module.
    3. the uncore makes up about half of the core.

    Put this in numbers:
    Take a module as 100 size units. 4 modules means 400 size units, adding the uncore makes the size of the whole die 800. 5% of 800 is 40 size units. And tadaa, this makes an int core 40% of the size of a module ;) The number gets closer to 50% if one takes the uncore bigger.

    If his assumptions are correct, a 25% total die increase (4*5% to 80%) results in 80% extra performance. This is about as good as Intel's 5% die increase for 15-20% extra performance (I know, this is a bold statement, a lot of unknown variables could alter this situation drastically).
    Reply
  • GaiaHunter - Monday, November 30, 2009 - link

    The data we have is: Removing 1 int core 5% from each module would result on 5% reduction of total die size. 1 int core = 50% of the total area dedicated to integer operations.


    So, for a total die of lets say 1000 units with 8 int cores, 4 int cores represent 5% of the total die size or 50 size units.

    So each int core is 12,5 size units and the 8 int cores take 100 size units or 10% of the core.

    Assuming sizes for total die size or what is the Bulldozer Module size relative to total size is pure speculation, as we don't have any numbers other than that JF affirmation.

    To remember:
    "What I asked the engineering team was "what is the actual die space of the dedicated integer portion of the module"? So, for instance, if I took an 8 core processor (with 4 modules) and removed one integer core from each module, how much die space would that save. The answer was ~5%. "

    That was the affirmation.

    In no way this contradicts the affirmation that AMD increased the Module area dedicated to integer operations by 50% to achieve 80% performance.

    Main point is DIE SIZE ? BULLDOZER MODULE.
    Reply
  • tatertot - Monday, November 30, 2009 - link

    I am disputing the JF claim: " Removing 1 int core 5% from each module would result on 5% reduction of total die size. "

    I suspect that his engineers misunderstood his question, and it is actually the removal of the "extra core" from ONE BD module that would result in 5% overall die savings.

    You can take it to the bank that Moore is correct that adding another integer execution unit group , L1D, etc to the core (thus making 2 cores, or a 'module') increased the size by 50%. Moore is the designer, not a marketing guy.

    In order for Fruehe's claim to be correct, the uncore area would have to be VERY large:

    Some more (different numbers):

    Assume BD module is 30 mm2, (thus increased by 10 mm2, or 50% from 20 mm2 to add the second 'core', per Moore)

    If 5% were actually the correct estimation of the area added for 4 BD modules (4 * 10 mm2 increase = 40 mm2 increase), then the overall die size would need to be... 800 mm^2.

    This is nuts.

    On the other hand, if "5% of the total die area" is the estimate of the space needed to add the integer resources to just 1 BD module, then the overall die can be 200 mm^2, so uncore 80 mm^2, 4 BD modules at 120 mm^2, and then Moore's numbers can be consistent with what JF heard back from the engineers.

    So, my theory is that his engineers thought they were being asked how much of the total die (for a 4 BD module part) the increase in integer units to 1 BD module resulted in, while JF thought he was asking how much the increase to ALL 4 modules would be. This would be an easy misunderstanding to have, and I don't see another way to reconcile Moore's information (which I trust), with JF's claim.
    Reply

Log in

Don't have an account? Sign up now