The Pursuit of Clock Speed

Thus far I have pointed out that a number of resources in Bulldozer have gone down in number compared to their abundance in AMD's Phenom II architecture. Many of these tradeoffs were made in order to keep die size in check while adding new features (e.g. wider front end, larger queues/data structures, new instruction support). Everywhere from the Bulldozer front-end through the execution clusters, AMD's opportunity to increase performance depends on both efficiency and clock speed. Bulldozer has to make better use of its resources than Phenom II as well as run at higher frequencies to outperform its predecessor. As a result, a major target for Bulldozer was to be able to scale to higher clock speeds.

AMD's architects called this pursuit a low gate count per pipeline stage design. By reducing the number of gates per pipeline stage, you reduce the time spent in each stage and can increase the overall frequency of the processor. If this sounds familiar, it's because Intel used similar logic in the creation of the Pentium 4.

Where Bulldozer is different is AMD insists the design didn't aggressively pursue frequency like the P4, but rather aggressively pursued gate count reduction per stage. According to AMD, the former results in power problems while the latter is more manageable.

AMD's target for Bulldozer was a 30% higher frequency than the previous generation architecture. Unfortunately that's a fairly vague statement and I couldn't get AMD to commit to anything more pronounced, but if we look at the top-end Phenom II X6 at 3.3GHz a 30% increase in frequency would put Bulldozer at 4.3GHz.

Unfortunately 4.3GHz isn't what the top-end AMD FX CPU ships at. The best we'll get at launch is 3.6GHz, a meager 9% increase over the outgoing architecture. Turbo Core does get AMD close to those initial frequency targets, however the turbo frequencies are only typically seen for very short periods of time.

As you may remember from the Pentium 4 days, a significantly deeper pipeline can bring with it significant penalties. We have two prior examples of architectures that increased pipeline length over their predecessors: Willamette and Prescott.

Willamette doubled the pipeline length of the P6 and it was due to make up for it by the corresponding increase in clock frequency. If you do less per clock cycle, you need to throw more clock cycles at the problem to have a neutral impact on performance. Although Willamette ran at higher clock speeds than the outgoing P6 architecture, the increase in frequency was gated by process technology. It wasn't until Northwood arrived that Intel could hit the clock speeds required to truly put distance between its newest and older architectures.

Prescott lengthened the pipeline once more, this time quite significantly. Much to our surprise however, thanks to a lot of clever work on the architecture side Intel was able to keep average instructions executed per clock constant while increasing the length of the pipe. This enabled Prescott to hit higher frequencies and deliver more performance at the same time, without starting at an inherent disadvantage. Where Prescott did fall short however was in the power consumption department. Running at extremely high frequencies required very high voltages and as a result, power consumption skyrocketed.

AMD's goal with Bulldozer was to have IPC remain constant compared to its predecessor, while increasing frequency, similar to Prescott. If IPC can remain constant, any frequency increases will translate into performance advantages. AMD attempted to do this through a wider front end, larger data structures within the chip and a wider execution path through each core. In many senses it succeeded, however single threaded performance still took a hit compared to Phenom II:

 

Cinebench 11.5 - Single Threaded

At the same clock speed, Phenom II is almost 7% faster per core than Bulldozer according to our Cinebench results. This takes into account all of the aforementioned IPC improvements. Despite AMD's efforts, IPC went down.

A slight reduction in IPC however is easily made up for by an increase in operating frequency. Unfortunately, it doesn't appear that AMD was able to hit the clock targets it needed for Bulldozer this time around.

We've recently reported on Global Foundries' issues with 32nm yields. I can't help but wonder if the same type of issues that are impacting Llano today are also holding Bulldozer back.

The Architecture Power Management and Real Turbo Core
Comments Locked

430 Comments

View All Comments

  • silverblue - Thursday, October 13, 2011 - link

    Isn't the AseTek cooler self-contained?
  • jaygib9 - Thursday, October 13, 2011 - link

    Belard, water cooling is what many of the higher end gaming systems already run. What makes it stupid? It's far more effective than air cooling, it just requires more equipment and is a little more costly. You say a 25% overclock won't make up the performance difference, but what about possibly going up to 6 GHz/core with water cooling? Do you really think that wouldn't have some pretty good numbers? Hey silverblue, I'm not sure.
  • dillonnotz24 - Wednesday, October 12, 2011 - link

    This is a rather naïve sounding post, but it just occurred to me and I figured I might share.

    Now, I'm putting a lot of faith in simple marketing gimmicks here, but bear with me, and you might find this excellent food for thought.

    When I first discovered the leak conceding AMD's new Bulldozer consumer CPU's, I was kind of put off by AMD's naming scheme. FX 8150 seems like such a small number, and obviously wouldn't appear appealing to the eyes of un-savvy consumers. Now, one might find this claim a bit irrelevant, but if you look at history, numbers sell. Even AMD confirmed this when it launched its new series naming jargon, the "A4's, A6's, and A8's." This is quite obviously a marketing illusion to make AMD processors appear better than Intel's Core i3's, i5's, and i7's in an area that most unaware consumers will see first: the name!

    That said, I started thinking about processor branding. In the past, AMD has used a really strict branding system for its last two CPU designs. A Phenom II's part name very consistently correlated with the CPU's clockspeed and, therefore, performance. Slap on and extra +5 to the name and you got an extra 0.1 GHz of CPU frequency. Also, the higher end CPU's were always placed in the higher spectrum of the thousands place. The top of the line quad cores populated the numbers 925-1000, while the hexa-cores resided in the 1000-1100's. The rebranded CPU's based on original Phenom and Athlon architectures were given much lower values in the 1000's place, with the very popular 555 BE being a prime example. With Llano, the top-end A8-3850 reiterates this phenomenon. The further the part name extends from the number "4000," the less performance you received from the CPU and GPU, relatively incrementally. So, as you can see, AMD consistently used this strategy to give value to their parts without listing a single specification. Larger numbers generally means more performance, and to the casual onlooker, unfamiliarity with the performance range you actually received from the processors in comparison to Intel's made that sub-$200 price point look really tasty.

    So, I say all that to present the following theory. Given that these processors can reach 4.6 GHz on air, and the unicorn-like 5.0 GHz (presumably) on AMD's water cooling solution, there seems to be a lot of headroom for AMD to pull off the most unprecedented comeback in the history of computing. That's right, I'm saying that maybe AMD intends to release new Bulldozer variants with upped clockspeeds and an actual included water-cooling solution for a raised pricepoint
  • dillonnotz24 - Wednesday, October 12, 2011 - link

    ...a raised price point. Could we see a future Bulldozer AMD A8 8950 @ 4.5 GHz with water-cooling bundled for $350 once Global F. Gets it's act together with producing reliable chips? Think about it...AMD's CPU frequency stepping and naming is nowhere to be found with these CPU's, and they are all huddled down around the number 8000. If this is actually the very bottom of the spectrum, this would mean that the very low end Bulldozer variants were on par with the best of Phenom II. Subsequently, the higher end Bulldozer's I propose would have nothing to lose, but anything to gain with higher clock speeds. All they can do is go up! With higher clockspeeds, Bulldozer could make up for all its woes seen here today in single and double threaded applications, which comprise nearly 50% of consumer level apps. There's potential here, but I will admit to those of you who find this whole concept absurd, I have my doubts. Can AMD do it? They'd have my eternal respect, and wallet, if they do.
  • Belard - Thursday, October 13, 2011 - link

    Sooner or later.... someone (perhaps Anandtech) will benchmark a 5Ghz AMD FX 8000 series CPU.

    If said 5.0Ghz CPU (water cooled) is still SLOWWWER in any way compared to a $200 intel 2400 (3.1Ghz) or the $210 2500. Who would care to buy such a $300~350 chip?

    Okay.. I upclock the 2500k to 4ghz and it kills the 8150 at 5~6Ghz.... Nobody buys the 8150 or higher. It just doesn't matter... its too slow.
  • stephenbrooks - Wednesday, October 12, 2011 - link

    They support FMA instructions but then don't fuse multiply and add micro-ops to *make* FMA instructions (as far as I can tell from the article). That's stupid.

    The way they've done it, everyone has to get a new compiler to take advantage of their chips. If they created FMAs in the muop-fusion stage, then even older software would get a boost too.
  • mczak - Wednesday, October 12, 2011 - link

    You can absolutely not fuse mul+add on the fly to fma as the results will be different. Now you can argue the result is "better" by omitting the rounding after the multiplication but fact is you need to 100% adhere to the standard which dictates you need to do it. Software might (quite likely some really will) rely on doing the right thing.
    For the same exact reason compilers can't do such optimizations neither, unless you specifically tell them it's ok and you don't care about such things (it's not only standard adherence but also reproducability - such compiler switches also allow them to reorder things like a+b+c as a+(b+c) which isn't allowed neither otherwise for floating point as you can get different results which makes things unpredictable).
    (gcc for instance has a -ffast-math switch which I guess might be able to do such fusing, I don't know if it will though I know you can get surprising bugs with that option if you don't think about it...)
  • stephenbrooks - Thursday, October 13, 2011 - link

    Thanks for explaining that. I'd kind of assumed FMA would just round as if the MUL happened first. Defining it "correctly", they've thrown away a lot of compatibility for a really marginal increase in accuracy!
  • Pipperox - Thursday, October 13, 2011 - link

    Nope, it's not marginal.
    Basically with your "fused madd" you'd get code which on Bulldozer gives slightly different results than on any other CPU.. silently.
    This is called a bug.
    It is just not acceptable for a CPU to produce "optimizations" which alter even slightly the expected numerical output, because then the programs which run on them would fail in very slight and hard to track ways.
  • mczak - Thursday, October 13, 2011 - link

    That isn't quite correct. There is a real demand for fused multiply add, not doing rounding after the mul is something which is quite appreciated. You just can't use fma blindly as a mul+add replacement, but it's perfectly defined in standard floating point too nowadays (ieee 754-2008).
    Besides, it would be VERY difficult for the cpu to fuse mul+adds correctly even if it would do intermediate rounding after the mul. First the cpu would need to recognize the mul+add sequence - doable if they immediately follow each other I guess, though requires analysis of the operands. Unless both instructions write to the same register it also wouldn't be able to do it anyway since it cannot know if the mul result won't get used later by some other instruction.
    This is really the sort of optimization which compilers would do, not cpus. Yes cpus do op fusion these days but it's quite limited.

Log in

Don't have an account? Sign up now