The Pursuit of Clock Speed

Thus far I have pointed out that a number of resources in Bulldozer have gone down in number compared to their abundance in AMD's Phenom II architecture. Many of these tradeoffs were made in order to keep die size in check while adding new features (e.g. wider front end, larger queues/data structures, new instruction support). Everywhere from the Bulldozer front-end through the execution clusters, AMD's opportunity to increase performance depends on both efficiency and clock speed. Bulldozer has to make better use of its resources than Phenom II as well as run at higher frequencies to outperform its predecessor. As a result, a major target for Bulldozer was to be able to scale to higher clock speeds.

AMD's architects called this pursuit a low gate count per pipeline stage design. By reducing the number of gates per pipeline stage, you reduce the time spent in each stage and can increase the overall frequency of the processor. If this sounds familiar, it's because Intel used similar logic in the creation of the Pentium 4.

Where Bulldozer is different is AMD insists the design didn't aggressively pursue frequency like the P4, but rather aggressively pursued gate count reduction per stage. According to AMD, the former results in power problems while the latter is more manageable.

AMD's target for Bulldozer was a 30% higher frequency than the previous generation architecture. Unfortunately that's a fairly vague statement and I couldn't get AMD to commit to anything more pronounced, but if we look at the top-end Phenom II X6 at 3.3GHz a 30% increase in frequency would put Bulldozer at 4.3GHz.

Unfortunately 4.3GHz isn't what the top-end AMD FX CPU ships at. The best we'll get at launch is 3.6GHz, a meager 9% increase over the outgoing architecture. Turbo Core does get AMD close to those initial frequency targets, however the turbo frequencies are only typically seen for very short periods of time.

As you may remember from the Pentium 4 days, a significantly deeper pipeline can bring with it significant penalties. We have two prior examples of architectures that increased pipeline length over their predecessors: Willamette and Prescott.

Willamette doubled the pipeline length of the P6 and it was due to make up for it by the corresponding increase in clock frequency. If you do less per clock cycle, you need to throw more clock cycles at the problem to have a neutral impact on performance. Although Willamette ran at higher clock speeds than the outgoing P6 architecture, the increase in frequency was gated by process technology. It wasn't until Northwood arrived that Intel could hit the clock speeds required to truly put distance between its newest and older architectures.

Prescott lengthened the pipeline once more, this time quite significantly. Much to our surprise however, thanks to a lot of clever work on the architecture side Intel was able to keep average instructions executed per clock constant while increasing the length of the pipe. This enabled Prescott to hit higher frequencies and deliver more performance at the same time, without starting at an inherent disadvantage. Where Prescott did fall short however was in the power consumption department. Running at extremely high frequencies required very high voltages and as a result, power consumption skyrocketed.

AMD's goal with Bulldozer was to have IPC remain constant compared to its predecessor, while increasing frequency, similar to Prescott. If IPC can remain constant, any frequency increases will translate into performance advantages. AMD attempted to do this through a wider front end, larger data structures within the chip and a wider execution path through each core. In many senses it succeeded, however single threaded performance still took a hit compared to Phenom II:

 

Cinebench 11.5 - Single Threaded

At the same clock speed, Phenom II is almost 7% faster per core than Bulldozer according to our Cinebench results. This takes into account all of the aforementioned IPC improvements. Despite AMD's efforts, IPC went down.

A slight reduction in IPC however is easily made up for by an increase in operating frequency. Unfortunately, it doesn't appear that AMD was able to hit the clock targets it needed for Bulldozer this time around.

We've recently reported on Global Foundries' issues with 32nm yields. I can't help but wonder if the same type of issues that are impacting Llano today are also holding Bulldozer back.

The Architecture Power Management and Real Turbo Core
Comments Locked

430 Comments

View All Comments

  • Hrel - Wednesday, October 12, 2011 - link

    yes, I use it as a term that means being cheap. My friends often call me jewish cause I hunt for bargains pretty relentlessly.
  • silverblue - Friday, October 14, 2011 - link

    I get called Scottish for the same thing. ;)
  • poohbear - Wednesday, October 12, 2011 - link

    so disappointing to read this. What on earth were they doing all this time?? AMD's NEW cpu can't even outperform its OLD CPU? well atleast i can stick with my PhenomII X6 till Ivy Bridge comes out & thank goodness i didnt buy a pricey AM3+ before reading reviews.:p So sad to see AMD has come to this.....
  • OutsideLoopComputers - Wednesday, October 12, 2011 - link

    I think when Anand publishes benchmarks with a couple of Bulldozers working together in a dual or quad-socket board (Opteron), THEN we will see why AMD designed it the way they did. If the FX achieves parity and sometimes superiority in heavily multithreaded apps vs Sandy Bridge in a single socket, then imagine how two or four of these working together will do in server applications vs Sandy Bridge Xeon. I'll bet we see superiority in most server disciplines.

    I don't think this silicon was designed to go after Intel desktop processors, but to perform directly with dual and quad socket Xeon.

    Its intended to be an Opteron right now, and as an afterthought-to be sold as an FX desktop single socket part, to bridge the gap between A-series and Opteron.
  • JohanAnandtech - Wednesday, October 12, 2011 - link

    Indeed. The market for high-end desktop parts is very small, with low margins, and shrinking! The mobile market is growing, so AMD A6 en A8 CPUs make a lot more sense.

    The server market keeps growing, and the profit margins are excellent because a large percentage of the market wants high end parts (compare that to the desktop market, where almost every one wants the midrange and budgets). the Zip and crypting benchmarks show that Bulldozer is definitely not a complete failure. We'll see :-)
  • g101 - Wednesday, October 12, 2011 - link

    Good to see an intelligent reviewer that knows how to do more than run synthetic benchmarks and games.

    It's funny seeing all the uneducated gamer "complete failure" comments.
  • bassbeast - Thursday, February 9, 2012 - link

    I'm sorry but you are wrong sir and here is why: They are marketing this chip at the CONSUMER and NOT the server, which makes it a total Faildozer.

    If they would have kept P2 for the consumer and kept BD for the Opteron then you sir would have been 100% correct, but by killing their P2 they have just admitted they are out of the desktop CPU business and for a company that small that is a seriously DUMB move. Their Athlon and P2 have been the "go to" chip for many of us system builders because it gave "good enough" performance in the apps that people use, but Faildozer is a hot pig of a chip that is worse for consumer loads in every. single. way. over the P2.

    I'm just glad i bought an X6 when i did, but when i can no longer get the P2 and Athlon II for new builds i'll be switching to intel, the BD simply is worthless for the consumer market and NEVER should have been marketed to it in the first place! so please get off your high horse and admit the truth, the BD chip should have never been sold for anything but servers.
  • haplo602 - Wednesday, October 12, 2011 - link

    This is a server CPU abused for the desktop.

    Have a look at FPU performance. Almost clock for clock (3.3G vs 3.6G) it beats 6 FPU units in Phenom X6. That's quite nice.

    Once they do some optimisations on a mature process, this will achieve SB performance levels. However until then I am going for 2389 optys ....
  • GourdFreeMan - Wednesday, October 12, 2011 - link

    You introduce the fact that AMD lengthened the pipeline transitioning to Bulldozed without explicitly mentioning the pipeline length. How many stages exactly is Bulldozer's pipeline?
  • duploxxx - Wednesday, October 12, 2011 - link

    Well there clearly seems to be something wrong with the usage of the modules in combination with the way to high latency on any cache and memory. single threaded performance is hit by that and so does lack any gaming performance.

    So I hope anandtech can have a clear look at the following thread and continue to seek further:
    http://www.xtremesystems.org/forums/showthread.php...

    secondly during OC just like previous gen, do something more with NB oc in stead of just upping the GHZ, there is more to an architecture then just the ghz....

Log in

Don't have an account? Sign up now