The Impact of Bulldozer's Pipeline

With a new branch prediction architecture and an unknown, but presumably significantly deeper pipline, I was eager to find out just how much of a burden AMD's quest for frequency had placed on Bulldozer. To do so I turned to the trusty N-Queens solver, now baked into the AIDA64 benchmark suite.

The N-Queens problem is simple. On an N x N chessboard, how do you place N queens so they cannot attack one another? Solving the problem is incredibly branch intensive, and as a result it serves as a great measure of the impact of a deeper pipeline.

The AIDA64 implementation of the N-Queens algorithm is heavily threaded, but I wanted to first get a look at single-core performance so I disabled all but a single integer/fp core on Bulldozer, as well as the competing processors. I also looked at constant frequency as well as turbo enabled speeds:

Single Core Branch Predictor Performance—AIDA64 Queens Benchmark

Unfortunately things don't look good. Even with turbo enabled, the 3.6GHz Bulldozer part needs another 25% higher frequency to equal a 3.6GHz Phenom II X4. Even a 3.3GHz Phenom II X6 does better here. Without being fully aware of the optimizations at work in AIDA64 I wouldn't put too much focus on Sandy Bridge's performance here, but Intel is widely known for focusing on branch prediction performance.

If we let the N-Queens benchmark scale to all available threads, the performance issues are easily masked by throwing more threads at the problem:

SMP Branch Predictor Performance—AIDA64 Queens Benchmark

However it is quite clear that for single or lightly threaded operations that are branch heavy, Bulldozer will be in for a fight.

Power Management and Real Turbo Core Cache and Memory Performance
Comments Locked

430 Comments

View All Comments

  • nofumble62 - Thursday, October 13, 2011 - link

    Crappy building block will mean crappy building.
  • richaron - Friday, October 14, 2011 - link

    At first I was pissed off by being strung along for this pile of tripe. After sleeping on it, I am not completely giving up on this SERVER CHIP:
    1) FX is a performance moniker, scratch stupid amount of cache & crank clock
    2) I'm sure these numbties can get single thread up to thuban levels
    3) Patch windows scheduler ffs
    Fix those (relatively simple) things & it will kick ass. But it means most enthusiasts wont be spending money on AMD for a while yet.
  • 7Enigma - Friday, October 14, 2011 - link

    Biggest problem for a server chip is the load power levels. It just doesn't compete on that benchmark and one in which is VERY important for a server environment from a cost/heat standpoint.

    Let's hope that's just a crappy leaky chip due to manufacturing but it's to early to tell.
  • richaron - Friday, October 14, 2011 - link

    I've worked in a 'server environment'. of course power consumption is an issue. at the lower clock speeds & considering multithread performance, this is already a good/great contender. virtual servers & scientific computing this is already a winnar.
    with a few (hardware & software) tweaks it could be a GREAT pc chip in the long term.
  • ryansh - Friday, October 14, 2011 - link

    Anyone have a BETA copy of WIN8 to see if BD's performance increases like AMD says it will.
  • silverblue - Friday, October 14, 2011 - link

    There's benchmarks here and there but nothing to say it'll improve performance more than 10% across the board. In any case, the competition also benefits from Windows 8, so it's still not a sign of AMD closing any sort of gap in a tangible fashion.
  • Pipperox - Friday, October 14, 2011 - link

    But Bulldozer is different.
    Windows 7 scheduler does not have a clue about its "modules" and "cores".
    So for example it may find it perfectly legit to schedule 2 FP intensive threads to the same module.
    Instead this will result in reduced performance on Bulldozer.
    Also one may want to schedule two integer threads which share the same memory space to the same module, instead of 2 different modules.
    This way the two threads can share the same L2 cache, instead of having to go to the L3 which would increase latency.

    All of the above does not apply to Thuban; to a lesser degree it applies to Sandy Bridge, but Windows 7 scheduler is already aware of Sandy Bridge's architecture.
  • nirmv - Saturday, October 15, 2011 - link

    Pipperox, It's not different than Intel's Hyper Threading.
  • Pipperox - Sunday, October 16, 2011 - link

    It is, although they're similar concepts.
    Let's make an example: you have 2 integer threads working on the same address space (for example two parallel threads working in the same process).
    All cores are idle.
    What is the best scheduling for a Hyperthreading cpu?
    You schedule each thread to a different core, so that they can enjoy full execution resources.

    What is best on Bulldozer?
    You schedule them to the SAME module.
    This because the execution resources are split in a BD module, so there would be no advantage to schedule the threads to different modules.
    HOWEVER if the 2 threads are on the same module, they can share the L2 cache instead of the L3 cache on BD, so they enjoy lower memory latency and higher bandwidth.

    There are cases where the above is not true, of course.

    But my example shows that optimal scheduling for Hyperthreading can be SUB-optimal on Bulldozer.

    Hence the need for a Bulldozer-aware scheduler in Windows 8.
  • Regs - Friday, October 14, 2011 - link

    AMD needs a 40-50% performance gain and they're not going to see it using windows 8. What AMD needs is...actually I have no clue what the need. I've never been so dumbfounded about a product that makes no sense or has any position in the market.

Log in

Don't have an account? Sign up now