POST A COMMENT

84 Comments

Back to Article

  • thunderising - Wednesday, May 30, 2012 - link

    Glad AMD has "Greater Performance" planned sometime in the future. Wow! Reply
  • themossie - Wednesday, May 30, 2012 - link

    "There are also other factors at play, though, as it's already known that StarCraft II doesn't use more than two cores; theinstead, it's likely the..."

    (feel free to remove comment after fixing this)
    Reply
  • Nightraptor - Wednesday, May 30, 2012 - link

    I am wondering if it would be possible to compare the processor performance of the Trinity A10 with a underclocked FX-4100 set to the same frequencies (I don't know if it is possible to disable the L3 cache on the FX-4100). This might give us a rough idea of how much the improvements of Piledriver have bought us. Just doing rough math in my head it would seem that they have to be pretty significant given how a FX-4100 compared to the Phenom II X4's (it lost alot, if not most of the time t of the time). The new A10 Trinity's on the other hand seem to win most of the time compared to the old architecture. Given that the A10 is a Piledriver based FX-4XXX series equivalent minus the L3 cache it would seem that Piledriver brought very significant enhancements. Either that or the Phenom II era processors responded much more poorly to the lack of L3 than Piledriver does. Reply
  • coder543 - Wednesday, May 30, 2012 - link

    I was hoping they would be doing the same thing, even though it would be challenging to draw real information out of comparing a desktop processor and a mobile processor. Reply
  • SleepyFE - Wednesday, May 30, 2012 - link

    +1 Reply
  • kyuu - Wednesday, May 30, 2012 - link

    +2

    If you can figure out some way to do a comparison and analysis of Piledriver's performance vs. Bulldozer, I think a great many of us are interested to see that. From benchmarks, it seems like Piledriver improved a great deal over Bulldozer, but it's difficult to tell without being able to compare two similar processors.
    Reply
  • Aone - Wednesday, May 30, 2012 - link

    You can compare A10-4600M or A8-4500M versus mobile Llano or Phenom or even Turion to see tweaked BD is nothing of spectacular. For instance, in most cases A8-4500M (2.3GHz base) loses to Llano A8-3500M (1.5GHz base). Reply
  • Nightraptor - Wednesday, May 30, 2012 - link

    Where are you getting the benchmarks from that a Trinity loses to Llano. In almost all the benchmarks I have been able to find (with the exception of a few) it seems that Trinity beats Llano, hence the original post. If the Piledriver enhancements were very minor I would've expected Trinity (a hacked quad core) to lose to Llano most of the time (a true quad core). This didn't appear to happen - at least not in the anandtech review. Reply
  • Aone - Wednesday, May 30, 2012 - link

    "http://www.notebookcheck.net/AMD-A-Series-A8-4500M...
    Look at "Show comparison chart". Great info!
    Reply
  • Nightraptor - Thursday, May 31, 2012 - link

    I'm not a big fan of the reliability of that website - They tend to be pretty scant on the test circumstances and configurations. Furthermore I'm curious where they are getting the informaiton for the A8-4500 as to the best of my knowledge the only Trinity in the wild at the moment is the A10 which AMD sent out in a custom made review laptop. All they list is a "K75D Sample". Reply
  • Schmide - Wednesday, May 30, 2012 - link

    I do remember from some analysis that the L2 cache reads were as slow as main memory. That's great if you hit a L2 cache, but it's not going to buy you anything if it's that slow. Reply
  • SocketF - Wednesday, May 30, 2012 - link

    Impossible, you probably mix some things up, maybe latency and bandwidth? Reply
  • Schmide - Wednesday, May 30, 2012 - link

    Yup. It was late at night, I was thinking writes. the L1 write through basically makes L1 writes the same as L2 writes. Reply
  • Homeles - Wednesday, May 30, 2012 - link

    Not even close. L2 is about 10 times faster than main memory.

    http://www.anandtech.com/show/4955/the-bulldozer-r...
    Reply
  • jcollake - Wednesday, May 30, 2012 - link

    Through research here at Bitsum on the AMD Bulldozer platform (specifically the 9150), I found a couple things of interest.

    First, disabling CPU core parking seems to make a big difference in performance. I believe that by default the CPU core parking is just too aggressive. I wrote a tool to let you enable or disable CPU parking in *real time* without a reboot, so you can test this yourself. It is called ParkControl, http://bitsum.com/about_cpu_core_parking.php . For *me*, it seemed to make a night and day difference.

    Second, I am working on a neat little benchmarking tool called ThreadRacer, currently only in alpha prototype. It allows you to really see the effects of these paired cores, and how much it matters that the scheduler is properly aware of them. Take this 1 second or so sample, as seen in the screenshot here (downloads available, but it is an early prototype that I'll quickly be finishing up): http://bitsum.com/forum/index.php/topic,1434.0.htm...

    The scheduler update that Microsoft issued of course treats these paired cores as it would a hyper-threaded core. Indeed, the concept is very similar, except perhaps to avoid patents, AMD took the 'share a little' instead of 'share a lot' approach when it comes to shared computational resources. This was the proper way to *quickly* address the issue, but I believe the scheduler is still suboptimal on these processors (likely to be resolved in Windows 8 or a later update to Windows 7/Vista).

    For Bulldozer, as you know, they are two real processors, but because they have shared dependencies, the performance can really be drained if the other processor in the 'pair' is busy. You can see the effects from ThreadRacer, the core without its pair busy quickly out-paced the paired cores that were both busy.
    Reply
  • jcollake - Wednesday, May 30, 2012 - link

    I should have also mentioned that ThreadRacer also allows you to see how a single CPU consuming thread gets swapped around to different cores (the multi-core thread in the utility). This is its other use. The less the thread gets swapped from core to core, the greater the performance will be. It is interesting to compare and contrast the behavior of the scheduler. I fully believe that most the problems with Bulldozer are due to the Windows scheduler, something that could be tested by using linux and replacing the scheduler with a custom one, or an off the shelf alternative that may behave substantially differently than the Windows scheduler. Reply
  • SocketF - Wednesday, May 30, 2012 - link

    Some people running BOINC programs have reported that Windows-applications run faster when they use a Linux and WINE or a VM.

    The Win-scheduler especially hurts AMD chips, because of the huge exclusive caches. If a thread on an intel CPU is switched to another core, it can load the warmed up L2 portion from the L2 inclusive L3.

    I did some google-search and it seems that under Linux, each core has its own run-queue, whereas on Windows, there is only one run queue for all cores.

    But i didn't delve into it deeply, there are so many different schedulers for Linux, seems to be a complex issue ;-)

    Btw. your link to download is off limits for non-members of your discussion board:
    -------------------------
    Warning!

    The topic or board you are looking for appears to be either missing or off limits to you.
    Please login below or register an account with Bitsum Forums.
    ----------------------------

    Maybe you can upload it somewhere else?
    Reply
  • jcollake - Saturday, September 01, 2012 - link

    Sorry for the late reply. First, the forum permissions were fixed. Second, the utility (still in early stages) is included in Process Lasso *and* available here: http://bitsum.com/threadracer.php Reply
  • eoerl - Wednesday, May 30, 2012 - link

    Very interesting article, together with the hardware.fr report there's a lot of information. One question though, if you read commentaries : you didn't speak much about the influence of compilers. This proved to change a lot of things on Linux (see phoronix extensive tests on both ivy bridge and bulldozer depending on compiler used and compiler options, for example
    http://www.phoronix.com/scan.php?page=article&...
    http://www.phoronix.com/scan.php?page=article&...
    Benchmark results really change a lot with bulldozer, much more than with ivy or sandy bridge. Do you think AMD lost being oversensitive to compiler optimisations, due to a very original architecture ?
    Reply
  • JohanAnandtech - Thursday, May 31, 2012 - link

    I deliberately avoided the compiler issues as this would make the article too convoluted. But notice that what we found is not influenced by compiler choice: we find the same indications in SAP and SQL server (compiled by "conservative" compilers and compiler settings) as in CPU CPU 2006, which uses the best optimized settings and compiler as possible. Reply
  • ArteTetra - Wednesday, May 30, 2012 - link

    "A core this complex in my opinion has not been optimized to its fullest potential. Expect better performance when AMD introduces later steppings of this core with regard to power consumption and higher clock frequencies."

    You don't say?
    Reply
  • JohanAnandtech - Thursday, May 31, 2012 - link

    A quote by a reader, not ours :-). The idea is probably that Bulldozer was AMD's very first implementation of their new architecture. Reply
  • haplo602 - Wednesday, May 30, 2012 - link

    now this was a great read. finaly something interesting (the consumer benchmarks are NOT intereseted anymore for me).

    I hope there will be a differential analysis once you have Piledriver CPUs available.
    Reply
  • JohanAnandtech - Thursday, May 31, 2012 - link

    Piledriver analysis: definitely. Thanks for the encouraging words :-) Reply
  • mikato - Friday, June 01, 2012 - link

    I agree - great critical thinking in this article! This subject definitely needed more research. Reply
  • Spunjji - Wednesday, June 06, 2012 - link

    +1. This is the sort of thing I come here for! Reply
  • Beenthere - Wednesday, May 30, 2012 - link

    Expecting Vishera to be an Intel killer is foolish as it's not going to happen and there is no need for it to happen. Ivy Bridge is very much like FX in that it's only 5% faster than SB and runs hot. At least FX chips OC and scale well unlike Ivy Bridge.

    If AMD can use some of the techniques imployed in Trinity they should be able to get a 15+% improvement over the FX CPUs. This combined with higher clockspeeds now that GloFo has sorted 32nm production should provide a nice performance bump in Vishera.

    95% of consumers do not buy the fastest, most over-hyped and over-priced CPU on the planet for their PC or server apps. Mainstream use is what AMD is shooting for at the moment and doing pretty well at it. Eventually they will release APUs for all PC market segments that perform well, use less power and cost less than discrete CPU/GPU combo. THAT is what 95% of the X86 world will be using.
    Reply
  • Homeles - Wednesday, May 30, 2012 - link

    "Ivy Bridge is very much like FX in that it's only 5% faster than SB and runs hot"

    I think you need to go read about Intel's tick-tock strategy.

    Also, unlike Bulldozer, Ivy Bridge was a step forward. A small one, but performance per watt went up, while with Bulldozer it often went backwards.

    Process maturity from GloFo will help, but probably not as much as you would think.

    Finally, "95% of users" aren't going to benefit best from a processor built with server workloads in mind. Even with server workloads, Bulldozer fails to deliver. APUs are definitely the future, but keep in mind that Intel's had an APU out for as long as AMD has. If you think that AMD's somehow going to pull a fast one on Intel, you're delusional. Intel and Nvidia as well are very, very well aware of heterogeneous computing.
    Reply
  • The_Countess - Wednesday, May 30, 2012 - link

    looking at how much the performance per watt went up with piledriver compared with llano, I think they''ll have a lot more headroom on the desktop and server space to increase the clock frequencies to where they are suppose to be with the bulldozer launch. Reply
  • Homeles - Wednesday, May 30, 2012 - link

    Yeah, Piledriver will likely perform the way AMD had intended Bulldozer to perform. Reply
  • Spunjji - Wednesday, June 06, 2012 - link

    Agreed. That will be nice! Reply
  • haukionkannel - Wednesday, May 30, 2012 - link

    Very nice article! Can we get more thorough explanation about µop cache? It seems to be important part of Sandy bridge and you predict that it would help bulldoser...
    How complex it is to do and how heavily it has been lisensed?
    Reply
  • JohanAnandtech - Thursday, May 31, 2012 - link

    Don't think there is a license involved. AMD has their own "macro ops" so they can do a macro ops cache. Unfortunately I can not answer your question of the top of head on how easy it is to do, I would have to some research first. Reply
  • name99 - Thursday, May 31, 2012 - link

    Oh for fsck's sake.
    The stupid spam filter won't let me post a URL.

    Do a google search for
    sandy bridge Real World Technologies
    and look at the main article that comes up.
    Reply
  • SocketF - Friday, June 01, 2012 - link

    It is already planned, AMD has a patent for sth like that, google for "Redirect Recovery Cache". Dresdenboy found it already back in 2009:

    http://citavia.blog.de/2009/10/02/return-of-the-tr...

    The BIG Question is:
    Why did AMD not implement it yet?

    My guess is that they were already very busy with the whole CMT approach. Maybe Streamroller will bring it, there are some credible rumors in that direction.
    Reply
  • yuri69 - Wednesday, May 30, 2012 - link

    Howdy,
    FOA thanks for the effort to investigate the shortcomings of this march :)

    Quoting M. Butler (BD's chief architect): 'The pipeline within our latest "Bulldozer" microarchitecture is approximately 25 percent deeper than that of the previous generation architectures. ' This gives us 12 stages on K8/K10 => 12 * 1.25 = 15.

    Btw all the major and significant architectural improvements & features for the upcoming BD successor line were set in stone long time ago. Remember, it takes 4-5 years for a general purpose CPU from the initial draft to mass availability. The stage when you can move and bend stuff seems to be around half of this period.
    Reply
  • BenchPress - Wednesday, May 30, 2012 - link

    "This means that Bulldozer should be better at extracting ILP (Instruction Level Parallelism) out of code that has low IPC (Instructions Per Clock)."

    This should be reversed. ILP is inherent to the code, and it's the hardware's job to extract it and achieve a high IPC.
    Reply
  • Arnulf - Wednesday, May 30, 2012 - link

    Ugh, so much crap in a single article ... this should never have been posted on AT.

    You weren't promised anything. You came across a website put up by some "fanboy" dumbass and you're actually using it as a reference. Why not quote some actual references (such as transcripts of the conference where T. Seifert clearly stated that gains are expected to be in line with core number increase, i.e. ~33%) instead of rehashing this Fruehe nonsense ?
    Reply
  • erikvanvelzen - Wednesday, May 30, 2012 - link

    Yes AMD totally set out to make a completely new architecture with a massive increase in transistors per core but 0 gains in IPC.

    Don't fool yourself.
    Reply
  • Homeles - Wednesday, May 30, 2012 - link

    It's a more intelligent analysis than your sorry ass could ever produce. Getting hung up on one quote... really? Reply
  • Taft12 - Wednesday, May 30, 2012 - link

    Johan, this is the best article I've read on Anandtech in quite some time, even better than Jarred, Ryan and Anand have come up with lately.

    The level of analysis goes far, far beyond just what the benchmarks show.

    Bravo!
    Reply
  • JohanAnandtech - Thursday, May 31, 2012 - link

    Great! Good to read there are still people that like these kinds of analysis!

    :-)
    Reply
  • ct760ster - Wednesday, May 30, 2012 - link

    Would be interesting if they could test the aforementioned benchmark in an OS with a customizable kernel like GNU-Linux since code optimization is not possible in most of the proprietary format benchmark. Reply
  • alpha754293 - Wednesday, May 30, 2012 - link

    What about the lacklustre FPU performance?

    The very fact that the FP has to be shared between two integer cores and as far as I know, it cannot run two FP threads at the same time, so for a lot of HPC/computationally heavy workloads - Bulldozer takes a HUGE performance hit. (almost regardless of anything/everything else; although yes, it counts, but remembering that CPUs are glorified calculators, when you take out one of the lanes of the highway and two-lane traffic is now squeezed down to one lane, it's bound to get slower.)
    Reply
  • The_Countess - Wednesday, May 30, 2012 - link

    except the FP CAN run 2 threads at the same time.
    only for the as yet pretty much unused 256bit instructions does it need the whole FP unit per clock.

    in fact the FP can run 2 threads of 128bit, or 4 even of 64bit.
    and a single CPU can use 2x128bit or both can use 1x128.
    intel and AMD previously had only 1x128bit capability per core.
    so there is no regression in FP performance per core. its just much more flexible.
    Reply
  • Zoomer - Wednesday, May 30, 2012 - link

    FPU throughput is much more irrelevant nowadays, as many FP intensive HPC computations have already been ported to GPUs. Yes, there may be instances where there might be FP heavy and branchy, not easily parallelization or otherwise unsuitable, but such beasts are few and far between. I can't think of any, to be honest. Reply
  • Iger - Wednesday, May 30, 2012 - link

    Thanks a lot, that was a very interesting read! Reply
  • Rael - Wednesday, May 30, 2012 - link

    AMD should fire all its marketing department, because these guys accustomed to lie at every announcement they make. The performance gains are multiplied by five or ten, and the per-core advancement, which is close to zero, is presented as 'significant'.
    I don't believe these announcements anymore.
    Reply
  • jabber - Wednesday, May 30, 2012 - link

    What the whole of the AMD Marketing team?

    Thats Tim the caretaker and Trisha on the front desk isnt it?

    I thought AMD's marketing budget was around $42.
    Reply
  • kyuu - Wednesday, May 30, 2012 - link

    Oh hai. You must be new to the human race. Marketing and "stretching the truth" have been synonymous since... forever. AMD is hardly exceptional in this regard. Stop believing anything any marketing department sells you, period. Reply
  • Homeles - Wednesday, May 30, 2012 - link

    This. Read 3rd party reviews (like AnandTech!) -- several of them -- and draw your conclusions from there. That's pretty much the point of reviews; if marketing teams could provide honest, reliable benchmarks over a wide range of applications, we'd have little need for 3rd party reviews. Reply
  • Mugur - Thursday, May 31, 2012 - link

    Well... they actually did! Reply
  • moravista - Wednesday, May 30, 2012 - link

    Great article Johan! I have been reading your articles since the Pentium III / K6-2 days and have really enjoyed them! Thanks for sharing your insight! Keep 'em coming! Reply
  • JohanAnandtech - Friday, June 01, 2012 - link

    Great to hear from you. Did you used to participate at the different forums on a different callsign? Reply
  • muy - Wednesday, May 30, 2012 - link

    i want a phenom II x4 980+ on 32 nm. this whole idea of "lets put as many crippled dual cores on a die and smack a level 3 cache on top and call it out next cpu" is utter crap stuff that doesn't multi thread well (95 % of all stuff).

    6 core bulldozer i bought to replace my amd x3 450 is slower than the chip i wanted to replace at the same clock speed. now i have a shiny asus rog mb, a x3 450 powering it, and a 6 core bulldozer gathering dust. what a waste of money that was.

    shame i can't find any x4 970+'s anymore and amd is to foolhardy to keep manufacturing their best gaming cpu's, let alone do a shrink on them to 32 nm.

    i can only imagine how much better a phenom 2 x4 9xx, default clocked at 4.2 ghz+ would be than any bulldozer. (and how much cheaper to manufacture considering the die size compared to the die size of bulldozer).

    i just don't understand amd.
    Reply
  • Roland00Address - Wednesday, May 30, 2012 - link

    Microcenter has these following processors
    1045t six core for $99
    965 quad core black edition for $99
    960t quad core black edition for $89 (this model is a disabled six core and has a possibility of unlocking to a 6 core. The 960t is a clearance processor so it is while supplies last.
    Reply
  • fic2 - Thursday, May 31, 2012 - link

    Those are all 45 nm. He is wanting a tick - a die shrunk Phenom II.
    Would have to agree with him. If AMD would do a die shrink they would have a killer product - assuming GloFo didn't f*ck it up.
    Reply
  • muy - Wednesday, May 30, 2012 - link

    bulldozer doesn't do single threaded, highly branching (cough games cough) stuff well.

    and before you say "some games use multiple cores", i'll say that 1 core running on 100 % and 7 cores at 5 % is not a good use of multi threading.

    (1 * 100) + (7 * 5) = (1 * 100) + (1 * 35) - 1.35 cores used. this means that a DUAL core going at 10 % higher speed than the exampled 8 core would be 10 % faster than the 8 core 'using' it's 8 cores.

    clock speed + ipc are the only things that matter 90% + of the time for games.
    Reply
  • wolfman3k5 - Wednesday, May 30, 2012 - link

    People don't buy CPUs based on theoretical performance, ideology or brand loyalty (OK, some fan-boys do). Most of us are not computer engineers, and even if we where, it wouldn't matter, because at the end of the day only the end result would matter: performance, efficiency and price. Just like I didn't buy Intel because it looked good on paper back in the glory days of AMD (cca. 2005). So no matter how deep and involved these articles are, AMD still trails Intel when it comes to performance, and it will do so until their lazy and incompetent CPU engineers will get off their lazy buts and start working. The sole reason why Bulldozer was such a massive fail was because most of the design process was highly automated. So, stop slacking and start working lazy AMD engineers! Reply
  • Homeles - Wednesday, May 30, 2012 - link

    Being a "lazy" electrical engineer is practically impossible. The amount of work that has to go into making these processors simply function is quite massive. These guys work hard to get to where they are with their careers and work even harder to keep those careers. The margin of error here is also quite huge... a small flaw can create enormous performance penalties.

    I'd be willing to bet that many, if not most of Bulldozer's shortcomings could be blamed on management. Saying it was "lazy engineers" is callous and ignorant.
    Reply
  • Zoomer - Thursday, May 31, 2012 - link

    True. It's probably better out way back then, but synthesized, than to come out maybe next year with all their lovingly fully customized, hand placed transistors. That's if they don't go bankrupt first.

    wolfman3k5's probably going to call nVidia, 3dfx, ATi (then), most FPGA program design houses, etc, lazy, too.
    Reply
  • misiu_mp - Monday, June 11, 2012 - link

    A large margin of error means that you have a lot of space to make errors with little consequence.
    You meant of course that engineers have small margins of error in their work.
    Reply
  • 500MM - Wednesday, May 30, 2012 - link

    http://images.anandtech.com/graphs/graph5057/42770...

    If lower was better, AMD would have one kickass CPU. The caption is wrong.
    Reply
  • JohanAnandtech - Friday, June 01, 2012 - link

    Fixed, thx! Reply
  • weebnuts - Wednesday, May 30, 2012 - link

    The problem with all these benchmarks is that most organizations are going to be using this is Xen or Vmware uses. The idea is that with more cores, you can run more VM's especially if you are trying to implement Virtual Desktops. How do the processors compare when you are loading the server to 80-90% capacity with lots of VM's? That's a real world comparison I want to see. Reply
  • Iketh - Wednesday, May 30, 2012 - link

    I was dying for information like this. Thank you!

    And as for that quote on the first page by Iketh, that guy is a genious!! :D
    Reply
  • Aone - Thursday, May 31, 2012 - link

    1) Maybe i missed something but, Should "Higher is better" be for "Data Cache hitrate", i.e. opposite to cache misses?

    2) And on the chart "L2 Cache hitrate", is it correct that "Opteron 6276" tag is shown on first line while "Opteron 6174" on the last line? I thought Opteron 6174 was faster in MS SQL than Opteron 6276.
    Reply
  • mrdcook - Thursday, May 31, 2012 - link

    There are a few new instructions in Bulldozer's architecture that, for certain specific computations, can make it 10X faster than Intel. For example, FMA. An FMA does a multiply and then an add as one instruction, rounding only once. Combining the multiply and the add isn't such a big deal (and in many cases can even be counter-productive), but rounding only once is very important in some cases.

    For example, assume you have 3 digits of accuracy and want to calculate (1.23 * 2.31 - 2.84). Without FMA, you calculate Round(1.23 * 2.31) = 2.84, then you calculate Round(2.84 - 2.84) = 0. With FMA, you calculate 1.23 * 2.31 = 2.8413, then you calculate Round(2.8413 - 2.84) = 0.0013. While that may seem contrived (it was!), the difference is significant in certain simulations and calculations.

    When doing math, computers have a very specific level of accuracy -- a certain number of digits of precision. If you want your simulations to come out right, you have to take these limits into account. Learning how to account for the computer's rounding errors is a bit of a black art.

    Mathematicians design algorithms in terms of matrix multiplications and dot products, and if you translate those algorithms directly into computer multiplications and additions, you tend to end up with a lot of cancellation errors like the example given above. You can hire a computer science grad student to rework your algorithm to not lose accuracy, but that is expensive and has to be done for every new algorithm. Or you can use an FMA for the dot products and the matrix multiplications (the high-accuracy dot product and matrix multiplication libraries already do this).

    FMA in software is slow. Single-precision emulated FMA isn't too bad since you can use double-precision to help with the hardest bits of the emulation. The result is that you can do one fmaf in about 4X the amount of time it would take to do a single a*b+c. However, SSE2 allows you to do 4 a*b+c at a time, so emulated single-precision FMA ends up being about 15X slower than optimized SSE2 non-fused multiply-add. Double-precision is harder, taking about 10 times longer than a single a*b+c, so it ends up being 20X slower than non-fused multiply-add.

    Admittedly, the target market for FMA is probably smaller than a breadbox, but those who need it really need it. And as it becomes more common, it'll only become more important. For now, since only Bulldozer has it, nobody is going to care.
    Reply
  • BaronMatrix - Thursday, May 31, 2012 - link

    There are admittedly only two viable X86 licensees in America and one of them sucks... Reply
  • shodanshok - Thursday, May 31, 2012 - link

    Hi Johan,
    first of all, let me thank you for your wonderful analysis on Bulldozer architecture. I read it with great interest.

    However, I think that you left out a very important thing to mention: L1/L2 cache read/write bandwidth. Especially for L2, while latency is an important thing, throughput can be an even more crucial one.

    The key point is that Bulldozer has an write-through L1 cache, so all L1 writes are more or less immediately broadcasted to L2 cache. Some small writes can be effectively cached inside a write-back combining buffer called Write Combining Cache (WCC), but this cache is only 4KB in size per the entire module. So, streaming writes will immediatly fill the WCC and bring down L1 cache speed to L2 levels.

    This can really hamper CPU performance. Obviously, AMD went this road for some understandable reasons, however, the WCC is really too small to cache much data and the L2 is way too slow to efficiently serve L1 write requests.

    This bring us to another point: L2 cache is slow. Comparing this with the super-fast (but much smaller) L2 Intel cache, it has no hope; it is more or less at Intel's L3 level.

    Here you can find my analysis of AMD Bulldozer architecture: http://www.ilsistemista.net/index.php/hardware-ana...">AMD Bulldozer analysis
    Note that, while I collected and normalized data from multiple web site, I left very clear what was the original reference (so that you can easily verify my data).

    Thanks.
    Reply
  • shodanshok - Thursday, May 31, 2012 - link

    Mmm... the link was malformed in the previous message.

    The correct one is: http://www.ilsistemista.net/index.php/hardware-ana...

    Thanks.
    Reply
  • name99 - Thursday, May 31, 2012 - link

    "First of all, in most applications, an OOO processor can easily hide the 4-cycle latency of an L1 cache."

    I know you guys are interested in the question --- why does Bulldozer frequently suck? --- rather than the question ---why is Sandy Bridge so much better? --- but it is this latter question that interests me the most.

    What strikes me, on going through all this data (including information that is NOT in the article, and on my experience back in the day when I was writing assembly and counting cycles) is that the "eventual cost" of misses that go all the way to RAM is not covered in the article, and I suspect this is a large part of the issue.

    What I mean here is the following: consider an extremely simple model --- an L1 hit takes 1 cycle, an L1 miss that goes to RAM takes 100 cycles. Then a 97% L1 hit rate takes a total of basically 400 cycles; a 99% L1 hit rate takes 200 cycles --- apparently minor differences have a huge effect! But that's not the point I want to focus on.
    Let's make the model more complicated. First let's make L1 hit cost more realistic --- 4 cycles. As the article says, this is, for the most part, trivially hidden by the OoO engine. But then why can't the OoO engine also hide all or most of the cost of all those cycles to RAM?
    And that, I think, is where the Intel advantage is. They do such a good job with their OoO engine.

    At a gross level, OoO engines all look kinda the same --- look at a PPC 750 and an IB and, at a superficial level, they look similar. But firstly the IB has just so much larger buffers (what, 168 or so, compared to the 750s what, 6 or so) that, of course, it has a vastly larger stock of instructions it can keep chewing through as it waits for the RAM.

    But, you say, AMD also has large buffers now. Yes, but it's not only the raw buffers. Whenever you start looking at these chips, you discover all sorts of weird limitations on what they can actually do to use all those buffers. I've no idea what the current exact limitations are, but the sort of thing you would have in the past is that maybe all the buffers are flushed on an interrupt or system call; or there'd be strange conditions that could occur where, although in theory the integer engine could keep going past a blocking FP instruction, it turned out to be easier to prevent some race condition by freezing the integer engine under these conditions.

    Secondly while you're executing other instructions, waiting on your RAM, you may well execute a few more load/store instructions that again miss in RAM. How well do you handle these? Can you just keep firing out these load/stores, or do you block at the second (or third, or fourth)? Frequently these load-stores refer to the same cache-line that's already in play from the first L1 miss, and how do you handle that? the truly dumb thing, of course, is to send out ANOTHER memory request. Smarter is to suppress that, but you're still using load-store entries in the main "miss to RAM" data structures. Smarter still is to be aware that this line will be coming eventually, and use auxiliary data structures to hold info about this load/store.

    It's these sorts of technical details, which don't appear in the gross specs (and sometimes not even in the detailed CPU descriptions) that make so much difference. They are obviously astonishingly difficult to get right. Intel has the manpower to worry about every one of them, AMD does not.

    Point is --- if I had to look for a single difference between the the two, that's what I'd be looking at --- how much time is REALLY wasted waiting on DRAM in SB vs on Bulldozer.
    Reply
  • misiu_mp - Monday, June 11, 2012 - link

    It is the compiler's and Out-Of-Order engine's job to order loads, stores and other instructions to minimize the total execution time.
    So making sure no stupid and unnecessary loads are being committed is what the OOO mechanism normally does.
    There is no reason to suspect it is fundamentally broken in Bulldozer.
    Reply
  • IceDread - Friday, June 01, 2012 - link

    It really is simple, Amd did a Huge mistake.

    The product is a bust, simple as that.

    The next generation or the generations after that might be a whole different matter, but guess what? No one cares. It wont help the poor souls that bought this busted product.

    It's annoying that Amd could not do better because now Intel reigns supreme and competes with itself .
    Reply
  • mikato - Friday, June 01, 2012 - link

    I know I shouldn't feed the trolls but...
    You say next generations might be a whole different matter - well what do you think is the point of learning about the Bulldozer architecture? The next generations are based on it.
    Reply
  • IceDread - Monday, June 04, 2012 - link

    What is the point of releasing a product that does not outperform it's predecessor?
    Hope that people will purchase the product anyway and learn it?
    Which companies would be interested in this, how many? Why would they invest money into this?
    Reply
  • _vor_ - Saturday, June 02, 2012 - link

    Yes. I too would be interested in exactly what aspects you think Bulldozer failed and your design ideas and approach on how you would fix them. Do tell. Reply
  • wiyosaya - Friday, June 01, 2012 - link

    Personally, I think it is always nice to see in-depth articles like this that explain the details of the structure of a processor.

    To me, it sounds like AMD has a foundation that with a few well-directed tweaks, may put them in contention with Intel again in the CPU arena. Though AMD has said that they are through competing with Intel, I truly hope this is not the case. Perhaps this is a marketing tactic remove focus from themselves after the enthusiast arena panned BD and its siblings.

    I've built my systems with AMD for a long time; however, this time I went with Intel because I thought they had the better value. Perhaps the future will bring me back to AMD, however, I cannot see doing so right now simply because Intel has become the "value" line over AMD.

    With an i7-3820 in my most recent rig, I think I picked the SB-E value processor. I run more than games, and some of what I run takes advantage of quad-channel memory.

    In any event, I'm set for a while. Perhaps AMD will once again produce a superior product by the time I am ready for my next build.
    Reply
  • jamyryals - Friday, June 01, 2012 - link

    What a great read, thanks! Reply
  • SocketF - Friday, June 01, 2012 - link

    Hi Johan,

    thanks for the test, it is great.

    However, on page 9 you have some trouble with percentage calculations. You wrote:

    quote:
    -------------------
    We get a 65% speed up (2x 0.71 vs 0.86), which is somewhat lower than the 80% predicted by the AMD slides discussing CMT.
    -------------------
    This numbers are totally correct and within AMD's predictions. AMD promised 80% performance for the CMT-Bulldozer module, compared to an hypothetical Bulldozer CMP core, i.e. 2 (single) cores.

    So you have to double your single-thread results, to get the score of 2 (single) Bulldozer cores (2 CMP cores). That gives: 0.86 x 2 = 1.72

    Now compare that to the real performance of 2 CMT cores of one module, which is 0.71 x 2 = 1.42

    1.42 are 82.6% of 1.72, which is better than AMD's 80% claim. Thus their claim holds. Everything's fine, don't worry.

    Source of AMD's claim is e.g. here:
    http://techreport.com/r.x/bulldozer-uarch/bulldoze...
    (sorry, didn't find it on anandtech)

    Please update your article accordingly.

    Oh and one last question, why did you add up the SMT scores but not the CMT scores? Seems odd, an IPC of "two threads", This is just weired. Furthermore it is somehow useless, because you cannot compare it directly with the CMT scores. A diagram should visualize the results not force the reader to do some re-calculations.

    Thanks again

    Erik
    Reply
  • Aone - Monday, June 04, 2012 - link

    Bulldozer's conception was wrong from the scratch.
    I told it a few time, let's me explain it here again.

    I'm sure everyone of you do remember AMD's own words "one BD module has 80% of throughput of two independent cores".
    What does this mean in figures?
    Let's take the performance of one core as 1.0 point. Therefore two BD modules would have 3.2 points or in other words less than 10% than 3.0 (performance of three independent cores).
    Should I remind that with development of independent cores AMD wouldn't had wasted resources (engineering, transistors, money and time) on design and debugging the shared logic. The chip could have been much smaller due to the fact that the chip would have had only 1MB L2 and 2MB L3 per each core and no shared logic. And all of those released resources could have been allocated for development of a more advanced core.

    You see that packing two cores inside a one module was wrong even on the conceptional level. I'm very curious who was the main supporter and decision maker of this approach in AMD.

    AMD must through away BD conception and return to standard practice. The only question remains: Does AMD have long enough TTL to do it?

    BTW, I recommend to look through Spec results again. The comparison of 12c Opteron 62xx w/ 12c Opteron 61xx is of special interest. And let's not forget that Opteron 62xx submissions have higher freq, faster memory and as well as more advanced compiler version and extended instruction set.
    Reply
  • TC2 - Monday, June 18, 2012 - link

    I'm agree in 100%!!!
    The BD uA is "unsuccessful" port from graphics uA. There is many and major drawbacks! Note for example one - to write an optimal software you must adopt an application at algorithmic level (in sense of thread specialization)! This is because the both BD-cores are not the same! Also they shares L1 IC, the number of elements is high, ... and many others uA weaknesses.
    Reply
  • evolucion8 - Tuesday, June 17, 2014 - link

    Northwood was 20 stage pipelines and Prescott was 31, not 39... Reply
  • tipoo - Wednesday, October 08, 2014 - link

    Where is the aftermath?

    "But what about the fourth show stopper? That is probably one of the most interesting ones because it seems to show up (in a lesser degree) in Sandy Bridge too. However, we're not quite ready with our final investigations into this area, so you'll have to wait a bit longer. To be continued...."
    Reply

Log in

Don't have an account? Sign up now