Caching Analysis

It does not take us long to find a suspect for the lower single-threaded performance of the CMT enabled module: the instruction cache.

Instruction Cache Hitrate

The instructions of Cinebench and 7-Zip fit almost perfectly in the instruction cache, but that cannot be said about our MS SQL Server SQL statements. The 8-way 32KB Instruction caches of the latest Intel CPUs are clearly not large enough and shed some light on why the Opteron 6174 performed so well in this benchmark. The older AMD CPU has up to 40% fewer instruction cache misses.

The 2-way 64KB instruction cache was clearly not the optimal choice for caching two threads: the hit rate goes from an excellent 97% down to a mediocre 95% once we enable the second integer thread. It will take some engineering, but increasing the associativity of the L1 instruction cache seems necessary to make sure that the two CMT threads do not hinder each other. Let's move on to the data cache.

Data Cache hit rate

Reducing the data cache from 64KB to 16KB was probably necessary in order to keep the die size of the module under control. (A Bulldozer module is less than 80 mm², while two Magny-Cours cores are good for 115 mm².) However this reduction comes with a price: the data cache suffers twice as many misses as before. Intel's 8-way cache does a bit better, but it is not spectacular. Now let's check out the L2 caches.

L2 Cache hit rate

The very low L2 cache hit rates on the older Opteron and Xeon seem like a fluke but that is not the case. In  the case of Cinebench, don't forget that this benchmark has an extremely low miss rate in the L1 cache, so most of the easy to cache code and data is already there. The relatively high L2 cache miss rate on the Xeon means that 44% of less than 1% misses the L2 cache--or in other words, almost nothing. The data is almost perfectly cache inside the caches and the data cache hit rate is 99.99%. Most of the L2 cache misses are a few hardly used instructions.

The same is true for the relatively bad hit rate of the Opteron 6174 L2 cache in 7-Zip. The Opteron has a higher L1 data cache hit rate than the other CPUs, so the L2 cache is less accessed. The bad L2 hit rate is not the reason for the lower performance of the older Opteron. Which brings us to the final area of analysis....

IPC Analysis Branch Prediction Analysis
Comments Locked

84 Comments

View All Comments

  • Spunjji - Wednesday, June 6, 2012 - link

    Agreed. That will be nice!
  • haukionkannel - Wednesday, May 30, 2012 - link

    Very nice article! Can we get more thorough explanation about µop cache? It seems to be important part of Sandy bridge and you predict that it would help bulldoser...
    How complex it is to do and how heavily it has been lisensed?
  • JohanAnandtech - Thursday, May 31, 2012 - link

    Don't think there is a license involved. AMD has their own "macro ops" so they can do a macro ops cache. Unfortunately I can not answer your question of the top of head on how easy it is to do, I would have to some research first.
  • name99 - Thursday, May 31, 2012 - link

    Oh for fsck's sake.
    The stupid spam filter won't let me post a URL.

    Do a google search for
    sandy bridge Real World Technologies
    and look at the main article that comes up.
  • SocketF - Friday, June 1, 2012 - link

    It is already planned, AMD has a patent for sth like that, google for "Redirect Recovery Cache". Dresdenboy found it already back in 2009:

    http://citavia.blog.de/2009/10/02/return-of-the-tr...

    The BIG Question is:
    Why did AMD not implement it yet?

    My guess is that they were already very busy with the whole CMT approach. Maybe Streamroller will bring it, there are some credible rumors in that direction.
  • yuri69 - Wednesday, May 30, 2012 - link

    Howdy,
    FOA thanks for the effort to investigate the shortcomings of this march :)

    Quoting M. Butler (BD's chief architect): 'The pipeline within our latest "Bulldozer" microarchitecture is approximately 25 percent deeper than that of the previous generation architectures. ' This gives us 12 stages on K8/K10 => 12 * 1.25 = 15.

    Btw all the major and significant architectural improvements & features for the upcoming BD successor line were set in stone long time ago. Remember, it takes 4-5 years for a general purpose CPU from the initial draft to mass availability. The stage when you can move and bend stuff seems to be around half of this period.
  • BenchPress - Wednesday, May 30, 2012 - link

    "This means that Bulldozer should be better at extracting ILP (Instruction Level Parallelism) out of code that has low IPC (Instructions Per Clock)."

    This should be reversed. ILP is inherent to the code, and it's the hardware's job to extract it and achieve a high IPC.
  • Arnulf - Wednesday, May 30, 2012 - link

    Ugh, so much crap in a single article ... this should never have been posted on AT.

    You weren't promised anything. You came across a website put up by some "fanboy" dumbass and you're actually using it as a reference. Why not quote some actual references (such as transcripts of the conference where T. Seifert clearly stated that gains are expected to be in line with core number increase, i.e. ~33%) instead of rehashing this Fruehe nonsense ?
  • erikvanvelzen - Wednesday, May 30, 2012 - link

    Yes AMD totally set out to make a completely new architecture with a massive increase in transistors per core but 0 gains in IPC.

    Don't fool yourself.
  • Homeles - Wednesday, May 30, 2012 - link

    It's a more intelligent analysis than your sorry ass could ever produce. Getting hung up on one quote... really?

Log in

Don't have an account? Sign up now