The Bulldozer Aftermath: Delving Even Deeper

Name: The Bulldozer Aftermath: Delving Even Deeper
Item: The Bulldozer Aftermath: Delving Even Deeper
Author: Johan De Gelas

by Johan De Gelas on May 30, 2012 1:15 AM EST

84 Comments | Add A Comment

84 Comments

The Current Situation

It's not hard to explain why an 8-thread processor with slightly lower single-threaded performance does not do well in many desktop applications. If you compare for example the hex-core Core i7-3960X with a quad-core i7-3820, four games did not benefit from the extra two cores: Civilization V, Crysis, Dirt 3 and Metro 2033. In Starcraft 2, World of Warcraft, and Dawn of War 2, the 50% higher core count was good for a 10% performance boost at best. In other words, the situation has improved, but most games don't scale well beyond four cores. There are also other factors at play, though, as it's already known that StarCraft II doesn't use more than two cores; instead, it's likely the 15MB (vs. 10MB in i7-3820) L3 cache that helps improve performance.

The situation in the server space is a lot harder to explain. The Opteron 6100 was able to keep up—more or less—with the Xeon 5600 performancewise. However, the Xeon 5600 was equipped with much better power management and the Xeon won the performance/watt race in most applications, with the exception of HPC applications.

The Opteron 6200 added a bit of performance but sips much less power at low and medium load, so it was capable of offering a better performance per Watt ratio than its older brother. However, since the Xeon E5 came out, the situation became pretty dramatic for the Opteron. One telling example is the fact that only one VMmark 2.0 result on the Opteron 6200 exists, but it has been withdrawn. Even if the reported 12.77 score is close to truth, we need four AMD Opteron 6726 (2.3GHz) to beat the best dual Xeon E5 (2690 at 2.9GHz) by 15%.

We have shown already quite a few benchmarks in two Opteron 6276 articles and one Xeon E5 review. We summarized the relevant numbers of both articles in the table below. The benchmarks below are real world and very relevant to the professional in our opinion.

Software: Importance in the market	Opteron 6276 vs. Opteron 6174	Xeon E5-2660 vs. Opteron 6276
Virtualization: 20-50%
ESXi + Linux (vApusMark FOS)	+1%	+40%
OLAP Databases: 10-15%
MS SQL Server 2008 R2 (OLAP throughput)	-9%	+34%
HPC: 5-7%
LS-Dyna (Neon-Refined)	+21%	+26%
Rendering software: 2-3%
Cinebench	+2%	+37%
ERP
SAP	+18%	+13%

Now consider that all these applications are highly-threaded and scale well. Despite the 33% higher integer core count, the Opteron 6276 is not able to outperform the older Magny-Cours in the OLAP, virtualization and rendering benchmarks. However, the architecture is showing its promise by offering about 20% better performance in SAP and HPC applications.

What makes the Bulldozer cores fail in the OLAP benchmark and succeed in SAP? We now have some interesting profiling details on SAP as well as our OLAP benchmark, so we can delve deeper.

Setting Expectations: the Back End SAP S&D Benchmark in Depth

PRINT THIS ARTICLE

Post Your Comment
Please log in or sign up to comment.

Comments Locked

84 Comments

View All Comments

Schmide - Wednesday, May 30, 2012 - link
I do remember from some analysis that the L2 cache reads were as slow as main memory. That's great if you hit a L2 cache, but it's not going to buy you anything if it's that slow.
SocketF - Wednesday, May 30, 2012 - link
Impossible, you probably mix some things up, maybe latency and bandwidth?
Schmide - Wednesday, May 30, 2012 - link
Yup. It was late at night, I was thinking writes. the L1 write through basically makes L1 writes the same as L2 writes.
Homeles - Wednesday, May 30, 2012 - link
Not even close. L2 is about 10 times faster than main memory.

http://www.anandtech.com/show/4955/the-bulldozer-r...
jcollake - Wednesday, May 30, 2012 - link
Through research here at Bitsum on the AMD Bulldozer platform (specifically the 9150), I found a couple things of interest.

First, disabling CPU core parking seems to make a big difference in performance. I believe that by default the CPU core parking is just too aggressive. I wrote a tool to let you enable or disable CPU parking in *real time* without a reboot, so you can test this yourself. It is called ParkControl, http://bitsum.com/about_cpu_core_parking.php . For *me*, it seemed to make a night and day difference.

Second, I am working on a neat little benchmarking tool called ThreadRacer, currently only in alpha prototype. It allows you to really see the effects of these paired cores, and how much it matters that the scheduler is properly aware of them. Take this 1 second or so sample, as seen in the screenshot here (downloads available, but it is an early prototype that I'll quickly be finishing up): http://bitsum.com/forum/index.php/topic,1434.0.htm...

The scheduler update that Microsoft issued of course treats these paired cores as it would a hyper-threaded core. Indeed, the concept is very similar, except perhaps to avoid patents, AMD took the 'share a little' instead of 'share a lot' approach when it comes to shared computational resources. This was the proper way to *quickly* address the issue, but I believe the scheduler is still suboptimal on these processors (likely to be resolved in Windows 8 or a later update to Windows 7/Vista).

For Bulldozer, as you know, they are two real processors, but because they have shared dependencies, the performance can really be drained if the other processor in the 'pair' is busy. You can see the effects from ThreadRacer, the core without its pair busy quickly out-paced the paired cores that were both busy.
jcollake - Wednesday, May 30, 2012 - link
I should have also mentioned that ThreadRacer also allows you to see how a single CPU consuming thread gets swapped around to different cores (the multi-core thread in the utility). This is its other use. The less the thread gets swapped from core to core, the greater the performance will be. It is interesting to compare and contrast the behavior of the scheduler. I fully believe that most the problems with Bulldozer are due to the Windows scheduler, something that could be tested by using linux and replacing the scheduler with a custom one, or an off the shelf alternative that may behave substantially differently than the Windows scheduler.
SocketF - Wednesday, May 30, 2012 - link
Some people running BOINC programs have reported that Windows-applications run faster when they use a Linux and WINE or a VM.

The Win-scheduler especially hurts AMD chips, because of the huge exclusive caches. If a thread on an intel CPU is switched to another core, it can load the warmed up L2 portion from the L2 inclusive L3.

I did some google-search and it seems that under Linux, each core has its own run-queue, whereas on Windows, there is only one run queue for all cores.

But i didn't delve into it deeply, there are so many different schedulers for Linux, seems to be a complex issue ;-)

Btw. your link to download is off limits for non-members of your discussion board:
-------------------------
Warning!

The topic or board you are looking for appears to be either missing or off limits to you.
Please login below or register an account with Bitsum Forums.
----------------------------

Maybe you can upload it somewhere else?
jcollake - Saturday, September 1, 2012 - link
Sorry for the late reply. First, the forum permissions were fixed. Second, the utility (still in early stages) is included in Process Lasso *and* available here: http://bitsum.com/threadracer.php
eoerl - Wednesday, May 30, 2012 - link
Very interesting article, together with the hardware.fr report there's a lot of information. One question though, if you read commentaries : you didn't speak much about the influence of compilers. This proved to change a lot of things on Linux (see phoronix extensive tests on both ivy bridge and bulldozer depending on compiler used and compiler options, for example
http://www.phoronix.com/scan.php?page=article&...
http://www.phoronix.com/scan.php?page=article&...
Benchmark results really change a lot with bulldozer, much more than with ivy or sandy bridge. Do you think AMD lost being oversensitive to compiler optimisations, due to a very original architecture ?
JohanAnandtech - Thursday, May 31, 2012 - link
I deliberately avoided the compiler issues as this would make the article too convoluted. But notice that what we found is not influenced by compiler choice: we find the same indications in SAP and SQL server (compiled by "conservative" compilers and compiler settings) as in CPU CPU 2006, which uses the best optimized settings and compiler as possible.

The Bulldozer Aftermath: Delving Even Deeper

Virtualization: 20-50%

+1%

+40%

OLAP Databases: 10-15%

-9%

+34%

HPC: 5-7%

+21%

+26%

Rendering software: 2-3%

+2%

+37%

ERP

+18%

+13%

Post Your Comment

84 Comments

View All Comments

Schmide - Wednesday, May 30, 2012 - link

SocketF - Wednesday, May 30, 2012 - link

Schmide - Wednesday, May 30, 2012 - link

Homeles - Wednesday, May 30, 2012 - link

jcollake - Wednesday, May 30, 2012 - link

jcollake - Wednesday, May 30, 2012 - link

SocketF - Wednesday, May 30, 2012 - link

jcollake - Saturday, September 1, 2012 - link

eoerl - Wednesday, May 30, 2012 - link

JohanAnandtech - Thursday, May 31, 2012 - link

Log in

Don't have an account? Sign up now