The Opteron 6276: a closer look

Name: The Opteron 6276: a closer look
Item: The Opteron 6276: a closer look
Author: Johan De Gelas

by Johan De Gelas on February 9, 2012 6:00 AM EST

46 Comments | Add A Comment

46 Comments

Threading Tricks or Not?

AMD claimed more than once that Clustered Multi Threading (CMT) is a much more efficient way to crunch through server applications than Simultaneous Multi Threading (SMT), aka Hyper-Threading (HTT). We wanted to check this, so for our next tests we disabled and enabled CMT and HTT. Below you can see how we disabled CMT in the Supermicro BIOS Setup:

First, we look at raw throughput (TP in the table). All measurements were done with the "High Performance" power policy.

Concurrency	CMT	No CMT	TP Increase CMT vs. No CMT	HTT	No HTT	TP Increase HTT vs. No HTT
25	24	24	100%	24	25	100%
40	39	39	100%	39	39	100%
80	77	77	100%	78	78	100%
100	96	96	100%	97	98	100%
125	120	118	101%	122	122	100%
200	189	183	103%	193	192	100%
300	275	252	109%	282	278	102%
350	312	269	116%	321	315	102%
400	344	276	124%	350	339	103%
500	380	281	135%	392	367	107%
600	390	286	136%	402	372	108%
800	389	285	137%	405	379	107%

Only at 300 concurrent users (or queries per second) do the CPUs start to get close their maximum throughput (around 400 q/s). At around that point is where the multi-threading technologies start to pay off.

It is interesting to note that the average IPC of one MS SQL Server thread is about 0.95-1.0 (measured with Intel vTune). That is low enough to have quite a few unused execution slots in the Xeon, which is ideal for Hyper-Threading. However, Hyper-Threading is only capable of delivering a 3-8% performance boost.

On the AMD Opteron we measured an IPC of 0.72-0.8 (measured with AMD CodeAnalyst). That should also be more than low enough to allow two threads to pass through the shared front-end without obstructing each other. While it is not earth shattering, CMT does not disappoint: we measure a very solid 24-37% increase in throughput. Now let's look at the response times (RT in the table).

Concurrency	CMT	No CMT	RT Increase (CMT vs. No CMT)	HTT	No HTT	RT Increase HTT vs. No HTT
25	29	28.5	2%*	20.4	18.9	8%*
40	31.1	32.1	-3% *	21.7	20.3	7%*
80	36	39	-9%*	24	23	2%*
100	39	46	-14%	28	25	13%
125	46	57	-20%	28	28	0%
200	59	92	-35%	38	40	-4%
300	92	189	-51%	62	79	-21%
350	121	303	-60%	91	112	-19%
400	164	452	-64%	143	182	-21%
500	320	788	-59%	278	335	-17%
600	545	1111	-51%	498	621	-20%
800	1003	1825	-45%	989	1120	-12%

* Difference between results is within error margin and thus unreliable.

The SQL server software engine shows excellent scaling and is ideal for CMT and Hyper-Threading. CMT seems to reduce the response time even at low loads. This is not the case for Hyper-Threading, but we must be careful to interpret the results. At the lower concurrencies, the response times measured are so small that they fall within the error margin. A 21.7 ms response time is indeed 7% more than a 20.3 ms response time, but the error margin of these measurements is much higher at these very low concurrencies than at the higher concurrencies, so take these percentages with a grain of salt.

What we can say is that Hyper-Threading only starts to reduce the response times when the CPU goes beyond 50% load. CMT reduces the response times much more than HTT, but the non-CMT response times are already twice (and more) as high as the non-HTT response times.

In the end, both multi-threading technologies improve performance. CMT seems to be quite a bit more efficient than SMT; however, it must be said that the Xeon with HTT disabled already offers response times that are much lower than the Opteron without CMT. So you could also argue the other way around: the Xeon already does a very good job of filling its pipelines (IPC of 1 versus 0.72), and there is less headroom available.

MS SQL Server 2008 R2 at Low Load MS SQL Server 2008 Power Analysis

PRINT THIS ARTICLE

Post Your Comment
Please log in or sign up to comment.

Comments Locked

46 Comments

View All Comments

Scali - Saturday, February 11, 2012 - link
"It also reduces throughput."

No, it improves throughput, assuming we are talking from improvement going from 1 physical core to 2 logical cores.
Clearly two logical cores (on the same physical core) have less throughput than two physical cores, but that's obvious since you only have half the hardware.

And that, together with the fact that Intel's SMT chips have far better single-threaded performance to begin with, results in very good performance per die area (you know, that thing that people used to praise AMD GPUs for).

"Yes, it does, via the implementation of all that shared hardware on the chip."

You can't say that, since there is no non-modular version of Bulldozer (just as there is no non-HT version of the Intel architectures).
However, if you compare a 4-core HT architecture with a non-HT architecture, be that a Core2 Quad or a Phenom X4, Intel's transistorcount is clearly in the same ballpark, so HT does not add much in terms of transistorcount.

With CMT we see little or no indication of reduced transistorcount. AMD's 4-module chips are MUCH larger than regular 4-core chips have been. In fact, AMD"s 4-module design is even larger than Intel's 6-core design with HT.

"Two different approaches to the same idea."

I disagree. SMT is a very different idea from CMT (which is a bogus marketing term invented by AMD anyway). CMT is more of a marketing excuse for not having proper SMT, and shows no merit in actual silicon.

"but I don't think we can label one as inherently better than the other yet."

Well clearly we disagree on that then.
I think SMT is clearly inherently better than CMT. SMT has far more flexible sharing of resources than AMD's half-baked approach.
And any theoretical disadvantages (fighting over resources and all that) can be put to bed with benchmarking such as in this review: the disadvantages may exist, but the net performance is unbeatable anyway. A midrange Xeon schools a CMT-based chip of twice the size.
Andexxx - Wednesday, February 15, 2012 - link
Well, there are a lot of factors affecting single-threaded performance in real life. So CMT indeed has its scaling advantages as tests suggested. At least most of the things should be constant when comparing CMT-on and CMT-off, while comparing SMT and CMT on different implementations is not. Lack of single-threaded performance is not a valid point of blaming CMT.

If you want to *proof* CMT is a half-baked marketing crap while SMT is the only solution, what you need is a SMT-based AMD BD monolithic core or a CMT-based Intel monolithic module for comparison.

For the transistors counting, well, that's their choice of making such a cache and uncore configuration. You can keep telling 4-module chip is blahblahblah, but in some cases it beats a 4C8T Xeon chips. Transistors is not a big matter from customer viewpoint but just the producer viewpoint. If you want to argue with GPU's performance metrics, GPU is a data-parallel processor with bunch of logic units, while CPU is a latency-sensitive girlfriend of caches. Large amount of cache can make your Performance/mm^2 or Performance/transistors look worse. So trade-offs on the amount of cache should have been done before they started to design the chip.
Scali - Wednesday, February 15, 2012 - link
Well, one of the reasons why AMD's current CPUs have such poor single-threaded performance is because they moved from 3 ALUs per thread to 2 ALUs per thread.
This is part of the whole CMT design.
So in that sense, CMT can be blamed for the poor single-threaded performance at least.
And since single-threaded performance is so bad, it is only logical that scaling to more threads is relatively good.
On a CPU with faster single-threaded performance, you run into IO limits sooner (memory, disk etc), so it is more difficult to maintain similar scaling with increased thread count.

The strength of SMT is that Intel did not have to cut any ALUs when implementing HT. Pentium 4 Northwood with HT still had two double-pumped ALUs, like the non-HT Willamette that went before it.
Likewise, Core i7 still has 3 ALUs, like Core2.
Another strength of SMT is that even with one less ALU per 2 threads than CMT, it still reaches similar performance in multithreaded scenarios. CMT can not share these ALUs between threads, while SMT can.
Conclusion: CMT is nonsense.
For the full version, see: http://scalibq.wordpress.com/2012/02/14/the-myth-o...
slycer.tech - Monday, February 13, 2012 - link
If Bulldozer arc really bad, how about this?
http://www.marketwatch.com/story/amd-opterontm-620...
Can someone prove this award is a big liar?
duploxxx - Tuesday, February 14, 2012 - link
read the article, the baseline they use for price/performance is based on spec results....lots of companies still use these results to decide on a platform.

but then again, benchmarks don't always show the real world value or even hard to compare since many have in house applications that don't scale or scale different like the ones benchmarked in reviews. 90% of the datacenters don't even require more then any midrange cpu, anything above midrange is wasted money and both vendors provide more then adequate solutions to that. It's the human mind that is often blocking sanity. Investing that wasted money in other solutions often provide a better total performing solution.
anti_shill - Monday, April 2, 2012 - link
shill_detector by anti_shill on Monday, April 02, 2012
Here's a more accurate reflection of Bulldozer/ interlagos performance, untainted by intel ad bucks...

http://www.phoronix.com/scan.php?page=article&...

But if u really want to see what the true story is, have a look at AMD's stock price lately, and their server wins. They absolutely smoke intel on virtualization, and anything that requires a lot of threads. It's not even close.

The Opteron 6276: a closer look

Post Your Comment

46 Comments

View All Comments

Scali - Saturday, February 11, 2012 - link

Andexxx - Wednesday, February 15, 2012 - link

Scali - Wednesday, February 15, 2012 - link

slycer.tech - Monday, February 13, 2012 - link

duploxxx - Tuesday, February 14, 2012 - link

anti_shill - Monday, April 2, 2012 - link

Log in

Don't have an account? Sign up now