The Opteron 6276: a closer look

Name: The Opteron 6276: a closer look
Item: The Opteron 6276: a closer look
Author: Johan De Gelas

by Johan De Gelas on February 9, 2012 6:00 AM EST

46 Comments | Add A Comment

46 Comments

Threading Tricks or Not?

AMD claimed more than once that Clustered Multi Threading (CMT) is a much more efficient way to crunch through server applications than Simultaneous Multi Threading (SMT), aka Hyper-Threading (HTT). We wanted to check this, so for our next tests we disabled and enabled CMT and HTT. Below you can see how we disabled CMT in the Supermicro BIOS Setup:

First, we look at raw throughput (TP in the table). All measurements were done with the "High Performance" power policy.

Concurrency	CMT	No CMT	TP Increase CMT vs. No CMT	HTT	No HTT	TP Increase HTT vs. No HTT
25	24	24	100%	24	25	100%
40	39	39	100%	39	39	100%
80	77	77	100%	78	78	100%
100	96	96	100%	97	98	100%
125	120	118	101%	122	122	100%
200	189	183	103%	193	192	100%
300	275	252	109%	282	278	102%
350	312	269	116%	321	315	102%
400	344	276	124%	350	339	103%
500	380	281	135%	392	367	107%
600	390	286	136%	402	372	108%
800	389	285	137%	405	379	107%

Only at 300 concurrent users (or queries per second) do the CPUs start to get close their maximum throughput (around 400 q/s). At around that point is where the multi-threading technologies start to pay off.

It is interesting to note that the average IPC of one MS SQL Server thread is about 0.95-1.0 (measured with Intel vTune). That is low enough to have quite a few unused execution slots in the Xeon, which is ideal for Hyper-Threading. However, Hyper-Threading is only capable of delivering a 3-8% performance boost.

On the AMD Opteron we measured an IPC of 0.72-0.8 (measured with AMD CodeAnalyst). That should also be more than low enough to allow two threads to pass through the shared front-end without obstructing each other. While it is not earth shattering, CMT does not disappoint: we measure a very solid 24-37% increase in throughput. Now let's look at the response times (RT in the table).

Concurrency	CMT	No CMT	RT Increase (CMT vs. No CMT)	HTT	No HTT	RT Increase HTT vs. No HTT
25	29	28.5	2%*	20.4	18.9	8%*
40	31.1	32.1	-3% *	21.7	20.3	7%*
80	36	39	-9%*	24	23	2%*
100	39	46	-14%	28	25	13%
125	46	57	-20%	28	28	0%
200	59	92	-35%	38	40	-4%
300	92	189	-51%	62	79	-21%
350	121	303	-60%	91	112	-19%
400	164	452	-64%	143	182	-21%
500	320	788	-59%	278	335	-17%
600	545	1111	-51%	498	621	-20%
800	1003	1825	-45%	989	1120	-12%

* Difference between results is within error margin and thus unreliable.

The SQL server software engine shows excellent scaling and is ideal for CMT and Hyper-Threading. CMT seems to reduce the response time even at low loads. This is not the case for Hyper-Threading, but we must be careful to interpret the results. At the lower concurrencies, the response times measured are so small that they fall within the error margin. A 21.7 ms response time is indeed 7% more than a 20.3 ms response time, but the error margin of these measurements is much higher at these very low concurrencies than at the higher concurrencies, so take these percentages with a grain of salt.

What we can say is that Hyper-Threading only starts to reduce the response times when the CPU goes beyond 50% load. CMT reduces the response times much more than HTT, but the non-CMT response times are already twice (and more) as high as the non-HTT response times.

In the end, both multi-threading technologies improve performance. CMT seems to be quite a bit more efficient than SMT; however, it must be said that the Xeon with HTT disabled already offers response times that are much lower than the Opteron without CMT. So you could also argue the other way around: the Xeon already does a very good job of filling its pipelines (IPC of 1 versus 0.72), and there is less headroom available.

MS SQL Server 2008 R2 at Low Load MS SQL Server 2008 Power Analysis

PRINT THIS ARTICLE

Post Your Comment
Please log in or sign up to comment.

Comments Locked

46 Comments

View All Comments

sonofgodfrey - Thursday, February 9, 2012 - link
Have you explicitly tested one socket vs. two sockets? We've found an immense increase in contention once a cache-line has to be shared between sockets on some systems.
JohanAnandtech - Friday, February 10, 2012 - link
That is one suggestion I will try out next week. Thanks!
Klimax - Thursday, February 9, 2012 - link
Hello.

Nice tests.

However I would like to see MySQL tested on Windows Server 2008 R2
Would be interesting comparsion.

(Especially due to http://channel9.msdn.com/shows/Going+Deep/Arun-Kis... )
Klimax - Thursday, February 9, 2012 - link
Title of post is wrong... (I have deleted second thing and forgot to fix title)
Scali - Thursday, February 9, 2012 - link
Unless I'm mistaken, the Xeon 5650 is a 1.17B transistor chip, where the Interlagos 6276 is a 2.4B transistor chip.
In that light, doesn't that make Intel's SMT implementation a lot better than CMT?
I mean, yes CMT may give more of a performance boost when you increase the threadcount. But considering the fact that AMD spends more than twice the number of transistors on the chip... well, that's pretty obvious.
AMD might as well just have used conventional cores.
The true strength of SMT is not so much that it improves performance in multithreaded scenarios, but that it does so at virtually no extra cost in terms of transistors (and with little or no impact on the single-threaded performance either).
JohanAnandtech - Friday, February 10, 2012 - link
Interlagos is 1.2 billion chip (maybe 1.3 but anyway). Most of those transistors are spend on the L3 cache: about 0.5 billion. Only 213 million transistors are in a module and each module contains a 2 MB L2-cache, probably good for 120 million transistors. That leaves 90 million transistors to the core, and it has been stated that the second cluster added 12%. So that second cluster costs about 12 million transistors, or 48 million on the total 4 module die. That is less than 5% of the total transistor count but you get a 30-90% performance boost!

So for AMD, this was clearly a great choice.

SMT is perfect for Intel, as the Intel architecture puts all instructions in one big ROB.

For very low IPC serverworkloads, I think the CMT approach gives better results. Unfortunately AMD lowered some of the CMT benefits by keeping the datacache so small and the low associativity of the Icache.
Scali - Friday, February 10, 2012 - link
Uhhh, I think you're wrong here... the 4-module Bulldozer is a 1.2B chip (Zambezi). But you tested the 8-module Interlagos (16 threads), which is TWO Zambezi dies in one package.
Hence 2*1.2 = 2.4B transistors.
JohanAnandtech - Friday, February 10, 2012 - link
Ok, it is two chips of 1.2 billion. That doesn't change anything about our analyses of CMT.
Scali - Friday, February 10, 2012 - link
Not in the article, because you did not factor in transistor count (which is the flaw I tried to point out in the first place... comparing two chips, where once is twice the transistor count of the other, is quite the apples-to-oranges comparison. One would expect a chip with twice the transistorcount to be considerably better in multithreading scenarios, not 'catching up' to the smaller chip).

But in your above post, I think it changes everything about your analysis. All your figures have to be done times two.
Which makes it a very poor comparison, not only to Intel, but also to AMD's own previous line of CPUs.
The 6174 Magny Cours is actually beating Interlagos, with 'only' 12 threads, no kind of CMT/SMT, and 'only' 1.8B transistors.

How does that make CMT look like a great choice for AMD?
slycer.tech - Friday, February 10, 2012 - link
What i read on benchmark configuration page, Anand used 2x Intel Xeon X5650. So 2x 1.17B = 2.34B. I think it is comparable to AMD CPU used in this test. Am I right?

The Opteron 6276: a closer look

Post Your Comment

46 Comments

View All Comments

sonofgodfrey - Thursday, February 9, 2012 - link

JohanAnandtech - Friday, February 10, 2012 - link

Klimax - Thursday, February 9, 2012 - link

Klimax - Thursday, February 9, 2012 - link

Scali - Thursday, February 9, 2012 - link

JohanAnandtech - Friday, February 10, 2012 - link

Scali - Friday, February 10, 2012 - link

JohanAnandtech - Friday, February 10, 2012 - link

Scali - Friday, February 10, 2012 - link

slycer.tech - Friday, February 10, 2012 - link

Log in

Don't have an account? Sign up now