The Opteron 6276: a closer look
by Johan De Gelas on February 9, 2012 6:00 AM EST- Posted in
- IT Computing
- CPUs
- Bulldozer
- AMD
- Opteron
- Cloud Computing
- Interlagos
Threading Tricks or Not?
AMD claimed more than once that Clustered Multi Threading (CMT) is a much more efficient way to crunch through server applications than Simultaneous Multi Threading (SMT), aka Hyper-Threading (HTT). We wanted to check this, so for our next tests we disabled and enabled CMT and HTT. Below you can see how we disabled CMT in the Supermicro BIOS Setup:
First, we look at raw throughput (TP in the table). All measurements were done with the "High Performance" power policy.
Concurrency | CMT | No CMT |
TP Increase CMT vs. No CMT |
HTT | No HTT |
TP Increase HTT vs. No HTT |
25 | 24 | 24 | 100% | 24 | 25 | 100% |
40 | 39 | 39 | 100% | 39 | 39 | 100% |
80 | 77 | 77 | 100% | 78 | 78 | 100% |
100 | 96 | 96 | 100% | 97 | 98 | 100% |
125 | 120 | 118 | 101% | 122 | 122 | 100% |
200 | 189 | 183 | 103% | 193 | 192 | 100% |
300 | 275 | 252 | 109% | 282 | 278 | 102% |
350 | 312 | 269 | 116% | 321 | 315 | 102% |
400 | 344 | 276 | 124% | 350 | 339 | 103% |
500 | 380 | 281 | 135% | 392 | 367 | 107% |
600 | 390 | 286 | 136% | 402 | 372 | 108% |
800 | 389 | 285 | 137% | 405 | 379 | 107% |
Only at 300 concurrent users (or queries per second) do the CPUs start to get close their maximum throughput (around 400 q/s). At around that point is where the multi-threading technologies start to pay off.
It is interesting to note that the average IPC of one MS SQL Server thread is about 0.95-1.0 (measured with Intel vTune). That is low enough to have quite a few unused execution slots in the Xeon, which is ideal for Hyper-Threading. However, Hyper-Threading is only capable of delivering a 3-8% performance boost.
On the AMD Opteron we measured an IPC of 0.72-0.8 (measured with AMD CodeAnalyst). That should also be more than low enough to allow two threads to pass through the shared front-end without obstructing each other. While it is not earth shattering, CMT does not disappoint: we measure a very solid 24-37% increase in throughput. Now let's look at the response times (RT in the table).
Concurrency | CMT | No CMT |
RT Increase (CMT vs. No CMT) |
HTT | No HTT |
RT Increase HTT vs. No HTT |
25 | 29 | 28.5 | 2%* | 20.4 | 18.9 | 8%* |
40 | 31.1 | 32.1 | -3% * | 21.7 | 20.3 | 7%* |
80 | 36 | 39 | -9%* | 24 | 23 | 2%* |
100 | 39 | 46 | -14% | 28 | 25 | 13% |
125 | 46 | 57 | -20% | 28 | 28 | 0% |
200 | 59 | 92 | -35% | 38 | 40 | -4% |
300 | 92 | 189 | -51% | 62 | 79 | -21% |
350 | 121 | 303 | -60% | 91 | 112 | -19% |
400 | 164 | 452 | -64% | 143 | 182 | -21% |
500 | 320 | 788 | -59% | 278 | 335 | -17% |
600 | 545 | 1111 | -51% | 498 | 621 | -20% |
800 | 1003 | 1825 | -45% | 989 | 1120 | -12% |
* Difference between results is within error margin and thus unreliable.
The SQL server software engine shows excellent scaling and is ideal for CMT and Hyper-Threading. CMT seems to reduce the response time even at low loads. This is not the case for Hyper-Threading, but we must be careful to interpret the results. At the lower concurrencies, the response times measured are so small that they fall within the error margin. A 21.7 ms response time is indeed 7% more than a 20.3 ms response time, but the error margin of these measurements is much higher at these very low concurrencies than at the higher concurrencies, so take these percentages with a grain of salt.
What we can say is that Hyper-Threading only starts to reduce the response times when the CPU goes beyond 50% load. CMT reduces the response times much more than HTT, but the non-CMT response times are already twice (and more) as high as the non-HTT response times.
In the end, both multi-threading technologies improve performance. CMT seems to be quite a bit more efficient than SMT; however, it must be said that the Xeon with HTT disabled already offers response times that are much lower than the Opteron without CMT. So you could also argue the other way around: the Xeon already does a very good job of filling its pipelines (IPC of 1 versus 0.72), and there is less headroom available.
46 Comments
View All Comments
Jaguar36 - Thursday, February 9, 2012 - link
I too would love to see more HPC related benchmarks. Finite Element Analysis (FEA) or Computational Fluid Dynamic (CFD) programs scale very well with increased core count, and are something that is highly CPU dependent. I've found it very difficult to find good performance information for CPUs under this load.I'd be happy to help out developing some benchmark problems if need be.
dcollins - Thursday, February 9, 2012 - link
These would indeed be interesting benchmarks to see. These workloads are very floating point heavy so I imagine that the new Opterons will perform poorly. 16 modules won't matter when they only have 8 FPUs. Of course, I am speculating here.Going forward, these types of workloads should be moving toward GPUs rather than CPUs, but I understand the burden of legacy software.
silverblue - Friday, February 10, 2012 - link
They have 8 FPUs capable of 16x 128-bit or 8x 256-bit instructions per clock. On that level, it shouldn't be at a disadvantage.bnolsen - Sunday, February 12, 2012 - link
GPUs are pretty poor for general purpose HPC. If someone wants to fork out tons of $$$ to hack their problem onto a gpu (or they get lucky and somehow their problem fits a gpu well) that's fine but not really smart considering how short release cycles are, etc.I have access to a quad socket magny cours built mid last year. In december I put together a sandy-e 3930k portable demo system. Needless to say the 3930k had at least 10% more throughput on heavy processing tasks (enabling all intel sse dropped in another 15%). It also handily beat our dual xeon nehalem development system as well. With mixed IO and cpu heavy loads the advantage dropped but was still there.
I'd love to be able to test these new amds just to see but its been much easier telling customers to stick with intel, especially with this new amd cpu.
MySchizoBuddy - Friday, March 9, 2012 - link
"GPUs are pretty poor for general purpose HPC."tell that to the #2, #4 and #5 most powerful supercomputers in the world. I'm sure no one told them.
hooflung - Thursday, February 9, 2012 - link
I think I'd rather see some benchmarks based around Java EE6 and an appropriate container such as Jboss AS 7. I'd also like to see some Java 7 application benchmarks ( server oriented ).I'd also like to see some custom Java benchmarks using Akka library so we can see some Software transactional memory benchmarks. Possibly a node.js benchmark as well to see if these new technologies can scale.
What I've seen here is that the enterprise circa 2006 has a love hate relationship with AMD. I'd also like to see some benchmarks of the Intel vs AMD vs SPARC T4 in both virtualized and non virtualized J2EE environments. But this article does have some really interesting data.
jibberegg - Thursday, February 9, 2012 - link
Thanks for the great and informative article! Minor typo for you..."Using a PDU for accurate power measurements might same pretty insane"
should be
"Using a PDU for accurate power measurements might seem pretty insane"
phoenix_rizzen - Thursday, February 9, 2012 - link
MySQL has to be the absolute worst possible choice for testing multi-core CPUs (as evidenced in this review). It just doesn't scale beyond 4-8 cores, depending on CPU choice and MySQL version.A much better choice for "alternative SQL database" would be PostgreSQL. That at least scales to 32 cores (possibly more, but I've never seen a benchmark beyond 32). Not to mention it's a much better RDBMS than MySQL.
MySQL really is only a toy. The fact that many large websites run on top of MySQL doesn't change that fact.
PixyMisa - Friday, February 10, 2012 - link
This is a very good point. While it can be done, it's very fiddly to get MySQL to scale to many CPUs, much simpler to just shard the database and run multiple instances of MySQL. (And replication is single-threaded anyway, so if you manage to get one MySQL instance running with very high inserts/updates, you'll find replication can't keep up.)Same goes for MongoDB and, of course, Redis, which is single-threaded.
We have ten large Opteron servers running CentOS 6, five 32-core and five 48-core, and all our applications are sharded and virtualised at a point where the individual nodes still have room to scale. Since our applications are too large to run un-sharded anyway, and the e7 Xeons cost an absolute fortune, the Opteron was the way to go.
The only back-end software we've found that scales smoothly to large numbers of CPUs is written in Erlang - RabbitMQ, CouchDB, and Riak. We love RabbitMQ and use it everywhere; unfortunately, while CouchDB and Riak scale very nicely, they start out pretty darn slow.
We actually ran a couple of 40-core e7 Xeon systems for a few months, and they had some pretty serious performance problems for certain workloads too - where the same workload worked fine on either a dual X5670 or a quad Opteron. Working out why things don't scale is often more work than just fixing them so that they do; sometimes the only practical thing to do is know what platform works for what workload, and use the right hardware for the task at hand.
Having said all that, the MySQL results are still disappointing.
JohanAnandtech - Friday, February 10, 2012 - link
"It just doesn't scale beyond 4-8 cores, depending on CPU choice and MySQL version."You missed something: it does scale beyond 12 Xeon cores, and I estimate that scaling won't be bad until you go beyond 24 cores. I don't see why the current implementation of MySQL should be called a toy.
PostgreSQL: interesting several readers have told me this too. I hope it is true, because last time we test PostgreSQL was worse than the current MySQL.