Magny-Cours

You probably heard by now that the new Opteron 6100 is in fact two 6-core Istanbul CPUs bolted together. That is not too far from the truth if you look at the micro architecture:  little has changed inside the core. It is the “uncore” that has changed significantly: the memory controller now supports DDR-1333, and a lot of time has been invested in keeping cache coherency traffic under control. The 1944-pin (!) organic Land Grid Array (LGA) Multi Chip Module (MCM) is pictured below.

The red lines are memory channels, blue lines internal HT cache coherent connects. The gray lines are external cache HT connections, while the green line is a simple non coherent I/O HT connect.

Each CPU has two DDR-3 channels (red lines). That is exactly the strongest point of this MCM: four fast memory channels that can use DDR-1333, good for a theoretical bandwidth peak of 42.7 GB/s. But that kind of bandwidth is not attainable, not even in theory bBecause the next link in the chain, the Northbridge, only runs at 1.8GHz. We have two 64-bit Northbridges both working at 1.8 GHz, limiting the maximum bandwidth to 28.8 GB/s. That is price AMD’s engineers had to pay to keep the maximum power consumption of a 45nm 2.2 GHz  below 115W (TDP).

Adding more cores makes the amount of snoop traffic explode, which can easily result in very poor scaling. It can get worse to the point where extra cores reduce performance. The key technology is HT assist, which we described here.  By eliminating unnecessary probes, local memory latency is significantly reduced and bandwidth is saved. It cost Magny-cours 1MB of L3-cache per core (2MB total), but the amount of bandwidth increases by 100% (!) and the latency is reduced to 60% of it would be without HT-assist.

Even with HT-assist, a lot of probe activity is going on. As HT-assist allows the cores to perform directed snoops, it is good to reach each core quickly. Ideally each Magny-cours MCM would have six HT3 ports. One for I/O with a chipset, 2 per CPU node to communicate with the nodes that are off-package and 2 to communicate very quickly between the CPU nodes inside the package. But at 1944 pins Magny-Cours probably already blew the pin budget, so AMD's engineers limited themselves to 4 HT links.

One of the links is reserved for non coherent communication with a possible x16 GPU. One x16 coherent port communicates with the CPU that is the closest, but not on the same package. One port is split in two x8 ports. The first x8 port communicates with the CPU that is the farthest away: for example between CPU node 0 and CPU node 3. The remaing x16 and x8 port are used to make communication on the MCM as fast as possible. Those 24 links connect the two CPU nodes on the package.

 

The end result is that a 2P configuration allows fast communication between the four CPU nodes. Each CPU node is connected directly (one hop) with the other one. Bandwidth between CPU node 0 and 2 is twice than that of P0 to P3 however.

Whilte it looks like two Istanbuls bolted together, what we're looking at is the hard work of AMD's engineers. They invested quite a bit of time to make sure that this 12 piston muscle car does not spin it’s wheels all the time. Of course if the underground is wet (badly threaded software), that will still be the case. And that'll be the end of our car analogies...we promise :)

Index The SKUs
Comments Locked

58 Comments

View All Comments

  • wolfman3k5 - Monday, March 29, 2010 - link

    Great review! Thanks for the review, when will you guys be reviewing the AMD Phenom II X6 for us mere mortals? I wonder how the Phenom II X6 will stack up against the Core i7 920/930.

    Keep up the good work!
  • ash9 - Tuesday, March 30, 2010 - link

    Since SSE4.1,SSE4.2 are not in AMD's , its Andand's way of getting an easy benchmark win, seeing some of these benchmark test probably use them-

    http://blogs.zdnet.com/Ou/?p=719
    August 31st, 2007
    SSE extension wars heat up between Intel and AMD

    "Microprocessors take approximately five years to go from concept to product and there is no way Intel can add SSE5 to their Nehalem product and AMD can’t add SSE4 to their first-generation 45nm CPU “Shanghai” or their second-generation 45nm “Bulldozer” CPU even if they wanted to. AMD has stated that they will implement SSE4 following the introduction of SSE5 but declined to give a timeline for when this will happen."

    asH
  • mariush - Tuesday, March 30, 2010 - link

    One of the best optimized and multi threaded applications out there is the open source video encoder x264.

    Would it be possible to test how well 2 x 8 and 2x12 amd configurations work at encoding 1080p video at some very high quality settings?

    A workstation with 24 cores from AMD would cost almost as much as a single socket 6 cores system from Intel so it would be interesting to see if the increase in frequency and the additional SSE instructions would be more advantage than the number of cores.
  • Aclough - Tuesday, March 30, 2010 - link

    I wonder if the difference between the Windows and Linux test results is related to the recentish changes in the scheduler? From what I understand the introduction of the CFS in 2.6.23 was supposed to be really good for large numbers of cores, and I'm given to understand that before that the Linux scheduler worked similarly to the recent Windows one. It would be interesting to try running that benchmark with a 2.6.22 kernel or one with the old O(1) patched in.

    Or it could just be that Linux tends to be more tuned for throughput whereas Windows tends to be more tuned for low latency. Or both.
  • Aclough - Tuesday, March 30, 2010 - link

    In any event, the place I work for is a Linux shop and our workload is probably most similar to Blender, so we're probably going to continue to buy AMD.
  • ash9 - Tuesday, March 30, 2010 - link

    http://www.egenera.com/pdf/oracle_benchmarks.pdf


    "Performance testing on the Egenera BladeFrame system has demonstrated that the platform
    is capable of delivering high throughput from multiple servers using Oracle Real Application
    Clusters (RAC) database software. Analysis using Oracle’s Swingbench demonstration tool
    and the Calling Circle schema has shown very high transactions-per-minute performance
    from single-node implementations with dual-core, 4-socket SMP servers based on Intel and
    AMD architectures running a 64-bit-extension Linux operating system. Furthermore, results
    demonstrated 92 percent scalability on either server type up to at least 10 servers.
    The BladeFrame’s architecture naturally provides a host of benefits over other platforms
    in terms of manageability, server consolidation and high availability for Oracle RAC."
  • nexox - Tuesday, March 30, 2010 - link

    It could also be that Linux has a NUMA-aware scheduler, so it'd try to keep data stored in ram which is connected to the core that's running the thread which needs to access the data. I probably didn't explain that too well, but it'd cut down on memory latency because it would minimize going out over the HT links to fetch data. I doubt that Windows does this, given that Intel hasn't had NUMA systems for very long yet.

    I sort of like to see more Linux benchmarks, since that's really all I'd ever consider running on data center-class hardware like this, and since apparently Linux performance has very little to do with Windows performance, based on that one test.
  • yasbane - Wednesday, May 19, 2010 - link

    Agreed. I do find it disappointing that they put so few benchmarks for Linux for servers, and so many for windows.

    -C
  • jbsturgeon - Tuesday, March 30, 2010 - link

    I like the review and enjoyed reading it. I can't help but feel the benchmarks are less a comparison of CPU's and more a study on how well the apps can be threaded as well as the implementation of that threading -- higher clocked cpus will be better for serial code and more cores will win for apps that are well threaded. In scientific number crunching (the code I write ), more cores always wins (AMD). We do use Fluent too, so thanks for including those benchamarks!!
  • jbsturgeon - Tuesday, March 30, 2010 - link

    Obviously that rule can be altered by a killer memory bus :-).

Log in

Don't have an account? Sign up now