Multiprocessor Mecca

Whereas the K7 was architected to be AMD's flagship desktop processor, the K8 was designed from the start to be an excellent enterprise processor with versions eventually trickling down to the desktop. With this in mind, it makes perfect sense that the K8 core should be well suited for multiprocessor environments.

To understand how the K8 (and Opteron) are so perfectly tailored to MP environments you have to first understand the limitations in conventional architectures. The Athlon MP brought to light the first limitation of conventional MP Intel architectures - the shared FSB. Regardless of whether you had one, two or four processors, they all shared the same 64-bit wide FSB that connected the CPUs to the rest of the system. The obvious bottleneck here is that the more CPUs you have, the less FSB bandwidth each individual CPU gets.

AMD got around this limitation with the Athlon MP by giving each CPU its own 64-bit connection to the North Bridge, introducing the world's first point-to-point FSB protocol used on an x86 machine. AMD's approach was higher performing than Intel's, but at a much higher cost; the 760MP chipset was an extremely expensive chipset, and that's only for a 2-way setup. AMD never built a 4-way Athlon MP chipset, mainly because of a lack of demand, but it wouldn't have been very easy had there been tremendous demand for the solution.

With the K8, AMD took things one step further and offered an even higher performing MP solution that was also cheaply scalable courtesy of a technology AMD pioneered called Hyper Transport. Hyper Transport is a serial point-to-point bus that AMD uses to connect everything from I/O controllers to AGP/PCI bridges and even CPUs.

The Opteron features three 16-bit wide HT links, with each link offering up to 3.2GB/s of bandwidth in each direction (for a total of 6.4GB/s of bandwidth per link). Each Opteron CPU can connect to other Opteron CPUs using two of the three links, the third link is present for any I/O chips that the CPU must connect to.

The Multiprocessing Capabilities of Opteron

The beauty of this setup is that a configuration of 8 CPUs is just as easy to implement as a configuration of 2 CPUs, no need for expensive chipsets.

A side effect of each CPU having its own memory controller is that memory bandwidth scales with the number of CPUs you have present. Whereas in conventional MP architectures CPUs must share the memory bandwidth much like they do FSB bandwidth, with Opteron, each CPU gets a dedicated 128-bit DDR memory bus to itself. With multiple CPUs in a system, each CPU can use both its own memory controller as well as a non-local memory controller to pull in data, thus increasing effective memory bandwidth even further. For example, the Opteron supports a maximum of DDR333 SDRAM currently, giving it a peak bandwidth of 5.3GB/s per CPU. The CPU can also pull data from other memory controllers in a n-way machine as quickly as 3.2GB/s, the maximum transfer rate of the HT link between two CPUs.

In order for the performance benefit of this sort of memory access to be truly taken advantage of, the OS needs to be smart enough to not put all data in the first xxxMB of memory. Instead, the OS must keep data in memory in such a way as to optimize for local and non-local memory accesses. For example, in a 4-way Opteron server with each CPU having 1GB of memory, if the working dataset is only 512MB in size it shouldn't all be placed in CPU0's memory - especially if all four CPUs are using the data. It should either be copied to all four sets of memory, or it should be divided up so that all CPUs can have at least some local access to the data at their full 5.3GB/s rate. This type of memory access is known as NUMA, which stands for Non-Uniform Memory Access; Windows 2003 Server supposedly has support for NUMA.

The culmination of all of this is that the K8 core (and thus the Opteron) scales very well with the number of CPUs you have in a system, much better so than any Intel processor. To prove this we've taken an excerpt from one of our tests in Part 2 of our Opteron coverage and compare the benefit of moving to dual processors from an Opteron standpoint vs. a Xeon standpoint:

Performance Increase over Uniprocessor Configuration
8x Load Ad DB Test - Opteron 244 vs. Xeon 2.80GHz
AMD Opteron 244 (1.80GHz)

Intel Xeon 2.80GHz

23.5%

11.4%

|
0
|
5
|
9
|
14
|
19
|
24
|
28

Whereas the Xeon only sees an 11% increase in performance from going to two CPUs, the Opteron sees an impressive 24% performance boost! These are not numbers to scoff at; AMD has clearly designed the Opteron for serious multiprocessing environments. We hope to be able to bring you 4-way scaling benchmarks very soon.

Another interesting thing about the K8 architecture is that it has already been engineered for use in multicore designs. AMD's Fred Weber mentioned to us that the logic for multicore, single die Opteron processors has already been verified, although nothing has taped out. The process is actually quite simple; AMD produces two Opteron cores, removes the physical layers of the Hyper Transport links and connects the two on a single die. AMD could do this today if they desired, however according to Mr. Weber, a multicore Opteron only makes sense if they can keep their die size below 120 mm^2. Two Opertons on a single die at below 120 mm^2 will be possible on AMD's 65nm process, so whenever that transition occurs you can expect to hear a bit more from AMD on multicore Opteron solutions.

Look what we found, an on-die memory controller Let's get physical
Comments Locked

1 Comments

View All Comments

Log in

Don't have an account? Sign up now