Original Link: http://www.anandtech.com/show/780



Support for multiprocessor (MP) configurations has almost always been present in AMD CPUs.  In fact, the AMD K6 supported MP operation; however, it lacked the chipset support to bring it to the MP market.  The CPU wouldn’t have done too well in that market in any case, but the technology to take it there was present.

The original Athlon released in 1999 was perfect for MP systems as well, especially since when it first came out it was offering performance greater than that of Dual Pentium III systems while only being used in a single processor configuration.  Unfortunately, AMD had enough problems getting the Athlon accepted as a single processor desktop solution, much less an MP workstation/server platform. 

Since then, the Athlon has enjoyed tremendous success in the performance desktop market; it was only a matter of time before it was finally paired up with a truly high-end platform to try to attain the same type of success in the server and workstation markets.  However, in order to succeed in these two markets AMD cannot be dependent on companies like ALi and VIA to provide chipsets for their processors as neither of the aforementioned companies even intends to branch out into the truly high-end workstation and server markets anytime soon.  Instead, AMD took it upon themselves to design the chipset that would drive their Athlon processor into the workstation/serer markets: the 760MP chipset.

Unlike the desktop 760 chipset, AMD does intend to manufacture the 760MP chipset for as long as there is demand.  Their roadmap doesn't have the 760MP being replaced by a third party solution anytime soon, mainly because of reliability issues.  The 760MP has been going through revisions for two years now, with AMD insistent on making its launch as picture-perfect as possible.  While AMD has gained clout in the enthusiast market because they are much more reliable than they once were, the same reputation doesn't follow in the workstation and server markets.  The immense amount of testing that the 760MP has been through virtually guarantees it to be the most reliable Socket-A chipset ever to be made available. 

On the CPU front, the 760MP's release is accompanied by the first server version of the Athlon core.  Just three weeks ago AMD announced their mobile Athlon 4 processor, which is based on the Palomino core.  The server version of the Athlon is also based on the same Palomino core but carries a different name, much like how Intel’s Xeon uses the Willamette core but carries a different name from the desktop Pentium 4.  The name of the server Athlon is the Athlon MP, the MP obviously coming from the fact that the CPU is validated by AMD for operation in multiprocessor mode.  From an architectural standpoint, the Athlon MP is no different from the mobile Athlon 4.  Although AMD introduced new naming for some of the features of the Athlon MP such as Smart MP Technology, the same features are present in the mobile Athlon 4. 

Although Athlon MP is the only CPU that is validated by AMD for operation in dual processor (DP) mode, it isn’t the only processor that works in DP configurations.  In fact, the first and only 760MP motherboard being released today was tested and debugged using regular Athlons with Thunderbird cores.  Even Durons will work in DP mode without any problems, but AMD is only officially supporting configurations with Dual Athlon MPs.  This is somewhat like Intel’s insistence that the Celeron would not work in MP mode, though ABIT obviously proved them wrong with their BP6, which was designed with DP Celerons in mind. 

We will get into the technology behind the Athlon MP and the 760MP chipset later in this article, but first it’s important to establish the rules of the game when it comes to the high-end workstation and server markets that the Athlon MP and 760MP target. 



The Requirements

The average AnandTech reader is already quite familiar with what kind of system will run the latest games at their fastest.  You know that a powerful video card with a lot of memory bandwidth is one of the most basic requirements for a high performance gaming system.  You also know that in order to build an efficient home/office system, a platform with enough memory and on-chip cache can help performance tremendously; but when it comes to building the most powerful high-end workstations and servers, the rules of the game change considerably.

A GeForce what?

There are a few types of high-end workstations and servers that can be built, but for the purposes of discussion of graphics cards we will generalize them into two categories: those that handle 3D graphics and those that don't.  The workstations that are used in programs like 3D Studio MAX and Pro/ENGINEER depend on having an extremely fast video subsystem.  These systems use graphics cards that cost thousands of dollars and actually require the 110W of power provided by an AGP Pro110 slot.  Cards such as the NVIDIA Quadro DCC and the 3DLabs Oxygen GVX420 are quite commonplace in these types of workstations.

However, having a high-performance graphics subsystem isn't necessary at all if your computer is just going to be used for displaying 2D graphics.  This is the case in many servers where the only reason to have a graphics card installed is so that your system will actually boot.  Administration is handled remotely so there isn't even a need for a monitor unless something goes horribly wrong with the system.  Having a motherboard with integrated video is actually a very attractive feature in the server market since it means that there is one less expansion card to install.  The ideal situation is to have everything on-board so that your motherboard can fit in as small of a case as possible.  With many web and database servers, this helps keep the costs of collocating the server to a minimum. 

North/South Bridge Bandwidth Matters

It is rare that your average power user worries about being bottlenecked by the PCI bus or the connection between the North and South bridges on their chipset.  With a single hard drive, a DVD drive and an Ethernet card in your system you are probably not consuming more than 50MB/s of bandwidth.  This bandwidth is of course offered by the 32-bit PCI bus running at 33MHz in most of today's systems, which provides a theoretical maximum of 133MB/s of bandwidth for your peripherals. 

In modern day chipsets, the PCI bus is actually an extension off of the South Bridge (or I/O Controller Hub - ICH - as Intel likes to call it).  In these chipsets, such as the Apollo Pro266 and the Intel 850, the connection between the South Bridge and the North Bridge is made by a special bus offering 266MB/s of bandwidth so that even if the PCI bus is completely saturated there is enough bandwidth between the North and South Bridges to allow for unrestricted traffic.

In some older chipsets, however, the PCI bus is an extension off of the North Bridge and it is used to connect the North and South Bridges.  This isn't a problem at all for most desktop users today since they rarely become bottlenecked by the bandwidth offered by the PCI bus.  This is why the new interconnect technologies such as Intel's Hub Architecture and VIA's V-Link don't offer any tangible performance gains for a lot of AnandTech readers.  But once again, in the workstation and server worlds, this isn't the case.

This next requirement is actually specific to servers that depend on having fast disk I/O such as a file or database server.  With these types of servers it is quite common to have massive RAID arrays of at least three or four drives.  Once you get into RAID configurations with four or more drives, the total amount of sustained bandwidth offered by the disk array can often exceed what the PCI bus is capable of handling.  For example, if you had a four drive RAID 0 array where each drive can deliver a sustained throughput of 40MB/s then the array could deliver a sustained 160MB/s of data.  Remember that the 32-bit PCI bus can only offer 133MB/s of bandwidth to the North Bridge, so if your RAID array can deliver 160MB/s of data then it will be limited by the amount of bandwidth that your PCI bus can offer.

The high-end market got around this by using 64-bit PCI, which is available in two flavors: one running at 33MHz and one that runs at 66MHz.  The 64-bit/33MHz bus offers 266MB/s of bandwidth while the 64-bit/66MHz bus offers 533MB/s of bandwidth, which is definitely enough for heavy server I/O.  This also helps when you throw in things like gigabit Ethernet adapters that are capable of transferring over 100MB/s of data across a network.  Just two of these cards can easily eat up the limited bandwidth that 64-bit/33MHz PCI can offer, which is why most truly high-end systems offer multiple 64-bit PCI buses that generally operate at 66MHz (although, for reasons of backwards compatibility, they offer a 33MHz operating mode as well). 

If your peripherals eat up over 266MB/s of bandwidth, getting that data to your CPU and main memory is a bit more complicated. In this case, even a 266MB/s connection between the two won't be enough.  This is why technologies such as ServerWorks' Inter Module Bus (IMB) are used; the ServerWorks IMB in particular is capable of transferring up to 1GB/s of data between the North and South Bridges.  And people wonder why they are called ServerWorks.



Memory: 1GB is barely enough

Here's something to try with your overclocked desktop boards: fill all of your memory banks and run a constant barrage of CPU/memory intensive tests on your system.  If your system can stand this sort of torture then you've truly got a solid desktop machine, but chances are that it wouldn't be able to handle this load for months on end without a single reboot.  This is the type of stress a mission critical server must deal with. 

It is rare that you see a high-end server outfitted with less than 1GB of memory.  This is one of the reasons that RDRAM has been kept out of the server market because when you're dealing with multiple gigabytes of memory, a more expensive memory technology can easily raise server cost by thousands of dollars.  This is also the reason that Iwill's i860 motherboard that we previewed two weeks ago has 8 memory slots - to allow for up to 4GB of RDRAM to be installed.  Even at AnandTech, our Forums Database server is outfitted with 1.5GB of SDRAM and it doesn't handle nearly as much traffic as the database servers at websites such as Yahoo and CNet.  You can only imagine the type of memory configurations they have running over there. 

Adding more memory can easily improve performance by hundreds of percentage points in database serving applications depending on load.  In the workstation arena, when modeling extremely complex designs, it isn't rare to see systems outfitted with 16GB of memory.  Remember that one of the reasons Intel designed the 64-bit Itanium processor was so that they could use it to design their next-generation microarchitecture.  One of the biggest benefits of 64-bit processors is their ability to address beyond 4GB of memory, a limitation of 32-bit processors (2^32 bytes).  This means that 64-bit microprocessors will be able to address more than 18,446,744,073GB of memory.  Although that is quite a bit, remember that it was once thought that nobody would ever need more than 640K of memory.  To put it simply, there is a need for massive amounts of memory in these types of systems, much more than what you'd use in your desktop computer.

Reliability is not second to memory size; it is easily an equal when it comes to workstations and servers.  This is why ECC memory is often a requirement in the high-end environment.  In many cases, in order to allow for higher density modules, registered memory is required as well.  This is the case with the Tyan 760MP motherboard we are taking a look at today that does, in fact, require the use of Registered DDR SDRAM. 

For those of you not familiar with Registered DIMMs, they are generally not the same type of modules you use in your personal systems except for a few unusual cases (you will most likely know if you have registered memory or not). In contrast to unbuffered SDRAM (conventional SDRAM), Registered SDRAM features small registers present between the module's interface and the actual SDRAM chips on the PCB. They are often used to decrease loading and allow for more physical SDRAM devices to be used on a single DIMM.

Bandwidth actually matters

AMD, Intel and VIA have all been guilty of promoting higher bandwidth memory technologies on the desktop platform, but the performance improvement over previous "low bandwidth" technologies is relatively small.  The reason behind this is that most of today's applications aren't bandwidth intensive enough to take advantage of this additional bandwidth that they've been given.  It's like having a four-lane highway but only three cars to take advantage of it.  However, it is best to plan ahead and add the fourth lane instead of getting caught with your pants down by only having three lanes when there are more cars present later.

With workstations and servers, the applications already take advantage of the bandwidth that these new memory technologies offer.  This is partially why the single processor Athlon on an AMD 760 motherboard did so well in our recent database server tests, compared to the dual processor Pentium III.

As more processors are added to the equation, the need for more memory bandwidth increases as there are now multiple CPUs competing for the same memory bandwidth.  This is why the ServerWorks Grand Champion HE with support for up to four Intel Xeon processors is outfitted with 6.4GB/s of memory bandwidth courtesy of its quad channel DDR200 memory bus. 

Bandwidth to the CPUs also matters, which is determined by the speed and bus width of the FSB.  That's another reason why a single Athlon, with its 266MHz EV6 bus, did so well in many of our recent server tests.

With these requirements in mind, we can now take a look at the architecture behind the Athlon MP processor and the 760MP chipset. 



Athlon MP Technology

As you will remember from our original story on the Athlon 4, there are a number of improvements that allow the Palomino core to maintain a somewhat noticeable performance advantage over its predecessor.  With the knowledge that the Athlon MP uses the same Palomino core as the Athlon 4, these same improvements are present in this processor as well.  The only thing the Athlon MP won't boast as a feature is PowerNow!, though it is supported by the processor. 

The Athlon MP is the first non-mobile AMD processor to bring a full implementation of Intel's Streaming SIMD Extensions (SSE) to the table.  This allows the Athlon MP to run code optimized for 3DNow! or SSE instruction sets although it doesn't necessarily mean that it can run SSE optimized code as fast as a Pentium III/Pentium 4.  AMD calls the Athlon MP's 3DNow! + SSE support their 3DNow! Professional technology.  AMD will eventually include full SSE2 compliance in their Hammer line of CPUs. 

The second improvement the Athlon MP offers over the Athlon is its improved data prefetch mechanism.  This feature allows the Athlon MP to automatically take advantage of otherwise unused FSB bandwidth for prefetching data that the processor thinks it may be requested to gather, before it is actually instructed to do so.  This increases the Athlon MP’s dependency on a high-speed FSB and memory bus as well, and it also accounts for the majority of the Athlon MP's performance advantage over the Athlon.  As we’ve noticed in the Pentium 4's performance characteristics, data prefetch can help in applications that require a great deal of bandwidth and have easily predictable memory accesses, such as video editing or more specific to this article, 3D rendering and database serving.  Data prefetch is actually quite useful in the case of the Athlon MP since its chipset platform offers a considerable amount of FSB bandwidth, which is more easily consumed with data prefetch enabled, but more on that later.

The third improvement offered by the Athlon MP is a set of three enhancements to the processor's Translation Look-aside Buffers (TLBs).  As taken from AMD’s tech docs on the Palomino core, the three TLB enhancements are:

1. The L1 Data TLB increases from 32 to 40 entries
2. Both the L2 Instruction TLB and L2 Data TLB use an exclusive architecture
3. TLB entries can be speculatively reloaded

As you will remember from our initial story on the Athlon 4 processor, the task of the TLB is to cache translated memory addresses.  This translation process is necessary for the CPU to gain access to the data stored in main memory, and by caching the translated addresses, it becomes much quicker to find data in main memory. 

The first improvement comes by increasing the number of entries in the L1 Data TLB.  This increase allows for a greater hit rate (probability of finding what the CPU needs in the TLB) in the L1 Data TLB.  You will also remember that the Pentium III has a L1 Data TLB with significantly more entries than even the new 40 entry TLB on the Athlon MP. 

The next Athlon MP TLB enhancement comes by moving the L2 TLBs to an exclusive architecture.  This means that data contained within the L1 TLBs is not duplicated in the L2 TLBs, which obviously saves space in the L2 TLBs meaning that they can be used to store even more translated addresses.  The downside to this exclusive architecture is that there is a latency sacrifice that is made since the addresses aren't duplicated in the L2 TLBs.

The final improvement is that the TLB entries can be speculatively reloaded.  This means that in the event that an address is not found in the TLB, the address can be loaded into the TLB before the instruction that requested the address is finished executing.  On older Athlon cores, this was not possible, resulting in a bit of a performance hit in this situation.  According to AMD, this situation is usually observed in "high-end software applications." 

In fact, AMD states that the TLB enhancements of the Athlon MP are most useful in these "high-end software applications."  Hopefully, we will see whether or not they are correct with our benchmarks, which are composed of a number of very high-end tests.



The importance of Cache Coherency

Imagine for a moment that you have a building with two programmers working.  They are in adjacent cubicles and are working on the same project.  Their manager is across the hall in another office and can see what the two are working on.  Obviously it is necessary for the two programmers to communicate with one another so that they know what each other is doing.  Assuming they can't talk, there are two ways for the programmers to communicate with one another.  One way is by reaching around the cubicle and passing notes to one another; and the other way is to send a message to the manager and have him deliver it to the other programmer.  Clearly, the first way is the most efficient and most appropriate.  If you haven’t figured it out thus far, this is an example of the communication that must occur between two CPUs. 

Now it is quite useful for one programmer to find out if the other one has a particular function already written, but it would require constant communication between the two, and as we already established, these two programmers don’t talk.  This is another problem that MP systems encounter, how does one CPU know what is stored in the other CPU’s cache? 

In most SMP systems, the individual CPUs monitor for requests across the FSB and return the data if it is present within the CPU’s cache.  For example, let’s take a dual processor Athlon MP system with two Athlon MP CPUs: CPU0 and CPU1.  First, CPU0 requests a block of data that is contained within main memory and not within CPU0’s cache or CPU1’s cache.  The data is delivered from main memory, through the North Bridge, up to the CPU that requested it, in this case CPU0. 

Then, CPU0 requests another block of data that is located within CPU1’s L2 cache.  CPU1 is always monitoring (also called snooping) the FSB for requests for data; this time around, the data is in its cache and it sends it out.  Now there are two ways of getting the data to CPU0: it can either be written to main memory by CPU1 and read by CPU0, or it can be transferred directly from CPU1 to CPU0.

In the case of a Shared Front Side Bus (see right), where all of the CPUs in a MP system share the same connection to a North Bridge, inner-CPU communication must be carried through main memory, which was the first example we gave.  In the case of a Point-to-Point Front Side bus, where each of the CPUs get their own dedicated path to the North Bridge, inner-CPU communication can occur without going to main memory, simply within the North Bridge.

The Shared FSB and Point-to-Point FSB aren’t functions of the CPU; all the Athlon MP can do is make sure it works with a particular protocol.  Instead, this is a chipset function, and in the case of the 760MP, it implements a Point-to-Point bus protocol.  This helps reduce memory bus traffic since all inner-CPU communication occurs without even hitting the memory bus.  For comparison’s sake, all MP chipsets for Intel processors use a Shared FSB including the recently released i860 chipset for the Intel Xeon.  It is arguable whether or not the ability to direct all snooping traffic internally within the North Bridge helps performance; all indications seem to point to this being a feature that is nice to have but not necessarily a performance booster.

Another benefit of the Athlon MP’s EV6 FSB is that there are two unidirectional address ports (address in and address out) and one bidirectional data port in every EV6 bus link.  This means that an Athlon MP can snoop for data it needs while fulfilling a data request at the same time.  The Pentium 4’s AGTL+ FSB only has a single bidirectional address port and a single bidirectional data port meaning that addresses can only be sent to/from the processor at once, not simultaneously.  

Taking our Athlon MP system out for another test, we have the following situation: CPU0 has a block of data in its cache, and CPU1 has the same data in its cache.  CPU1 then changes the data that both processors have in their caches after which CPU0 attempts to read that data.  At this point the copy of the data stored in CPU0’s cache isn’t the most recent copy; in fact it has been changed since CPU0 pulled it into its cache.  Keeping the data in each CPU’s cache up to date, or coherent with one another, is what we mean when we refer to cache coherency. 

There are only a couple major cache coherency protocols but many variants of them.  By far the most common cache coherency protocol is known as write invalidate.  Generally speaking, the write invalidate coherency protocol simply dictates which processor’s cache to invalidate the data in during the event of a coherency conflict.  The invalidate function is one that takes place over the address bus alone, meaning that the EV6’s dual ported address bus comes in handy once again, allowing for a cache line invalidate and a data request to be executed simultaneously. 

There are many forms of the write invalidate coherency protocol, the most common being a MESI protocol.  The four-letter acronym stands for the four states (Modified, Exclusive, Shared or Invalid) that a cache line may take.  The meanings of the four states are as follows:

Modified – The data in the line has been modified thus meaning that the copy in main memory is invalidated.

Exclusive – The only copy of the data is stored in this cache thus meaning that the copy in main memory is valid

Shared – The data is in more than one processor’s cache and the copy in memory is valid.

Invalid – The data in cache is invalid.

The MESI protocol is present in the majority of x86 processors including the AMD K6, Intel Pentium III, Pentium 4 and Xeon.  Even the PowerPC processor uses the MESI protocol.

The Athlon MP (including all previous Athlon variants and the Duron) uses a five-state MOESI protocol instead.  The MOESI protocol adds another state known as the “Owned” state.  This is a state that is triggered when the data being requested is in more than one processor’s cache and the data in one cache has been modified. 

Implementing MOESI cache coherency is much more complicated than the basic four step MESI implementation and thus requires many more transistors to implement.  However, it works perfectly with the Athlon MP’s Point-to-Point FSB’s dual address ports and actually increases bus efficiency. 

The MOESI cache coherency protocol had been previously reserved for high-end server CPUs such as the Sun UltraSPARC II, but the Athlon actually debuted with it back in 1999.



Athlon MP – The Chip

The Athlon MP will be available in two versions at launch: 1.0GHz and 1.2GHz.  Both versions operate on the 133MHz DDR bus (effectively 266MHz). 

The future of the Athlon MP is pretty much the same as the desktop Athlon.  The Athlon MP will receive an upgrade to the Thoroughbred and eventually the Barton core with SOI.  For more information on Thoroughbred, Barton and SOI (as well as how it works) consult our Athlon 4 preview


Click to Enlarge

The rumors are obviously true as AMD shows in their above roadmap that they want the future Durons to be used in entry level servers. Since the current Durons work with the 760MP chipset (although again, they aren't validated for DP operation) we are able to get a sneak preview of what to expect from DP Durons with the Morgan core later this year.

The Athlon MP 1.2 and 1.0GHz parts are priced at $265 and $215 respectively in 1,000 unit quantities.  This is a markup of around $90 for the CPUs over their desktop counterparts.



Far from just 760

Just one year ago AMD promised that we'd see their first multiprocessor solution in Q3/Q4 of 2000.  The AMD 770, as it was called, was to be paired up with the Mustang processor outfitted with a very large L2 cache.  It may have been more appropriate to keep the name of the chipset as 770 since it is significantly different from the desktop 760 chipset, but AMD did decide to call it the 760MP. 

The main reason that the 760MP would have been better off being called the 770 is its sheer size advantage over the desktop 760 chipset.  The 760 chipset is made up of two major parts, the AMD 761 North Bridge and the AMD 766 South Bridge and the two are connected by the 32-bit/33MHz PCI bus.  The 761 North Bridge is manufactured in a BGA (Ball Grid Array) packaging with a total of 569-balls (essentially interface pins) that connect the chip to the various buses and parts of the motherboard.  The AMD 760MP on the other hand uses a different North Bridge, the AMD 762 North Bridge, which features approximately 1000-balls, making this the most complex North Bridge AMD has ever manufactured. 

We mentioned earlier that the 760MP chipset features a Point-to-Point FSB protocol, meaning that each processor in a 760MP system gets its own connection to the North Bridge.  Unfortunately, this means that there are significantly more traces going between the North Bridge and CPUs and basically double the number of interface pins on the North Bridge – accounting for some of the increase in North Bridge pin-count.

The benefits of the Point-to-Point FSB are numerous.  We’ve already explained the benefits when it comes to inner-CPU communication (although the tangible performance benefits resulting from this may be limited), but there is obviously a much larger benefit: an increase in overall FSB bandwidth. 

Each Athlon MP gets 2.1GB/s of bandwidth to/from the North Bridge in a 760MP system.  Let’s take a two-processor 760MP system for example with two Athlon MPs running at 1.2GHz.  Since all Athlon MPs run at the 133MHz double pumped FSB, the effective clock of the FSB is 266MHz.  Multiply the 266MHz effective frequency of the bus by the 64-bit bus width and you get the bandwidth available to one processor in Gigabits per second; dividing by 8 results in our 2.1 Gigabytes per second figure.  Remember that 2.1GB/s is just to a single processor in the system; the second Athlon MP receives just as much bandwidth to the North Bridge

Although a single Intel Xeon has more FSB bandwidth than a single Athlon MP (3.2GB/s vs. 2.1GB/s), a dual Intel Xeon setup must share that 3.2GB/s of FSB bandwidth while each individual Athlon MP gets the full 2.1GB/s of bandwidth to the North Bridge. 

The one thing to keep in mind here is that the performance doesn’t immediately double because of this; the reason being that there is still only 2.1GB/s of bandwidth available to/from the memory that must be shared by the two processors.  This means that the 760MP still only has a single channel 64-bit DDR memory bus; there is no more memory bandwidth present in a 760MP system than there is a desktop 760 system. 

What we will undoubtedly see, however, is that the two EV6 links on the 762 North Bridge will allow for much greater usage of the memory bandwidth that is available.  If you will recall from recent Sandra STREAM scores, the Athlon offers between 500 – 800MB/s of usable memory bandwidth while the specs call for a peak of 2.1GB/s.  If the Athlon is saturating its FSB then moving to dual processors on the 760MP should allow for greater utilization of memory bandwidth since it essentially has twice as much FSB bandwidth to saturate. 

The 762 North Bridge features the same AGP 4X controller that was found on the 761 North Bridge (AMD 760 chipset), and we already mentioned that the chipset has the same 64-bit DDR memory controller.  Because it uses the same memory controller, the memory bus is always synchronous to the FSB (e.g. 100MHz FSB results in 100MHz memory bus clock). 

The other major difference is the PCI controller on the 762 North Bridge.  While the 761 North Bridge only had a 32-bit 33MHz PCI bridge, the 762 contains a 32/64-bit 33MHz PCI bridge.  This is the bus that connects the North Bridge to the AMD 766 South Bridge, which remains unchanged from the original 766 South Bridge that was launched with the AMD 760 chipset. 

Obviously, the support for only 33MHz 64-bit PCI is a limitation of the 760MP chipset; luckily, AMD does have something planned to fix this. 



The Mass Production Chipset: The 760MPX

If you haven’t heard by now, the only motherboard manufacturer that will have a 760MP chipset in the near future will be Tyan.  The Tyan Thunder K7 is actually available today and it will most likely have the exclusive for at least a couple of months.  In fact, no motherboard manufacturers are scheduled to produce a 760MP-based motherboard.  Instead, they will be producing motherboards based on a different revision of the 760MP chipset known as the 760MPX.

The 760MPX was actually already announced by AMD at WinHEC back in April but few picked up on it.  The main difference between the 760MP and the 760MPX is that the latter features a 66MHz 64-bit PCI bus with support for up to two devices off of the North Bridge, and a new South Bridge with support for a legacy 32-bit 33MHz PCI bus.


Click to Enlarge

The new South Bridge, the AMD 768, is connected to the North Bridge via a 32-bit 66MHz PCI bus offering a total of 266MB/s of bandwidth between the two chips – which is exactly what Intel’s Hub Architecture and VIA’s V-Link offer. 

Is the AMD 760MPX worth holding out for?  Most of the motherboard manufacturers such as ABIT and MSI have already announced their MPX based platforms and we know from MSI that they are targeting the sub-$200 market.  Since AMD will eventually announce Morgan-based Durons validated for MP operation, this could make for some very cheap MP systems. 

AMD's 760MPX Reference Board


Click to Enlarge



Tyan: Still the King

When AMD launched the Athlon processor in 1999, they picked a handful of motherboard manufacturers to be their launch partners.  These partners included manufacturers like FIC and Gigabyte and it was their duty to make sure that the Athlon processor had motherboards to run on upon its release.  For the 760MP chipset, AMD decided to pick one and only one motherboard manufacturer to deliver a motherboard to the market.  And of course, they picked the one motherboard manufacturer that has always prided themselves on producing server motherboards: Tyan.


Click to Enlarge

Tyan essentially has the exclusive on the AMD 760MP release, meaning that you won’t see motherboards from another manufacturer until the end of this year at the earliest.  There are a few reasons for this; for one, AMD wanted someone very experienced in building server motherboards to handle the task.  You already know the requirements for a server platform, and you can also believe that Tyan is also very well versed in those requirements. 

Being the only 760MP motherboard manufacturer isn’t necessarily a bad thing, especially when you realize that the Tyan Thunder K7 is quite possibly the most feature filled motherboard we have ever seen in our four years of reviewing motherboards. 

The Thunder K7 is actually the reference design for AMD’s 760MP platform, which AMD internally refers to as the Guinness platform.  The features are as follows:

-         2 - Socket-A Interfaces (100MHz/133MHz DDR FSB)


Click to Enlarge

-         4 - 45 degree angled DDR SDRAM DIMM slots


Click to Enlarge

-         5 – 64-bit/33MHz PCI slots (backwards compatible with 32-bit/33MHz PCI devices)
-         1 – AGP Pro110 slot delivering a maximum of 110W of power to AGP Pro110 cards
-         1 – onboard ATI RageXL graphics
-         1 – Adaptec 7899W dual channel Ultra160 SCSI controller
-         2 – onboard 3Com 10/100 Ethernet controllers

There are a number of interesting points about the motherboard’s feature set.  First of all, the angled memory slots allow the Thunder K7 to be used in a 1U server case.  This is a huge accomplishment since rack space can be very expensive if you have quite a few servers.  Designing a 1U server around a dual Athlon MP Thunder K7 platform requires special cooling and power to be implemented, since there isn’t enough space for regular heatsinks to be mounted in the case, but it is possible.  In the near future, you will see systems from at least one manufacturer produced in a 1U form factor (approximately 2” high). 




Click to Enlarge

The Thunder K7 also only provides support for registered DRR SDRAM and will not work with regular DDR SDRAM.  Luckily, courtesy of extremely low DDR memory prices from companies like Crucial, registered DDR SDRAM isn’t much more expensive than regular DDR SDRAM. 

The Thunder K7 uses a heat spreader on the 762 North Bridge which gets extremely hot. It may eventually make more sense for manufacturers to stick with a heatsink and fan in order to cool the North Bridge. It is definitely the hottest running North Bridge we have ever seen.

The on-board PCI video and SCSI, coupled with the two 3Com 10/100 Ethernet ports make this the perfect 1U server board since absolutely no add-in cards are necessary for operation as a full fledged server.  While it’s difficult to get more than three or four hard drives into a 1U server, the Thunder K7 can easily be the base for a very powerful 1U web server.

Many of Tyan’s OEM customers are interested in using the Thunder K7 in truly high-end 3D workstations, most of which require AGP Pro50 or AGP Pro110 support.  Because of the incredible power required by an AGP Pro110 card (110W), the Thunder K7 comes with some very unique power requirements. 


24-pin WTX, 20-pin ATX, 8-pin secondary WTX connector (from top to bottom)

The motherboard features two power connectors on it: a 24-pin WTX connector and a secondary 8-pin power connector (standard ATX power connectors have 20 pins).  While the connectors themselves are compatible with the WTX specification and look identical to the two connectors that were present on the Iwill Dual Xeon motherboard, the pinouts are different.  Unfortunately, this difference means that the Thunder K7 uses a non-standard power supply. 


Click to Enlarge

Currently, we are aware of two manufacturers that make power supplies for the Thunder K7: Delta and NMB Technologies.  Delta has a 450W model and NMB uses a 460W power supply; we used the NMB unit for our tests. 

Tyan’s goal with the Thunder K7 was to make it a time-to-market board, meaning that it would bring the technology to the market as soon as possible.  When other manufacturers release their channel boards later this year, these may either be WTX specification or even potentially use regular ATX power supplies.  Future boards should also be able to work with regular DDR SDRAM and not just registered DDR.  The Thunder K7 also has no overclocking options, you will have to wait for a board from another manufacturer for things such as clock multiplier, FSB and voltage adjustment.

ABIT’s first 760MPX board will be a WTX solution while MSI has been showing off a regular ATX design. 

The Tyan board ships to distributors at $500, meaning that the retail price will be above that.  The final retail price could be as “low” as $550 or as high as $700. 



The Test

Windows 98SE / 2000 Test System

Hardware

CPU(s)

Intel Pentium III 933MHz x 2
Intel Xeon 1.7GHz x 2
AMD Athlon MP 1.2GHz x 2
AMD Athlon-C "Thunderbird" 1.2GHz x 2
AMD Duron 850MHz x 2
Motherboard(s)
Iwill DVD266-R
Iwill DX400-SN
Tyan Thunder K7
Memory

1GB PC2100 Corsair DDR SDRAM
1GB PC800 Toshiba RDRAM

Hard Drive

IBM Deskstar 30GB 75GXP 7200 RPM Ultra ATA/100

CDROM

Phillips 48X

Video Card(s)

NVIDIA GeForce2 Ultra 64MB DDR (default clock - 250/230 DDR)

Ethernet

Linksys LNE100TX 100Mbit PCI Ethernet Adapter

Software

Operating System

Windows 2000 Professional SP2
Windows 2000 Server SP2

Video Drivers

NVIDIA Detonator3 v6.50 @ 1024 x 768 x 16 @ 75Hz
NVIDIA Detonator3 v6.50 @ 1280 x 1024 x 32 (SPECviewperf) @ 75Hz
VIA 4-in-1 4.31V was used for all VIA based boards

 



Memory Bandwidth Comparison

Although we're not big fans of SiSoft Sandra as a benchmark, the memory bandwidth test contained within the suite has proved to be quite useful. Without a doubt the Xeon offers the greatest amount of useable memory bandwidth of the collection; the i860's dual channel RDRAM interface provides for a theoretical maximum of 3.2GB/s of bandwidth to main memory. According to these results, approximately 44% of the bandwidth is actually being used by the two Xeon processors. Even with only a single Xeon processor present in the system, the bandwidth usage figures do not change. This indicates that either the FSB or memory bus is saturated already by a single Xeon at 1.7GHz.

Things get even more interesting when you look at the Athlon MP Single vs. Dual bandwidth figures. By going to a DP configuration, the Athlon MP's memory bandwidth figures increase a whopping 37%. A single processor Athlon MP manages to use approximately 33% of its 2.1GB/s of memory bandwidth, but going to two processors takes the memory bandwidth utilization up to 45%; even higher than that of the Xeons on the i860. Similar increases in memory bandwidth utilization are present when looking at the regular Athlon and Duron scores.

Let's take a look at the FP STREAM results before we begin dissecting exactly what's going on here.

Similar results are present in the FP STREAM tests under Sandra 2001. For starters, the Xeon platform offers and takes advantage of the most amount of memory bandwidth of the group. The performance difference between the single and dual Xeon configurations is next to nothing, both offering around 1400MB/s of bandwidth.

Again the really interesting numbers come from the Single vs. Dual comparisons on the 760MP. Here the Athlon MP goes from using 36% of its memory bandwidth to using 46%. The question is why?

Remember that the 760MP is the only dual platform here to have a Point-to-Point FSB. A single Athlon MP CPU in this case can eat up approximately 700 - 800MB/s of memory bandwidth, adding a second CPU means that the total system (since each CPU gets a full 2.1GB/s path to the North Bridge) could consume at most twice that much memory bandwidth. Obviously since performance doesn't scale perfectly when going to multiple processors we don't see this sort of scaling, but the fact that we see any increase in memory bandwidth utilization indicates that in a single processor Athlon system the FSB is limiting the processor's utilization of its memory bandwidth.

If this is true then it could also mean that one of the reasons that the Pentium 4 (and Xeon) do so well in these tests is because of the 100MHz quad-pumped FSB.



Database Server Performance

In our Intel Xeon review we introduced a new test to our benchmark suite for high-end processors: a Database Server Performance test. The premise behind it was simple; record every single transaction that occurred on the AnandTech Forums Database server for a period of 30 minutes then play it back on a similar server using one of these platforms as fast as possible. This sort of benchmarking is much like a Quake III Arena timedemo where the demo is played back as quickly as possible and an average frame rate is displayed; the only difference here is that instead of an average frame rate, we get a time to complete result instead.

The test remains unchanged from our original review, and for more information about the history behind the benchmark you can read our description of it here.

In order to minimize the I/O bottlenecks the test systems were not only outfitted with four Quantum Atlas 10K hard drives in RAID 0 (offering more write but similar read bandwidth than our Forums DB server's 4 drive RAID 10 array) but they were also given 1GB of memory.

During the 30 minute recording there were: 105267 selects, 4984 updates, 701 inserts and 5 deletes performed on the database. The names of the tasks describe exactly what they are; selects are reads, updates are reads and writes, inserts are writes and deletes remove data from the database (and are quite rare).

The first thing to notice is that the test is extremely read intensive, meaning that the I/O bottlenecks aren't as great as if the test was more write intensive. You can always read faster than you can write so this should mean that the test will be more dependent on a fast platform, provided that it isn't I/O bottlenecked from the start.

If your particular database application is more write intensive the performance results should be similar in terms of the standings of the processors, but the performance gap will be decreased provided that the I/O doesn't change.

The nature of the AnandTech Forums database is that there are very few computational intensive functions performed on the database; most of the functions are straight reads and writes. This places the performance dependency on having a fast platform, not necessarily CPUs with powerful integer/floating point units.

In our original test, the Dual Xeon 1.7GHz platform dominated and we're about to find out how the 760MP when coupled with Dual Athlon MPs stacks up.

If you're not shocked, you should be. AMD's first try at the server market is without a doubt an incredible success, and we've only shown you one real world benchmark this far. A pair of Athlon MPs running at 1.2GHz on the 760MP are able to complete this 30 minute benchmark run in a little over 12 minutes; that's close to 20% faster than the Dual Xeon 1.7GHz system which happens to have over 50% more memory bandwidth.

This test clearly shows you that memory bandwidth isn't all that matters in these systems; factors like FSB bandwidth and even raw CPU power can come into play. If you are looking to upgrade your database server, look no further than the 760MP and a pair of Athlon MPs.

Interestingly enough, there is a fairly large difference between a pair of Athlon MPs and regular Athlons in this benchmark. The regular Athlons are not able to outperform the dual Xeon configuration, while the Athlon MPs do so quite easily. Even more interesting is that this performance difference isn't nearly as pronounced in single processor configurations. Remember that the enhancements made to the Athlon MP core are very data transfer oriented, and with the amount of data processing that goes on when serving content out of a 3GB database it isn't too much of a surprise that the Athlon MP does so well in a DP configuration here.

Another golden nugget to keep an eye on is the Duron. If AMD can ramp up the speed of the Duron, it would make for a great low-cost DB server platform; especially if 760MPX motherboards eventually hit the $200 price point The biggest problem with the Duron being used in a MP system is that a lot of the tasks that MP systems are put to work on are very dependent on having large processor caches; even the 384KB total cache on the Athlon is cutting it pretty close. It will be quite incredible to see what a pair of Athlons with 512KB or 1MB of cache would be able to do, yet all current indications point to the Athlon sticking with 384KB total cache and only the Hammer series (K8) breaking that barrier.

As if the platform and performance alone weren't impressive enough, another major selling point here is the extremely low cost of Registered DDR SDRAM vs. the RDRAM that the Dual Xeon platform requires. You can purchase 1GB of Registered DDR SDRAM for less than $350 while an equivalent amount of RDRAM can be found for more than twice that at around $740. With most of these servers running 2 - 4GB of memory, the savings can add up.



3D Rendering Performance

The 760MP with Athlon MP has started out on the right foot, taking the lead in database serving scenarios but what about playing a role in a high-end 3D workstation? Sticking with the real-world theme of our benchmarks we decided to explore another problem faced by one of our team members.

Just recently, AnandTech Senior Hardware Editor Matthew Witheiler had begun working under 3D Studio MAX on the creation and subsequent rendering of some pretty interesting 3D scenes. The biggest complain he had was that at 640 x 480, his scenes often took over two hours to render. While Matthew was just playing around with 3D Studio MAX, there are AnandTech readers out there that depend on high rendering performance for their jobs and actually use Kinetix's popular software package on a daily basis. If Matthew's fairly simple Space scene took 2 hours to render on a decently spec'd workstation, we can only imagine how long some of the more complicated scenes would take to render.

This brings us to our next benchmark, a 3D Studio MAX rendering test. The test scene consists of 4 objects total (2 spheres, a modified sphere, and a quad patch with noise modifiers) plus an omni light and a camera. Effects added to the scene are lens flare and lens glow on the omni light, and fire on the sun sphere. The asteroid (modified sphere) is set to have motion blur. The animation occurs over 300 frames and consists of moving the camera, rotating the planet, and moving and rotating the asteroid. The rendering is done in video post rendering so that the camera effects are added. AVI compression used is Cinepak Codec by Radius with a compression quality of 100. Size of movie is 320x240. Monitor was set to 32-bit color at 1024x768x32 and all boards were tested with GeForce2 Ultras. The program was set to use OpenGL and ran off of NVIDIA’s 12.01 drivers. 3D Studio MAX version 4.02 was used.

Performance here is dependent on not only a fast processor but a fast graphics subsystem; luckily 3D Studio MAX is multithreaded so it can easily take advantage of moving to multiple processors.

We've known that the Xeon (and Pentium 4) don't have the world's strongest FPUs and generally rely on SSE2 optimizations to deliver superior FPU performance, however the Dual Xeon platform does perform quite well in this test. Once again the Dual Athlon MP and this time around the Dual Athlon as well are able to outperform Intel's latest Xeon creation.

The Dual Athlon MP setup is able to complete the animation process in 94% of the time of the Dual Athlon 1.2 which continues to support the theory that the enhancements made in the Athlon MP core are definitely of some use in a lot of high-end applications.

The Duron 850 in dual again offers performance that is nearly identical to the Dual Pentium III 933 setup, but it also provides a much more flexible upgrade path since you can drop a pair of Athlons or Athlon MPs into a dual Duron system later on when CPU prices fall even further. The weak points of the Duron continue to be its slower FSB (200MHz effective vs. 266MHz effective) and its smaller cache (192KB vs. 384KB).



Image Editing Performance

Today we live in an increasingly 3D world, at least when it comes to the hardware industry. There is a constant drive to bring as many technologies as possible to the third dimension; first it was modeling, then games, and now even scanners but we can't forget that a lot of content creation and manipulation is still of two dimensional pictures.

Case in point would be Adobe Photoshop; undeniably the most well known 2D image editing package available today. In 3D Studio MAX, rendering and animation take the bulk of the time but in Photoshop, applying filters is what eats up the CPU cycles. Photoshop isn't as multithreaded as 3D Studio MAX but certain filters in Photoshop are multithreaded. With the latest patches and updates to the software, Photoshop 6.0.1 can also boast improved performance with the Pentium 4 and thus the Xeon as well.

In order to benchmark Photoshop we used PSBench which takes us through a sequence of applying around 20 filters individually (and reverting back to the original image afterwards) to a 50MB image. The performance is measured as time in seconds and is broken down according to the following table:

 
Duron 850
Duron 850 (Dual)
Athlon 1.2GHz
Athlon 1.2GHz (Dual)
Athlon MP 1.2GHz
Athlon MP 1.2GHz (Dual)
Intel Xeon 1.7GHz (Dual)
Intel Xeon 1.7GHz
Intel Pentium III 933MHz (Dual)
Filter/Action
Time to Complete in Seconds (lower is better)
Rotate 90
8.0
8.7
7.2
7.5
7.8
8.2
7.9
7.8
6.8
Rotate 9
15.2
13.2
14.4
10.3
11.4
10.5
10.9
13.8
10.7
Rotate .9
16.0
12.5
14.4
10.6
11.8
10.8
11.1
13.0
10.3
Gaussian Blur 1 pixel
7.4
6.2
6.9
6.0
6.4
6.0
6.3
6.6
5.1
Gaussian Blur 3.7 pixels
15.9
12.4
16.6
10.5
12.1
10.7
10.5
11.8
11.4
Gaussian Blur 85 pixels
18.0
13.7
18.5
12.1
13.4
11.6
10.9
12.6
12.4
50%, 1 pixel, 0 level Unsharp Mask
6.5
5.4
5.7
4.3
4.3
4.2
4.4
5.3
4.4
50%, 3.7 pixel, 0 level Unsharp Mask
16.4
12.7
17.1
11.7
13.0
11.1
10.7
12.2
11.7
50%, 10 pixel, 5 level Unsharp Mask
16.5
12.9
17.3
11.8
13.3
11.0
10.9
12.7
11.9
Despeckle
10.6
7.3
10.0
5.1
7.1
5.1
6.7
9.5
7.1
RGB-CMYK
29.5
29.5
28.2
21.0
20.5
20.2
26.6
26.5
26.7
Reduce Size 60%
5.2
3.7
5.3
2.9
3.7
2.9
2.8
3.3
3.2
Lens Flare
19.7
13.6
19.7
12.2
15.8
12.6
12.6
16.0
14.4
Color Halftone
28.7
28.4
28.8
20.0
18.8
19.2
30.1
30.8
3.3
NTSC Colors
11.1
11.4
9.8
7.6
7.1
7.3
8.6
8.4
9.4
Accented Edges Brush Strokes
34.3
34.5
34.0
24.1
23.3
23.6
25.1
25.7
28.3
Pointillize
57.4
34.7
57.7
26.7
41.6
26.7
26.2
42.1
29.7
Water Color
69.2
70.0
69.9
48.7
45.2
45.5
54.1
54.4
58.5
Polar Coordinates
33.0
23.8
35.6
16.0
22.7
15.5
17.3
28.1
Failed
Radial Blur
154.7
88.1
153.8
66.8
109.4
66.3
57.5
101.8
70.2
Lighting Effects
25.1
17.4
24.0
12.7
8.0
8.1
6.4
7.7
8.4

 

The total time taken to complete the filters is represented by the following graph:

The Xeon continues to remain competitive but the Athlon MP does nothing short of dethroning its Intel based counterpart in yet another test. Even with Pentium 4/Xeon optimizations, Photoshop is clearly faster on 760MP.

Before you let the tempting taste of dual Durons for less than $100 (for the CPUs at least) notice that the Dual Duron 850 system performed worse than a single Athlon MP 1.2GHz. Moving to two slower CPUs is almost never more beneficial than sticking with a single but noticeably faster CPU; remember that SMP isn't nearly as efficient as we'd like it to be.

The Duron's smaller cache size also plays a part in its performance placement.



Workstation Performance

The next aspect of performance to look at is workstation performance in CAD, imaging and development software. For this we used Ziff Davis Media's Dual Processor Inspection tests which run through three applications that can take advantage of multiple processors: Microstation SE, Photoshop 4, and Visual C++ 6.

Microstation SE is a CAD/Design package that is very stressful when it comes to x87 FPU calculations. This area is where the Athlon has clearly excelled in and the benchmarks definitely prove it. Even the Dual Duron 850 is able to give the Dual Xeon 1.7 a run for its money.

Running applications with a good deal of legacy x87 code will prove to perform well on a Pentium III, even better on an Athlon and extremely well on a DP Athlon system. The performance here speaks for itself.

The Photoshop 4.0 test is the only one we have run into thus far where the Dual Xeon is able to outperform the Dual Athlon MP setup. Here the performance advantage is a noticeable 12% in favor of the Dual Xeon.

Visual C++ itself won't split up the compiling process into multiple threads however this particular test compiles two separate programs simultaneously which definitely illustrates a situation in which having multiple processors can come in handy. Here the Athlon MP regains its performance lead over its biggest competitor with a respectable 14% performance gap.

The overall workstation performance crown obviously goes to the Dual Athlon MP, you've already seen the benchmarks that make up these impressive results.



Linux Performance

The Linux market stands much to gain from the release of AMD's dual processor platform. Linux is a much more server-oriented operating system than Microsoft's Windows products and thus its applications often demand and benefit from having multiple processors. Take the Linux systems companies like Pogo, Penguin Computing and VA Linux, for example, who sell largely to the web serving market. Tyan's 760MP board appears designed specifically for this market, with slanted DIMM slots to allow it to fit within a 1 unit case. Given the simple-task, many-users nature of web serving, nearly perfect speed gains are possible with the increased parallel processing power. A rack of these machines would be impressive indeed.

While the extra CPU tends to benefit server tasks, desktop applications often see little or no performance improvement. For example, XFree86 (the low-level windowing server beneath your GUI) is single-threaded, as are most of the applications you would typically use. OpenGL acceleration can even suffer from reduced stability, as is the case with NVIDIA's drivers, which were used in our test beds, although we didn't stress these machines enough to find the limits.

To benchmark these systems, we evaluated using kernel compilation. Kernel compilation works well given its real-life nature and ability to stipulate the number of concurrent processes.

We used Linux 2.4.4 for compilation even though 2.4.5 was just released for the reason that it combines nicely with our previous article on the Xeon, which used the same test. We reran the Xeon tests anyway for consistency. Note that on 10 platforms, compiling with 3 different make parameters 3 times each means that we managed to compile 90 kernels! Kernels were compiled with default options -- run make menuconfig and quit or make config and hold return until you've exhausted the options. Kernels were compiled uncompressed.

Given that these scores are in seconds, the lower the value the better.

First, we can use this data to determine how much an architecture actually benefits from increased parallelism. The Athlons and Durons have a FSB with a point-to-point architecture, which yields more efficient usage of its 266MHz and 200MHz (respectively) DDR clock. The Xeon introduced a 400MHz quad-pumped FSB, a large improvement over the comparatively cramped 133MHz Pentium III FSB. Despite these differences, almost every platform showed similar improvements between 1 and 2 process SMP builds, with 2 process builds taking roughly 60% of the time 1 process builds took. Thus, the kernel compilation benchmark does not stress the FSB enough for one architecture to prove better than the others.

Second, we can easily see that clock rate can be deceiving, as the 1.2GHz Athlon MPs outperformed the Xeon. The Athlon MP finished the 2-process SMP build in better than 90% the Xeon's time. Further, the new Athlon MP core bested its previous generation by 4% in the same benchmark.

Third, the difference in 1-process builds between SMP and non-SMP architectures is very interesting. Note that every architecture improved when running with SMP enabled, meaning that despite serial compiling, there was still room for improvement. We particularly like this stat because it means one benefits from SMP without even running multithreaded / multi-process applications. Perhaps an odd and unimportant thing to notice, the Athlon and Athlon MP showed better improvements between 1 process non-SMP and 1-process SMP builds, with the SMP build taking 82% the time the non-SMP test took. For comparison, the Xeon's SMP build was 87% as long as its non-SMP build. We can't think of a reason for this, but it's interesting anyway.

Finally, note that increasing the number of processes to more than the number of available CPUs actually degrades performance. Our only explanation is that in battling for CPU time, instruction, memory and filesystem cache hits become less common.



IT/Constant Computing Performance

We've been preaching about CSA Research's Benchmark Studio 2001 for a few months now, and the benchmark has gotten even better. In an effort to simplify the benchmarking process, CSA Research has built in various loading levels that can be used to stress the system. As you will remember from our other reviews using Benchmark Studio, the benchmark measures performance in a Constant Computing environment. In such an environment a user may be connecting to an exchange server to check email, pulling data off of the company's database, and streaming video off of an intranet server all while working on a Word document or creating a Power Point presentation. This type of a situation is what Benchmark Studio 2001 measures performance in by using stressors to simulate all of these tasks in the benchmark. As you can tell, the idea of Constant Computing is definitely far beyond that of a basic office PC; luckily this is another area where MP systems can be of some use.

We benchmarked in three separate configurations; the first is baseline office performance with no additional loading tasks working in the background. The second configuration is with the database, email and media player stress simulators all set to the lowest loading level and the third test setup is with all of the stress simulators set to the second highest loading level. The office benchmark scripts being executed while all of these loading simulators are being run never change, and performance is measured as a function of time in seconds.

With no major stress being placed on the system, the Athlon MP isn't able to offer any performance advantage over the Athlon and there is really no benefit to going to dual processors. The Xeon gets a bit of a performance boost since the single processor is already lagging behind in terms of office performance (high branch mis-predict rates causing the penalties associated with a long pipeline to appear), but for the most part, you shouldn't be considering a multiprocessor system if all you're going to be running is Word, Excel and PowerPoint.

Turning things up a notch immediately shifts the standings around. For one thing, almost all of the dual platforms rise to the top of this pot with the exception of the Dual Duron 850 which is struggling because of a relatively small cache. The performance difference between the top systems is noticeable but still too small to justify preferring one over another.

Increasing the load yet again finally causes the contenders to spread out even more and also illustrates a need for more computing power in a lot of these high-end workstation corporate LAN systems. Keep in mind that you really do have to be running more than just MS Office in order to justify this sort of power; but with multiple video streams being handled by your CPU along with persistent database and exchange server accesses taking place it isn't too surprising that the performance comes out the way it does.

AMD wasn't lying when they said that the Athlon MP core would offer up to a 15% performance improvement over the regular Athlon core. In this case the Athlon MP is able to complete the test in about 10% less time than the regular Athlon (when both are in DP configuration).



Overall System Performance

The last set of performance figures we're going to take a look at are from SYSMark 2001, a benchmark that isn't designed to show off MP performance by any means. The reason we include it is because it provides a great example of how power enthusiast users can benefit from MP systems without using applications that are specifically designed with parallel processing in mind.

These performance figures are for those of you that run a lot of applications at once but not necessarily ones that are multithreaded.

The Dual Xeon manages to pull ahead in one more benchmark, to find out why we must consult the individual components of the SYSMark 2001 suite.



Internet Content Creation Performance

The Pentium 4 and Xeon have always done very well in this portion of the SYSMark 2001 test simply because of the bandwidth intensive nature of the benchmark. Although 760MP is quite impressive, even a regular Pentium 4 platform offers more memory bandwidth which can be very useful depending on the application scenario at hand. In this case, with Windows Media Encoder converting a video file while the system is working on a number of other content creation tasks, it is obvious that memory bandwidth is a necessity.

Here the Dual Xeon is still close to 20% faster than a pair of Athlon MPs on the Tyan Thunder K7 760MP platform. Being constrained by bandwidth limitations that won't go away by adding a second processor, the 760MP setup must take a second place position to the Xeon in this test.

The average response time metric mimics the SYSMark 2001 score for the Internet Content Creation test but it provides us with a tangible performance figure to look at and relate to the real world. In this case, for the Athlon MP, moving to dual processors decreased average response time of applications by 35%. Again you will notice that the performance increase is far from double when you move to a dual processor setup mainly because of the overhead associated with going to multiple processors including cache coherency issues.



Office Productivity Performance

Not everyone likes to encode videos while working in Photoshop, in fact there are a good amount of AnandTech readers that don't even have Photoshop installed. The Office Productivity user is heavily running tasks like MS Office, while browsing the net, unzipping files and actively scanning for viruses. The performance here is much more governed by disk I/O and won't scale nearly as well with dual processors as the Internet Content Creation test but it is good to look at nevertheless.

Performance is very difficult to improve in this situation as it honestly doesn't take much to run well. Because of this, the Athlon MP and Xeon platforms perform nearly identical to one another.

Going to Dual Athlon MPs only reduced average system response time by 7%. This is obviously not the targeted market for the 760MP or multiprocessor systems in general.



Final Words

AMD has indeed come a long way in the past two years; they have gone from the unreliable low-cost manufacturer to the company that offers the highest performing DP x86 platform on the market - and that is exactly what the 760MP is.

Less than a month ago Intel announced their new Xeon processor and today AMD takes away any reason to purchase that very platform. Intel can continue this game of pretending that AMD isn't a threat, but the truth of the matter is that they are and they happen to be a very big one.

Intel isn't ill prepared; with the upcoming Prestonia (Xeon MP) chip supposedly shipping with Jackson technology enabled for the first time, it should be able to give AMD a run for their money, especially the larger cache versions. However that is still over half a year away, and AMD has a lot of time to prepare for that match up. Until then, the real question is, what configuration is best suited for the 760MP?

The new Athlon MP processors are obviously what AMD wants you to use on the platform, in spite of the fact that regular Thunderbirds will work just fine. For all means and purposes, the Athlon MPs are your best bet. The architectural improvements contained within Athlon MP core help considerably in the types of applications that would demand a dual processor solution such as the 760MP.

The Duron is an interesting match for the 760MP but currently the processor isn't too well paired with the platform. In many cases a Dual Pentium III would offer similar if not superior performance. In the future this may change but a crippling factor for the Duron will continue to be its small cache size. In fact, this is a limiting factor for the Athlon cores as well. Larger caches help considerably in these multiprocessor systems, and the market isn't going to want to wait until Hammer before seeing anything larger than 384KB from AMD.

The move to 0.13-micron cores should ideally be met with an increase in L2 cache size for the Athlon MP, it would help tremendously in a lot of the server performance figures.

If you can't tell by now, we're very impressed with the AMD 760MP chipset. In fact, one of the reasons we spent so long on this comparison is because the outcome of our results would influence what platform we used in our next database server for the AnandTech Forums. The choice is simple; the AMD 760MP is the DP workstation and server platform to have. It's reliable, it's high performing and it's very flexible; everything you'd expect from an Intel based server solution, except that it's from AMD instead.

Three years ago we wouldn't be caught dead recommending an AMD based server solution. Today, we'd be crazy not to do so.

Log in

Don't have an account? Sign up now