Original Link: http://www.anandtech.com/show/1164



It was an unexpected addition to the meeting; apparently the call had just been made prior to my arrival. I was standing in front of two systems running AMD "Hammer" processors, clocked at 800MHz, in both 32-bit and 64-bit OSes. Granted the demos that AMD was running involved nothing more than a simple web server and a ball bouncing around the screen, but coming off of the strong launch and execution of the Athlon XP we all had high hopes for this next-generation chip.

Many will remember the aforementioned demo, as it happened almost two years ago just outside the convention center at the Intel Developer Forum; AMD always had a way of crashing the party it seemed. It was at that show that we proclaimed AMD as stealing the show from Intel, criticizing the CPU giant for giving us a fairly lackluster showing at IDF that year.

The AMD from IDF had promised us a chip by the end of the year and given that we had all forgotten about the horribly executed K5 and mediocre K6 deployments, why were we to believe that they would do otherwise? Everyone expected AMD to deliver on their word because prior to Hammer, it was Intel that was coming up short on promises. A series of competitive paper launches in the early days of the Athlon and a poor performing, overpriced Pentium 4 plagued Intel and tarnished their reputation in the community.

Fast forward to almost two years and the Hammer is just finally being released on the desktop as the Athlon 64 and the Athlon 64 FX. AMD has lost a lot of face in the community and in the industry as a whole, but can the 64 elevate them back to a position of leadership?

We've covered the Athlon 64 and its server-brother, the Opteron, in great detail already so be sure to check out our previous coverage for even more information before continuing on here.

AMD Opteron Coverage - Part 1: Intro to Opteron/K8 Architecture
AMD Opteron Coverage - Part 2: Enterprise Performance
AMD Opteron Coverage - Part 3: The First Servers Arrive
AMD Opteron Coverage - Part 4: Desktop Performance
AMD Athlon 64 Preview: nForce3 at 2.0GHz



An Early Christmas present from AMD: More Registers

In our coverage of the Opteron we focused primarily on the major architectural enhancements the K8 core enjoyed over the K7 (Athlon XP) - the on-die memory controller, improved branch predictor and more robust TLBs. For information on exactly what these improvements are for and why we'll direct you back to our Opteron coverage; the same information applies to the Athlon 64 as we are talking about the same fundamental core.

What we didn't spend much time talking about in our Opteron coverage was the benefit of additional registers, a benefit that is enabled in 64-bit mode. To understand why this is a benefit let's first discuss the role registers play in a microprocessor.

Although we think of main memory and cache as a CPU's storage areas, the often overlooked yet very important storage areas that we don't talk about are registers. Registers are individual storage locations that can hold numbers; these numbers can be values to add together, they can be memory addresses where the CPU can find the next piece of information it will need or they can be temporary storage for the outcome of one operation. For example, in the following equation:

A = 2 + 4

The number 2, the number 4 and the resulting number 6 will all be stored in registers, with each number taking up one register. These high speed storage locations are located very close to the processor's functional units (the ALUs, FPUs, etc…) and are fixed in size. In a 32-bit x86 processor like the Athlon XP or Pentium 4, the majority of registers will be 32 bits in width, meaning they can store a single 32-bit value. In 32-bit mode, the Athlon 64's general purpose registers are treated as being 32-bits wide, just like in its predecessor. However, in 64-bit mode all of the general purpose registers (GPRs) become 64-bits wide, and we gain twice as many GPRs. Why are more registers important and why haven't AMD or Intel added more registers in the past? Let's answer these two questions next.

Take the example of A = 2 + 4 from before; in a microprocessor with more than 3 registers, this operation could be carried out successfully without ever running out of registers. Internal to the microprocessor, the operation would be carried out something like this:

Store "2" in Register 1
Store "4" in Register 2
Store Register 1 + Register 2 in Register 3

After the operation has been carried out, all three values are able to be used, so if we wanted to add 2 to the answer, the processor would simply add register 1 and register 3.

If the microprocessor only had 2 registers however, if we ever needed to use the values 2 or 4 again, they would have to be stored in main memory before being overwritten by the resulting value of A. Things would change in the following manner:

Store "2" in Register 1
Store "4" in Register 2
Store Register 1 + Register 2 in a location in main memory

Here you can see that there is now an additional memory access that wasn't there before, and what we haven't even taken into account is that the location in main memory the CPU will store the result in will also have to be placed in a register so that the CPU knows where to tell the load/store unit to send the data. If we wanted to use that result for anything the CPU would have to first go to main memory to retrieve the result, evict a piece of data from one of the occupied registers and put it in main memory, and then store the result in a register. As you can see, the number of memory accesses increases tremendously; and the more memory accesses you have, the longer your CPU has to wait in order to get work done - thus you lose performance. Simple enough? Now here's where things get a little more complicated, why don't we just keep on adding more registers?

The beauty of the x86 Instruction Set Architecture (ISA) is that there are close to two decades of software that will run on even today's x86 microprocessors. One way this sort of backwards compatibility is maintained is by keeping the ISA the same from one microprocessor generation to the next; while this doesn't include things like functional units, cache sizes, or anything of that nature, it does include the number and names of registers. When a program is compiled to be run on an x86 CPU, the compiler knows that the architecture has 8 general purpose registers and when translating the programmer's code into machine code that the CPU can understand it references only those 8 general purpose registers. If Intel were to have 10 general purpose registers, anything that was compiled for an Intel CPU would not be able to run on an AMD CPU as the extra 2 general purpose registers would not be found on the AMD processor.

Microprocessor designers have gotten around this by introducing a technique known as register renaming, which makes only the allowed number of registers visible to software, however the hardware can rename other internal registers to juggle data around without going to main memory. Register renaming does fix a large percentage of the issues associated with register conflicts, where a CPU simply runs out of registers and must start swapping to main memory, however there are some cases where we simply need more registers.

When AMD introduced their AMD64 architecture, they had a unique opportunity at their hands. Because no other x86 processor would be able to run 64-bit code anyways, they decided to double the number of general purpose and SSE/SSE2 registers that were made available in 64-bit mode. Since AMD didn't have to worry about compatibility, doubling the register count in 64-bit mode wasn't really a problem, and the majority of the performance increases you will see for 64-bit applications on the desktop will be due to the additional registers.

What is important to note is that although AMD has increased the number of visible registers in 64-bit mode, the number of internal registers for renaming has not increased - most likely for cost/performance ratio constraints.



Where does 64-bit help?

Although the performance that will sell the Athlon 64 today has nothing to do with this, the 64-bit part of the equation will definitely play a role in the processor's future. With no final release of the 64-bit version of Windows XP, there is no popular OS support (we will touch on Linux support as well as Win64 support shortly) and no real application support at this time, but where will the 64-bitness of the Athlon 64 help?

There are three main categories that you can split up the performance benefits into: 32-bit applications running on a 32-bit OS, 32-bit applications on a 64-bit OS and 64-bit applications on a 64-bit OS; we will be analyzing each one of these scenarios individually.

Case 1: 32-bit apps under a 32-bit OS

At the launch of the Athlon 64, the predominant operating environment will be running 32-bit applications under a 32-bit OS. All performance benefits the K8 architecture will show here are courtesy of the on-die memory controller, improved branch predictor, higher clock speed and more robust TLBs - none of the performance improvements you'll see in this case will have anything to do with the 64-bit capabilities of the processor.

Case 2: 32-bit apps under a 64-bit OS

When Windows XP 64-bit Edition is officially released (a public beta is due out at the time of publication), many users will be running their 32-bit applications under the 64-bit OS.

Outside of the performance improvements that we just outlined in Case 1, there are a couple of additional benefits the Athlon 64 may offer users. Currently under Windows, although you have a physical memory limit of 4GB, any given process can only use up to 2GB of memory; the remaining 2GB is reserved for use by the OS. With the 32-bit applications under a 64-bit OS scenario, each 32-bit application could be given a full 4GB of memory to work with, instead of being limited to the 2GB Windows process size limitation. Unfortunately this benefit isn't really "plug 'n play" as the application would have to be aware that it can use the added memory, which in the vast majority of cases would require a new patch to be made available.

The second benefit the Athlon 64 could offer in this scenario comes from the availability of additional registers. Although the 32-bit application would still only be compiled to use the regular set of 8 general purpose registers and standard set of FP and SSE2 registers, the 64-bit OS would be able to reference and use all of the registers at its disposal. The performance benefits that you would see here exist in any sort of task handling that the OS would be doing (switching between applications) as well as just regular Windows performance. Granted that the performance improvements seen here should be negligible, considering the extra overhead that does exist when running 32-bit applications in a 64-bit environment (more on this in a bit).

Case 3: 64-bit applications under a 64-bit OS

The final scenario is the one that shows the most promise, yet has the least amount of application support today - running a 64-bit app under a 64-bit OS. Here, the benefits are numerous; not only do you get the performance improvements courtesy of the Athlon 64's architecture, but each application now has full access to the increased number of registers and each application can use much more than 4GB of memory.

Although the Athlon 64 can support 64-bit memory addressability, for demand reasons it only supports 40-bit of physically addressable memory - or ~137GB, not exactly a limiting factor at this point.

The performance improvements developers are expecting to see under this final scenario has been estimated to be in the 10 - 20% range in tasks that are not memory bound, meaning those areas where the application is using less than 2 - 4GB of memory in the first place will still see sizable performance gains courtesy of the availability of more registers. We will investigate a few of these scenarios to substantiate (or refute) these claims later on in the article.

Performance improvements where you are memory bound will be even more impressive; just think about how slow swapping to disk is and how much faster keeping everything in memory makes your computer.



AMD's Gem: Athlon 64

When you have an architecture that has been talked about publicly for a couple of years and when all of your partners have had access to CPUs for almost as long, it becomes very tough to keep things a secret. Leaks occur and it would be an understatement to say that AMD was plagued by a few leaks, so most of the information you're about to hear has been published elsewhere and already alluded to.

With that said, AMD has brought two versions of their K8 architecture to the desktop market - branded the Athlon 64 and the Athlon 64 FX. The Athlon 64 is the 754-pin ClawHammer that we've been hearing about all this time, while the Athlon 64 FX is little more than a higher clocked 940-pin Opteron.

Let's start with the regular Athlon 64; contrary to surprisingly popular belief, the regular Athlon 64 does include an on-die memory controller - what it doesn't include is the on-die 128-bit memory controller found on the Opteron. Instead, you will find only a single-channel 64-bit memory controller with the Athlon 64. This on-die memory controller supports regular unbuffered DDR SDRAM at speeds of up to DDR400.

The other major difference between the Athlon 64 and the Opteron is that the Athlon 64 only has a single Hyper Transport link. Remember that the K8 architecture does not have any external "Front Side Bus" instead, serial Hyper Transport links connect the CPU to external chips such as a South Bridge, AGP controller or another CPU. With only one Hyper Transport link, there's no hope for the Athlon 64 to be used in multiprocessor environments as the sole Hyper Transport link would be tied up by the South Bridge/AGP controller. This lack of multiprocessor support is in direct contrast to the "lack" of multiprocessor support with the Athlon XP, which you could use in multiprocessor configurations; with the Athlon 64 it is physically impossible (unless you don't want any BIOS, hard drive or expansion slot support).

AMD originally announced that the Athlon 64 would have a 512KB L2 cache, however after continued delays and increased competition the Athlon 64 was given a full 1MB L2 cache. As we mentioned before, the 128KB L1 cache remains unchanged from the original Athlon XP and its exclusive nature means that the Athlon 64 has a total of 1088KB of cache for data storage (the remaining 64KB is for instruction storage).

The Athlon 64 will continue with AMD's model numbering system, although with a revised test suite. The end result is that AMD is much more conservative with their ratings, meaning that an Athlon 64 3200+ is inherently faster than an Athlon XP 3200+, despite carrying the same model number. As you've undoubtedly heard, the only Athlon 64 available at launch will be the 3200+, which will run at a 2.0GHz clock speed. The 2.0GHz clock speed is arrived at by taking the 200MHz Hyper Transport clock and multiplying it by a 10.0x clock multiplier. Currently there isn't a way to adjust the multiplier of Athlon 64 CPUs, so the potential for overclocking exists by increasing the Hyper Transport clock.

In Q4 AMD will launch the Athlon 64 3400+, which we'd assume would be clocked at 2.2GHz. The 3400+ will be the last Athlon 64 for 2003, and although we will see lower clocked versions in the mobile space, that will be it for desktops. The 3400+ will be introduced at around $600.

The Athlon 64 3200+ will sell for $417 in 1,000 unit quantities.



Sigh, the Athlon 64 FX

With the release of the 865PE and 865G chipsets, Intel has ensured that virtually all Pentium 4 processors on the market are paired with very high-bandwidth dual-channel memory subsystems. Ignoring the performance boost Intel gains by going to dual-channel, OEMs demanded a dual-channel solution from AMD simply as a checkbox feature.

Not having the time or resources to undertake introducing a brand new dual-channel desktop processor, AMD simply took their existing dual-channel design and called it an Athlon 64 FX. The existing design was the Opteron of course, and the first incarnation of the Athlon 64 FX is almost directly borrowed from the Opteron. What do we mean by directly borrowed?

For starters, the Athlon 64 FX gets the Opteron's memory controller with a slight change - support for DDR400. Offering DDR400 support on the server side is a little trickier than on the desktop for a couple of reasons; server processors must go through more validation than their desktop counterparts and adding DDR400 to the list of validated configurations would increase testing time. Then there's the issue of bringing DDR400 support to motherboards; an issue whose complexity increases tremendously as the number of memory slots you have to support grows. Given the memory requirements of the server market (and associated memory slots), it's just easier to wait on DDR400 support.

On the desktop, DDR400 support is great and the 128-bit memory controller from the Opteron is also nice to have, however there is one issue with the Opteron's memory controller that made its way to the desktop - the memory controller only supports buffered (aka registered) DIMMs. Although AMD is launching with Kingston releasing a line of HyperX registered DDR400 DIMMs, the vast majority of the desktop users have invested in unbuffered DDR400 DIMMs and spending more on registered DIMMs isn't exactly an easy pill to swallow.

AMD's justification for no unbuffered support is that the Athlon 64 FX is for the "enthusiast" community and these "enthusiasts" will want to use lots of memory of densities that are currently only available in registered module sizes. Given that very few "enthusiasts" have registered DDR400 it seems much more likely that it was simply easier to re-badge the Opteron than modify the CPU to support unbuffered memory.

What is necessary to add unbuffered support? Unfortunately, it is a CPU packaging issue and not something that can be added on the motherboard (remember, the memory controller is on-die now). AMD plans on adding unbuffered support to the Athlon 64 FX, but that will come at a later date as they will have to redo the chip's packaging. It seems likely that AMD would introduce unbuffered support with the rumored 939-pin Athlon 64 FX due out next year since they are changing the package anyways to support a different pinout.

Although AMD says that the Athlon 64 FX is for use in single processor environments only, the current version appears to have all three Hyper Transport links - meaning that it can work in multiprocessor environments just like the Opteron. AMD has indicated that future versions of the Athlon 64 FX would only have a single Hyper Transport link, but there's no way of knowing when that will be.

With the Athlon 64 FX, AMD has abandoned their model number system in favor of a series nomenclature similar to the Opteron. For example, the first Athlon 64 FX is the series 51 CPU, running at 2.2GHz. The number 51 was chosen arbitrarily (AMD confirmed this) and indicates nothing about its performance relative to any chip other than the Athlon 64 FX. The next CPU due out next year will be the Athlon 64 FX 53, and all you are expected to know is that 53 is faster than 51.

There's no criticizing AMD for their Athlon 64 FX series numbers simply because it was our distaste with their original model numbers that brought this nomenclature about. We criticized the Athlon XP for using model numbers in the first place, we complained when AMD rated their processors to conservatively and then we lashed out at them for being too aggressive with the model numbers. Look at the facts, AMD labels the Athlon 64 FX as an "enthusiast" processor, only sends Athlon 64 FX parts out to reviewers - the fact of the matter is that AMD doesn't want to face criticism about their naming system any longer so they've removed it where possible, and kept it where they thought it was necessary. AMD will get no complaints from us about the series numbers attached to the Athlon 64 FX, it remains to be seen if the Athlon 64's model numbers will suffer the same fate as the Athlon XP's.

The FX goes back to using a ceramic package, as opposed to the organic packaging that the Athlon 64 uses. Both processors have an identical 193mm^2 die size (which is massive, these will be expensive chips to make) and are made up of 105.9 million transistors. The chips run at a 1.50V core voltage.

The 940-pin Athlon 64 FX will work in all 940-pin motherboards and the Athlon 64 FX 51 will be priced at $733 in 1,000 unit quantities.



Socket-939: Athlon 64 FX DOA?

As we've already mentioned, AMD is planning on releasing a 939-pin version of the Athlon 64 FX sometime next year. We're hearing rumors that it will be very early next year, which would leave early FX adopters in a not-so-great situation.

Unlike previous situations where a chip manufacturer has switched sockets, the 940 situation won't leave users completely abandoned as the Opteron uses the same socket and thus you can always upgrade your CPU or motherboard to an Opteron down the road. AMD is planning on making 940-pin CPUs for a while so that shouldn't be a big problem, the only issue will be that the number of performance enthusiast boards available in a 940-pin version will eventually decrease over time; the boards you find for Opterons will obviously not be made with the enthusiast in mind.

The Socket-939 Athlon 64 FX will most likely have unbuffered memory support from the start (it doesn't make sense for AMD not to offer the support as they're redoing the package anyways, unless they simply chop off a pin with this part) and will be shipping at higher initial clock speeds than the current FX, so it makes sense to wait.

Combine the launch of a 939-pin version with the outstanding performance of the Athlon 64 and you will see why we are fairly negative on the FX at this point in time. If you want our advice, stay away from the FX for now.



Motherboard Support Powered by NVIDIA & VIA

Unlike the release of the original Athlon, AMD has full industry support behind the Athlon 64 (although the same can't be said for industry confidence). We have seen chipsets from ALi, NVIDIA, SiS and VIA, however only NVIDIA and VIA are dominating AMD's launch.

Our own Wesley Fink has prepared an article comparing the NVIDIA and VIA solutions, so be sure to check that out if you're interested in the detailed differences between the implementations of the two chipsets.

Because AMD has integrated the memory controller on the Athlon 64's die, the amount of work that needs to be done by the chipset vendors has been reduced significantly. The performance difference you'll see between chipsets should be negligible (even more so than in conventional architectures) as the only variables between chipsets are the South Bridge (IDE, PCI, SATA controllers) and the AGP controller.

AMD has no favorites in the chipset game; although they shipped all initial review systems with nForce3 boards, their reasoning was primarily one of availability, as they had to ship systems out in the summer to meet the deadlines faced by print publications.

As you will find out in Wesley's review, the nForce3 is currently limited to a 600MHz Hyper Transport link between the CPU and the chipset, while VIA's solution runs at 800MHz. The performance difference due to VIA's bandwidth lead is negligible however; remember, we're not talking about memory bandwidth, rather bandwidth between the CPU and the AGP controller. NVIDIA will have 800MHz support in the next version of the nForce3, the 250.

Despite the fact that chipset costs have gone down (as there's no more memory controller), motherboards will not reflect the lower price initially according to motherboard manufacturers. AMD is positioning the Athlon 64 as a premium part and thus the motherboard manufacturers will position their solutions competitively, but don't expect to see lower-than-Socket-A prices.

What's also interesting is the incredible recognition that NVIDIA has managed to establish in the chipset industry with the nForce brand. We are seeing incredible support for nForce3, despite the fact that it doesn't really offer anything above and beyond VIA's solution. We're expecting the nForce3 to be positioned as a premium solution, while VIA will compete for the lower end of the Athlon 64 market - all because of the success of NVIDIA's nForce2 brand; the name nForce3 somehow just sounds all that much more powerful, even though NVIDIA's powerful memory controller isn't being used.

At the start, it looks like NVIDIA will begin to pull ahead as the market leader, but it is unclear how VIA's support for an 800MHz HT bus and potentially lower price point will change things (if at all).



Where is the software?

AMD sent out evaluation systems with a beta copy of Windows XP 64-bit Edition, and if you remember back several months the alleged point of delaying the Athlon 64's launch was to coincide with the release of Windows XP 64-bit Edition. From what we're hearing, although a beta of the OS will be available very soon, the final release will not be until Q1 2004.

We managed to do some 32-bit performance testing on the 64-bit Edition of Windows XP, however out of our entire benchmark suite - only one test would complete and that test was actually slower under Win64. The latest build of the beta is supposed to have better performance, but for now don't expect 64-bit Windows to be a reality. Without application support, the 64-bit "experience" is quite anticlimactic. We browsed the web in 64-bit Internet Explorer for a while before rebooting the system and getting some work done in 32-bit Windows.

The story is much different under Linux, where we managed to run tests under Red Hat Enterprise 2.9.5WS (Taroon), also a beta release. We will talk more about our results under Linux later in the article.



Intel's Preemptive Strike - Pentium 4 Extreme Edition

As we announced at last week's Intel Developer Forum, Intel preempted AMD's 64 launch with a release of their own - the Pentium 4 Extreme Edition.

The Extreme Edition is a 169 million transistor Pentium 4, currently running at 3.20GHz (800MHz FSB) with Hyper-Threading support, and featuring a 2MB on-die L3 cache in addition to the standard 512KB on-die L2 cache.

The point of adding such a large L3 cache is to basically give the Pentium 4 as many of the benefits of an on-die memory controller, without actually integrating one. Intel is weary of the on-die memory controller approach, simply because of the horrible experience they had with attempting to push the market in the direction of RDRAM 4 years ago; thus a large L3 cache is the next best option.

A large L3 cache helps to hide the overall memory latency by keeping more frequently used data in the L3 cache, and Intel chose the size of the cache very wisely. For example, a single frame of DVD quality video can't fit into a 1MB cache but a 2MB cache is more than enough to store it. The vertex buffer data in most modern day games also happens to fit quite nicely in the 2MB that Intel chose for the Extreme Edition (EE).

Intel is toying with the idea of releasing an Extreme Edition version of every high-end Pentium 4 (e.g. Prescott 3.40GHz Extreme Edition), however nothing is set in stone yet. We have already passed along the information that an Extreme Edition processor would truly be worthy of the name if Intel would unlock the processors, allowing overclockers to freely push their processors. In order to combat remarking, we also passed along the suggestion that only lower multipliers be made available.

Both of these suggestions were provided by AnandTech readers and were very well received by Intel, it may take some time but we may be able to get the chip-giant to budge on this one.

The Pentium 4 3.2 EE will be available in the next month or two and will sell for around $740 in 1,000 unit quantities. The processor will work in all current motherboards, most of which will not require a BIOS update.

The Test

We used nForce3 boards from ASUS (Socket-940) and Shuttle (Socket-754) to keep our Athlon 64 vs. Athlon 64 FX numbers as comparable as possible. All systems were configured with 512MB of DDR400 SDRAM and used ATI Radeon 9800 Pro 256MB cards with the latest Catalyst 3.7 drivers.



Memory Latency & Bandwidth Performance

Given AMD's on-die memory controller, memory latencies should be reduced significantly as well as bandwidth efficiency improved a bit. In order to validate these hypotheses we turn to ScienceMark 2.0 to give us an indication of memory latency and bandwidth performance:

Although the Pentium 4 and the Athlon 64 FX have the same amount of theoretical memory bandwidth, the Athlon 64 FX comes out ahead by a significant margin thanks to the on-die memory controller.

Memory latency is reduced significantly over the Athlon XP thanks to the on-die memory controller, although there is much to be said about Intel's 865/875 memory controllers by looking at the latency comparison seen above.

Here we have the same comparison as before, just with the performance measured in CPU clocks and not ns.



Business Application Performance

When we use the term "business applications" we normally think about computer users sitting in cubicles using their PCs as a part of their daily jobs - checking email, editing documents, flipping through presentations and such. It turns out that a good number of users outside the workplace use their PCs in a "business" oriented fashion, basically using their systems for email, web browsing and document editing.

The Business Winstone 2002 benchmark suite has been with us for quite some time and stresses multitasking environments that incorporate actions such as checking email, browsing the web, editing Word documents, Excel spreadsheets and PowerPoint presentations.

The nature of these tasks predominantly stresses the integer execution units of modern day microprocessors as well as their load/store units for memory accesses. Given that we're dealing with mostly integer code, a good deal of it happens to be filled with conditional branches (read: if action A then do action B). As we've seen in previous investigations, architectures with shorter pipelines tend to do much better with branch-happy code. So let's look at the results:

Here the performance advantage is clearly AMD; the shorter pipeline of the Athlon 64 combined with the large L2 cache and the on-die memory controller make the Athlon 64 a very strong performer under business applications.



Content Creation Performance

Content Creation Applications are the new performance drivers for CPUs; applications like Photoshop, Dreamweaver and media encoding applications are all examples that fall into the content creation category.

The performance paradigm is much different under content creation applications compared to business apps, as you will see from the performance results below:

Intel is still ahead in content creation applications; in this particular benchmark AMD does do much better if you apply a new media encoder patch after installing the benchmark, so you can choose whether or not to take these results at face value. We're not a fan of modifying benchmark configurations and would rather wait for the next version of Winstone to be released in order to depict a more accurate picture of performance here (we have many more content creation benchmarks for you to look at to help you decide today however).



Gaming Performance

We are at an interesting point in PC gaming history, as DX9 titles are released the way we look at performance will change considerably. Unfortunately we are not able to bring you performance numbers based on titles such as Half-Life 2 just yet, but come September 30th the release of the benchmarking demo will allow us to do just that. Until then, we're left with our usual DX7/DX8 benchmarking titles to show off:

The Athlon 64 does extremely well under Quake III Arena, historically an Intel-dominated test. The Pentium 4 3.2 EE and Athlon 64 FX perform very similarly in this test.

Not all tests are easily influenced by CPU performance, as is made evident by these Splinter Cell results that show virtually no performance difference between all platforms.

Under Unreal Tournament 2003, AMD and Intel are virtually tied for first place...

Although when we're talking pure physics and AI performance, AMD is the clear leader. The Pentium 4 EE manages to regain some lost ground for Intel, but not enough.



Development Workstation Performance

One test we've always been asked to run is a Visual Studio/Visual C compile test, however we never had a project large enough to compile. Sitting around talking about Athlon 64 testing one day we came up with the idea of using the publicly available Quake 3 source code as a compile test for CPUs, and thus for this next test we timed how long it took for the Quake 3 source code to compile.

This compile test should be a relatively good indicator of overall compile performance, which will be very useful for those of you that have very large projects that can take countless minutes to well over an hour to compile.

These results shouldn't be too surprising as compilers are very branch-happy applications, which definitely penalize long pipelines like the Pentium 4's. There will be more optimized compilers available for the Pentium 4 in the future that may exploit multithreaded compiling with the generation of helper threads to show performance improvements courtesy of Hyper Threading, but for now if you want a good development workstation - AMD is the way to go.



DivX Encoding

We have been using a DivX encoding test as a part of our CPU benchmarking suite for quite some time now, however the performance test has never been truly realistic as it wasn't geared towards producing a high quality DivX rip - rather it was designed to stress CPU performance.

We have since revised our benchmark and now follow the DivX 5 encoding guide published at Doom9.net. For our test title we use Chapter 9 from The Sum of All Fears DVD. We conduct a 2-pass encoding process and report the encoded FPS from both passes averaged together. The results are lower than our previous Xmpeg tests, however they are much more applicable to real-world usage.

Intel continues to do extremely well under content creation applications such as DivX encoding; the clear leader here is still Intel.



3D Rendering

Finally we have our 3D Rendering tests, which are composed of Lightwave and 3dsmax. 3D rendering has been increasingly favorable to the Pentium 4 due to the vast number of SSE2 optimizations that are present in current applications, but with the Athlon 64's SSE2 support will the playing field be leveled at all?

AMD manages to come very close to Intel, finally, in 3D rendering performance that is accelerated by the use of SSE2 optimizations. Intel still holds the lead, but AMD is finally competitive.

AMD even manages to take the lead in this Lightwave test, something that was never thought possible from AMD before the Athlon 64.

Closing off with 3dsmax, Intel still does have the performance lead in 3D rendering applications as well.



32-bit vs. 64-bit Performance

Our entire benchmark suite to this point has been on 32-bit applications under a 32-bit OS, mostly because there are no good desktop 64-bit applications at this point in a popular 64-bit OS (not to mention the issues with 64-bit Windows XP we described earlier).

Under Linux however we don't have to wait for applications to be released in a 64-bit version, we can simply recompile them. Linux would thus provide us with an excellent venue to see the tangible performance increases from exposing the additional general purpose registers in 64-bit mode.

We ran all benchmarks on Red Hat Enterprise 2.9.5WS (Taroon), a beta release, booted in single user mode to avoid system services interfering with benchmark results. Neither Red Hat 9 nor 9.0.93 Beta (Severn) supply a 64-bit compiler or libraries, which is why we used Taroon.

The Taroon kernel initially had issues on startup requiring us to disable APIC and ACPI support to get it to install. Once actually running the OS was quite stable however DMA disk access was disabled for some reason.

We used the following compiler that came with Taroon:

gcc 3.2.3 20030502 (Red Hat Linux 3.2.3-16)

And the following kernel:

2.4.21-1.1931.2.393.ent

With this compiler and kernel we ran the following tests:

Whetstone

A simple C loop measuring floating point performance, configured to do double precision calculations.

Compiled with:
-O3 -msse2 -mfpmath=sse (and -m32 for 32bit, -m64 for 64bit)

The performance improvements due to 64-bit are in the 10 - 20% range we mentioned earlier.

Bytemark

An old integer CPU benchmark (FP results were discarded) - for more information on the tests visit this site.

Compiled with:
-O3 -msse2 -mfpmath=sse (and -m32 for 32bit, -m64 for 64bit)

Here we do see a small 2% drop in performance when moving to 64-bit in one test, however the rest of the tests show a 0 - 15% improvement across the board.

Lame 3.93

A MP3 encoder; encoded a 40minute .wav file (403MB).
Lame args: -b 192 -m s -h --quiet <file> - >/dev/null
(192kbps, simple stereo, high quality, output to nothing to avoid disk hits)

Compiled with:
-O3 -fomit-frame-pointer -fno-strength-reduce -malign-functions=4 -funroll-loops -ffast-math -msse2 -mfpmath=sse (again, -m32 for 32bit, -m64 for 64bit)

The performance improvement here is astounding - in 64-bit mode the Athlon 64 FX managed to finish the encode 34% quicker than in 32-bit mode, if these results are any hint of what could be in store for Windows users, there's a lot of promise behind the Athlon 64...assuming we get software support in time.

We wanted to do a transcode benchmark but that didn't work out - one library found a bug in gcc and transcode refused to compile. It actually forced a compile error because a structure came out padded, meaning they didn't expect anyone to run it on a 64bit machine just yet.



Final Words

Seemingly overnight AMD went from about to fall off of the performance charts to being competitive with Intel's latest and greatest. But there's much more to this situation than proclaiming a winner and leaving it at that; AMD has lost a considerable amount of credibility, and the Athlon 64 (and FX) of today will not bring AMD back to the heydays of the Athlon.

For starters, at a 192mm^2, the Athlon 64 and Athlon 64 FX are well above AMD's "sweet spot" for manufacturing. When we last talked with AMD's Fred Weber, 100 - 120mm^2 die size is ideal for mass production given AMD's wafer size, yields and other manufacturing characteristics - and the Athlon 64 is close to twice that size. For the Athlon 64 to become the mainstream part that AMD wants it to be, they need to significantly reduce the die size - a shrink that the move to 90nm would be able to do just that. The mass market success of the Athlon 64 is directly dependent on AMD's ability to move to 90nm, until then the 64 will be exclusively a high-end part.

You can also understand AMD's desire to bring to market a 256KB L2 version of the Athlon 64, as reducing the cache size would not only cut down on the ~106M transistors but also significant die area.

AMD has also priced the Athlon 64 and Athlon 64 FX very much like the Pentium 4s they compete with, which is a mistake for a company that has lost so much credibility. AMD needed to significantly undercut Intel (but not as much as they did with the Athlon XP) in order to offer users a compelling reason to switch from Intel. However, given the incredible costs of production (SOI wafers are more expensive as well) and AMD's financial status, AMD had very little option with the pricing of their new chips.

When it comes down to recommendations, the Athlon 64 offers very compelling performance at a much more reasonable price point than the Athlon 64 FX. We cannot recommend the FX until AMD does release a version with unbuffered memory support and we would strongly suggest waiting until the Socket-939 version is released if you are considering the FX.

What is promising however are the performance gains we saw when recompiling for 64-bit on the Athlon 64; if AMD can actually get 64-bit applications and a compatible OS from Microsoft out in the market then the recommendations become much more positive for AMD. Until then, it's wait and see, AMD has done well but execution isn't a singular task - it is continued execution that will guarantee AMD a spot at the top of the market again.

Log in

Don't have an account? Sign up now