Original Link: https://www.anandtech.com/show/1254



AMD got the attention of Microsoft with their 64-bit Athlon 64/Opteron platform, and it was enough attention to warrant a new OS port to x86-64. Just weeks ago AMD scored another victory, with Intel announcing the adoption of AMD's 64-bit extensions to x86.

Future Xeon and Pentium 4 processors will ship with the x86-64 extensions enabled but architecturally they will be identical to the currently available Prescott based Pentium 4. The architectural similarity between Intel's IA-32e ad IA-32 processors (IA-32e is Intel's marketing equivalent to AMD64) is an important point to note as it means that if Opteron is able to outperform Xeon in 32-bit mode, it will maintain a performance advantage in 64-bit mode as well. We are assuming that Intel has no specialized hardware to improve 64-bit performance over AMD's solution, so the Xeon vs. Opteron comparisons we've brought you in the 32-bit world should still hold true in the 64-bit world later this year.

There has been much editorializing about Intel's recent 64-bit announcement, and we'll add nothing more than this to it all: it's a very good thing that Intel has gone the x86-64 route, it will mean that we see software support, drivers and overall market acceptance sooner. We have AMD to thank for Intel's backing x86-64, which is a big feather in AMD's cap but if there's one thing to be said about business it's that there's no room for pride.

Intel made the right decision; they would be losing sales if they didn't adopt x86-64, leaving those who needed a 64-bit x86 solution no option other than Opteron. However Intel gives AMD nothing if they adopt x86-64 in their own CPUs; AMD's sales don't increase and remember what we said about pride in business.

We'll talk more about Intel's upcoming 64-bit Xeons (Nocona and Potomac) in the conclusion, but let's get to what we're all here to see today: AMD's Opteron and Intel's Xeon go head to head in a real-world database serving comparison.

We compared the two titans in our web serving tests late last year, where AMD left Intel in a cloud of dust. Now the stakes are much higher, can Intel's deeply pipelined architecture contend with AMD's server-grown Opteron?



A Confusing Market

IT managers have it tough; Intel's Xeon line honestly does not make much sense. Initially things were simple, Intel had dual processor Xeons simply branded as the Intel Xeon, and quad processor Xeons that were aptly named Xeon MP. The regular Xeon processors were validated for up to 2-way operation, while the Xeon MP could be used in 2-way, 4-way and 8-way servers.

The regular 2-way Xeons were basically desktop Pentium 4s, while the Xeon MPs included an on-die L3 cache. Fast forward today and things have definitely changed.

We are comparing three different Intel cores to AMD's one and only Opteron core, so let's focus on the Intel cores first. Intel's Prestonia core is the 0.13-micron heart and soul of the 2-way Xeon processor now. The latest and greatest Prestonia based Xeon runs at 3.2GHz and features a 512KB L2 cache as well as a 2MB on-die L3 cache. This Prestonia should sound very familiar as it is basically a Xeon version of the Pentium 4 Extreme Edition, which was a Pentium 4 version of the Xeon MP at a higher clock speed. Yes, Prestonia is a server version of a desktop version of a server processor. In fact, the only difference between Prestonia and the Pentium 4 Extreme Edition (other than packaging) is that the Prestonia only supports the 533MHz FSB. Front Side Bus bandwidth is actually a very serious issue when it comes to Intel CPUs, but we'll talk about that shortly.

Next we have the Xeon MP processors based off of Intel's 0.13-micron Gallatin core. The Gallatin core is what the Pentium 4 Extreme Edition was derived from, and offers 1MB, 2MB and now 4MB on-die L3 cache configurations. Prior to this article the largest cache size available on a Gallatin core was 2MB, but today Intel is launching their 4MB Gallatin parts. Both the Gallatin 2MB and 4MB parts continue to use a 400MHz FSB, which is the Xeon MP's Achilles' heel. The Gallatin 4MB parts are available in speeds of up to 3.0GHz, which we are including in this review today.

AMD's offerings are much simpler; the Opteron is available in 1-way, 2-way and 4-way+ configurations: the 1xx, 2xx and 8xx series respectively. AMD's offerings haven't changed since our web server comparison, although we should see 2.4GHz Opterons debut in the near future.



FSB Impact on Performance

We've alluded to FSB bandwidth being a fundamental limitation in Intel's multiprocessor architecture, and now we're here to address the issue a bit further.
A major downside to Intel's reliance on an external North Bridge is that it becomes very expensive to implement multiple high speed FSB interfaces as well as a difficult engineering problem to solve once you grow beyond 2-way configurations. Unfortunately Intel's solution isn't a very elegant one; regardless of whether you're running 1, 2 or 4 Xeon processors they all share the same 64-bit FSB connection to the North Bridge.

The following diagram should help illustrate the bottleneck:

In the case of a 4-way Xeon MP system with a 400MHz FSB, each processor can be offered a maximum of 800MB/s of bandwidth to the North Bridge. If you try running a single processor Pentium 4 3.0GHz with a 400MHz FSB you'll note a significant performance decrease and that's while still giving the processor a full 3.2GB/s of FSB bandwidth; now if you cut that down to 800MB/s the performance of the processor would suffer tremendously.

It is because of this limitation that Intel must rely on larger on-die L3 caches to hide the FSB bottleneck; the more information that can be stored locally in the Xeon's on-die cache, the less frequently the Xeon must request for data to be sent over the heavily trafficked FSB.

What's even worse about this shared FSB is that the problem grows larger as you increase the number of CPUs and their clock speed. A 2-way Xeon system won't experience the negative effects of this FSB bottleneck as much as a 4-way Xeon MP; and a 4-way Xeon MP running at 3GHz will be hurting even more than a 4-way 2.0GHz Xeon MP. It's not a nice situation to be in, but there's nothing you can do to skirt the issue, which is where AMD's solution begins to appear to be much more appealing:

First remember that each Opteron has its own on-die North Bridge and memory controller, so there are no external chipsets to deal with. Each Opteron CPU features three point-to-point Hyper Transport links, delivering 3.2GB/s of bandwidth in each direction (6.4GB/s full duplex). The advantage is clear: as you scale the number of CPUs in an Opteron server there are no FSB bottlenecks to worry about. Scalability on the Opteron is king, which is the result of designing the platform first and foremost for enterprise level server applications.

Intel may be able to add 64-bit extensions to their Xeon MPs, but the performance bottlenecks that exist today will continue to plague the Xeon line until there's a fundamental architecture change.



Hyper Threading

Intel's Hyper Threading technology has been widely accepted in the enterprise and desktop markets, to the point where the vast majority of systems ship with Hyper Threading enabled and leave it that way.

Our tests have shown that Hyper Threading improved performance 3 - 5% on average and thus we left it enabled for all of our tests here.

The Tests

We ran two sets of tests for this comparison: an updated version of our own home-grown tests on the AnandTech Forums Database, as well as another more strenuous test representative of enterprise-class transactional database serving applications. We will discuss the two tests in greater detail in the coming pages, but first the basic hardware configuration for our tests:

AMD Opteron 848/248 and Intel Xeon/Xeon MP (Prestonia/Gallatin)
4GB DDR333 (NUMA was enabled for the opteron)
8 x 36GB 15,000RPM Ultra320 SCSI drives in RAID-0
Windows 2003 Enterprise Server

Days, and then weeks went by as we researched and regression-tested various benchmark methodologies in order to come up with fair, repeatable and, most of all, real world database benchmarks. In the past, we've used a trace playback methodology to stress the database. While it served its purpose for the hardware that was tested, it was time for a change. This time around, we wanted to have two different tests: one that represented an average database load, like the AnandTech Forums; and, the other that represented an enterprise level workload.



Constructing a database benchmark (average load)

Our first new benchmark was custom written in .NET, using ADO.NET to connect to the database. The AnandTech Forums database, which is over 14GB in size at the time of the benchmark, was used as the source database. We'll dub this benchmark tool "SQL Loader" for the purposes of discussing what it does.

SQL Loader allows us to specify the following: an XML based workload file for the test, how long the test should run, and how many threads it should use with which to load the database. The XML workload file contains queries that we want executed against the database, and some random ID generator queries that populate a memory resident array with ID's to be used in conjunction with our workload queries. The purpose of using random ID's is to keep the test as real-world as possible by selecting random data. This test should give us a lot of room for growth, as the workload can be whatever we want in future tests.

Example workload:

< workload>

< !--- A SAMPLE WORKLOAD QUERY THAT RETURNS ALL THE FIELDS FROM THE PRIVATEMESSAGES TABLE RANDOMLY --->

<query>

<code>select * from privatemessages where imessageid = @pmessageid</code>

<type>read</type>

<randkey>pmessageid</randkey>

</query>

<!--- RANDOM ID GENERATOR FOR SELECTING RANDOM PRIVATE MESSAGES --->

<randomid>

<rcode>select imessageid,newid() as pmsgid from privatemessages order by pmsgid</rcode>

<name>pmessageid</name>

</randomid>

< /workload>


A screenshot of the SQL Loader

Test Information

The workload used for the test was based on every day use of the Forums, which are running FuseTalk. We took the most popular queries and put them in the workload. Functions, such as reading threads and messages, getting user information, inserting threads and messages, and reading private messages, were in the spotlight. Each iteration of the test was run for 10 minutes, with the first being from a cold boot. SQL was restarted in between each test that was run consecutively.

The importance of this test is that it is as real world as you can get; for us, the performance in this test directly influences what upgrade decisions we make for our own IT infrastructure.



AnandTech Forums Database Test Results

The results are split up into two categories: 2-way and 4-way setups. Remember that the 3.2GHz Potomac based Xeon is only available in 2-way configurations and is thus absent from the 4-way graphs. The labels are as follows: CPU Name Clock Speed/FSB Speed/Cache Size (e.g. Xeon 3.0GHz/400/4MB = Xeon 3.0GHz, 400MHz FSB, 4MB L3 cache). Keep in mind that all Xeons have a 512KB on-die L2 cache, and all Opterons have a 1MB on-die L2 cache (but no L3 cache).



...but add another 2 processors to all of the systems and the Opteron flexes its muscle once again. It's clear that AMD put together a very scalable design with Opteron and it's paying off.



"Order Entry" Stress Test: Measuring Enterprise Class Performance

One complaint we've historically received about our Forums database test was that it isn't strenuous enough for some of the Enterprise customers to make a good decision based on the results.

In our infinite desire to please everyone we worked very closely with a company that could provide us with a truly Enterprise Class SQL stress application. We cannot reveal the identity of the Corporation that provided us with the application because of non-disclosure agreements in place. As a result, we will not go into specifics of the application, but rather provide an overview of it's database interaction so that you can grasp the profile of this application and better understand the results of the tests (and how they relate to your database environment).

We will use an Order Entry system as an analogy for how this test interacts with the database. All interaction with the database is via stored procedures. The main stored procedures used during the test are:

sp_AddOrder - inserts an Order
sp_AddLineItem - inserts a Line Item for an Order
sp_UpdateOrderShippingStatus - updates an status to "Shipped"
sp_AssignOrderToLoadingDock - inserts a record to indicate which Loading Dock the Order should be shipped from
sp_AddLoadingDock - inserts a new record to define an available Loading Dock
sp_GetOrderAndLineItems - selects all information related to an Order and it's Line Items

The above is only intended as an overview of the stored procedure functionality; obviously the stored procedures perform other validation, and audit operations.

Each Order had a random number of Line Items, ranging from one to three. Also randomized was the Line Items chosen for an order, from a pool of approximately 1500 line items.

Each test was run for 10 minutes and was repeated three times. The average between the three tests was used. The number of Reads to Writes was maintained at 10 reads for every write. We debated for a long while about which ratio of reads to writes to would best services the benchmark and we decided there was no correct answer... so we went with 10.

The application was developed using C# and all database connectivity was accomplished using ADO.NET and used 20 threads, 10 for reading and 10 for inserting.

So as to ensure that IO was not the bottleneck, each test was started with an empty database and expanded to ensure that autogrow activity did not occur during the test. Additionally, a gigabit switch was used between the client and the server. During the execution of the tests, there were no applications running on the server or monitoring software. Task Manager, Profiler, and Performance Monitor where used when establishing the baseline for the test, but never during execution of the tests.

At the beginning of each platform both the server and client workstation was rebooted to ensure a clean and consistent environment. The database was always copied to the 8 disk RAID 0 array with no other files present to ensure that file placement and fragmentation was consistent between runs. In between each of the three test the database was deleted, the empty one was copied again the clean array. SQL Server was not restarted.



To give you an idea of the scale of this benchmark we have graphs of stored procedures calls per second. We decided to focus on Stored Procedures / Second rather than Transactions / Second as the definition of a Transaction can have a business context or a technical context.



The Opteron goes from lagging slightly behind the Xeon to offering a 8.5% performance advantage in a 4-way configuration. The Xeon's shared FSB severely clips its wings when moving to a 4-way setup.

If you're familiar with these sort of database applications, the above graphs will give you a good idea of what sort of stress we're putting on these systems; we are pushing enterprise class performance limits. Now onto the results:



Order Entry Stress Test Results

The results are split up into two categories: 2-way and 4-way setups. Remember that the 3.2GHz Potomac based Xeon is only available in 2-way configurations and is thus absent from the 4-way graphs. The labels are as follows: CPU Name Clock Speed/FSB Speed/Cache Size (e.g. Xeon 3.0GHz/400/4MB = Xeon 3.0GHz, 400MHz FSB, 4MB L3 cache). Keep in mind that all Xeons have a 512KB on-die L2 cache, and all Opterons have a 1MB on-die L2 cache (but no L3 cache).



And once again, moving to 4-way configurations shows that the Opteron is a much better scaler, and much better suited for 4-way servers than the Xeon MP.



Final Words

The 533MHz FSB 2MB L3 Prestonia based Xeon manages to help Intel tremendously in keeping competitive with the Opteron. In fact, under heavy enough workloads there is virtually no performance difference between a 3.2GHz Xeon and a 2.2GHz Opteron (x48). It isn't until you move to 4-way configurations that AMD's platform architecture begins to flex its muscle. That being said, Intel has done an incredible job of keeping up performance wise in 2-way configurations; we have a much better showing here than we did in the web server test.

Interestingly enough, while the new Gallatin Xeon MPs have a massive 4MB L3 cache, most of that cache will end up being used to keep traffic off of the bandwidth starved 400MHz FSB. The performance gap between the Opteron 848 and the Xeon MP is amplified significantly once you move to a 4-way setup; the Xeon's shared bus just can't cut it anymore, not at 400MHz. AMD's point-to-point Hyper Transport implementation helps extend their performance advantage significantly. An 8-way Opteron vs. Xeon comparison would not be pretty.

In a matter of months, Intel will begin transitioning their Xeon line to 90nm cores - more specifically Nocona (the replacement for the current Prestonia Xeon). The 90nm Xeons will be Prescott derived, which means they get all of the bittersweet changes that went into Prescott. At the same time, this next generation of Xeon processors will enable Intel's 64-bit IA-32e instruction set (read: x86-64). From a performance perspective we would expect the 90nm cores to perform noticeably worse than the current Xeons on a clock for clock basis, but it seems that Intel is avoiding an embarrassing launch by releasing the first Nocona based Xeons at 3.6GHz. With Nocona, Intel will also introduce the 800MHz FSB to the Xeon family - definitely a much needed step in the right direction. For 4-way servers, Intel will have to wait a bit longer; it won't be until the first quarter of 2005 before 64-bit extensions make their way into the Xeon MP processors using the 90nm Potomac core.

The comparison we've made here is a very important one; it identifies Intel's strengths and their weaknesses with Xeon, and it crowns Opteron a clear multiprocessor winner. An area that we didn't touch on is cost, which is where AMD truly shines. The Opteron 848 processors we tested are around 1/2 the price of Intel's 2MB L3 Xeon MPs and we have not seen retail data on how expensive the 4MB parts will be.

In a 4-way configuration AMD's Opteron cannot be beat, and thus it is our choice for the basis for our new Forums database server. We'll be documenting that upgrade in a separate article so stay tuned.

Log in

Don't have an account? Sign up now