It was less than three years ago that Intel released the Pentium II Xeon processor.  Based off of the same core as the Pentium II and Celerons of the day, the Pentium II Xeon was introduced to offer a high-end workstation/server processor that could pick up where the Pentium Pro left off. 

One of the main goals behind the Xeon was to offer a processor that was powerful enough to handle the most CPU intensive workstation and server tasks while also retaining the features of the P6 core that allowed it to perform well on home/office tasks as well.  The idea of having a specialized computer for work but not being able to use it for your home/gaming applications was combated by the release of the Pentium II Xeon.  The Pentium II Xeon also helped to gain further ground in the multiprocessor workstation market which had been previously dominated by non-x86 offerings. 

The very first Pentium II Xeon had a full speed L2 cache of up to 2MB.  However because the 0.25-micron Pentium II die was already fairly large, the L2 cache wasn’t on die, rather it was contained in a separate chip that was connected to the CPU core by an external bus.  The Xeon family has definitely come a long way since its first days in 1998.  With the Pentium III Xeon’s shrink to a 0.18-micron process the processor core was able to house an on-die L2 cache of up to 2MB, tremendously increasing the cache performance of the platform. 

Today Intel is continuing their trend of segmenting their flagship processors by introducing the next-generation Xeon processor, based off of the Pentium 4’s Willamette core.  This processor, branded just as the Intel Xeon processor, is being launched at 1.4GHz, 1.5GHz and 1.7GHz and has a core that is almost identical to the current desktop Pentium 4 with a few minor changes.


Click to Enlarge

The Architecture of the Intel Xeon

The Intel Xeon processor shares the exact same core as the desktop Pentium 4, meaning that the same features the Pentium 4 can boast, the Xeon can do the same.  This also unfortunately means that the same shortcomings which affected the Pentium 4 will also affect the Xeon. 

We’ve explained the architecture behind the Pentium 4 many times, so here is a brief rundown of all of the major features behind the Pentium 4 and the Xeon:

Hyper Pipelined Technology – The Xeon features a much longer pipeline than either the Pentium III or the Athlon.  This unfortunately means that the Xeon accomplishes less per clock, however it does pave the way for the Xeon to achieve much higher clock speeds.  The theory behind this is that the enablement of much higher clock speeds will allow the Xeon to offer a greater performance advantage over its predecessors because being able to do less per clock doesn’t matter if you can hit incredibly high clock speeds.  Case in point would be that the Pentium III was only able to reach 1GHz on its 0.18-micron process while the Xeon is currently at 1.7GHz on the same 0.18-micron process.  And as you’re about to see, there is a clear performance difference between the two.

Improved Branch Prediction – Obviously with such a long pipeline, it is necessary to have an improved Branch Prediction Unit which the Xeon does boast.  The BPU is arguably the most advanced in this sector which is something that has held back the Athlon’s performance somewhat.  In any case, the Xeon’s BPU must be solid otherwise the penalties associated with its Hyper Pipelined Architecture would cripple the P4 beyond reparation.

Rapid Execution Engine – Two of the Xeon’s ALUs (Arithmetic Logic Units: they handle Integer operations) are double pumped, meaning they transfer twice as much data per clock effectively giving them throughput identical to that of ALUs operating at twice the core frequency.  In the case of the 1.7GHz Xeon, this means that the ALUs operate as if they were normal ALUs (not double pumped) clocked at 3.4GHz.  As we have discovered in the past, this is necessary in order to provide the Xeon with respectable performance when running Integer code.  Integer code is generally much more susceptible to mis-predicted branches, the lower latency/higher effective clocked ALUs allow the branch mis-predict penalties associated with the Xeon’s extremely long pipeline to be minimized when dealing with integer operations.

12K micro-op trace cache – This special cache replaces and improves upon the traditional L1 instruction cache.  The 8-way set associative Execution Trace Cache caches micro-ops after they have been decoded and they are also cached in the predicted path of execution.  This helps to hide some of the performance penalties caused by such a long pipeline.

256KB Advanced Transfer Cache – The Xeon’s L2 cache subsystem is quite incredible to say the least.  Not only does the processor have a 256-bit internal pathway to its L2 cache, it is also able to transfer data from the cache once every clock meaning that it has the highest peak cache bandwidth figures of any processor in its class.  At 1.7GHz, the Xeon has a maximum of 54.4GB/s of bandwidth to/from its L2 cache.  In comparison a Pentium III at 1.0GHz can only offer 16GB/s of bandwidth for L2 data transfers and similarly an Athlon at 1.33GHz can only offer 10GB/s of peak bandwidth (the Athlon only has a 64-bit datapath to its L2).

Hardware Prefetch – The Xeon is able to predict what data it will need before it is actually requested to get it from main memory and it will fetch it directly into cache, thus when it is requested the data is already in its cache.  In the event that the data isn’t needed, this becomes a waste of cache space and also FSB/memory bandwidth.  In either case, Hardware Prefetch is a FSB/memory bandwidth hog luckily this next feature of the Xeon architecture helps avoid that being a problem.

Quad Pumped 100MHz FSB + Dual Channel RDRAM – The Xeon has a 100MHz FSB that is quad pumped to offer data bandwidth equivalent to that of a 400MHz FSB, meaning it can transfer at most 3.2GB/s of data to the Xeon.  This bus runs synchronously with the i850’s (P4 chipset) dual channel RDRAM setup that runs at 400MHz over a 2 x 16-bit wide buses, for a total of 3.2GB/s of peak memory bandwidth.  While RDRAM was not necessary on the Pentium III platform, when coupled with the Xeon, the bandwidth RDRAM offers is very well appreciated.

SSE2 – The Xeon offers an improvement over the original 70 SSE instructions with its 144 new SSE2 instructions however even under SPEC CPU2000, the performance improvement offered by SSE2 optimizations alone is supposedly around 5%.  With SPEC CPU2000 being a highly synthetic benchmark, it is unlikely that SSE2 would translate into any real world performance gains in today’s applications.  One thing that isn’t being taken into account here is SSE2’s ability to handle two 64-bit SIMD-Int and SIMD-FP (Single Instruction Multiple Data; click here for an explanation) operations.  This ability isn’t being taken advantage of in SPEC CPU2000 and could prove to be one of SSE2’s greatest assets.

Jackson Technology: Not this time around

Log in

Don't have an account? Sign up now