Original Link: http://www.anandtech.com/show/1363




Introduction

When we first heard that Intel would be continuing the Celeron tradition with a Prescott based "D" line, we were a little skeptical. When we further heard that the Celeron D would only be getting a quarter of the cache its underperforming Pentium 4 parent has, our eyes widened with doubt. Sure, a bump up to a 533MHz FSB would help, but it couldn't possibly make up for the kind of performance issues that we saw with the Pentium 4 E; could it?

Looking back over the past of couple months, we can almost imagine Intel knowing what everyone was thinking and going along quietly with a little smirk on its face. That's right, our first inclinations that Celeron D performance would be worse than Intel's already atrocious budget performance were utterly and completely wrong.

In fact, the new Celeron D is a big step up in performance over the Northwood-based Celeron.

We've gone from thinking that this would be a quick article on the hastening demise of the lowest value "value" chip on the market to an article about how Intel is taking a step in the right direction, while we are once again reminded that knowing the ins and outs of an architecture is no substitute for performance numbers. Of course, that was the point of requiring scaling graphs and analysis along with our simulators back in Microprocessor Architecture class.

Before getting to the numbers, we'll take a brief look back at what's inside the new Prescott based Celeron, and we'll try to understand exactly what makes Celeron D so special.

UPDATE: When this article was first published, the L2 cache size of Northwood based Celeron processors was incorrect. The information has been corrected, and the article updated accordingly. Thanks to everyone who pointed out our error. We appologize for any inconvenience we may have caused.


Under The Hood of Celeron D

For an in-depth look at what's different with the new Celeron, the first 11 or so pages of our Pentium 4 E (Prescott) launch article do an excellent job of covering the bases. For a quick summary, here's a look at the major changes inside the Prescott core:
  • 90nm Strained Silicon Process - more, faster transistors in less space
  • 31 Pipeline Stages - for clock speed ramping
  • Improved Branch Predictor - helps avoid pipeline stall
  • Improved Scheduler - helps avoid doing unnecessary work
  • Improved Execution Core - added integer multiply and fast shift to ALU
  • Larger, Slower Caches - higher latency caches for speed and size scaling
  • SSE3 - 13 new instructions
The Celeron D gets an additional bonus of an FSB speed increase from 400MHz to 533MHz as well.

Even with the ominous 31-stage pipeline and higher latency caches, we get better performance with the new Celeron D. So, how does all this stack up to make Prescott a better Celeron than Northwood? Well, let's take it step by step.

First of all, the 16kb L1 cache size of Prescott has a significant impact on the Celeron. Northwood based Celerons only have 8kb of L1 cache. With 8kb more of the on die data stored "closer" (in terms of latency) to the processor, we will definitely see more cache hits get to the processor quicker in spite of the fact that cache latency on Celeron D is the same as Pentium 4 E. Prescott's cache latency is much higher than Northwood's. Improving this ability to recover is critical, as eventhough Celeron D has an increased L2 cache, the size of on die memory is still small and cache misses will occur more than on the Pentium 4.

When dealing with a processor short on cache and prone to very painful pipeline stalls, improving the average cache hit latency can really help to keep extra stalls from happening (a fast L2 hit will come back in about 25 cycles on Prescott), and can help to refill the pipeline once its stalled (as more data will be able to get back into the pipeline faster).

This 8kb of extra L1 cache is a much smaller portion of Pentium 4's total cache size. Since Pentium 4 E has fewer cache misses than Celeron D (it has 4 times the L2 cache), improvements to the L1 cache size don't have as much opportunity to shine.

Speaking of L2, the Celeon D has received an increase from 128kb in the current Celeron to 256kb. Even though this is still a quarter of the (still insufficient) 1MB cache the Pentium 4 E has, we aren't going to see the same type of performance drop we saw when moving from the Northwood Pentium 4 to Celeron (which also had a quarter of its big brother's cache). The reason is the number of cache hits we will see increase rapidly and hit a point of diminishing returns after a certain size. The curve is similar to a logarithmic curve (benefits increase rapidly as cache size increases at first, but then level off quickly).

What it comes down to is that doubling a small cache (say, going from 128kb to 256kb) will have a much higher impact on performance (because the number of cache hits is significantly increased) than doubling a larger cache (like going from 512kb to 1MB). In other words, P4 E gets less benefit from its doubled L2 cache than Celeron D.

While we're on the subject of caches and memory, the 533MHz frontside bus effectively gets data from memory to the processor faster in case of a cache miss. This is very important in the low- cache environment of the Celeron world. Unfortunately, we couldn't increase our multiplier and run our 2.8 GHz Celeron 335 at 28x100 to see just what kind of impact bus speed has on the new processor.

The enhancements Intel made to branch prediction and scheduling round out the factors that help make Prescott an excellent Celeron core. Since we're working with a small L2 cache, it is excessively important to work with good data and avoid stalls for reasons other than cache misses. Northwood is at a disadvantage to Prescott here. Better branch prediction will help avoid filling the cache with data from a mis-predicted branch as well as aid in averting unnecessary bubbles in the pipeline for the same reason. Better scheduling means more efficient use of the data available to the processor as well. Northwood is stuck on these two counts. Adding an integer multiply and fast shift/rotate to Prescott also helped the Celeron D maintain a high level of efficiency, but this really shouldn't have any greater impact on Celeron D than on Pentium 4.

It all comes down to being resilient and efficient. Northwood is very dependent on its L2 cache size. The enhancements Intel made to Prescott in order to avoid that large negative impact of adding so many pipeline stages really benefit the processor when it is starved for data. Prescott has to be more careful not to stall just to keep up with the current Pentium 4 line. As a result, the Celeron flavor can deal with tighter constraints on L2 cache size, which help even more when paired with a larger cache than the Northwood derived version.



CPU Model Numbers and Pricing

A little more than a month ago, we brought you an update on Intel's roadmaps that included the new Celeron D processors and their model numbers. Aside from a nice handy guide to how Celeron D stacks up to the Northwood Celeron, we can fill in pricing information on the new processors.

Intel Celeron Processors
CPU Name Clock Speed L1 Cache Size L2 Cache Size FSB Speed Fab Process Est. Price
Celeron D 335 2.8GHz 16KB 256KB 533MHz 90nm $117
Celeron D 330 2.66GHz 16KB 256KB 533MHz 90nm $89
Celeron D 325 2.53GHz 16KB 256KB 533MHz 90nm $79
Celeron 2.6GHz 2.6GHz 8KB 128KB 400MHz 130nm $91
Celeron 2.0GHz 2.0GHz 8KB 128KB 400MHz 130nm $65


And just to make sure we've got all the useful info in one neat little package, we'll include our Celeron D core enhancement list from the previous page as well.
  • 90nm Strained Silicon Process - more, faster transistors in less space
  • 31 Pipeline Stages - for clock speed ramping
  • Improved Branch Predictor - helps avoid pipeline stall
  • Improved Scheduler - helps avoid doing unnecessary work
  • Improved Execution Core - added integer multiply and fast shift to ALU
  • Larger, Slower Caches - higher latency caches for speed and size scaling
  • SSE3 - 13 new instructions
So now that we know what what we're dealing with, let's take a look at the performance tests.




The Test

There are a few numbers that we are going to want to pay attention to in the following tests. First, obviously, will be the performance of the Celeron D compared to the Northwood based 2.6GHz and 2.0GHz parts and competing Athlon XP parts. Ideally, we would have dug up a multiplier unlocked Northwood Celeron and ran 2.8GHz, but the performance advantage of both the 330 and 325 over the 2.6 should be enough to show how a 2.8GHz Northwood based Celeron would perform versus the 335.

The second set of numbers that we want to look at are our FSB underclocked Celeron D numbers. We ran our Prescott Celeron at 20x100 for a direct comparison to the Northwood based core. With the same multiplier, FSB, and platform, we are able to take a focused look at the Celeron D performance difference due to architecture and L1/L2 size changes in the Prescott core. These numbers will be collected on the first page of benchmarks.

This time around, our D865PERL board could not be resurrected for testing the Celeron D. We had no choice but to retest our Celeron 2.0 and 2.6 on an ABit 865 board (which performs a little better than an Intel D875PBZ). The extra 5% or so performance improvement wasn't enough to help push the Northwood based Celerons out from under the bottom of the pile. Other than that difference, our testing set-up is the same as the one used in December.

Performance Test Configuration
Processor(s): Intel Celeron D 335 (2.8GHz)
Intel Celeron D 330 (2.66GHz)
Intel Celeron D 325 (2.53GHz)
Intel Celeron 2.6GHz
Intel Celeron 2.0GHz
AMD Athlon XP 2600+
AMD Athlon XP 2500+
AMD Athlon XP 2400+
AMD Athlon XP 2200+
AMD Athlon XP 1700+
AMD Duron 1.6GHz
RAM: 2 x 256MB DDR400 @ 2:3:3:6
Hard Drive(s): 2 x Western Digital Special Edition
Chipset Drivers: Intel Chipset Driver 5.00.1009
Video Card(s): ATI Radeon 9800 Pro 256
Video Drivers: ATI Catalyst 3.9
Operating System(s): Windows XP Professional SP1
Motherboards: ASUS A7N8X Deluxe
ABit IS7 (Intel 865)



Celeron D vs. Celeron

As the following graph shows, over all our benchmarks, the new Celeron D outperforms the Northwood based Celeron even when clock speeds and FSB speeds are equal. This leaves only the core improvements and extra L1/L2 cache as variables in performance difference.

We would be happy with a small performance increase considering we first expected a performance drop. Of course, the reality is that most of our benchmarks see more than 10% performance increases (with one improving over 25%).

For these benchmarks, both the Celeron D and the Celeron were run at a 100 MHz FSB with a 20x multiplier.



We definitely didn't see numbers like this when comparing Pentium 4 E to Northwood. But enough ado about this. Now let's see how Intel's new Celeron really stacks up.




General Usage and Content Creation Performance

The Business Winstone test shows the 330 and 335 falling in line between the 2200+ and 2400+ Athlon XP processor. This is a major step up from the 2.6GHz Celeron, which comes in under the P4 1.8A, the Athlon XP 1700+, and the 1.6GHz Duron. We see similar results with the Content Creation Winstone: the 335 performs on par with the 2500+ Barton, and the 325 falls in just below the 2200+.

Business Winstone 2004

Content Creation Winstone 2004




DivX 5.1 Encoding

In looking at Gordian Knot's encoding speed with DivX 5.1, we see the trend continue: Celeron D is a strong performer in the budget segment. This is one of the stronger showings for the Northwood based Celerons, and the 2.53GHz Celeron D still comes in at over 8% faster than the 2.6GHz Celeron.

It looks like FSB increases bring in the most benefit here, as the 20x100 Celeron D is only 3% faster than the 2.0GHz Northwood Celeron. Not that it is anything to worry about.

It's probable that if we retested with DivX 5.1.1, the Prescott based Celerons would perform even better in comparison to everything else due to the added SSE3 support.

DivX 5.1 Encoding




DirectX 9 Gaming Performance

Here, in all but Halo, the Celeron D 335 takes the lead in performance. In Halo, the 335 comes in second, which still isn't anything to shake a stick at.

In the Gunmetal benchmark, all the Celeron D processors line up in front of everything else. Of course, with these DX9 games we are only talking about tiny differences, but this is still a complete change of character for Intel's low end.

Pay attention the the performance differences between the 20x100 Celeron D and the 2.0GHz Celeron. Performance advantages range from 5% and way up. The 20x100 Celeron D outperforms the 600MHz faster Northwood based Celeron every time.

Aquamark 3

Gunmetal Benchmark 2

Halo




DirectX 8 Gaming Performance

Once again, in all but Simcity, the top Celerons take the lead. In Warcraft III, the Athlons have a significant performance advantage over the Northwood Celerons, and the slowest Prescott Celeron has a 5.8% lead over the fastest Athlon.

In looking at the 20x100 Celeron D, we can see that the core and L1 cache enhancements provide a more than 16% performance advantage under Warcraft III. In light of the performance of the Pentium 4 flavor of Prescott, this is simply remarkable.

Command & Conquer Generals: Zero Hour

Simcity 4

Warcraft III: Frozen Throne




Unreal Tournament 2003 Benchmark

Just to keep the DX8 performance page a little cleaner, we decided to give UT2K3 its own page. There is still a good bit of DX7 in UT2K3 code as well, so it's not completely unprecedented.

Anyway, Under the flyby (which is traditionally GPU limited with faster processors), we see the Celeron 335 once again putting in numbers at the top of the heap. The Athlon XP and Celeron D processors are fairly well matched here.

But looking at the Botmatch benchmark numbers tells a different story. We see the Athlon processors sitting well above the Celeron Ds. Of course, the Prescott core processors still hold their own and maintain a position well above the Northwood Celerons. The 20x100 Celeron D has a 14% performance advantage over the 2.0GHz Celeron in Botmatch.

Since Botmatch is more like actual game play, we would expect that the Athlon processors would perform better when actually playing UT2K3.

Unreal Tournament 2003 Flyby

Unreal Tournament 2003 Botmatch




OpenGL Gaming Performance

The performance of Northwood Celerons is horrible here, but Prescott takes the lead in Quake III.

Wolfenstein: Enemy Territory is based on an extended version of the Quake III engine, and under this benchmark, the Athlons come out on top.

Quake III Arena

Wolfenstein: Enemy Territory




3D Rendering Performance

While not pulling in top honors in rendering performance as the fastest Pentium 4 processors are accustomed to doing, the Celeron D is able to make huge leaps beyond Northwood based Celeron performance. Holding in the middle of the pack is definitely not a disgrace for these budget processors.

3D Studio Max Render Time




Development Workstation Performance

In compilation, AMD still takes the cake, but the Celeron 335 manages to best the 2500+ Barton in this benchmark. We can see the more than 7% performance improvement when looking at the 20x100 Celeron D, and the 2.53GHz Celeron D 325 performs head and shoulders above the 2.6GHz Celeron with the addition of the 533MHz FSB.

Quake III Compile Times




Final Words

The numbers don't lie: Prescott is very well suited for Celeron. Not only do the new Celeron 3xx line of processors perform better than the previous Northwood based Celerons, but even when hampered by a 400MHz FSB, the Prescott Celerons consistently showed improved performance over their predecessors. Even more impressive is the fact that the Celeron 3xx line is able to keep pace with AMD's 2600+ and under Athlon XP line.

The improvements in Prescott's core weren't enough to help it keep up with Northwood as a Pentium 4, even with a double-sized L2 cache. With a 128kb cache, much of Northwood's strong points were ripped away, causing plenty of costly pipeline stalls as we mentioned in last year's budget CPU roundup. With the Prescott based Celeron, Intel's architectural enhancements aimed at avoiding pipeline stalls really had a chance to shine. This is due in no small part to the increased returns on performance when smaller caches are doubled in size.

It's clear that at this L2 cache size, Prescott is able to avoid enough potential pipeline stalls that the increased impact of refilling a 31-stage pipeline is much less significant than Northwood's constant struggle to keep itself busy. It's hard to tell how much of this is due to the improved branch prediction and scheduling as opposed to the fact that the extra 8kb of L1 cache that the Prescott has is a more significant percentage of the overall cache size on Celerons than Pentiums, but either way, the performance advantage over Northwood is there.

The bottom line is that Intel's newest architecture scales down with cache size and bus speed in a much more graceful manner than Northwood.

The only issue left for Intel to deal with is pricing. With both the Athlon XP 2500+ and 2600+ easily available at under $80.00 (as per our RealTime Pricing Engine), Intel really shouldn't be charging much more for their essentially comparable Celeron D parts. The information we were able to track down tells us that the 325 will be priced at $79, the 330 at $89, and the 335 at $117. It's a welcome change to see Intel close the gap between price and performance, as it was distasteful for us to look around and see people being pulled in by ads pushing something like "only $200 more to upgrade form an Athlon XP 2500+ to a 2.7GHz Celeron!" Stuff like that almost makes my stomach cringe. Now that Intel's gotten a handle on performance, we're happy that they are tightening up their belts and starting to rely on quality (rather than the Intel name) to sell products in the value space.

A very special thanks to MonarchComputer.com for sponsoring this review.


Log in

Don't have an account? Sign up now