Intel Core Duo (Yonah) Performance Preview - Part II

Name: Intel Core Duo (Yonah) Performance Preview - Part II
Item: Intel Core Duo (Yonah) Performance Preview - Part II
Author: Anand Lal Shimpi

by Anand Lal Shimpi on December 19, 2005 12:55 PM EST

Posted in
CPUs

103 Comments | Add A Comment

103 Comments

What about Clock Speeds?

Whereas the Pentium 4's extremely deep pipeline made clock-for-clock comparisons to the Athlon 64 virtually meaningless, the Pentium M and Yonah processors feature far shorter pipelines akin to AMD's architecture.

The Athlon 64 features a 12-stage integer pipeline, and while Intel has never specifically disclosed the length of Yonah's pipeline, they have made two important statements: it is longer than the Pentium III's 10-stage integer pipeline, and shorter than Conroe/Merom's 14-stage pipeline. Given the relatively tight range, Yonah's pipeline can pretty much be considered to be very similar to AMD's Athlon 64, give or take a stage of the pipeline.

The net result is that we can draw some valid conclusions based on comparisons of Yonah to the Athlon 64 X2 at similar clock speeds.

But our Yonah sample ran at 2.0GHz, which ends up being the speed of the slowest Athlon 64 X2 that is currently available: the 3800+. The highest end Athlon 64 X2s currently run at 2.4GHz, with high speeds just around the corner. So the question isn't just how competitive Yonah is at 2.0GHz, but rather, how high can Yonah go? Unfortunately, our test platform wouldn't allow us to overclock our chip very far, but thankfully, we have access to a decent amount of Intel's future roadmaps, so we can at least see what's going to happen to Yonah over the next year.

While Yonah will make its debut at a maximum speed of 2.16GHz, it will actually only receive a single speed bump before Merom's release at the end of the year. That means that we'll see a 2.33GHz Yonah after the middle of the year, but we'll have to turn to Merom to get any higher clock speeds.

Looking back to our initial articles on the Pentium M's architecture, you'll remember that one of the important aspects of its design is that all critical paths in the chip were slowed down to meet a maximum clock target. This means that Intel set a clock target for the CPU and made sure that the chip ran at that speed or below, and did not optimize any paths that would have allowed the CPU to run higher. Instead, the Pentium M team depended on the manufacturing folks to give them additional clock speed headroom by providing smaller manufacturing processes every 2 years. In other words, the Pentium M was never designed for high clock speeds, which is why it debuted at 1.5GHz and still has not even reached 2.33GHz today.

Intel's next-generation microarchitecture hopes to change that approach ever so slightly by introducing a longer pipeline into the equation, but on a much more conservative basis than the Pentium 4 did just 5 years ago. Conroe (desktop), Merom (mobile) and Woodcrest (server) will feature a 14-stage integer pipeline, which will allow for higher clock speeds than what Yonah could pull through. We would expect a debut at a minimum of 2.4GHz and probably at least one speed grade higher. Learning from their mistakes with the Pentium 4, Intel will balance the reduction in efficiency of a deeper pipeline with a wider 4-issue core (vs. the current 3-issue core used in Yonah).

So it looks like Intel's plan for 65nm is to rely on their deeper pipelined processors (Conroe/Merom/Woodcrest) for higher clock speed, with Yonah falling below the 2.5GHz mark. And based on what we've seen in the first article, a 2.33GHz Yonah would be competitive with an Athlon 64 X2 4600+, but definitely not outpacing it. This does bode well for Intel's next-generation processors, especially on the desktop with Conroe.

If the move to a 4-issue core is able to balance out the negative impact of a deeper pipeline (which admittedly it may or may not do in all cases), a higher clock speed desktop part should be very good competition for AMD's offerings. Although based on what we've seen thus far, we would be surprised if Conroe vs. Athlon 64 was a blow-out in favor of either manufacturer; more and more, it is looking like Conroe will simply bring Intel up to par with AMD, ahead in some areas, behind in others, and with the lower power advantage as long as AMD is still at 90nm.

Why the X2 and why not Turion?

One of the other questions that we were asked a lot after the first article was why we insisted on comparing a mobile Yonah processor to a desktop Athlon 64 X2, and not an AMD Turion 64. Our reasoning was obvious to some, but we felt it made sense to present it more clearly here:

As much as Yonah is a mobile processor, it is a great indicator of the performance of Intel's future desktop processors based on the Conroe core. AMD has already stated that beyond moving to Socket-M2 and some minor updates, there will be no significant architectural changes to the Athlon 64 line next year. In other words, we know for the most part how AMD's going to be performing next year, but we have no clue how Intel will towards the end of 2006; Yonah helps us fill in the blanks.
AMD will have a dual core Turion based mobile processor out sometime in 2006. However, it will be based on AMD's Socket-M2 platform, meaning that it will include DDR2 support. Given that we don't know exactly how DDR2 is going to impact the Athlon 64's performance, we couldn't accurately simulate the performance of AMD's upcoming dual core Turion. Comparing a dual-core Yonah to AMD's single-core Turion also wouldn't be too valid a comparison either.

Index It's called the Core Duo

PRINT THIS ARTICLE

Post Your Comment
Please log in or sign up to comment.

Comments Locked

103 Comments

View All Comments

Furen - Monday, December 19, 2005 - link
Well, the memory controller is the major difference between the K7 and the K8 but if you compare performance between the two the K8 performs much better. This means that the memory controller directly lead to this increase of performance.

In truth the K8's performance is a combination of its micro-architecture and the low-latency access to the memory but since the microarchitecture came first (and was insanely bottlenecked by FSB at higher clocks) the one improvement that lead to the performance difference between the K7 and the K8 was the memory controller. In fact when AMD launched its K8 it said that there would be a 20-30% performance improvement because of the on-die memory controller.
tfranzese - Monday, December 19, 2005 - link
The K8 saw a new instruction set, a slightly lengthened pipeline, SSE2 extensions, SSE3 extensions (eventually), dual-core/multi-cpu design strategies, etc. Oh, and it got an on-die memory controller among other architectural tweaks.

I don't think it's valid to attribute so many factors that could have benefitted the architecture to just the memory controller. A lot of small differences add up to a lot.
Furen - Monday, December 19, 2005 - link
Longer pipelines lead to lower performance, the "dual-core design strategies" have nothing to do with a single-core K8's performance benefits over a K7, SSE3 is useless even now and, of course, AMD64 does not benefit 32-bit execution. The only thing that you mentioned that makes a difference is SSE2 and it doesn't really make as much of a difference on A64s as it does on P4s since SIMD vector instructions require multiple passes on the A64. The deeper buffers help, as do the increased L2 cache bandwidth and the increase in L2, but the biggest benefit does come from the integrated memory controller. Cutting access latency is insanely important but having a faster frontside bus (the bus that connects the execution core/cache to the memory controller) is probably what makes A64s perform how they perform.
fitten - Tuesday, December 20, 2005 - link

quote:
Longer pipelines lead to lower performance,

This is not always the case. On branchy code, it is typically true. On non-branchy code, longer pipelines can be very efficient. The problem is that typical codes on the x86 are very branchy so longer pipelines aren't that good on typical x86 codes.

As far as latency numbers and the like, you should do the math to understand why the latency helps. For large cache sizes (512M and larger), the L2 should get above 96% hit rate typically. For 1M L2, hit rates should be 98% or more. Obviously, the application you have will govern these hit rates but this is for "typical" codes. Some applications will see almost no benefit from having an L2 cache at all, for example. The latency of the main memory accesses are felt in the misses (that other 4% or 2%). If the L1 is pretty good (1 cycle penalty), you can zero that out for the calculation. Use some numbers on L2 and main memory access times to get an idea of how it really helps.

So many people just chant "integrated memory controller" as some kind of mantra without even knowing how much it *really* effects memory access times.
Furen - Tuesday, December 20, 2005 - link
Longer pipelines do not help non-branchy code, higher clock speeds do. Longer pipelines allow you to raise clock speeds but if you compare two equally clocked CPUs with similar architectures but different pipeline lenghts then the longer-pipelined one will ALWAYS be slower, since both will eventually mispredict a branch and the penalty on the longer-pipelined one will take a greater hit. In the case of the K8 compared to the K7, however, the branch predictor was improved, the TLBs increased and so on, so you probably end up having the same performance per clock.

"Typical" code is code that operates on very small data sets, like a word processor. This is not what I'm talking about, however, I'm referring to code that handles massive data sets that cannot fit inside the L2 cache. These include games and streaming media. A K7 performs pretty much the same as a K8 (clock for clock) in office applications and the like, but once you have data traveling down the frontside bus (the K8s frontside bus equivalent is the link between the execution core and the memory controller, which runs at CPU clock) then the performance differences are massive. It may be true that most of the code we execute on a PC does not even touch the main memory to a significant degree but it is also true that we perceive the times when it does as a massive drop in performance. Saying that memory bandwidth (and latency, as the two are directly related) is useless is like saying that a P3 is enough for everyone.
fitten - Wednesday, December 21, 2005 - link

quote:
Longer pipelines do not help non-branchy code, higher clock speeds do.

Yes... and longer pipelines is one of the design parameters to achieve higher clock speeds.

quote:
"Typical" code is code that operates on very small data sets, like a word processor. This is not what I'm talking about, however, I'm referring to code that handles massive data sets that cannot fit inside the L2 cache.

Yes, which is why I used "typical" there with a caveat that some workflows do not match that pattern. The math that I mentioned is not difficult to do and the percentages for hit/miss are simply parameters into the equation. You can take any instruction mix and data access pattern, analyze it, and plug the newly found percentages into the equation for a comparison. And... I never said that memory bandwidth is useless. However, I would be inclined into discussion about your bandwidth and latency being directly related (in the general form). Quite obviously, satellite communication has high bandwidth and it is equally obvious that satellite communication has a very high latency, for example.

So, your post confirms exactly what I have said and that AnandTech's benchmarks show (and the conclusions stated in the article). For the majority of applications, since data locality is high, the IMC doesn't do all that much (simply because it isn't used that much). For applications such as games and other applications with data access patterns that do not have a high degree of data locality, the IMC starts to shine. I would also argue that streaming does not fall into that category unless you have poorly optimized code. Intelligent use of prefetching, for example, can hide most of the latency penalties of main memory. I guess we could discuss what "majority of things" means and whether or not games fall into that category. ;)
Furen - Thursday, December 22, 2005 - link
[quote] Yes... and longer pipelines is one of the design parameters to achieve higher clock speeds. [/quote]

That's exactly what i said in the line that followed what you quoted. When I said that longer pipelines themselves dont help performance I meant that the clock-for-clock performance benefits of the K8 over the K7 can be mostly attributed to its on-die memory controller. Of course the bigger caches help, as do SSE2 and the other improvements, but the lion's share of the improvement comes from the integrated northbridge (the FSB was a horrible choke point in the K7).

[quote] I would be inclined into discussion about your bandwidth and latency being directly related (in the general form). Quite obviously, satellite communication has high bandwidth and it is equally obvious that satellite communication has a very high latency, for example. [/quote]

Sorry, let me clarify that a bit. When dealing with DRAM (at a set frequency) in a computer system the usable memory bandwidth is directly related the latency. It is not directly proportional but a higher latency will mean a lower usable bandwidth. This is because the memory subsystem in a PC is not just a data transport mechanism but also functions as a data storage array, which gives latency more importance (Satellite communication, on the other hand, only moves data from point to point, it does not store it or modify it in any way, it's just a conduit, which makes its bandwidth somewhat independent of the latency). Now, remember that I'm talking about usable memory bandwidth, not peak bandwidth (which is what manufacturers love to quote). Peak bandwidth is pretty much unrealizable when doing anything useful.

Anyway, I agree with you on the caches, I wanted to point out that the pipeline length itself provides no performance improvements whatsoever, and wanted to say that an integrated memory controller is a wonderful thing. Now, I say that an IMC is wonderful but it does have huge drawbacks, the main one being what AMD is currently dealing with, having to change sockets in order to update memory technology. The thing is, Intel needs flexibility because it is always updating to the newest technologies out there but AMD, on the other hand, actually gained control over the part of the traditional northbridge that affects performance the most without having to go all out and design its own chipsets like Intel does, which is why pretty much all AMD chipsets perform very similarly.
Furen - Thursday, December 22, 2005 - link
Now, can someone tell me how to make decent looking quotes?!!
Xenoterranos - Tuesday, December 20, 2005 - link
Considering that bus was developed with dual core and multi-cpu design in thought, I'd say that "dual-core design strategies" had a lot to do with the increase in performance of K8 oer K7. AMD's technical director said something to that effect in so many words in an interview here a few years back; he said they'd built K8 from the ground up for dual core, multi-cpu applications.
blackbrrd - Monday, December 19, 2005 - link
Actually, last time I checked, AMDs FPU (since K7) has had 3 execution units, while Intels has had 2 execution units (since pentium..2, or the original pentium, can't remember).

Intel Core Duo (Yonah) Performance Preview - Part II

Post Your Comment

103 Comments

View All Comments

Furen - Monday, December 19, 2005 - link

tfranzese - Monday, December 19, 2005 - link

Furen - Monday, December 19, 2005 - link

fitten - Tuesday, December 20, 2005 - link

Furen - Tuesday, December 20, 2005 - link

fitten - Wednesday, December 21, 2005 - link

Furen - Thursday, December 22, 2005 - link

Furen - Thursday, December 22, 2005 - link

Xenoterranos - Tuesday, December 20, 2005 - link

blackbrrd - Monday, December 19, 2005 - link

Log in

Don't have an account? Sign up now