Intel Q3'05 Roadmap: Conroe Appears, Speculation Ensuesby Kristopher Kubicki & Jarred Walton on August 8, 2005 3:13 AM EST
- Posted in
In order to have any inkling of what Conroe will offer, we need to take a step back for a minute. The last truly new architecture that Intel introduced was the IA-64/EPIC platform for Itanium (although depending on how you look at it, some would say that NetBurst actually came after IA-64). Prior to that, Intel had the P6 architecture, which was preceded by P5 (Pentium), 486, 386, etc. all the way back to the first parts Intel made. At present there are three major architectures that are all in production at Intel: P6 (Pentium Pro/II/III now evolved to Pentium M), NetBurst (Pentium 4 and derivatives), and IA-64/EPIC used in Itanium processors. P6 isn't actually the real name of the architecture for Pentium M, of course - Intel has never come forth with an official name. While Pentium M does use an extension of P6, the Banias and Dothan cores really change things quite a bit. We'll talk about how in a moment, but we'll refer to the architecture as P6-M for the remainder of this article. When we say P6-M, we mean Banias, Dothan, and Yonah. Let's take a quick look a the benefits and problems of each architecture before we talk about Conroe.
Prescott is used on the recent Pentium and Celeron chips and has a 31 stage pipeline, coupled to a separate 8 stage fetch/decode front end. (Earlier Northwood and Willamette cores use a 20 stage pipeline with the 8 stage front end.) Together the total pipeline length comes in at 39 stages - over twice the length of the current AMD K8 pipeline. In fact, the next longest pipelines outside of Intel aren't even out yet: Cell and Xenon are both around 21 stages long. The benefits of a long pipeline are in raw clock speeds. It's no surprise that NetBurst is the only chip currently shipping in speeds greater than 3 GHz, and Cell and Xenon are slated to join that "elite" group of processors in the future.
While a lengthy pipeline allows for high clock speeds, it also introduces inefficiencies in cases where a branch prediction misses. When that occurs, everything following the branch instruction has to be cleared from the CPU pipeline and execution begins again - a penalty of as much as 30 cycles in the case of Prescott. (Of course, it could be even longer if there's a cache miss and main memory needs to be accessed, but that delay would occur with or without the branch miss so we'll ignore it.) In order to avoid the full penalty of a branch misprediction (39 cycles), Intel decoupled the fetch/decode unit from the main pipeline and turned the L1 cache into a "trace cache" where instructions are stored in decoded form. The trace cache is actually a very interesting concept and certainly helped improve performance. It basically allows many instructions to skip 1/4 to 1/3 of the standard pipeline. While Intel no longer holds the performance crown, it wasn't until the launch of the K8 that Intel really lost the lead.
In terms of internal functioning of the NetBurst pipeline, each clock cycle at most three traces (instructions decoded into micro-ops) can be issued from the trace cache to the queues within the main pipeline. The NetBurst queues (schedulers) can then dispatch up to six micro-ops per cycle, but there are restrictions and in many cases there are execution slots that can't be filled on any given cycle. Based on the number of traces issued per clock, most would call NetBurst a three-wide issue design. That makes NetBurst the same as AMD's K7/K8 cores as well as the P6/P6-M cores in terms of issue rate. Purely from a theoretical standpoint, NetBurst could execute 3 instructions per clock, multiplied by the clock speed to give the final performance. Nothing ever reaches the theoretical performance of course - if it did, then NetBurst would still be over 35% faster than any other architecture, given its high clock speeds. Branch misses, cache misses, instruction dependencies, etc. all serve to reduce the theoretical performance offered.
Moving on to the Pentium M core, you can find out some of the details of what was changed in our Dothan investigation from last year. The basic idea is to take the P6 core and add some of the latest technologies to the design. To recap the earlier article, the Pentium M has several major design features. First, it goes with a more moderate pipeline length: longer than P6 to allow higher clock speeds, but shorter than NetBurst. (Intel isn't saying more than that, though guesstimates would put the length around 14 to 17 stages.) Next, Intel added micro-ops fusion to the core, which helps some instructions move through the core faster and avoids delays associated with out-of-order cores. Micro-ops fusion in essence eliminates dependency problems on certain instructions, since they are "fused" together. The core also has a dedicated stack manager that helps improve memory access efficiency as well as lower power use. Better branch prediction is another major improvement relative to the P6 design - take something like the branch prediction of NetBurst and put it on the P6 core and that's a rough description of what was done. Branch prediction is one of the features of an architecture that generally makes all code run a bit faster, and it once again reduces inefficiencies. The number of execution units remains the same as in P6, which means there's less wasted power on idle parts of the chip, while the faster system bus of NetBurst helps to keep the processor fed with data. Finally, power saving features were added to the cache, allowing the CPU to only fully power up small areas of the L2 cache for each cache access. The end result is a processor that has certain limitations but ends up achieving a very high performance per Watt rating, which is important for a mobile part. As we've shown in several articles, Pentium M makes for an attractive laptop processor but still can't compete with desktop parts in certain tasks.
Moving on to the final architecture, we come to IA-64/EPIC. While similar in some ways to VLIW (Very Long Instruction Word) architectures of the past, Intel worked to overcome some of the problems with VLIW (specifically the need to recompile code for every processor update) and called their new approach EPIC: "Explicitly Parallel Instruction Computer". In contrast to the P6, NetBurst, K7, and K8 architectures that can issue up to three instructions per cycle, the current Itanium 2 chips can issue six instructions per clock. From a purely theoretical standpoint, the fastest Itanium 2 running at 1.6 GHz actually has more computational power than any other Intel chip. Throw in dual core designs with HyperThreading - HyperThreading that actually works much better than NetBurst HTT due to the wide design of EPIC - and each chip not only has the potential to issue six instructions per clock, but it should actually come relatively close to that number. Another difference between Itanium and the other designs is that large amounts of cache are present in order to keep the pipelines fed with data. Current models ship with up to 9MB of L3 cache, while future parts like the Montecito will have 24MB of L3 cache (and a transistor count of 1.7 billion transistors - about eight times the transistor count of the Pentium D Smithfield core)!
Of course, with the wide issue rate of Itanium 2 (the original Itanium had a 6-wide core as well, but could generally only get 3.5 to 4.0 IPC at best), you need a lot of execution units. NetBurst has 7 execution units in Prescott: two simple integer units (which can function as 4 integer units if you count the double pumped design), a complex integer unit, two FP/SIMD units, and dedicated memory load and store units. If you want to count the simple integer units as 2 each, you could make a stretch and say NetBurst has nine execution units. AMD's K7 and K8 both have nine execution units as well, only they go for a less customized approach and instead have three each of the integer, FP/SIMD, and memory units. Each of AMD's units is fully functional, unlike the "simple" and "complex" integer units in NetBurst. In contrast to these architectures, the current Itanium 2 chips have six ALUs (Arithmetic Logic Units), three BRUs (Branch Units), two FPUs, one SIMD, two load units, and two store units - call it 16 functional units if you prefer, though the specialization of some of them makes it slightly less than that. While Itanium 2 is very wide, the length of the pipeline is only 8 stages - less than any other modern x86 processor by a significant amount. That certainly plays a role in the reduced clock speeds, but like Athlon 64, lower clock speeds with a more efficient architecture can outperform long pipelines in many instances. In order to extract all of the potential performance from Itanium, however, a lot of work needs to be done during code compilation. This is the Achilles' heel of VLIW designs; Processor updates require the code to be recompiled. While EPIC doesn't require that you recompile the code, newer compiler optimizations can improve performance significantly.
All that talk about other Intel architectures (as well as some of AMD), and yet we still haven't said exactly what Conroe is. The simple truth is that no one other than Intel and people under strict NDA really know for sure what the Conroe architecture will entail. There is a point to all of this discussion of previous architectures, though. While we've really only skimmed the surface of the designs, hopefully you can see how wildly different each architecture is from the others. NetBurst is long and narrow, EPIC is short and wide, and P6-M is a medium length pipeline that is narrower than either of the others but requires less power. The high clock speeds and resultant power levels have created problems for NetBurst, but there are still cases where it substantially outperforms P6-M. Itanium is still a better solution for certain types of big business work (databases in particular) than any of the other Intel architectures. While all three architectures have their strong points, none of them qualify as a universally superior solution. Having fallen behind AMD performance in many areas, we seriously doubt that Intel wants to create a design that merely aims at being "faster than AMD in most areas." Whether or not they can succeed is of course a question for the future.
If we don our speculation hats for a minute, we'd say that Conroe will return to more typical pipeline lengths and also reduce the maximum clock speed of the processors based off it relative to NetBurst. A 20 pipeline stage design, give or take, seems to be reasonable - we heard a few people at WinHEC suggest that NetBurst was hubris in terms of pipeline lengths, and that 20 or fewer stages is where all foreseeable pipelines - Intel and otherwise - are heading. The concept of a trace cache also seems to have merit, so some variant of that concept could show up in Conroe - micro-ops fusion plus a trace cache larger than that of NetBurst sounds interesting to us at least, though we're not at all sure it's feasible. Along with the shorter, more efficient pipeline, Conroe could also look into going to a wider issue rate. Some people have argued (rather convincingly) that x86 code is not conducive to issuing more than 3 instructions per clock without expending significant die resources, however, and current designs rarely manage issuing three instructions per clock anyway. A better solution could be to simply add more execution units, branch prediction, prefetch logic, etc. to ensure that the core can actually reach the maximum issue rate more frequently. Taking something like Pentium M and adding more FP/SIMD computational power isn't too much of a stretch (though that seems to be where Yonah is already heading).
If any of these ideas make the final design of Conroe, it's basically just an educated guess. The main point right now is that new architectures from Intel are not a frequent occurrence, so we expect it to be substantially different than P6/P6-M, NetBurst, and EPIC. Depending on how much collaboration there is between the various CPU design teams, we could see many elements of all three architectures or we could see a design largely derived from one or two of the others. If you consider that the Northwood to Prescott changes were pretty significant and Intel still didn't dub Prescott a new architecture, Conroe (and derivatives) ought to be a more significant redesign than going from 20 to 31 pipeline stages, adding EM64T, and changing cache sizes. Chances are that by the time we know more, we'll be under NDA as well until the official launch, so consider this our last chance at some enthusiast speculation.