Updated CPU Cheatsheet - Seven Years of Covert CPU Operations

Name: Updated CPU Cheatsheet - Seven Years of Covert CPU Operations
Item: Updated CPU Cheatsheet - Seven Years of Covert CPU Operations
Author: Jarred Walton

by Jarred Walton on August 28, 2004 9:00 AM EST

Posted in
CPUs

74 Comments | Add A Comment

74 Comments

Celeron, Pentium II and III Processors

I'm going to forego listing the various models of these processors for the time being. If anyone has a real desire to see them listed, feel free to let me know. If you're running one of these processors still, I can feel for you. I still use one at work, and I only upgraded from my P3 back in March. Given the price of upgrading, though - $225 will get you a decent motherboard, 512 MB RAM, and an Athlon XP 2500+ - you really should upgrade if at all possible.

The old Pentium Pro P6 architecture was a 12-stage pipeline, more or less concluding with the Pentium III (more on that later). It had three specialized AGUs, two ALUs - one that handled simple instructions and a second one for the more complex instructions - and one FPU. The FPU also added support for SSE (which AMD lacked until the Athlon XP, but by then Intel was pushing the P4) and MMX - and they were generally faster on these instructions than AMD. That's not too surprising, considering that they created the technologies and AMD had to license them from Intel.

Intel could have certainly stuck with the design for a lot longer, as the last gasp Tualatin core offered pretty competitive performance clock for clock with the Athlon up to 1.4 GHz (the last Pentium III-S). In fact, the later 1.0A to 1.4A Celeron processors were very good overclocking chips, and a 1.1A running on a 133 MHz bus gave pretty decent performance. (I have just such a system powering my Home Theater PC.) Newer and better chipsets could have improved speed further, but Intel cut off the line and focused on pushing the Pentium 4 and NetBurst. This appears now to have been more of a marketing driven decision, although for the most part it can't be said that it was the worst idea ever.

Celeron 2 and Pentium 4 Processors


Pentium 4 and Celeron (Desktop)
P4 1.3	1300	Willamette	256	100	13.0X	423
C 1.7	1700	Willamette	128	100	17.0X	478
P4 1.4	1400	Willamette	256	100	14.0X	423
P4 1.4	1400	Willamette	256	100	14.0X	478
C 1.8	1800	Willamette	128	100	18.0X	478
P4 1.5	1500	Willamette	256	100	15.0X	423
P4 1.5	1500	Willamette	256	100	15.0X	478
C 2.0	2000	Northwood	128	100	20.0X	478
P4 1.6	1600	Willamette	256	100	16.0X	423
P4 1.6	1600	Willamette	256	100	16.0X	478
C 2.1	2100	Northwood	128	100	21.0X	478
P4 1.7	1700	Willamette	256	100	17.0X	423
P4 1.7	1700	Willamette	256	100	17.0X	478
C 2.2	2200	Northwood	128	100	22.0X	478
P4 1.6	1600	Northwood	512	100	16.0X	478
C 2.3	2300	Northwood	128	100	23.0X	478
C 2.4	2400	Northwood	128	100	24.0X	478
C 2.5	2500	Northwood	128	100	25.0X	478
P4 1.8	1800	Northwood	512	100	18.0X	478
C 2.6	2600	Northwood	128	100	26.0X	478
C 2.7	2700	Northwood	128	100	27.0X	478
C 2.8	2800	Northwood	128	100	28.0X	478
P4 2.0	2000	Northwood	512	100	20.0X	478
P4 2.2	2200	Northwood	512	100	22.0X	478
C D 320	2400	Prescott	256	133.3	18.0X	478
P4 2.4	2400	Northwood	512	100	24.0X	478
C D 325	2533	Prescott	256	133.3	19.0X	478
C D 325/J	2533	Prescott	256	133.3	19.0X	T/775
P4 2.26B	2267	Northwood	512	133.3	17.0X	478
C D 330	2667	Prescott	256	133.3	20.0X	478
C D 330/J	2667	Prescott	256	133.3	20.0X	T/775
P4 2.4B	2400	Northwood	512	133.3	18.0X	478
P4 2.6	2600	Northwood	512	100	26.0X	478
P4 2.4A*	2400	Prescott	1024	133.3	18.0X	478
C D 335	2800	Prescott	256	133.3	21.0X	478
C D 335/J	2800	Prescott	256	133.3	21.0X	T/775
P4 2.53B	2533	Northwood	512	133.3	19.0X	478
C D 340	2933	Prescott	256	133.3	22.0X	478
C D 340/J	2933	Prescott	256	133.3	22.0X	T/775
P4 2.4C	2400	Northwood	512	200	12.0X	478
P4 2.66B	2667	Northwood	512	133.3	20.0X	478
P4 2.8B	2800	Northwood	512	133.3	21.0X	478
P4 2.6C	2600	Northwood	512	200	13.0X	478
P4 2.8A*	2800	Prescott	1024	133.3	21.0X	478
P4 2.8E	2800	Prescott	1024	200	14.0X	478
P4 520/J	2800	Prescott	1024	200	14.0X	T/775
P4 3.06B HTT	3067	Northwood	512	133.3	23.0X	478
P4 2.8C	2800	Northwood	512	200	14.0X	478
P4 3.0E	3000	Prescott	1024	200	15.0X	478
P4 530/J	3000	Prescott	1024	200	15.0X	T/775
P4 3.0C	3000	Northwood	512	200	15.0X	478
P4 3.2E	3200	Prescott	1024	200	16.0X	478
P4 3.2C	3200	Northwood	512	200	16.0X	478
P4 3.4E	3400	Prescott	1024	200	17.0X	478
P4 550/J	3400	Prescott	1024	200	17.0X	T/775
P4 3.4C	3400	Northwood	512	200	17.0X	478
P4 560/J	3600	Prescott	1024	200	18.0X	T/775
P4EE 3.2	3200	Gallatin	512	200	16.0X	478	2048
P4 540/J	3800	Prescott	1024	200	19.0X	T/775
P4 570J	3800	Prescott	1024	200	19.0X	T/775
P4EE 3.4	3400	Gallatin	512	200	17.0X	478	2048
P4EE 3.4	3400	Gallatin	512	200	17.0X	T/775	2048
P4 580J	4000	Prescott	1024	200	20.0X	T/775
P4EE 3.46	3467	Gallatin	512	266	13.0X	T/775	2048
P4EE 3.73	3733	Prescott	2048	266	14.0X	T/775
* Prescott 2.4A and 2.8A processors have HyperThreading Technology (HTT) disabled.
*** Front Side Bus (FSB) Speeds are "quad pumped", so Intel's FSB numbers are four times the actual bus speed on Pentium 4/Celeron, Pentium M/Celeron M, and Itanium processors. Multipliers are based off the base bus speed, not the FSB value.

NetBurst consists of a deep 20-stage pipeline coupled to an 8-stage fetch/decode unit. Due to the time spent fetching and decoding instructions, Intel created a new type of cache called a trace cache. This contained pre-decoded micro-ops, so for a large percentage of instructions, NetBurst runs as a 20-stage pipeline. Certain types of code run very well on NetBurst, while others - specifically branch-heavy code, like that seen in compilers and some games - do not. An incorrect branch prediction on P4 costs about twice as many lost cycles as an incorrect branch prediction on P3 or Athlon, which is why Intel added a more robust branch prediction unit.

The long pipeline allowed clockspeeds to scale very quickly with NetBurst. It was also a bandwidth hungry design, so increasing bus speeds combined with dual-channel memory eventually pushed the P4 beyond the reach of the Athlon XP. On the server front with the Xeon processors, the bandwidth was provided by adding L3 cache.

The Prescott further extended the NetBurst pipeline to 23 stages in addition to the 8 fetch/decode stages. For whatever reason, Intel generally describes the pipeline of the Prescott as 31 stages while only calling the earlier design a 20 stage pipeline. Besides the additional stages, Prescott doubled the L2 cache of the Northwood, added SSE3 support, and to the best of my knowledge contains deactivated x86-64 support - called EM64T by Intel and AMD64 by its creator AMD. Xeon versions of Prescott with the 64-bit support enabled are now shipping, and likely by the time XP-64 is released we will see 64-bit enabled desktop processors.

The Pentium 4 architecture also saw the introduction of Symmetric Multi-Threading (SMT) for Intel processors - they chose to call it Hyper Threading Technology (HTT). It appears to have been a part of the core from the very beginning, but Intel didn't enable the functionality until the P4 3.06 was launched, at which time it became available in the Xeon platforms as well. Later, it was enabled in all the 800 FSB "C" processors. Due to the length of the P4 pipeline, HTT allows the execution units to stay busy in the event of an incorrect branch prediction. The second thread can continue to run while the other thread recovers. In an ideal scenario, HTT could potentially increase performance by 20 or even 50 percent. In real world tests, however, rarely does it improve performance by more than 5 to 10 percent, and there are even times when it hurts performance.

With the switch to socket 775 LGA, Intel has also adopted model names. This likely has something to do with the recent difficulties Intel has encountered in scaling the NetBurst architecture to higher speeds. However, an even bigger problem is Intel's own Pentium M architecture (which is the next section). Anyway, we now have new model numbers which are supposed to reflect the overall capabilities of the chip, with higher numbers indicating more desirable chips. Comparing between families of chips should not be done based solely off the model number, however - there will certainly be instances where a 5xx chip offers better performance than a 7xx chip, and perhaps we'll also see some 3xx chips outperform their "superiors". For the time being, all of the 5xx chips are Prescott cores with 1 MB of L2 cache and an 800 MHz FSB. Future processors are also listed, and you can see where they will likely fall in the performance spectrum.

Mobile Celeron, Mobile P4, Celeron M and Pentium M Processors


Mobile Pentium/Celeron Chips**
MC 1.4	1400	Willamette	128	100	14.0X	478M
MC 1.5	1500	Willamette	128	100	15.0X	478M
MC 1.6	1600	Willamette	128	100	16.0X	478M
MC 1.7	1700	Willamette	128	100	17.0X	478M
MC 1.4	1400	Northwood	256	100	14.0X	478M
MC 1.8	1800	Willamette	128	100	18.0X	478M
MC 1.5	1500	Northwood	256	100	15.0X	478M
MC 2.0	2000	Willamette	128	100	20.0X	478M
MC 1.6	1600	Northwood	256	100	16.0X	478M
CM 353/J	900	Dothan	1024	100	9.0X	478M
MC 2.1	2100	Willamette	128	100	21.0X	478M
CM 333	900	Banias	1024	100	9.0X	478M
PM 900 (ULV)	900	Banias	1024	100	9.0X	478M
MC 1.7	1700	Northwood	256	100	17.0X	478M
MC 2.2	2200	Willamette	128	100	22.0X	478M
MC 1.8	1800	Northwood	256	100	18.0X	478M
CM 373J	1000	Dothan	1024	100	10.0X	478M
MC 2.3	2300	Willamette	128	100	23.0X	478M
PM 1.0 (ULV)	1000	Banias	1024	100	10.0X	478M
MC 2.4	2400	Willamette	128	100	24.0X	478M
MC 2.0	2000	Northwood	256	100	20.0X	478M
PM 723/J (ULV)	1000	Dothan	2048	100	10.0X	478M
PM 1.1 (LV)	1100	Banias	1024	100	11.0X	478M
CM 350/J	1300	Dothan	512	100	13.0X	478M
MC 2.2	2200	Northwood	256	100	22.0X	478M
PM 1.2 (LV)	1200	Banias	1024	100	12.0X	478M
CM 320	1300	Banias	512	100	13.0X	478M
MC 2.4	2400	Northwood	256	100	24.0X	478M
PM 1.3	1300	Banias	1024	100	13.0X	478M
PM 718 (LV)	1300	Banias	1024	100	13.0X	478M
CM 330	1400	Banias	512	100	14.0X	478M
MC 2.5	2500	Northwood	256	100	25.0X	478M
CM 360/J	1400	Dothan	1024	100	14.0X	478M
MC 2.6	2600	Northwood	256	100	26.0X	478M
CM 340	1500	Banias	512	100	15.0X	478M
PM 1.4	1400	Banias	1024	100	14.0X	478M
PM 713 (ULV)	1400	Banias	1024	100	14.0X	478M
MC 2.7	2700	Northwood	256	100	27.0X	478M
CM 370J	1500	Dothan	1024	100	15.0X	478M
MC D 325	2533	Prescott	256	133.3	19.0X	T/775
MC 2.8	2800	Northwood	256	100	28.0X	478M
PM 1.5	1500	Banias	1024	100	15.0X	478M
PM 705	1500	Banias	1024	100	15.0X	478M
PM 733/J (ULV)	1400	Dothan	2048	100	14.0X	478M
PM 738/J (LV)	1400	Dothan	2048	100	14.0X	478M
MC D 330	2667	Prescott	256	133.3	20.0X	T/775
MC D 335	2800	Prescott	256	133.3	21.0X	T/775
PM 1.6	1600	Banias	1024	100	16.0X	478M
PM 715	1500	Dothan	2048	100	15.0X	478M
PM 758J (LV)	1500	Dothan	2048	100	15.0X	478M
MC D 340	2933	Prescott	256	133.3	22.0X	T/775
PM 1.7	1700	Banias	1024	100	17.0X	478M
MC D 345	3066	Prescott	256	133.3	23.0X	T/775
MP4 2.8	2800	Northwood	512	133.3	21.0X	478M
MP4 2.8 HT	2800	Northwood	512	133.3	21.0X	478M
PM 735	1700	Dothan	2048	100	17.0X	478M
MC D 350	3200	Prescott	256	133.3	24.0X	T/775
PM 730/J	1600	Dothan	2048	133.3	12.0X	478M
MP4 518	2800	Prescott	1024	133.3	21.0X	?478M
PM 745	1800	Dothan	2048	100	18.0X	478M
PM 753J (ULV)	1800	Dothan	2048	100	18.0X	478M
MP4 3.0	3000	Northwood	512	133.3	22.5X	478M
MP4 3.0 HT	3000	Northwood	512	133.3	22.5X	478M
PM 740/J	1733	Dothan	2048	133.3	13.0X	478M
MP4 532	3067	Prescott	1024	133.3	23.0X	?478M
MP4 3.2 HT	3200	Northwood	512	133.3	24.0X	478M
MP4 538	3200	Prescott	1024	133.3	24.0X	?478M
PM 750/J	1867	Dothan	2048	133.3	14.0X	478M
PM 755	2000	Dothan	2048	100	20.0X	478M
PM 760/J	2000	Dothan	2048	133.3	15.0X	478M
MP4 552	3467	Prescott	1024	133.3	26.0X	?478M
MP4 558	3600	Prescott	1024	133.3	27.0X	?478M
PM 770/J	2133	Dothan	2048	133.3	16.0X	478M
PM 765	2400	Dothan	2048	100	24.0X	478M
** There are several chips in the mobile sector. PM is for Pentium M, MP4 is the Mobile Pentium 4, CM is the Celeron M, and MC is the Mobile Celeron (P4 core).
*** Front Side Bus (FSB) Speeds are "quad pumped", so Intel's FSB numbers are four times the actual bus speed on Pentium 4/Celeron, Pentium M/Celeron M, and Itanium processors. Multipliers are based off the base bus speed, not the FSB value.

With the scaling clock speeds of the Pentium 4, not even the specially designed Mobile versions were really suited for use in laptops. (Of course, they were still used, but Intel had other plans.) Higher clockspeeds mean higher power requirements as well as increased heat output, which makes it very difficult to get increased battery life. In response to pressure from companies such as Transmeta, Intel commissioned a design team in Israel to put together a high-performance, low-power processor. The end result was the Pentium M. Where the push for high clockspeeds was the driving force behind the NetBurst design, Pentium M was targeted at reaching specific thermal requirements. While specific details are rather hard to come by, since Intel is trying to protect its lead in the Mobile space, the Pentium M appears to be a modified version of the venerable P6 architecture.

One of the improvements made to the P6 architecture was a large L2 cache, which could be powered and accessed in 32K sections. This allows large portions of the cache to be in a low-power "sleep" mode at any given time, so they get the performance benefit of a large cache without incurring as much of the usual power increase. The L1 cache was also doubled from the PIII to 32K+32K data and instruction. Floating point performance was increased with the doubling of MMX/SSE units - although this really only helped with SSE optimized code - and there were a few other architectural changes. Overall, the Pentium M is able to provide performance that's roughly the equivalent of an Athlon processor of the same clock speed, while requiring much less power. Battery life in laptops that use the Pentium M can often be 25 to 50 percent longer than equivalent laptops that use the Mobile Pentium 4, Mobile Celeron or Mobile Athlon XP chips.

The length of the above chart should be an indication of how big the mobile market has become. One of the reasons for this increase in size is likely the cut-throat conditions that exist in the desktop CPU market. Intel charges a hefty premium for most of their mobile processors since, generally speaking, anyone looking for a high-performance laptop has more money to burn. This is what I call the "mobility tax": you should only buy a laptop if portability is a primary concern; otherwise, your money will go a lot further with a desktop system. Certainly, business types that use computers for presentations and work on the road will be willing to pay this so-called tax.

With the release of the Dothan core Pentium M chips, Intel has also switched to model numbers. Here, however, there are many factors that influence the overall number. Ultra-Low Voltage processors running at lower clock speeds can end up rated higher than faster processors that require more power. This is supposed to reflect the relative desirability of certain features, as an increased battery life could be more important to some people than raw performance. Of course, Intel specifically states that the model numbers are not measures of performance, but only the technically literate are likely to know this. In their own words: "Intel processor numbers are not a measure of performance. Processor numbers differentiate features within each processor family, not across different processor families. See http://www.intel.com/products/processor_number for details."

Itanium and Itanium 2 Processors


Itanium (Server)
Itanium	733	Merced	96	66	11.0X	PAC-418	2048
Itanium	733	Merced	96	66	11.0X	PAC-418	4096
Itanium	200	Merced	96	66	12.0X	PAC-418	2048
Itanium	200	Merced	96	66	12.0X	PAC-418	4096
Itanium 2	900	McKinley	256	100	9.0X	PAC-611	1536
Itanium 2	900	McKinley	256	100	9.0X	PAC-611	3072
Itanium 2	1000	McKinley	256	100	10.0X	PAC-611	1536
Itanium 2	1000	McKinley	256	100	10.0X	PAC-611	3072
Itanium 2 LV	1000	Deerfield	256	100	10.0X	PAC-611	1536
Itanium 2 LV	1500	Deerfield	256	100	15.0X	PAC-611	1536
Itanium 2	1300	Madison	256	100	13.0X	PAC-611	3072
Itanium 2	1400	Madison	256	100	14.0X	PAC-611	4096
Itanium 2	1500	Madison	256	100	15.0X	PAC-611	6144
*** Front Side Bus (FSB) Speeds are "quad pumped", so Intel's FSB numbers are four times the actual bus speed on Pentium 4/Celeron, Pentium M/Celeron M, and Itanium processors. Multipliers are based off the base bus speed, not the FSB value.

Itanium processors are likely one of the least understood CPUs by most computer enthusiasts. Given that the cheapest models still cost over $1000, that's not really surprising. These processors are meant to target the high-end corporate world. They are often used in massively parallel processing situations, and Itaniums are capable of working in up to 512-way SMP systems. Of course, that doesn't really explain what the Itanium is.

For starters, Itanium is the way that Intel envisioned 64-bit computing, and it is built on a new instruction set dubbed IA-64 (Intel Architecture 64).. IA-64 was a clean break from x86 legacy code, and it was designed for the future. Really, its competition isn't the Xeon or Opteron CPUs, although some mistakenly compare it with these processors. Itanium is meant to compete in the high-end corporate 64-bit computing world, going up against servers based on the IBM Power4/5, HP PA-RISC, Sun UltraSparc-III, and DEC Alpha. If none of those names ring a bell, that's not very surprising. The quad-processor IBM Power4 system that was used as the main server at a company I worked for (and they had two units for redundancy) cost somewhere in the neighborhood of $500,000, and the RAID-5 array that provided data storage was another $500,000. Perhaps more important than the hardware was the service contract with IBM that helped guarantee everything stayed running. The cost of the support contract with IBM (for dozens of such setups) was supposedly around $300 million dollars a year!

The Alpha technology, interestingly enough, was purchased by Compaq, who merged with HP, and HP worked with Intel on the design of the Itanium, with the intention of using it in place of PA-RISC once it was complete. I believe that some (all?) of the Alpha technology was later transferred to Intel, most likely for use in furthering the design of the Itanium processors. Compaq/HP has continued to support this chip for the past several years, but they haven't invested a lot of money into researching new iterations of the design. This makes sense, since HP is encouraging its enterprise customers to switch to their Itanium platforms. Recently, HP announced that the 1.3 GHz (I think that was the speed) EV7 version of the Alpha chip will be the last.

These systems are often referred to as "Big Tin" systems, and they're in a league of their own. They are frequently used in systems that process huge amounts of data - their 64-bit addressing allows the use of many gigabytes of physical RAM - and they are usually optimized for input/output functions. Of course, reliability and up-time are far more important than actual performance numbers, and often once a system has been built around a specific architecture, large corporations will stick with that hardware unless there is tremendous incentive to switch to something else. Switching usually consists of several years of coding, testing, debugging, and validation - a task not to be undertaken lightly, to be sure.

For the processor design, Intel continued with their radical departure from accepted norms. Instead of a RISC or CISC approach, Intel went back to a technology that had been used in old mainframes and other computers of yore, VLIW (Very Long Instruction Word). Itanium is not a strict VLIW machine, though, as VLIW has some well known drawbacks that Intel worked to overcome, and Intel chose to call their new approach EPIC, "Explicitly Parallel Instruction Computer". In contrast to designs such as the Xeon and Opteron, which can issue up to three instructions per cycle, the Itanium 2 (forget Itanium 1 for a minute) can issue eight instructions per clock, and unlike VLIW designs, future Itanium chips could further increase the issue width without needing to recompile the code. In theory, then, a 1 GHz Itanium chip could perform roughly as fast as a 2.66 GHz Xeon/Opteron, or the 1.5 GHz Itanium 2 would be roughly as fast as a 4 GHz Xeon/Opteron. That's just theoretical performance, of course, and the overall system design will play a large role in determining how much of the potential of any system is actually realized.

To help reach that potential, Itanium chips run off a 128-bit quad-pumped system bus, using standard SDRAM (for the time being). The lower clock speeds combined with the wider bus make the SDRAM less of an issue than with high-speed desktop systems. The initial Itanium design, Merced, had four integer units (ALUs), two floating point units (FPUs), and three branch units (BRUs), two SIMD (i.e. MMX/SSE) units, and two load/store units - also called address generation units (AGUs) in other CPUs. The modified McKinley (and later) designs have six ALUs, three BRUs, two FPUs, one SIMD, two load units, and two store units - sort of like having 4 AGUs, except that they're more specialized. In addition, the McKinley has roughly three times the cache bandwidth as Merced. Merced was also a six issue design with a deeper pipeline (10 stages) and less memory bandwidth - a rather problematic design. McKinley and later designs are eight issue designs with shorter pipelines (8 stages) and more memory bandwidth. While Merced rarely made full use of its six issue design, McKinley's enhancements help it come a lot closer to issuing the maximum eight instructions per clock.

That doesn't really tell a whole lot about the architecture, and I don't really want to go much deeper than that right now. Suffice it to say that Itanium depends in a large part on compiler technology in order to reach its potential, and Intel has apparently had more difficulties in that area than they initially anticipated, but lately this seems to be less of a problem. The initial Merced design was also flawed, if you couldn't tell from my above description of the architecture, but Itanium 2 goes a long way toward rectifying the problems.

Many have called the Itanium a failure - coming up with such names as Itanic to describe the processor - especially now that AMD has launched Opteron and Intel is following suit with x86-64 support. However, they're really very different goals, and in the target market segment, Itanium is still managing to compete. Needless to say, it helps that Intel has very deep pockets thanks to the income generated from their desktop and mobile processor divisions. Itanium may or may not live in the long term, but short term Intel has plans to keep it around at least another three or four years, and they will likely keep it around longer to support existing clients. Honestly, though, I doubt any of us will ever be running an IA64 processor on our desktop systems.

A case for AMD Final Thoughts

PRINT THIS ARTICLE

Post Your Comment
Please log in or sign up to comment.

Comments Locked

74 Comments

View All Comments

JarredWalton - Wednesday, September 1, 2004 - link
Jenand - thanks for the information. There are certainly some errors in the Itanium charts, but very few people seem to know much about the architecture, so I haven't gotten any corrections. Most of the future IA64 chips are highly speculative in terms of featurs.

Incidentally, it looks like Tukwilla (and Dimona) will be 4 core designs, with motherboards support 4 CPUs, thus "16C" - or something like that. As for Fanwood, I really don't know much about it other than the name and some speculation that it *might* be the same as Madison9M. Or it might be a Dual Processor version of Madison, which is multi-processor.

http://endian.net/details.asp?ItemNo=3835
http://www.xbitlabs.com/news/cpu/display/200311101...

At the very least, Fanwood will have more than just a 9 MB cache configuration, it's probably safe to say.
JarredWalton - Wednesday, September 1, 2004 - link
If Prescott and Pentium M both use the exact same branch predictor, then yes, the Prescott would be more accurate than Banias. However, with the doubling of the cache size on Dothan, I can't imagine Intel would leave it with inferior branch prediction. So perhaps it goes something like this in terms of branch prediction accuracy:

P6 cores
Willamette/Northwood
Banias
Prescott
Dothan

Possibly with the last two on the same level.

I'm still waiting to see if we can get pipeline stage information from Intel, but I have encountered several other sources online that refer to the Willamette/Northwood as having a 28 stage pipeline. Guess there's no use in beating a dead horse, though - either Intel will pass on information and we can have a definite, or it will remain an unknown. Don't hold your breath on Intel, though. :)
IntelUser2000 - Wednesday, September 1, 2004 - link
"Intel claims that the combination of the loop detector and indirect branch predictor gives Centrino a 20% increase in overall branch prediction accuracy, resulting in a 7% real performance increase."

Sure, but Prescott also has Pentium M's branch predictor enhancements in addition to the enhancements made to Willamette, while Pentium M didn't get Willamette's enhancements, just the indirect branch predictor.

Yes it says 20% increase, but from what? PIII, P4? Prescott?
jenand - Tuesday, August 31, 2004 - link
There are a few errors and some missing information on the IPF sheet:
1) Fanwood will get 4M(?) L3 or so, not 9M. You probably mixed it up with its bigger brother Madison9M, both to be released soon.

2)Foxton and Pelleston are code names for technologies used in Montecito, not CPU code names.

3) Dimona and Tukwila are "pairs" (just like Madison/Deerfield, Madison9M/Fanwood and Montecito/Millington) both will be made on 45nm nodes and are scheduled for 2007. Montvale is probably a shrink of Montecito or Millington to the 65nm node and will probably be launched in 2006.

4) Montecito and Millington will be made on 90nm and use the PAC-611 socket. The FSB of Montecito will be 100MHZ for compatibility reasons, but will also be introduced at a higher FSB (166MHz?) late in 2005.

5) Fanwood will probably get 100MHz and 133MHz FSB, not 166MHz. Same goes for Millington.

I hope it was helpful. Please note that I don't have any internal information I only read the rumors.
JarredWalton - Tuesday, August 31, 2004 - link
Heh... one last link. Hannibal discusses why the PM is able to have better branch prediction with a smaller BTB in his article about the PM. At the bottom of the following page is where he specifically discusses the improvements to the P4:

http://castor.arstechnica.com/cpu/004/pentium-m/pe...

And his summary: "Intel claims that the combination of the loop detector and indirect branch predictor gives Centrino a 20% increase in overall branch prediction accuracy, resulting in a 7% real performance increase. Of course, the usual caveats apply to these statistics, i.e. the increase in branch prediction accuracy and that increase's effect on real-world performance depends heavily on the type of code being run. Improved branch prediction gives the PM a leg up not only in terms of performance but in terms of power efficiency as well. Because of its improved branch prediction capabilities, the PM wastes less energy speculatively executing code that it will then have to throw away once it learns that it mispredicted a branch."

He could be wrong, of course, but personally I trust his research on CPUs more than a lot of other sites - after all, he does *all* architectures, not just x86. Hopefully, Intel will provide me (Kristopher) with some direct answers. :)
JarredWalton - Tuesday, August 31, 2004 - link
In case that last wasn't clear, I'm not saying the CPU detection is really that blatant, but if the CPU detection is required for accuracy, it *could* be that bad. Rumor, by the way, puts the Banias core at 14 or 15 stages, and the Dothan *might* add one more stage.
JarredWalton - Tuesday, August 31, 2004 - link
Regarding Pentium M, I believe the difference to the branch prediction isn't merely a matter of size. It has a new indirect branch predictor, as well as some other features. Basically, P-M is designed for power usage first, and so they made a lot more elegant design decisions at times, whereas Northwood and Prescott are more of a brute force approach.

As for the differences between various AT articles, it's probably worth pointing out that this is the first article I've ever written for Anandtech, so don't be too surprised that it has some differences of opinion. Who's right? It's difficult to say.

As for the program mentioned in that thread, I downloaded it and ran it on my Athlon 64. You know what the result was? 13.75 to 13.97 cycles. Since a branch miss doesn't actually necessitate a flush of the entire pipeline, that would mean that it's estimating the length of the A64 as probably 15 or 16 stages - off by a factor of 33% or so. If it were off by that same amount on Prescott, that would put Prescott at [drumroll...] 23 stages.

I've passed on some questions for Intel to Kristopher Kubuki, so maybe we can get the real poop. Until then, it's still a case of "nobody knows for sure". Estimating pipeline lengths based off of a program that reports accurate results on P4 and Northwood cores is at best a guess, I would say.

Incidentally, I looked at the source code, and while I haven't really studied it extensively, there is a CPU detection, so the mispredict penalty is calculated differently on P4, P6, and *other* architectures. Maybe it's okay, maybe it's not, but if accurate results are dependent on CPU detection, that sort of calls the whole thing into question.

if CPU=P6 then printf("12 stages.\n")
else if CPU=P4 then printf("10 stages.\n")
else if....

Hopefully, it *is* relatively accurate, but as I said, ~14 cycles mispredict penalty on an Athlon 64 is either incorrect, or AMD actually created a 15 stage pipeline and didn't tell anyone. :)
IntelUser2000 - Monday, August 30, 2004 - link
Okay, I don't know further than that. But one question: Since the old P4 article from Anandtech states 10 stage pipelin P6 core, and Prescott is claimed to have 31 stages and you claim otherwise, it tells that there is individual errors in the SAME site. So whether Hannibal's site can be trusted is doubtful because of that fact too, no? Also, take a look at this link: http://www.realworldtech.com/forums/index.cfm?acti...

I asked a guy in the forums about it and that link is about the responses to it.

One example Hannibal's site may be wrong is this: http://arstechnica.com/cpu/004/prescott-future/pre...

At the end of that link it says: "There's actually another reason why the Pentium M won't benefit as much from hyperthreading. The Pentium M's branch predictor is superior to Prescott's, so the Pentium M is less likely to suffer from instruction-related pipeline stalls than the Prescott. This improved branch prediction, in combination with its shorter pipeline, means improved execution efficiency and less of a need for something like hyperthreading."

Now, we know Pentium M has shorter pipeline than Prescott but better branch prediction? I really think its wrong, since one of the major improvements of BOTH Prescott and Pentium M in branch prediction is improvements in indirect branch prediction, PLUS, Prescott and Northwood I believe, has bigger BTB buffer size, somewhere in the order of 8x, because Pentium M used indirect branch prediction improvements to save die size and putting more buffer definitely doesn't coincide with that.
Fishie - Monday, August 30, 2004 - link
This is a great summary of the processor cores. I would like to see the same thing done with video cards.
JarredWalton - Monday, August 30, 2004 - link
#49 - Did you even read the links in post #44? Did you read post #44? Let's make it clear: the Willamette and Northwood cores were 20 stage pipelines coupled to an 8 stage prefetch/decode unit (which feeds into the trace cache). This much, we know for sure. The Prescott core appears to be 23 stages with the same (essentially) 8 stage prefetch/decode unit. So, you can call early P4 cores 20 stages, in which case Prescott is 23 stages, or you can call Prescott 31 stages, in which case early P4 cores were 28 stages.

If you look at the chart in the link to Anandtech, notice how the P4 pipeline is lacking in fetch and decode stages? Anyway, there's nothing that says the AT chart you linked from Aug 2000 is the DEFINITIVE chart. People do make errors, and Intel hasn't been super forthcoming about their pipelines. I'll give you a direct link to where Hannibal talks about the P6 and P4 pipelines - take it up with him if you must:

http://arstechnica.com/cpu/004/pentium-1/pentium-1...

Synopsis: In the AT picture, the P6 pipeline has 2 fetch and 2 decode stages, while Hannibal describes it as 3.5 BTB/Fetch stages and 2.5 Decode stages.

http://arstechnica.com/cpu/01q2/p4andg4e/p4andg4e-...

Here, the P4 and G4e architectures are compared, but if you read this page, it explains the trace cache and how it effects things. Specifically: "Only when there's an L1 cache miss does that top part of the front end kick in in order to fetch and decode instructions from the L2 cache. The decoding and translating steps that are necessitated by a trace cache miss add another eight pipeline stages onto the beginning of the P4's pipeline, so you can see that the trace cache saves quite a few cycles over the course of a program's execution."
-----------------------
Further reading:

http://episteme.arstechnica.com/eve/ubb.x?a=tpc&am...

The comments in the "Discuss" section of the article contain further elaboration by Hannibal on the Prescott: "The 31 stages came from the fact that if you include the trace cache in the pipeline (which Intel normally doesn't and I didn't here) then the P4's pipeline isn't 20 stages but 28 (at least I think that's the number). So if you add three extra stages to 28 you get 31 total stages."

The problem is, Intel simply isn't coming out and directly stating what the facts are. It *could* be that Prescott is really 31 stages (as Intel has said) plus another 8 to 10 stages of fetch/decode logic, putting the "total" length at 39 to 41 stages. However, given the clockspeed scaling - rather, the lack thereof - it would not be surprising to have it "only" be 23 stages plus 8 fetch/decode stages. After all, the die shrink to 90 nm should have been able to push the Northwood core to at least 4 GHz, which seems to be what the Prescott is hitting as well.

Unless you actually work for Intel and can provide a definitive answer? I, personally, would love some charts from Intel documenting all of the stages of both the initial NetBurst pipeline as well as the Prescott pipeline. (Maybe I should mention this to Anand...?)

<b>Updated</b> CPU Cheatsheet - Seven Years of Covert CPU Operations

Celeron, Pentium II and III Processors

Celeron 2 and Pentium 4 Processors

Mobile Celeron, Mobile P4, Celeron M and Pentium M Processors

Itanium and Itanium 2 Processors

Post Your Comment

74 Comments

View All Comments

JarredWalton - Wednesday, September 1, 2004 - link

JarredWalton - Wednesday, September 1, 2004 - link

IntelUser2000 - Wednesday, September 1, 2004 - link

jenand - Tuesday, August 31, 2004 - link

JarredWalton - Tuesday, August 31, 2004 - link

JarredWalton - Tuesday, August 31, 2004 - link

JarredWalton - Tuesday, August 31, 2004 - link

IntelUser2000 - Monday, August 30, 2004 - link

Fishie - Monday, August 30, 2004 - link

JarredWalton - Monday, August 30, 2004 - link

Log in

Don't have an account? Sign up now