Updated CPU Cheatsheet - Seven Years of Covert CPU Operations

Name: Updated CPU Cheatsheet - Seven Years of Covert CPU Operations
Item: Updated CPU Cheatsheet - Seven Years of Covert CPU Operations
Author: Jarred Walton

by Jarred Walton on August 28, 2004 9:00 AM EST

Posted in
CPUs

74 Comments | Add A Comment

74 Comments

AMD Cheat Sheet


AMD Processors
Argon (K7)	Athlon	Slot A	500-700	512K	22 + cache	250	184	100
Pluto (K75)	Athlon	Slot A	550-850	512K	22 + cache	180	102	100
Orion (K75)	Athlon	Slot A	900-1000	512K	22 + cache	180	102	100
Spitfire	Duron	462	600-950	64K	25	180	100	100
Morgan	Duron	462	900-1300	64K	25.2	180	106	100
Thunderbird	Athlon "B"	462	650-1400	256K	37	180	117	100
Thunderbird	Athlon "C"	462	1000-1400	256K	37	180	117	133
Palomino	Athlon XP/M	462	850-1733	256K	37.5	180	129	100/133
Palomino	Athlon MP	462	1000-1733	256K	37.5	180	129	100/133		1-2
Thoroughbred A	Athlon XP	462	1467-1833?	256K	37.5	130	80	133
Thoroughbred B	Athlon XP/M	462	1200-2133	256K	37.5	130	84	133
Thoroughbred B	Athlon XP	462	2083-2250	256K	37.5	130	84	166
Thoroughbred B	Athlon MP	462	1667-2133	256K	37.5	130	84	133		1-2
Barton	Athlon XP/M	462	1467-2133	512K	54.3	130	101	133
Barton	Athlon XP	462	1833-2167	512K	54.3	130	101	166
Barton	Athlon XP	462	2100-2200	512K	54.3	130	101	200
Barton	Athlon MP	462	2133	512K	54.3	130	101	166		1-2
Applebred	Duron	462	1400-1800	64K	25.2*	130	84*	133
Thorton	Athlon XP	462	1667-2067	256K	37.5*	130	101*	133
Thoroughbred B	Sempron	462	1500-2000+	256K	37.5	130	84	166
Sledgehammer	Athlon FX	940	2200-???	1024K	105.9	130 SOI	193	200	Y
Sledgehammer	Opteron	940	1400-2400	1024K	105.9	130 SOI	193	200	Y	1-8
Sledgehammer	Athlon FX	939	2400-???	1024K	105.9	130 SOI	193	200	Y
Clawhammer	Athlon 64	754	1800-2200(?)	512K	105.9	130 SOI	193	200	Y
Clawhammer	Athlon 64	754	2000-2400(?)	1024K	105.9	130 SOI	193	200	Y
Newcastle	Athlon 64	754	1800-2600(?)	512K	68.5	130 SOI	144	200	Y
Newcastle	Athlon 64	939	2200-2600(?)	512K	68.5	130 SOI	144	200	Y
San Diego	Athlon FX	939	2600-???	1024K	105.9(?)	90 SOI	114(?)	200	Y
Paris	Sempron	754	1800-???	256K	~50(?)	130 SOI	118	200	N
Venus	Opteron 1xx	940				90 SOI		200?	Y
Troy	Opteron 2xx	940				90 SOI		200?	Y	1-2
Athens	Opteron 8xx	940				90 SOI		200?	Y	1-8
Odessa	Athlon 64 M?	754?		512K		130 SOI		200?	Y
Winchester	Athlon 64	939		512K	68.5(?)	90 SOI	83(?)	200	Y
Dublin	Athlon XP-M	462			37.5	130 SOI	128	200?	N
Newark	Athlon 64-M LP	754?				90 SOI		200?	Y
Lancaster	Athlon 64 M	754?				90 SOI		200?	Y
Georgetown	Athlon XP M	462/754?				90 SOI		200?	N?
Sonora	Athlon XP-M LP	462/754?				90 SOI		200?	N?
Denmark	Opteron 1xx	940				90 SOI		200?	Y
Italy	Opteron 2xx	940				90 SOI		200?	Y	1-2
Egypt	Opteron 8xx	940				90 SOI		200?	Y	1-8
Toledo	Dual Core FX	939				90 SOI		200?	Y	2C
Palermo	Sempron (?)	939 (?)		256K?	~50(?)	90 SOI	62(?)	200	N?
Oakville	Athlon 64 Mobile	754?		512K?		90 SOI		200?	Y
Victoria	Sempron (?)	754 (?)		256K?	~50(?)	90 SOI	62(?)	200	N?
* Die Size and/or transistor count is based off a larger CPU core with a portion of the die disabled.
** Various steppings/sources listed different die sizes.
*** The bus speed all Athons/Durons is double-pumped, but the CPU multiplier is based off the listed speed.

A few notes to clarify the information. The stated die sizes and transistor counts for the Applebred and Thorton reflect the fact that these processors are Thoroughbred and Barton cores, respectively, with half of the L2 cache disabled, which is why they have a single asterisk next to them. There have been reports of hacking the Thorton processors and turning them into full Barton CPUs, but considering the insignificant cost difference these days, it's probably not worth worrying about. AMD plans on discontinuing the Barton soon anyway, and will use the old Thoroughbred core for the Socket A Sempron chips.

Transistor counts on Paris, Victoria, and Palermo are likely off, but it remains to be seen how AMD actually configures these chips. Early Athlon 64 512K cache chips for socket 754 were Clawhammer cores with half the cache disabled, but the newer models (i.e. 3200+ at 2.2 GHz with 512K, 3400+ 2.4 GHz 512K, and 3700+ 2.6 GHz with 512K) appear to be actual Newcastle cores. The same could very well happen with the Paris cores, where initial shipments are "downgraded" Newcastle cores, and later versions may physically remove the ~18.7 million transistors used in the L2 cache. Regardless, values on these cores should be taken with a grain of salt.

Unreleased processors will likely change from these current estimates, and question marks indicate best guess data at present. If you notice any errors or if you have additional information on forthcoming processors, let us know in the comments section or email.

Take note of the Toledo, Denmark, Italy, and Egypt cores; the 2C next to it stands for dual core. All four models use the same basic core and should come out around the same time in early 2005. Whether they launch as planned remains to be seen, and precise details about the internal layout are not yet clear - recent news suggests that each core will have its own L2 cache. Dual core is best described as SMP on a single chip, and while on the subject of SMP, please note that all of the Athlon XP processors could support multi-processor configurations unofficially. 2-way SMP was almost a certainty, but none of the CPUs were verified to function in such a configuration by AMD. While it would not be prudent to take such a risk as a business, quite a few enthusiasts saved themselves a lot of money by putting XP chips into SMP motherboards instead of spending the extra money on MP chips.

The basic core of the Athlon, from the Pluto all the way through the latest Newcastle and Paris processors, changed very little since its inception. It has a 10 stage integer pipeline and 15 stage floating point pipeline, with three identical Arithmetic/Logic Units (ALUs), Address Generation Units (AGUs), and Floating Point Units (FPUs). The FPUs also handle the MMX, 3DNow!/+, and SSE/SSE2 support. Opteron increased the length to 12/17 stages, in addition to bringing 64-bit support. Future versions of the Athlon 64 will likely increase the length of the pipeline past the current 12/17 stages in order to increase clock speeds, but I doubt that AMD will ever show the hubris of Intel by creating a 31 stage pipeline - at least, not on any iteration of the Athlon architecture. This is especially a problem with the increasing power leakage of high clockspeeds and increasingly small process technology. Until those issues are resolved, I think it's safe to say that pipeline lengths will stay in the 10 to 15 stages (for integers) range with AMD.

Update: One reader was good enough to send a link to AMD's site where they actually list the Opteron as being a 12/17 design. (Thanks Tom!) Finding any good details on the Intel and AMD sites can be a major chore, most likely due to the level of competition between the companies as well as their size. There's a rule somewhere that the larger a company gets, the less informative and helpful their web site becomes! For those that want the link, here's the Opteron information. That means that all Athlon 64 designs are also 12/17, of course. The Denmark, Italy, and Egypt CPUs are also dual core, it appears, and their entries have been updated to reflect this. (The old roadmap didn't include that information.)

Index Intel Processors

PRINT THIS ARTICLE

Post Your Comment
Please log in or sign up to comment.

Comments Locked

74 Comments

View All Comments

JarredWalton - Wednesday, September 1, 2004 - link
Jenand - thanks for the information. There are certainly some errors in the Itanium charts, but very few people seem to know much about the architecture, so I haven't gotten any corrections. Most of the future IA64 chips are highly speculative in terms of featurs.

Incidentally, it looks like Tukwilla (and Dimona) will be 4 core designs, with motherboards support 4 CPUs, thus "16C" - or something like that. As for Fanwood, I really don't know much about it other than the name and some speculation that it *might* be the same as Madison9M. Or it might be a Dual Processor version of Madison, which is multi-processor.

http://endian.net/details.asp?ItemNo=3835
http://www.xbitlabs.com/news/cpu/display/200311101...

At the very least, Fanwood will have more than just a 9 MB cache configuration, it's probably safe to say.
JarredWalton - Wednesday, September 1, 2004 - link
If Prescott and Pentium M both use the exact same branch predictor, then yes, the Prescott would be more accurate than Banias. However, with the doubling of the cache size on Dothan, I can't imagine Intel would leave it with inferior branch prediction. So perhaps it goes something like this in terms of branch prediction accuracy:

P6 cores
Willamette/Northwood
Banias
Prescott
Dothan

Possibly with the last two on the same level.

I'm still waiting to see if we can get pipeline stage information from Intel, but I have encountered several other sources online that refer to the Willamette/Northwood as having a 28 stage pipeline. Guess there's no use in beating a dead horse, though - either Intel will pass on information and we can have a definite, or it will remain an unknown. Don't hold your breath on Intel, though. :)
IntelUser2000 - Wednesday, September 1, 2004 - link
"Intel claims that the combination of the loop detector and indirect branch predictor gives Centrino a 20% increase in overall branch prediction accuracy, resulting in a 7% real performance increase."

Sure, but Prescott also has Pentium M's branch predictor enhancements in addition to the enhancements made to Willamette, while Pentium M didn't get Willamette's enhancements, just the indirect branch predictor.

Yes it says 20% increase, but from what? PIII, P4? Prescott?
jenand - Tuesday, August 31, 2004 - link
There are a few errors and some missing information on the IPF sheet:
1) Fanwood will get 4M(?) L3 or so, not 9M. You probably mixed it up with its bigger brother Madison9M, both to be released soon.

2)Foxton and Pelleston are code names for technologies used in Montecito, not CPU code names.

3) Dimona and Tukwila are "pairs" (just like Madison/Deerfield, Madison9M/Fanwood and Montecito/Millington) both will be made on 45nm nodes and are scheduled for 2007. Montvale is probably a shrink of Montecito or Millington to the 65nm node and will probably be launched in 2006.

4) Montecito and Millington will be made on 90nm and use the PAC-611 socket. The FSB of Montecito will be 100MHZ for compatibility reasons, but will also be introduced at a higher FSB (166MHz?) late in 2005.

5) Fanwood will probably get 100MHz and 133MHz FSB, not 166MHz. Same goes for Millington.

I hope it was helpful. Please note that I don't have any internal information I only read the rumors.
JarredWalton - Tuesday, August 31, 2004 - link
Heh... one last link. Hannibal discusses why the PM is able to have better branch prediction with a smaller BTB in his article about the PM. At the bottom of the following page is where he specifically discusses the improvements to the P4:

http://castor.arstechnica.com/cpu/004/pentium-m/pe...

And his summary: "Intel claims that the combination of the loop detector and indirect branch predictor gives Centrino a 20% increase in overall branch prediction accuracy, resulting in a 7% real performance increase. Of course, the usual caveats apply to these statistics, i.e. the increase in branch prediction accuracy and that increase's effect on real-world performance depends heavily on the type of code being run. Improved branch prediction gives the PM a leg up not only in terms of performance but in terms of power efficiency as well. Because of its improved branch prediction capabilities, the PM wastes less energy speculatively executing code that it will then have to throw away once it learns that it mispredicted a branch."

He could be wrong, of course, but personally I trust his research on CPUs more than a lot of other sites - after all, he does *all* architectures, not just x86. Hopefully, Intel will provide me (Kristopher) with some direct answers. :)
JarredWalton - Tuesday, August 31, 2004 - link
In case that last wasn't clear, I'm not saying the CPU detection is really that blatant, but if the CPU detection is required for accuracy, it *could* be that bad. Rumor, by the way, puts the Banias core at 14 or 15 stages, and the Dothan *might* add one more stage.
JarredWalton - Tuesday, August 31, 2004 - link
Regarding Pentium M, I believe the difference to the branch prediction isn't merely a matter of size. It has a new indirect branch predictor, as well as some other features. Basically, P-M is designed for power usage first, and so they made a lot more elegant design decisions at times, whereas Northwood and Prescott are more of a brute force approach.

As for the differences between various AT articles, it's probably worth pointing out that this is the first article I've ever written for Anandtech, so don't be too surprised that it has some differences of opinion. Who's right? It's difficult to say.

As for the program mentioned in that thread, I downloaded it and ran it on my Athlon 64. You know what the result was? 13.75 to 13.97 cycles. Since a branch miss doesn't actually necessitate a flush of the entire pipeline, that would mean that it's estimating the length of the A64 as probably 15 or 16 stages - off by a factor of 33% or so. If it were off by that same amount on Prescott, that would put Prescott at [drumroll...] 23 stages.

I've passed on some questions for Intel to Kristopher Kubuki, so maybe we can get the real poop. Until then, it's still a case of "nobody knows for sure". Estimating pipeline lengths based off of a program that reports accurate results on P4 and Northwood cores is at best a guess, I would say.

Incidentally, I looked at the source code, and while I haven't really studied it extensively, there is a CPU detection, so the mispredict penalty is calculated differently on P4, P6, and *other* architectures. Maybe it's okay, maybe it's not, but if accurate results are dependent on CPU detection, that sort of calls the whole thing into question.

if CPU=P6 then printf("12 stages.\n")
else if CPU=P4 then printf("10 stages.\n")
else if....

Hopefully, it *is* relatively accurate, but as I said, ~14 cycles mispredict penalty on an Athlon 64 is either incorrect, or AMD actually created a 15 stage pipeline and didn't tell anyone. :)
IntelUser2000 - Monday, August 30, 2004 - link
Okay, I don't know further than that. But one question: Since the old P4 article from Anandtech states 10 stage pipelin P6 core, and Prescott is claimed to have 31 stages and you claim otherwise, it tells that there is individual errors in the SAME site. So whether Hannibal's site can be trusted is doubtful because of that fact too, no? Also, take a look at this link: http://www.realworldtech.com/forums/index.cfm?acti...

I asked a guy in the forums about it and that link is about the responses to it.

One example Hannibal's site may be wrong is this: http://arstechnica.com/cpu/004/prescott-future/pre...

At the end of that link it says: "There's actually another reason why the Pentium M won't benefit as much from hyperthreading. The Pentium M's branch predictor is superior to Prescott's, so the Pentium M is less likely to suffer from instruction-related pipeline stalls than the Prescott. This improved branch prediction, in combination with its shorter pipeline, means improved execution efficiency and less of a need for something like hyperthreading."

Now, we know Pentium M has shorter pipeline than Prescott but better branch prediction? I really think its wrong, since one of the major improvements of BOTH Prescott and Pentium M in branch prediction is improvements in indirect branch prediction, PLUS, Prescott and Northwood I believe, has bigger BTB buffer size, somewhere in the order of 8x, because Pentium M used indirect branch prediction improvements to save die size and putting more buffer definitely doesn't coincide with that.
Fishie - Monday, August 30, 2004 - link
This is a great summary of the processor cores. I would like to see the same thing done with video cards.
JarredWalton - Monday, August 30, 2004 - link
#49 - Did you even read the links in post #44? Did you read post #44? Let's make it clear: the Willamette and Northwood cores were 20 stage pipelines coupled to an 8 stage prefetch/decode unit (which feeds into the trace cache). This much, we know for sure. The Prescott core appears to be 23 stages with the same (essentially) 8 stage prefetch/decode unit. So, you can call early P4 cores 20 stages, in which case Prescott is 23 stages, or you can call Prescott 31 stages, in which case early P4 cores were 28 stages.

If you look at the chart in the link to Anandtech, notice how the P4 pipeline is lacking in fetch and decode stages? Anyway, there's nothing that says the AT chart you linked from Aug 2000 is the DEFINITIVE chart. People do make errors, and Intel hasn't been super forthcoming about their pipelines. I'll give you a direct link to where Hannibal talks about the P6 and P4 pipelines - take it up with him if you must:

http://arstechnica.com/cpu/004/pentium-1/pentium-1...

Synopsis: In the AT picture, the P6 pipeline has 2 fetch and 2 decode stages, while Hannibal describes it as 3.5 BTB/Fetch stages and 2.5 Decode stages.

http://arstechnica.com/cpu/01q2/p4andg4e/p4andg4e-...

Here, the P4 and G4e architectures are compared, but if you read this page, it explains the trace cache and how it effects things. Specifically: "Only when there's an L1 cache miss does that top part of the front end kick in in order to fetch and decode instructions from the L2 cache. The decoding and translating steps that are necessitated by a trace cache miss add another eight pipeline stages onto the beginning of the P4's pipeline, so you can see that the trace cache saves quite a few cycles over the course of a program's execution."
-----------------------
Further reading:

http://episteme.arstechnica.com/eve/ubb.x?a=tpc&am...

The comments in the "Discuss" section of the article contain further elaboration by Hannibal on the Prescott: "The 31 stages came from the fact that if you include the trace cache in the pipeline (which Intel normally doesn't and I didn't here) then the P4's pipeline isn't 20 stages but 28 (at least I think that's the number). So if you add three extra stages to 28 you get 31 total stages."

The problem is, Intel simply isn't coming out and directly stating what the facts are. It *could* be that Prescott is really 31 stages (as Intel has said) plus another 8 to 10 stages of fetch/decode logic, putting the "total" length at 39 to 41 stages. However, given the clockspeed scaling - rather, the lack thereof - it would not be surprising to have it "only" be 23 stages plus 8 fetch/decode stages. After all, the die shrink to 90 nm should have been able to push the Northwood core to at least 4 GHz, which seems to be what the Prescott is hitting as well.

Unless you actually work for Intel and can provide a definitive answer? I, personally, would love some charts from Intel documenting all of the stages of both the initial NetBurst pipeline as well as the Prescott pipeline. (Maybe I should mention this to Anand...?)

<b>Updated</b> CPU Cheatsheet - Seven Years of Covert CPU Operations

AMD Cheat Sheet

Post Your Comment

74 Comments

View All Comments

JarredWalton - Wednesday, September 1, 2004 - link

JarredWalton - Wednesday, September 1, 2004 - link

IntelUser2000 - Wednesday, September 1, 2004 - link

jenand - Tuesday, August 31, 2004 - link

JarredWalton - Tuesday, August 31, 2004 - link

JarredWalton - Tuesday, August 31, 2004 - link

JarredWalton - Tuesday, August 31, 2004 - link

IntelUser2000 - Monday, August 30, 2004 - link

Fishie - Monday, August 30, 2004 - link

JarredWalton - Monday, August 30, 2004 - link

Log in

Don't have an account? Sign up now