The limits of TLP...

Hardware engineers do not believe in massive superscalar CPUs anymore[2]. Increasing ILP a tiny bit requires exponentially bigger out-of-order hardware, which exponentially requires more power.

TLP and multi-core is hot and trendy. But the same problems that were true for squeezing ILP out of hardware are true about TLP in software. On the exception of naturally parallel applications such as rendering and database servers, getting more and more threads out of the majority of hardware will require exponentially more programming and debugging time. There are more high TLP designs such as Sun's Niagara that increase throughput, but also response time. And while the numbers of users that can update and read the database matters, the response time of the database can be important too. For example, while OLTP loads consist of many relatively simple SQL selects, Decision Support Systems (DSS or OLAP) fire off very complex queries with a high response time. To offer a good "data mining" experience, the single thread performance must not be neglected. The same can be said for some HPC and many typical workstation applications.

So, while designs that sacrifice ILP completely on the altar of TLP, such as Sun's Niagara, they may well be very popular in some markets such as webserving. Single thread performance is going to make the difference between the different multi-core solutions.

Here, the Itanium can leverage two big advantages: higher ILP and smaller cores. This last comment might seem ridiculous, given that the current Itanium Madison is about 432 mm2 large. However, if we look at the core (L1 inclusive), there about 25 million transistors that take about 80 mm2 die space. Note that the pictures below have been scaled and resized to reflect the relative proportions of the different cores.


Madison 9 MB die

However, about 10-15 mm² is used for the x86 compatibility, excess baggage that most competitors don't have. Let us take a closer look at the Itanium Madison core.


Madison core parts

As you can see, the IA 32 or x86 is a big chunk of the die. If we ignore this x86 part and compare the Itanium core with the 0.13µ 190 mm2 large Opteron, we can see that the Opteron core is about as large as the Itanium core.


Opteron die, rotated 90°

In the next part, we study this even more closely.


Itanium: a slim figure

We compare the different CPU cores in the table below. We consider the L1-cache being part of the core, but we list the number of transistors separately to be clear. To keep it fair, we compare all CPU using the same process technologies, 0.13µ, with the exception of the Intel Xeon.

There is no doubt that a 0.13µ Xeon MP is no match for either of the IBM Power 5, the Itanium or Opteron with both Spec FP2000 and SpecInt around 1200 for a Xeon MP 3 GHz. The Xeon MP 0.13µ is also a 32 bit CPU, so it does not belong in the list below.

The reason why I listed the 90nm Xeon (DP) is to show how complex an x86 architecture can get when it has 64 bit and an extremely deep pipeline in a quest for high clock speeds, which must negate the low ILP. With more than 50 million transistors, it is no wonder that Xeon "Irwindale/Nocona" is the hottest CPU (per core) of the bunch despite being manufactured in a more advanced process.

Why do we use Spec FP2000 and Spec Int2000[3]? It is true that these benchmarks are close to meaningless when you want to compare server or workstation performance in the real world. Spec FP is a decent predictor of HPC/scientific performance, but fails to predict Digital Content Creation performance despite containing a few OpenGL benchmarks. The reason why we use these two benchmarks is that currently, we are evaluating the CPU architecture, its future potential and current compiler performance, and not the complete system.

CPU feature Intel Itanium "Madison" Intel Xeon P4 Irwindale IBM Power 5 (+) AMD Opteron
Process technology 0.13 µ CU 0.09 µ CU 0.13 µ CU SOI 0.13 µ CU SOI
Die Size (mm2) 432 130 389 190
Number of transistors (Million) 592 169 276 106
Number of transistors (Million) L1-cache 1.8 +/- 6 5.3 7.7
Number of transistors (Million) L2 Cache 14 113 107 57
Number of transistors (Million) L3 Cache 510 0 off die 0
Number of transistors (Million) Tag (L2 + L3) 23 4 33 4
Number of transistors (Million) Core 20 50 35* 40
Pure logic core (-L1) 18 44 30 32
Top clock speed 1600 3800 1900 2600
Best Spec FP2000 Score 2712 1898 2839 1955
Best Spec Int2000 Score 1590 1810 1470 1713
TDP 107 W 115-130 W 200 W** <95W
* per core, two cores: about 70 million transistors
** for two cores


To calculate the cache sizes, we used the following formula:
Cache size expressed in Bytes x 9 bits per byte (8 + 1 bit ECC/parity protection) x 6 Transistors per bit (SRAM)
We calculated the Power 5 core as follows. The PowerPC 970FX, aka Apple's G5, is essentially a Power 4 core with Altivec, but without the L3 cache tag. If we subtract the number of transistors of the L2 cache (28 million) from the total number of transistors in the PowerPC 970, we end up with about 30 million transistors. The Power 5 core is a bit more complex (SMT and a few tweaks have been added), so we estimate it at about 35 million transistors.

HP and Intel have stated that the Itanium 2 core, including the L2-cache, has about 40 million transistors. If we subtract the L2 cache, we end up with about 26 million transistors, which still includes the x86 compatibility transistors (about 4 million) and the L2-tag. It wouldn't be fair to include the x86 transistors when we compare the merits of EPIC with x86 and RISC.


The Itanium core is twice as small as the Xeon's

So, we end up with 20 to 22 million transistors for the core, which is truly remarkable for a CPU that is considered the fastest FP CPU out there (together with the Power 5), and is better than all the RISC players in Spec Int. The 0.13 micron Opteron (2.6 GHz) beats the best Itanium by about 10%; still is remarkable how good 20 million transistors can perform.

And, what about the Pentium M? Well, the core is about 25 million transistors, but it is pretty hard to compare this CPU with Itanium as the Pentium M is optimised for low power consumption. If we keep it fair and compare the two cores using the same process technology, the Pentium M isn't even close when it comes to performance. A 2 GHz Pentium M scores a respectable 1500 in SpecInt, but trails far behind with a specfp score of about 1000.

The CPU industry in three words The benefits of TLP...
Comments Locked

43 Comments

View All Comments

  • lifeguard1999 - Wednesday, November 9, 2005 - link

    Johan has written a good article on the Itanium and its advantages. However, just because something is good from an engineers point of view, does not mean that it will be a market success. There is the business side of the equation that is just as important, and I will be looking forward to future articles on this.

    I live and play in the HPC world. Historically, this world has been small and based on (for lack of a better word), big-iron chips such as those found in the Cray C-90 (early 1990's technology) to the Cray X1E (today's tech). In the 1990's people clustered together "commodity" PCs (commonly called Beowulf Clusters) which culminated in 1997 Gordon Bell Prize at SC97 for a cluster of 16 Intel Pentium Pros (200 MHz). Today, these cluster-based supercomputers are everywhere (Cray sells a XT3 based on Opterons). The advantage of the cluster-based supercomputer is price/performance, or said another way: cost.

    And that is where this ties back into the business case. Can Itanium compete based on cost? Now cost is more than just how much to produce the physical chip. There are systems adminstrator costs, cooling costs, user-needs-to-learn-to-program-it costs, etc. Cooling concerns are coming to the forefront now as supercomputers may need a dedicated power plant in the near future. Imagine, if you will, how much heat 10,000 Opterons can produce and much electricity it consumes (we only have 4096).

    SGI was a big seller of supercomputers based on the MIPS chips (low power, low performance, but easy to use). They transitioned over to the Itanium chips and have had a successful run of supercomputers called the Altix. The problem is that anyone can buy a Opteron cluster supercomputer for much less cost than an Itanium supercomputer. While this is not the only reason for the decline of SGI and its recent delisting from the NYSE (inept management is the main reason) it is a contributing factor.

    That leaves HP as the largest seller of Itaniums. Did I mention inept management two sentences back? Maybe I should mention it here again.

    Itanium may be a great architecture, and it may survive and thrive. Right now however, it appears that there are dark days ahead.
  • highlandsun - Wednesday, November 9, 2005 - link

    There are still a variety of problems that the Altix design can handle more easily than any cluster-based approach. I'm not convinced that the Altix architecture is tied to Itanium, though. It'd be cool to see an Altix-like machine based on Opterons.
  • ksherman - Wednesday, November 9, 2005 - link

    seems like a really good article! Too bad most of it goes over my head :(
  • ceefka - Wednesday, November 9, 2005 - link

    Me too, I can finally make some sense of what Itanium is all about. It may have potential in a technical sense but until it comes at an affordable price it doesn't stand a chance imho. It's not always the best tech that sells best, Intel knows ;-)
  • xbdestroya - Wednesday, November 9, 2005 - link

    Nice Article. I personally don't see too much of a future for Itanium with the environment it's presently operating in coupled with Intel's missteps, but I feel that for all the heat Itanium: 'The Project' often takes, the architecture itself is unduly maligned.

    Plus, I love to see articles analysing architectures other than the bread-and-butter x86 ones we're used to seeing. Some more on EPIC, Power... Sun/Fujitsu chips - maybe some NEC - let's spice things up!

    There was one problem though with the article on the last page though:

    "But the best x86 design - the AMD Opteron - does about 60% less work per clock cycle in integer, and about 115% less work per cycle in floating point than the Itanium."

    How does something do 115% *less* work per cycle? Obviously not.
  • JohanAnandtech - Wednesday, November 9, 2005 - link

    Mathematiques was never my best course. :-) Indeed the Itanium does 60% more integer work and 115% more FP.
  • Calin - Thursday, November 10, 2005 - link

    So the Opteron does just 62.5% integer work and 46.5% floating point work per clock cycle compared to Itanium.
    I learned this mostly after introduction of VAT in economy
  • snorre - Wednesday, November 9, 2005 - link

    You write:

    "it is clear, however, that the Itanium has time on its side and is most likely the architecture with the highest potential."

    No, that is not true by any standards. I've tested Itanium systems from day one, including several compilers and development tools and I don't see any high potential with this platform. It's over-expensive, under-performing and quite frankly a big flop.

    Don't keep this pace maker going any further, please let it die in peace. Some good ideas just dosen't work well in practice, and EPIC is just another like them.
  • Starglider - Wednesday, November 9, 2005 - link

    Here's a scenario I like to imagine. After many years of research, marketing and general toil Intel claim that their new Itanium-5 chip will finally be the one to popularise the platform. The day before the launch, AMD announce their new x86-64+++ architecture, which extends x86 (again) to allow a scheduling/cache/decoding-hints metadata stream interleaved with the main instruction stream. The new design combines the code density and dynamic optimisation of x86 with all the static optimisation power and execution width of Itanium (but done better becasue AMD have learned from Intel's mistakes), is binary compatible with legacy applications at full speed, and has AMD's onboard memory and PCI express controllers. AMD own 90% of the high-end space by the end of the year and Itanium is finally killed off. ;)
  • dexvx - Wednesday, November 9, 2005 - link

    You have no idea what you're talking about do you?

    I'd like to see a scheduler that is both dynamic and static at the same time.

Log in

Don't have an account? Sign up now