The CPU industry in three words

If we would summarize the current trends in the CPU industry in three words, we think that those three words would be "TLP, caches and power consumption"[2]. TLP gets exploited more and more with the introduction of multi-threaded and multi-core CPUs. Caches get bigger and bigger as they don't increase power, but rather save power by preventing costly accesses to the memory controller. Power consumption determines which performance increasing techniques get the spotlight: wasteful techniques such as Dynamic Multi-Threading, double-pumped ALUs and extremely deep Out Of Order (OOO) windows that have fallen out of grace as they consume too much power.

It is pretty clear that these three trends - bigger caches, power consumption being a deciding factor in CPU design, and TLP - will continue to influence the CPU architectures heavily in the coming years. How would this be beneficial to an EPIC CPU?


The Cache story

Bigger caches are what the EPIC CPU needs. One of the biggest disadvantages of the EPIC CPU is code inflation. When we compiled some source (64 bit) code on the Itanium back in 2001, the code was about 2.5 to 3 times bigger than (32 bit) x86 code. That is not really surprising: an IA-64 128 bit bundle contains 3 instructions. An x86 instruction can be from 1 to 17 bytes long, but is on average a little less than 3 bytes or 24 bits long. That means that x86 instructions are on average about 2 times more compact. There are many other reasons why EPIC code is more bloated than x86. Because of restrictions on the types of instructions that can be placed in each slot of an IA-64 bundle and the fact that a bundle must be of the same length, IA-64 requires NOPs in unfillable slots. This leads to the insertion of NOPs or useless instructions that take up space.

The whole complex x86 architecture has been built to conserve RAM space as RAM was very expensive in the days during which x86 was developed. In more recent years, this feature has helped x86 as it didn't need the big caches that RISC and EPIC CPUs need. A RISC instruction is (at least) 32 bits long, or at least 33% bigger than an x86 instruction.

Currently, it seems that EPIC compilers produce code that is at least - roughly estimated - twice as big as AMD64 or EM64T code. This means that if you want to compare an Itanium instruction cache to the Opteron instruction cache, you have to divide the Itanium Instruction cache in two.

So, the L1 cache of 8 KB (16 KB/2) looks tiny compared to the massive 64 KB of the Opteron. If we assume that data and instructions take about the same size in the shared L2, the Itanium 2's L2 is 192 KB big (128 KB/2 I + 128 KB D), which is small compared to the Opteron's 1 MB and Xeon's 2 MB L2. That is the reason why Montecito has a 1 MB L2-I Cache and a 256 KB Data cache. This will increase IPC significantly: cache misses are deadly for the in order Itanium.

Time is on the side of the Itanium. As new process technology was introduced, cache sizes have been growing very quickly during the past years, without introducing extra cost or high latency. No competitor has the advantages that Itanium has:
  1. As caches get bigger, Itanium benefits more than the x86 competition. X86 CPUs target higher clock speeds and, as such, it is more difficult to use large low latency caches.
  2. Intel has mastered as no other the skill to produce very dense and fast cache structures.
In 2001, the Itanium had only 96 KB of L2 on the die. In 2002, the Itanium "Mc Kinley" had a 256KB L2 cache and a 1.5MB L3 cache. In 2003, the Itanium 2 had 256 KB L2 and 6 MB of L3-cache on the die, which was increased to 9 MB in 2004. The fact that Itanium needs much larger caches than an x86 CPU has morphed from a catastrophic problem (Merced's Integer performance) into a minor nuisance (Itanium 2 Madison). There is no reason to believe that this trend won't continue.

EPIC 101 The limits of TLP...
Comments Locked

43 Comments

View All Comments

  • lifeguard1999 - Wednesday, November 9, 2005 - link

    Johan has written a good article on the Itanium and its advantages. However, just because something is good from an engineers point of view, does not mean that it will be a market success. There is the business side of the equation that is just as important, and I will be looking forward to future articles on this.

    I live and play in the HPC world. Historically, this world has been small and based on (for lack of a better word), big-iron chips such as those found in the Cray C-90 (early 1990's technology) to the Cray X1E (today's tech). In the 1990's people clustered together "commodity" PCs (commonly called Beowulf Clusters) which culminated in 1997 Gordon Bell Prize at SC97 for a cluster of 16 Intel Pentium Pros (200 MHz). Today, these cluster-based supercomputers are everywhere (Cray sells a XT3 based on Opterons). The advantage of the cluster-based supercomputer is price/performance, or said another way: cost.

    And that is where this ties back into the business case. Can Itanium compete based on cost? Now cost is more than just how much to produce the physical chip. There are systems adminstrator costs, cooling costs, user-needs-to-learn-to-program-it costs, etc. Cooling concerns are coming to the forefront now as supercomputers may need a dedicated power plant in the near future. Imagine, if you will, how much heat 10,000 Opterons can produce and much electricity it consumes (we only have 4096).

    SGI was a big seller of supercomputers based on the MIPS chips (low power, low performance, but easy to use). They transitioned over to the Itanium chips and have had a successful run of supercomputers called the Altix. The problem is that anyone can buy a Opteron cluster supercomputer for much less cost than an Itanium supercomputer. While this is not the only reason for the decline of SGI and its recent delisting from the NYSE (inept management is the main reason) it is a contributing factor.

    That leaves HP as the largest seller of Itaniums. Did I mention inept management two sentences back? Maybe I should mention it here again.

    Itanium may be a great architecture, and it may survive and thrive. Right now however, it appears that there are dark days ahead.
  • highlandsun - Wednesday, November 9, 2005 - link

    There are still a variety of problems that the Altix design can handle more easily than any cluster-based approach. I'm not convinced that the Altix architecture is tied to Itanium, though. It'd be cool to see an Altix-like machine based on Opterons.
  • ksherman - Wednesday, November 9, 2005 - link

    seems like a really good article! Too bad most of it goes over my head :(
  • ceefka - Wednesday, November 9, 2005 - link

    Me too, I can finally make some sense of what Itanium is all about. It may have potential in a technical sense but until it comes at an affordable price it doesn't stand a chance imho. It's not always the best tech that sells best, Intel knows ;-)
  • xbdestroya - Wednesday, November 9, 2005 - link

    Nice Article. I personally don't see too much of a future for Itanium with the environment it's presently operating in coupled with Intel's missteps, but I feel that for all the heat Itanium: 'The Project' often takes, the architecture itself is unduly maligned.

    Plus, I love to see articles analysing architectures other than the bread-and-butter x86 ones we're used to seeing. Some more on EPIC, Power... Sun/Fujitsu chips - maybe some NEC - let's spice things up!

    There was one problem though with the article on the last page though:

    "But the best x86 design - the AMD Opteron - does about 60% less work per clock cycle in integer, and about 115% less work per cycle in floating point than the Itanium."

    How does something do 115% *less* work per cycle? Obviously not.
  • JohanAnandtech - Wednesday, November 9, 2005 - link

    Mathematiques was never my best course. :-) Indeed the Itanium does 60% more integer work and 115% more FP.
  • Calin - Thursday, November 10, 2005 - link

    So the Opteron does just 62.5% integer work and 46.5% floating point work per clock cycle compared to Itanium.
    I learned this mostly after introduction of VAT in economy
  • snorre - Wednesday, November 9, 2005 - link

    You write:

    "it is clear, however, that the Itanium has time on its side and is most likely the architecture with the highest potential."

    No, that is not true by any standards. I've tested Itanium systems from day one, including several compilers and development tools and I don't see any high potential with this platform. It's over-expensive, under-performing and quite frankly a big flop.

    Don't keep this pace maker going any further, please let it die in peace. Some good ideas just dosen't work well in practice, and EPIC is just another like them.
  • Starglider - Wednesday, November 9, 2005 - link

    Here's a scenario I like to imagine. After many years of research, marketing and general toil Intel claim that their new Itanium-5 chip will finally be the one to popularise the platform. The day before the launch, AMD announce their new x86-64+++ architecture, which extends x86 (again) to allow a scheduling/cache/decoding-hints metadata stream interleaved with the main instruction stream. The new design combines the code density and dynamic optimisation of x86 with all the static optimisation power and execution width of Itanium (but done better becasue AMD have learned from Intel's mistakes), is binary compatible with legacy applications at full speed, and has AMD's onboard memory and PCI express controllers. AMD own 90% of the high-end space by the end of the year and Itanium is finally killed off. ;)
  • dexvx - Wednesday, November 9, 2005 - link

    You have no idea what you're talking about do you?

    I'd like to see a scheduler that is both dynamic and static at the same time.

Log in

Don't have an account? Sign up now