The CPU industry in three words

If we would summarize the current trends in the CPU industry in three words, we think that those three words would be "TLP, caches and power consumption"[2]. TLP gets exploited more and more with the introduction of multi-threaded and multi-core CPUs. Caches get bigger and bigger as they don't increase power, but rather save power by preventing costly accesses to the memory controller. Power consumption determines which performance increasing techniques get the spotlight: wasteful techniques such as Dynamic Multi-Threading, double-pumped ALUs and extremely deep Out Of Order (OOO) windows that have fallen out of grace as they consume too much power.

It is pretty clear that these three trends - bigger caches, power consumption being a deciding factor in CPU design, and TLP - will continue to influence the CPU architectures heavily in the coming years. How would this be beneficial to an EPIC CPU?

The Cache story

Bigger caches are what the EPIC CPU needs. One of the biggest disadvantages of the EPIC CPU is code inflation. When we compiled some source (64 bit) code on the Itanium back in 2001, the code was about 2.5 to 3 times bigger than (32 bit) x86 code. That is not really surprising: an IA-64 128 bit bundle contains 3 instructions. An x86 instruction can be from 1 to 17 bytes long, but is on average a little less than 3 bytes or 24 bits long. That means that x86 instructions are on average about 2 times more compact. There are many other reasons why EPIC code is more bloated than x86. Because of restrictions on the types of instructions that can be placed in each slot of an IA-64 bundle and the fact that a bundle must be of the same length, IA-64 requires NOPs in unfillable slots. This leads to the insertion of NOPs or useless instructions that take up space.

The whole complex x86 architecture has been built to conserve RAM space as RAM was very expensive in the days during which x86 was developed. In more recent years, this feature has helped x86 as it didn't need the big caches that RISC and EPIC CPUs need. A RISC instruction is (at least) 32 bits long, or at least 33% bigger than an x86 instruction.

Currently, it seems that EPIC compilers produce code that is at least - roughly estimated - twice as big as AMD64 or EM64T code. This means that if you want to compare an Itanium instruction cache to the Opteron instruction cache, you have to divide the Itanium Instruction cache in two.

So, the L1 cache of 8 KB (16 KB/2) looks tiny compared to the massive 64 KB of the Opteron. If we assume that data and instructions take about the same size in the shared L2, the Itanium 2's L2 is 192 KB big (128 KB/2 I + 128 KB D), which is small compared to the Opteron's 1 MB and Xeon's 2 MB L2. That is the reason why Montecito has a 1 MB L2-I Cache and a 256 KB Data cache. This will increase IPC significantly: cache misses are deadly for the in order Itanium.

Time is on the side of the Itanium. As new process technology was introduced, cache sizes have been growing very quickly during the past years, without introducing extra cost or high latency. No competitor has the advantages that Itanium has:
  1. As caches get bigger, Itanium benefits more than the x86 competition. X86 CPUs target higher clock speeds and, as such, it is more difficult to use large low latency caches.
  2. Intel has mastered as no other the skill to produce very dense and fast cache structures.
In 2001, the Itanium had only 96 KB of L2 on the die. In 2002, the Itanium "Mc Kinley" had a 256KB L2 cache and a 1.5MB L3 cache. In 2003, the Itanium 2 had 256 KB L2 and 6 MB of L3-cache on the die, which was increased to 9 MB in 2004. The fact that Itanium needs much larger caches than an x86 CPU has morphed from a catastrophic problem (Merced's Integer performance) into a minor nuisance (Itanium 2 Madison). There is no reason to believe that this trend won't continue.

EPIC 101 The limits of TLP...


View All Comments

  • ravedave - Thursday, November 10, 2005 - link

    Who cares about TLP in the consumer space? Nothign can take advantage of it, HT showed that. I think whoever comes out with the best individual core next will do some sweet buisiness...
  • eastvillager - Thursday, November 10, 2005 - link

    That train is Opteron. All aboard!

    Itanium had a window where it could've shown, Intel missed it by a mile. Well, on the bright side, they killed HP's Unix Server business at the same time. I remember when HP announced they were stopping r&d on new PA-RISC processors and were switching to Itanium.

  • ElFenix - Thursday, November 10, 2005 - link

    the story of how intel killed it for a processor that was about a a decade and a half, if not more, ahead of its time? and mostly because it wasn't invented at intel, but rather was bought as part of the dec compaq hp debacle (ooo, inept management again!). that was about the most promising processor on the planet for a while, but now its buried. Reply
  • Zebo - Thursday, November 10, 2005 - link

    Wow Johan I don't even care about Itanium but your prose kept me all the way through. :) Excellent write up. Reply
  • WhoBeDaPlaya - Wednesday, November 09, 2005 - link

    Interesting read, especially after having just talked with two engineers from the Itanium team (they were from the HP side) at Fort Collins. *Keeping fingers crossed for career prospects there* :D Reply
  • Matthias - Wednesday, November 09, 2005 - link

    "The Itanium is also wider than the competition, which results in bigger benefits from threading techniques."

    I don't buy that. Current Montecito's implementation of TLP only uses "Switch-on-Event Multithreading" which is a another name for Course Grain MT. At any specific time there is only one thread being executed per Montecito core. How can then a wider cpu benefit more than a more narrow cpu? You cannot use the unused execution units with instruction from another thread. So, where is the advantage of having more execution units available?

    The multithreading approach in Montecito helps hiding latencies but not doing more in parallel. You can't execute two instructions from different threads at the same time! The P4 can do so, although its capabilities in parallel instruction execution is limited by its rather narrow design.

    Of course, we are talking about one specific EPIC implementation. Nobody can't guarantee that with the next EPIC microarchitecture there will be an SMT in favor of a SoE-MT implementation. In this case the above statement would be correct, although I doubt that we will ever see an SMT implementation for Itanium. The static instruction issue used in Itanium does not fit very well with the rather dynamic issuing introduced with SMT.
  • IntelUser2000 - Wednesday, November 09, 2005 - link

    SoEMT hides memory latency, which is in a way taking advantage of increased ILP Itanium has since memory latency may limit the benefit.

    Also, it seems the performance among various apps vary as MUCH as opinions about the chip vary :). Some people really like it, while some hate it.

    About the performance, I can't find the link. There was an IDF presentation on PC World(found by google) and showed relative Montecito performance. It was around 20% faster per clock in integer, but they were very ambiguous about it. Citing MT, higher frequency, more cache, and dual cores. But for all that, 20% is so little. There was another IDF presentation about Foxton Technology, and showed Montecito benchmarks on TPC-C, which from numbers was almost 25% faster at same clock, half the sockets(same number of cores) and same platform.

    Intel and HP usually introduce better compilers at the same time, so I think its reasonable to expect 20-25% per clock. One other significant improvement on Montecito will be that it will have another shift unit, making the total two, along with others like more instructions and some little improvements here and there.

    Montecito has x86 compatibility unit taken out, using the software based IA32-EL.
  • IntelUser2000 - Wednesday, November 09, 2005 - link

    Itanium 2 Madison cache latency:
    32KB L1: 1 cycle
    256KB L2: 6 cycles
    9MB L3: 14 cycles

    32KB L1: 1 cycle
    1MB L2I, and 256KB L2D: 6 cycles(same as Madison)
    24MB L3: 14 cycles(same as madison)
  • stephenbrooks - Wednesday, November 09, 2005 - link

    I like this bit.

    --[HP and Intel have stated that the Itanium 2 core, including the L2-cache, has about 40 million transistors. If we subtract the L2 cache, we end up with about 26 transistors,]--
  • fic - Wednesday, November 09, 2005 - link">
    These are PowerPC chips, but not from IBM. From the website: "dual-core device, operates at 2GHz with typical power dissipation in the range of 5 to 13 watts". According to articles SPECint is >1000 per core and SPECfp >2000 per core.

Log in

Don't have an account? Sign up now