The benefits of TLP...

It is clear that the Itanium core has a big advantage in the area of threading and power dissipation constraints. If you are not convinced, the dual core Itanium Montecito (90 nm process) has no less than 1.72 billion transistors, but it is still able to consume less than 130 W. Compare this with the 300 million transistor Power 5+, which consumes about 170 W on a 90 nm SOI process.

And there is more. X86 CPUs are limited to a maximum of 3 decoded, issued and retired instructions. This might increase to 4 next year. But compared to the best x86 design today - the AMD Opteron -, the Itanium does about 60% more work per clock cycle in integer, and about 115% more work per cycle in floating point. Don't get me wrong, these numbers are no indication of superiority of any kind - clock speed matters just as much. But what these numbers tell you is that x86 designs are less brainiac in nature, and that the x86 ISA limits the ILP much more than IA-64 (we will give more proof in a later article). x86 designs prefer the speed-demon approach with deeper pipelines.

The Itanium can sustain 6 instructions per cycle and can issue up to 11 instructions. A lot of this potential goes to waste, but it also means that the potential gains for Multi-Threading techniques are much higher. While the Pentium 4 Xeon was unable to show any significant performance advantage due to SMT in our server tests[4], Montecito is claimed to be 30% faster in typical database loads, thanks to a Coarse Multi-Threading technique that is less advanced than Hyper Threading.


Itanium's future...

There is no doubt about it, the delay of Montecito and Intel's poor execution is a serious blow to the Itanium family. The Montecito based Itanium 2 has the features that it needs to be competitive in the server world for the next years: dual core, multi-threading and virtualization (Silverdale). Without these features, Itanium is hopelessly behind the competition, especially the dual core Xeon, Opteron and Power 5+. The Xeon and Opteron might still be a bit behind on the RAS features, but this can change quickly and is only important for a small part of the market.

If we ignore Intel's poor execution during the past months and the economic realities, and focus on the architecture, it is clear, however, that the Itanium has time on its side and is most likely the architecture with the highest potential.

Although the Itanium is capable of sustaining a theoretical maximum of 6 instructions and executing up to 11 instructions, and despite its massive register set, it uses fewer transistors for its core than all competitors. The main disadvantage is that it needs much more cache and instruction fetch width, but the disadvantage of needing more cache diminish as process technology gets better (smaller). To improve performance, the Itanium needs much bigger caches than its competitors, but this adds very little to the overall power consumption. As superscalar RISCs in x86 competitors increase their instruction execution width, they need to upgrade the Out-Of-Order buffers and more importantly, increase the complexity of the schedulers. This leads to a much higher complexity and power consumption.

As the focus shifts to Thread Level Parallellism, the Itanium's small cores make it easier to use more cores without increasing the power consumption too much. Montecito will be the living proof of this. The Itanium is also wider than the competition, which results in bigger benefits from threading techniques.

While Itanium may not be very popular in the hardware enthusiast community, it is definitely an architecture that, from an academic and technical point of view, deserves a lot more attention. We'll delve deeper in upcoming articles.


References

[1] The Quest for More Processing Power, Part One: "Is the single core CPU doomed?"
http://www.anandtech.com/cpuchipsets/showdoc.aspx?i=2343

[2] Hyper-Threading Technology Architecture and Microarchitecture
http://www.intel.com/technology/itj/2002/volume06issue01/art01_hyper/p01_abstract.htm

[3] Ace's hardware Specmine
http://www.aceshardware.com/SPECmine/

[4] Linux database server CPU comparison
http://www.anandtech.com/IT/showdoc.aspx?i=2447

The limits of TLP...
Comments Locked

43 Comments

View All Comments

  • ravedave - Thursday, November 10, 2005 - link

    Who cares about TLP in the consumer space? Nothign can take advantage of it, HT showed that. I think whoever comes out with the best individual core next will do some sweet buisiness...
  • eastvillager - Thursday, November 10, 2005 - link

    That train is Opteron. All aboard!

    Itanium had a window where it could've shown, Intel missed it by a mile. Well, on the bright side, they killed HP's Unix Server business at the same time. I remember when HP announced they were stopping r&d on new PA-RISC processors and were switching to Itanium.

  • ElFenix - Thursday, November 10, 2005 - link

    the story of how intel killed it for a processor that was about a a decade and a half, if not more, ahead of its time? and mostly because it wasn't invented at intel, but rather was bought as part of the dec compaq hp debacle (ooo, inept management again!). that was about the most promising processor on the planet for a while, but now its buried.
  • Zebo - Thursday, November 10, 2005 - link

    Wow Johan I don't even care about Itanium but your prose kept me all the way through. :) Excellent write up.
  • WhoBeDaPlaya - Wednesday, November 9, 2005 - link

    Interesting read, especially after having just talked with two engineers from the Itanium team (they were from the HP side) at Fort Collins. *Keeping fingers crossed for career prospects there* :D
  • Matthias - Wednesday, November 9, 2005 - link

    "The Itanium is also wider than the competition, which results in bigger benefits from threading techniques."

    I don't buy that. Current Montecito's implementation of TLP only uses "Switch-on-Event Multithreading" which is a another name for Course Grain MT. At any specific time there is only one thread being executed per Montecito core. How can then a wider cpu benefit more than a more narrow cpu? You cannot use the unused execution units with instruction from another thread. So, where is the advantage of having more execution units available?

    The multithreading approach in Montecito helps hiding latencies but not doing more in parallel. You can't execute two instructions from different threads at the same time! The P4 can do so, although its capabilities in parallel instruction execution is limited by its rather narrow design.

    Of course, we are talking about one specific EPIC implementation. Nobody can't guarantee that with the next EPIC microarchitecture there will be an SMT in favor of a SoE-MT implementation. In this case the above statement would be correct, although I doubt that we will ever see an SMT implementation for Itanium. The static instruction issue used in Itanium does not fit very well with the rather dynamic issuing introduced with SMT.
  • IntelUser2000 - Wednesday, November 9, 2005 - link

    SoEMT hides memory latency, which is in a way taking advantage of increased ILP Itanium has since memory latency may limit the benefit.

    Also, it seems the performance among various apps vary as MUCH as opinions about the chip vary :). Some people really like it, while some hate it.

    About the performance, I can't find the link. There was an IDF presentation on PC World(found by google) and showed relative Montecito performance. It was around 20% faster per clock in integer, but they were very ambiguous about it. Citing MT, higher frequency, more cache, and dual cores. But for all that, 20% is so little. There was another IDF presentation about Foxton Technology, and showed Montecito benchmarks on TPC-C, which from numbers was almost 25% faster at same clock, half the sockets(same number of cores) and same platform.

    Intel and HP usually introduce better compilers at the same time, so I think its reasonable to expect 20-25% per clock. One other significant improvement on Montecito will be that it will have another shift unit, making the total two, along with others like more instructions and some little improvements here and there.


    Montecito has x86 compatibility unit taken out, using the software based IA32-EL.
  • IntelUser2000 - Wednesday, November 9, 2005 - link

    Itanium 2 Madison cache latency:
    32KB L1: 1 cycle
    256KB L2: 6 cycles
    9MB L3: 14 cycles

    Montecito:
    32KB L1: 1 cycle
    1MB L2I, and 256KB L2D: 6 cycles(same as Madison)
    24MB L3: 14 cycles(same as madison)
  • stephenbrooks - Wednesday, November 9, 2005 - link

    I like this bit.

    --[HP and Intel have stated that the Itanium 2 core, including the L2-cache, has about 40 million transistors. If we subtract the L2 cache, we end up with about 26 transistors,]--
  • fic - Wednesday, November 9, 2005 - link

    http://www.pasemi.com/">http://www.pasemi.com/
    These are PowerPC chips, but not from IBM. From the website: "dual-core device, operates at 2GHz with typical power dissipation in the range of 5 to 13 watts". According to articles SPECint is >1000 per core and SPECfp >2000 per core.

Log in

Don't have an account? Sign up now