The benefits of TLP...

It is clear that the Itanium core has a big advantage in the area of threading and power dissipation constraints. If you are not convinced, the dual core Itanium Montecito (90 nm process) has no less than 1.72 billion transistors, but it is still able to consume less than 130 W. Compare this with the 300 million transistor Power 5+, which consumes about 170 W on a 90 nm SOI process.

And there is more. X86 CPUs are limited to a maximum of 3 decoded, issued and retired instructions. This might increase to 4 next year. But compared to the best x86 design today - the AMD Opteron -, the Itanium does about 60% more work per clock cycle in integer, and about 115% more work per cycle in floating point. Don't get me wrong, these numbers are no indication of superiority of any kind - clock speed matters just as much. But what these numbers tell you is that x86 designs are less brainiac in nature, and that the x86 ISA limits the ILP much more than IA-64 (we will give more proof in a later article). x86 designs prefer the speed-demon approach with deeper pipelines.

The Itanium can sustain 6 instructions per cycle and can issue up to 11 instructions. A lot of this potential goes to waste, but it also means that the potential gains for Multi-Threading techniques are much higher. While the Pentium 4 Xeon was unable to show any significant performance advantage due to SMT in our server tests[4], Montecito is claimed to be 30% faster in typical database loads, thanks to a Coarse Multi-Threading technique that is less advanced than Hyper Threading.


Itanium's future...

There is no doubt about it, the delay of Montecito and Intel's poor execution is a serious blow to the Itanium family. The Montecito based Itanium 2 has the features that it needs to be competitive in the server world for the next years: dual core, multi-threading and virtualization (Silverdale). Without these features, Itanium is hopelessly behind the competition, especially the dual core Xeon, Opteron and Power 5+. The Xeon and Opteron might still be a bit behind on the RAS features, but this can change quickly and is only important for a small part of the market.

If we ignore Intel's poor execution during the past months and the economic realities, and focus on the architecture, it is clear, however, that the Itanium has time on its side and is most likely the architecture with the highest potential.

Although the Itanium is capable of sustaining a theoretical maximum of 6 instructions and executing up to 11 instructions, and despite its massive register set, it uses fewer transistors for its core than all competitors. The main disadvantage is that it needs much more cache and instruction fetch width, but the disadvantage of needing more cache diminish as process technology gets better (smaller). To improve performance, the Itanium needs much bigger caches than its competitors, but this adds very little to the overall power consumption. As superscalar RISCs in x86 competitors increase their instruction execution width, they need to upgrade the Out-Of-Order buffers and more importantly, increase the complexity of the schedulers. This leads to a much higher complexity and power consumption.

As the focus shifts to Thread Level Parallellism, the Itanium's small cores make it easier to use more cores without increasing the power consumption too much. Montecito will be the living proof of this. The Itanium is also wider than the competition, which results in bigger benefits from threading techniques.

While Itanium may not be very popular in the hardware enthusiast community, it is definitely an architecture that, from an academic and technical point of view, deserves a lot more attention. We'll delve deeper in upcoming articles.


References

[1] The Quest for More Processing Power, Part One: "Is the single core CPU doomed?"
http://www.anandtech.com/cpuchipsets/showdoc.aspx?i=2343

[2] Hyper-Threading Technology Architecture and Microarchitecture
http://www.intel.com/technology/itj/2002/volume06issue01/art01_hyper/p01_abstract.htm

[3] Ace's hardware Specmine
http://www.aceshardware.com/SPECmine/

[4] Linux database server CPU comparison
http://www.anandtech.com/IT/showdoc.aspx?i=2447

The limits of TLP...
Comments Locked

43 Comments

View All Comments

  • fitten - Thursday, November 10, 2005 - link

    I'm guessing they'll write an article on it when it actually exists... it's at least two years out still before they expect to have *real* silicon for it and a lot can change between now and then.
  • fic - Thursday, November 10, 2005 - link

    Hmmm, their press release says Q3 '06. I know that dates can and do slip, but I doubt they will slip a year.

    Besides, most of the Itanic stuff that was talked about in the article isn't shipping and probably never will. How late is the "next" version - 2+ years? - with no real expected ship date in the forseeable future. It would be nice to see an article about the architecture of the chips, decisions made and trade offs for the power efficiency that they are driving toward. Also, this was started a few years ago, what lead them down the power efficiency path before some of the major companies (notably intel) even realized it was an issue.
  • fitten - Friday, November 11, 2005 - link

    From the press release:
    "It will sample in the third calendar quarter of 2006, with single-core and quad-core versions due in early and late 2007, respectively, and an eight-core version planned for 2008."

    Sampling doesn't mean general avialability... not even close. The closest thing they have to availability is "early and late 2007" for availability of single- and quad-core versions.
  • xelpmoc - Wednesday, November 9, 2005 - link

    "TLP, caches and power consumption" is more than three words!

    Interesting article, though.
  • Questar - Wednesday, November 9, 2005 - link

    Excelent article.

    I've been telling people for years the Itanium architecture is the future (not the chip). In 20 years there will be no OOE chips on the market, everything will be similar to EPIC. AMD will be there too.

  • highlandsun - Wednesday, November 9, 2005 - link

    I don't see any need for EPIC or VLIW. The Itanium is basically using a 41 bit instruction word. The allocation of bits is only slightly different from the allocation used in a 32 bit RISC instruction. Indeed, point a 128-bit memory channel at a stream of 32 bit instructions and you'll get higher instruction dispatch rates and greater code density. EPIC is philosophically the same as hyperthreading - running multiple instruction streams in parallel in a single CPU core. But that just makes CPU designs unnecessarily complex. With the trend to multi-core CPUs, you get parallelism by using separate cores. Let each core crunch on a single instruction stream at a time, and all of that extra baggage is unnecessary. What is the point of having 11 execution units in a single core if you can only feed it 3 instructions per cycle? An efficient design would keep the number of execution units matched to the number of instructions available, any more is just wasted.

    Personally I would have invested more effort into scaling speeds on the MIPS design. The Itanium's predicated instructions are cool, but the MIPS architecture has those too. Anything you can do to avoid branching is definitely a win. But if you can pre-fetch 4 32-bit instructions in one cycle and decode and detect branches in advance, that's going to give higher IPC than this VLIW implementation.
  • Questar - Wednesday, November 9, 2005 - link

    You don't know what EPIC is. It's not hyperthreading, and it makes CPU's LESS complex as there is no need for all the hardware needed to support OOE. Cell and Xenos are examples.

    Think what you want, but the brightest mins in the CPU world are all looking this way.
  • highlandsun - Wednesday, November 9, 2005 - link

    Actually, having handwritten IA64 assembly code I'm acutely aware of what EPIC is and isn't. The point is that it's another lame attempt at increasing parallelism in one core. The problem is that it tries to give the illusion of indepent execution units, just as hyperthreading tries to give the illusion of multiple execution units, and neither implementation is sufficiently flexible. You would get more throughput from truly independent cores, letting the programmer (or some layer above the processor) explicitly allocate instructions to execution units.
  • roymbrown - Thursday, November 10, 2005 - link

    "it's another lame attempt at increasing parallelism in one core"
    It sounds like you are confusing different types of parallelism here. You are referring to TLP (thread level), but EPIC attempts to address ILP (instruction level). Hyperthreading is focused on running multiple independent threads on a single core. Hyperthreading improves TLP, often, at the expense of ILP. EPIC is focused on executing non-dependent instructions within a single thread in parallel. This is more analagous to the work done by complex out-of-order scheduler. EPIC attempts to push this scheduling work onto the compiler.

    "You would get more throughput from truly independent cores"
    Yes, you would, if you have lots of independent threads. Adding more independent cores improves TLP, but does nothing about ILP.
  • Thunder 57 - Monday, May 6, 2019 - link

    It may not be 20 years later, but OoOE is very much alive and Itanium is dead. We've been hearing for years now that ARM will kill off x86-64. I wonder where we will be in another 20 years.

Log in

Don't have an account? Sign up now