Introduction

On HP's website, these prophetic words are hidden, but can still be found:
"EPIC is the old term for what is now known as the ItaniumTM processor family architecture, co-developed by HP and Intel®. This design philosophy will one day replace RISC and CISC. It is a gateway into the 64-bit future but it still remains completely 32-bit compatible."
These sentences showed how bullish HP and Intel were a few years ago about their new creation. But in 2005, the reality is somewhat different:
"Dell will phase out its remaining computer based on Intel's Itanium microprocessor, in another sign of the waning interest in a chip that cost an estimated several billion dollars to develop." The Wall Street Journal, September 15th 2005.
While it is hardly news that Dell, who doesn't believe in "big iron" anyway, is dropping Itanium, the rest of the sentences that the WSJ journalist wrote down seem to spell doom.

As the Itanium market is still limited to HPC and (ultra) high-end servers, Microsoft is losing interest in the Itanium. IA 64 versions of Longhorn are low priority and only the future of the High Performance Computing version for Itanium seems certain. Visual Studio 2005 does not even support the Itanium platform. Dell and IBM are no longer interested. It is not going too well for Itanium.

A few years ago, analysts predicted doom for Sun; not completely without reason, as the Intel Itanium 2 and IBM Power 5 clearly wipe the floor performance-wise with the UltraSparc CPUs. However, Sun's revenge is very sweet. Sun's newest Galaxy servers with up to 16 Opteron cores are a very competitive platform for the expensive Itanium servers. The Galaxy servers are well suited for clustering, so even in the market niche that requires more than 16 CPUs, are the Itanium based machines threatened by a cheaper alternative?

Although the AMD Opteron targets a different market than the Intel Itanium, the Opteron market is expanding towards the high end, thanks to Sun, which in turn forces Intel to expand the feature set of the Xeon. Back in 2004 when EM64T was introduced, Intel pointed out that EM64T was only introduced on the Xeon DP. Intel probably expected the Opteron to be limited to workstations and entry level servers. However, the Opteron was very successful in the quad CPU market, and then it entered the 8-way and 16-way CPU market too. Intel had no choice then to counter attack and equip the Xeon MP with EMT64 and much higher clockspeeds than before the Opteron era, better RAS features and massive (for x86) L3 caches, up to 8MB big.

Is Itanium nothing more than over an ambitious project that resulted in a CPU of titanic proportions? In this article, we try to answer the question of whether or not the EPIC CPU has a bright future ahead. To answer that question, we'll focus on the technical advantages and disadvantages of the chip, and look ahead to see if the architecture can still grow enough to outpace the competition.


The End of a Generation

Indeed, you might ask yourself, why do we even bother writing articles about Itanium? It is, after all, a massive CPU that ends up in very expensive machines, mostly huge database servers and HPC machines for scientific purposes; machines that most of us will never consider buying, not even for business purposes.


Sturdy heatsinks for the Itanium

And Itanium is in a lot of trouble. The newest generation, Montecito, was projected to arrive in 2004 when Intel first mentioned it. Then the PowerPoint slides mentioned 2005, and it became clear now that the newest Itanium wouldn't make its appearance before mid-2006. Many people feel that this is one of the many signs that the "Itanic" is sinking slowly, but steadily.

Still, despite its rather dull reputation of a big iron CPU, and the flood of negative predictions, the EPIC has something fascinating. From a purely technical and academic point of view - completely ignoring the economical and business logic - there are some strong indications that time may well be on the side of the EPIC CPU despite all doom scenarios. That might sound insane right now, but allow me to explain this statement.

As we stated in the "The Quest for More Processing Power, Part One", the CPU performance increase that we enjoyed during the golden era of the PC from 1981 to 2002 has hit the brakes, and is decreasing quickly. Back in the nineties, Intel and others introduced techniques like superscalar wide issue, out of order execution with big reorder buffers, speculative execution, integrated L2-caches, register renaming and dynamic branch prediction, which all increased the number of instructions that could be processed per cycle (IPC) on average. The AMD Athlon, which was introduced in 1999, and the Thunderbird incarnation in 2000 could be considered as the last representatives of this superscalar generation. Macro ops fusion, introduced in the Athlon, where two operations are travelling down the pipeline together until they get separated to get executed, was one of the last major tricks of this generation.

Since then, only one improvement has really pushed performance per cycle forward: the on die memory controller (ODMC). Sure, there have been other "little tricks" that have steadily improved performance, but nothing spectacular. The CPU engineers still have a few tricks upon their sleeves that can improve IPC somewhat, but are limited to those that do not increase leakage and dynamic power loss. The focus is no longer on IPC or Instruction Level Parallelism (ILP). It is on Thread Level Parallelism (TLP).

A good example of how the engineering focus has shifted is branch prediction. Quite a bit of resources have been spent on the Pentium 4's branch predictor, involving a whole team of Intel engineers. The result was that, on average, the Pentium 4 branch predictor is accurate 95-97% of the time, while the P6 BPU was accurate only 90% of the time.

At Spring IDF 2005, when Anand, Derek and I asked Justin Rattner what Intel is doing in the field of even more advanced Branch prediction, he smiled. He told us that the current team who works on branch prediction is very small...around one person.

There is no doubt that the whole industry has shifted their focus away from ramping clock speed and improving ILP to increasing performance by exploiting TLP. So, how does this affect Itanium and its EPIC foundation? Before we answer that, let us quickly review the basics behind the Itanium/EPIC philosophy.

EPIC 101
POST A COMMENT

42 Comments

View All Comments

  • Spoonbender - Wednesday, November 09, 2005 - link

    "The main philosophy behind Itanium is, of course, that a compiler can statically schedule instructions much better than a hardware scheduler" - Not always.
    Of course, the compiler can do all this with the static information within the same translation unit (or in some cases, only within the same basic code block), but not based on runtime behavior. Global optimizations are a pain to implement on a compiler, and a lot of them are simply too complex to even think about, while the hardware scheduler can easily see, for example, where a function is called from, meaning it can figure out some dependencies that might be practically impossible to do in the compiler.
    Dynamic and static scheduling can achieve different results based on the different data available to them (at compile-time vs runtime), but it's wrong to say that one is much better than the other. The trick is to use the best of both worlds. x86 compilers already lets the compiler do as much scheduling as possible, and then at runtime the hardware scheduler tweaks everything to fit the particular pipeline, and uses the runtime info available that the compiler didn't have.
    Of course, the Itanium could do the same, but relying solely on the compiler is a mistake.

    Another disadvantage with the Itanium is that everything becomes a lot more architecture-specific. For example, the same compiler can write decent code for either a P4 or an Athlon 64 (or even a 386).

    But because so much of the responsibility for scheduling and instruction bundles is put on the compiler, it's the compiler that has to reflect each particular architecture. So far, there's only Itanium and Itanium 2. What when we get to Itanium 5? Or AMD Athlanium? ;)
    Different compilers for each? Or should we accept that the same compiler just generates inefficient code on all other EPIC CPU's than the original target?

    And how much headroom does the architecture have then?
    (What if in the future we want wider instruction bundles? Or if they find out that reaing bigger amounts of smaller bundles is more efficient? Or if they want to remove some of the current restrictions on instruction order inside a bundle?
    I just can't see how EPIC can ever become a viable long-term architecture. And honestly, I don't want to go back to the old days of "New CPU? Have to recompile everything. Binary compatibility? What's that?"
    Reply
  • JohanAnandtech - Wednesday, November 09, 2005 - link

    You bring up very valid points that I will definitely address in a follow up. Indeed statically scheduling is not always better than dynamically. Most of the time it is, as you can look ahead much more far ahead, but it is less flexible.

    x86 compilers can never extract much ILP as they are limited by the ISA. With 20% branches and 8 registers, your options are very limited.

    But your comment about binary compatibility is a mistake. The 128 bit bundle hasn't changed, so your binary compatibility is saved. It is true that the Itanium 2 can use bundles that the Itanium can't, but the same can be said about the P4 using SSE-2 instructions that the Pentium II can't use. You just provide two codepaths in the same code like we do now in apps where you can enable or disable SSE. Secondly, there are almost no Itanium I out there, so it is sufficient to make your code Itanium 2 compatible.

    Wider bundles aren't going to happen. There is no reason to do so, as the groups of independent instructions can be as large as you want, you chain bundles together via the template. Montecito is perfectly compatible with Madison and mckinley





    Reply
  • mkruer - Wednesday, November 09, 2005 - link

    <mindless ramblings>
    I think one of the key things to point out is that the current x86 has very little in common with the original ISA, and that the ISA has been adapting over time. The current internal cores are more like RISC then the original CISC design which will probably lead to some low level VLIW implementation mainly in the area of the FP units.

    My predictions are that we are going to start seeing some low level implementations of VLIW most likely as a sub core options at first. As time progresses we will see those sub cores become more and more powerful and functional, and as time progresses more and more of the current x86 ISA will fall off to be replaced by an updated x86 ISA. </mindless ramblings>
    Reply
  • saratoga - Wednesday, November 09, 2005 - link

    Yes very little in common aside from almost complete binary compatability. You're confusing ISA (the binary format for operations) with microarch (the layout of transistors in a processor).

    Also, "low level VLIW", WTF?
    Reply
  • Brian23 - Wednesday, November 09, 2005 - link

    If Intel would drop the x86 compatability, L3 cache, and up the L1 and L2 chaches significantly and add an on die memory controller, this chip would be incredable. Then they could do something like transmetta did for backwards compatability until they can coax MS to write an os and compiler that runs natively on chip. At that point x86 would be dead. Reply
  • JohanAnandtech - Wednesday, November 09, 2005 - link

    If they up the L1 and L2, it would result in higher latencies. Right now, the L1-cache has a 1 cycle L1. So L1-accesses are as good as free, you don't want that to change for an in-order CPU.

    The L3-cache is important as it lowers the accesses to the memory significantely. But I agree that x86 hardware support should be dropped, and only software emulation should be available. That opens up a few million transistors that can be used for a primitive OOO system or improved prefetching.
    Reply
  • highlandsun - Wednesday, November 09, 2005 - link

    As a server chip there's really no reason to beg MS for anything. Linux and gcc can take it from here. Note that big Itanium servers from HP and SGI all run Linux anyway, MS is irrelevant in this space. But yes, they really ought to jettison the x86 baggage. In an open source world there's no need to do on-chip emulation to execute legacy binaries - just recompile the source and get a native binary instead. Reply
  • PeteRoy - Wednesday, November 09, 2005 - link

    YEah Reply
  • IntelUser2000 - Wednesday, November 09, 2005 - link

    Johan, Do you know that the 30% performance advantage is SoEMT only on Montecito??? Not comparing against Madison??

    Whether by major compiler improvements or core improvements, Montecito should be 25% faster per clock, per core over Madison.

    Its sad that Intel had problems with Montecito. At 2GHz it would have been amazing.
    Reply
  • JohanAnandtech - Wednesday, November 09, 2005 - link

    Yes, very nice remark. Part of that 25% is thanks to the L2-I cache which is now better adapted to the bigger instructions without increasing latency. Most RISC have a bigger I cache than D-cache.

    Do you have an URL handy? I have been searching all over the web to find that 25%.
    Reply

Log in

Don't have an account? Sign up now