The CPU industry in three words

If we would summarize the current trends in the CPU industry in three words, we think that those three words would be "TLP, caches and power consumption"[2]. TLP gets exploited more and more with the introduction of multi-threaded and multi-core CPUs. Caches get bigger and bigger as they don't increase power, but rather save power by preventing costly accesses to the memory controller. Power consumption determines which performance increasing techniques get the spotlight: wasteful techniques such as Dynamic Multi-Threading, double-pumped ALUs and extremely deep Out Of Order (OOO) windows that have fallen out of grace as they consume too much power.

It is pretty clear that these three trends - bigger caches, power consumption being a deciding factor in CPU design, and TLP - will continue to influence the CPU architectures heavily in the coming years. How would this be beneficial to an EPIC CPU?


The Cache story

Bigger caches are what the EPIC CPU needs. One of the biggest disadvantages of the EPIC CPU is code inflation. When we compiled some source (64 bit) code on the Itanium back in 2001, the code was about 2.5 to 3 times bigger than (32 bit) x86 code. That is not really surprising: an IA-64 128 bit bundle contains 3 instructions. An x86 instruction can be from 1 to 17 bytes long, but is on average a little less than 3 bytes or 24 bits long. That means that x86 instructions are on average about 2 times more compact. There are many other reasons why EPIC code is more bloated than x86. Because of restrictions on the types of instructions that can be placed in each slot of an IA-64 bundle and the fact that a bundle must be of the same length, IA-64 requires NOPs in unfillable slots. This leads to the insertion of NOPs or useless instructions that take up space.

The whole complex x86 architecture has been built to conserve RAM space as RAM was very expensive in the days during which x86 was developed. In more recent years, this feature has helped x86 as it didn't need the big caches that RISC and EPIC CPUs need. A RISC instruction is (at least) 32 bits long, or at least 33% bigger than an x86 instruction.

Currently, it seems that EPIC compilers produce code that is at least - roughly estimated - twice as big as AMD64 or EM64T code. This means that if you want to compare an Itanium instruction cache to the Opteron instruction cache, you have to divide the Itanium Instruction cache in two.

So, the L1 cache of 8 KB (16 KB/2) looks tiny compared to the massive 64 KB of the Opteron. If we assume that data and instructions take about the same size in the shared L2, the Itanium 2's L2 is 192 KB big (128 KB/2 I + 128 KB D), which is small compared to the Opteron's 1 MB and Xeon's 2 MB L2. That is the reason why Montecito has a 1 MB L2-I Cache and a 256 KB Data cache. This will increase IPC significantly: cache misses are deadly for the in order Itanium.

Time is on the side of the Itanium. As new process technology was introduced, cache sizes have been growing very quickly during the past years, without introducing extra cost or high latency. No competitor has the advantages that Itanium has:
  1. As caches get bigger, Itanium benefits more than the x86 competition. X86 CPUs target higher clock speeds and, as such, it is more difficult to use large low latency caches.
  2. Intel has mastered as no other the skill to produce very dense and fast cache structures.
In 2001, the Itanium had only 96 KB of L2 on the die. In 2002, the Itanium "Mc Kinley" had a 256KB L2 cache and a 1.5MB L3 cache. In 2003, the Itanium 2 had 256 KB L2 and 6 MB of L3-cache on the die, which was increased to 9 MB in 2004. The fact that Itanium needs much larger caches than an x86 CPU has morphed from a catastrophic problem (Merced's Integer performance) into a minor nuisance (Itanium 2 Madison). There is no reason to believe that this trend won't continue.

EPIC 101 The limits of TLP...
Comments Locked

43 Comments

View All Comments

  • Starglider - Wednesday, November 9, 2005 - link

    Well, back in university I passed my classes on CPU design, and I know a couple of flaours of assembly language and have worked on compilers professionally, so yes I'd say I know what I'm talking about.

    Hell, why am I being polite, /of course/ you can combine static and dynamic optimisation of instruction order. All x86 compilers /already/ do this. Virtual machine based programming languages (e.g. C# and Java) actually have /three/ tiers of optimisation; the primary compiler optimises the bytecode based on static global information, the runtime compiler optimises for the target instruction set based on medium-scale runtime information (at least Sun's Hotspot does), and then the CPU does instruction reordering and register remapping based on very local information. The efficiency of the final stage, e.g. the processor-level scheduling, can be improved by embedding hints in the instruction stream in exactly the same way that JIT compliation cane be improved by embedding hints in the bytecode of a VM language. Indeed arguably some RISC designs already do this to a limited extent, so implementing it for x86 isn't much of a stretch.
  • Spoonbender - Wednesday, November 9, 2005 - link

    "The main philosophy behind Itanium is, of course, that a compiler can statically schedule instructions much better than a hardware scheduler" - Not always.
    Of course, the compiler can do all this with the static information within the same translation unit (or in some cases, only within the same basic code block), but not based on runtime behavior. Global optimizations are a pain to implement on a compiler, and a lot of them are simply too complex to even think about, while the hardware scheduler can easily see, for example, where a function is called from, meaning it can figure out some dependencies that might be practically impossible to do in the compiler.
    Dynamic and static scheduling can achieve different results based on the different data available to them (at compile-time vs runtime), but it's wrong to say that one is much better than the other. The trick is to use the best of both worlds. x86 compilers already lets the compiler do as much scheduling as possible, and then at runtime the hardware scheduler tweaks everything to fit the particular pipeline, and uses the runtime info available that the compiler didn't have.
    Of course, the Itanium could do the same, but relying solely on the compiler is a mistake.

    Another disadvantage with the Itanium is that everything becomes a lot more architecture-specific. For example, the same compiler can write decent code for either a P4 or an Athlon 64 (or even a 386).

    But because so much of the responsibility for scheduling and instruction bundles is put on the compiler, it's the compiler that has to reflect each particular architecture. So far, there's only Itanium and Itanium 2. What when we get to Itanium 5? Or AMD Athlanium? ;)
    Different compilers for each? Or should we accept that the same compiler just generates inefficient code on all other EPIC CPU's than the original target?

    And how much headroom does the architecture have then?
    (What if in the future we want wider instruction bundles? Or if they find out that reaing bigger amounts of smaller bundles is more efficient? Or if they want to remove some of the current restrictions on instruction order inside a bundle?
    I just can't see how EPIC can ever become a viable long-term architecture. And honestly, I don't want to go back to the old days of "New CPU? Have to recompile everything. Binary compatibility? What's that?"
  • JohanAnandtech - Wednesday, November 9, 2005 - link

    You bring up very valid points that I will definitely address in a follow up. Indeed statically scheduling is not always better than dynamically. Most of the time it is, as you can look ahead much more far ahead, but it is less flexible.

    x86 compilers can never extract much ILP as they are limited by the ISA. With 20% branches and 8 registers, your options are very limited.

    But your comment about binary compatibility is a mistake. The 128 bit bundle hasn't changed, so your binary compatibility is saved. It is true that the Itanium 2 can use bundles that the Itanium can't, but the same can be said about the P4 using SSE-2 instructions that the Pentium II can't use. You just provide two codepaths in the same code like we do now in apps where you can enable or disable SSE. Secondly, there are almost no Itanium I out there, so it is sufficient to make your code Itanium 2 compatible.

    Wider bundles aren't going to happen. There is no reason to do so, as the groups of independent instructions can be as large as you want, you chain bundles together via the template. Montecito is perfectly compatible with Madison and mckinley





  • mkruer - Wednesday, November 9, 2005 - link

    <mindless ramblings>
    I think one of the key things to point out is that the current x86 has very little in common with the original ISA, and that the ISA has been adapting over time. The current internal cores are more like RISC then the original CISC design which will probably lead to some low level VLIW implementation mainly in the area of the FP units.

    My predictions are that we are going to start seeing some low level implementations of VLIW most likely as a sub core options at first. As time progresses we will see those sub cores become more and more powerful and functional, and as time progresses more and more of the current x86 ISA will fall off to be replaced by an updated x86 ISA. </mindless ramblings>
  • saratoga - Wednesday, November 9, 2005 - link

    Yes very little in common aside from almost complete binary compatability. You're confusing ISA (the binary format for operations) with microarch (the layout of transistors in a processor).

    Also, "low level VLIW", WTF?
  • Brian23 - Wednesday, November 9, 2005 - link

    If Intel would drop the x86 compatability, L3 cache, and up the L1 and L2 chaches significantly and add an on die memory controller, this chip would be incredable. Then they could do something like transmetta did for backwards compatability until they can coax MS to write an os and compiler that runs natively on chip. At that point x86 would be dead.
  • JohanAnandtech - Wednesday, November 9, 2005 - link

    If they up the L1 and L2, it would result in higher latencies. Right now, the L1-cache has a 1 cycle L1. So L1-accesses are as good as free, you don't want that to change for an in-order CPU.

    The L3-cache is important as it lowers the accesses to the memory significantely. But I agree that x86 hardware support should be dropped, and only software emulation should be available. That opens up a few million transistors that can be used for a primitive OOO system or improved prefetching.
  • highlandsun - Wednesday, November 9, 2005 - link

    As a server chip there's really no reason to beg MS for anything. Linux and gcc can take it from here. Note that big Itanium servers from HP and SGI all run Linux anyway, MS is irrelevant in this space. But yes, they really ought to jettison the x86 baggage. In an open source world there's no need to do on-chip emulation to execute legacy binaries - just recompile the source and get a native binary instead.
  • PeteRoy - Wednesday, November 9, 2005 - link

    YEah
  • IntelUser2000 - Wednesday, November 9, 2005 - link

    Johan, Do you know that the 30% performance advantage is SoEMT only on Montecito??? Not comparing against Madison??

    Whether by major compiler improvements or core improvements, Montecito should be 25% faster per clock, per core over Madison.

    Its sad that Intel had problems with Montecito. At 2GHz it would have been amazing.

Log in

Don't have an account? Sign up now