The CPU industry in three words

If we would summarize the current trends in the CPU industry in three words, we think that those three words would be "TLP, caches and power consumption"[2]. TLP gets exploited more and more with the introduction of multi-threaded and multi-core CPUs. Caches get bigger and bigger as they don't increase power, but rather save power by preventing costly accesses to the memory controller. Power consumption determines which performance increasing techniques get the spotlight: wasteful techniques such as Dynamic Multi-Threading, double-pumped ALUs and extremely deep Out Of Order (OOO) windows that have fallen out of grace as they consume too much power.

It is pretty clear that these three trends - bigger caches, power consumption being a deciding factor in CPU design, and TLP - will continue to influence the CPU architectures heavily in the coming years. How would this be beneficial to an EPIC CPU?


The Cache story

Bigger caches are what the EPIC CPU needs. One of the biggest disadvantages of the EPIC CPU is code inflation. When we compiled some source (64 bit) code on the Itanium back in 2001, the code was about 2.5 to 3 times bigger than (32 bit) x86 code. That is not really surprising: an IA-64 128 bit bundle contains 3 instructions. An x86 instruction can be from 1 to 17 bytes long, but is on average a little less than 3 bytes or 24 bits long. That means that x86 instructions are on average about 2 times more compact. There are many other reasons why EPIC code is more bloated than x86. Because of restrictions on the types of instructions that can be placed in each slot of an IA-64 bundle and the fact that a bundle must be of the same length, IA-64 requires NOPs in unfillable slots. This leads to the insertion of NOPs or useless instructions that take up space.

The whole complex x86 architecture has been built to conserve RAM space as RAM was very expensive in the days during which x86 was developed. In more recent years, this feature has helped x86 as it didn't need the big caches that RISC and EPIC CPUs need. A RISC instruction is (at least) 32 bits long, or at least 33% bigger than an x86 instruction.

Currently, it seems that EPIC compilers produce code that is at least - roughly estimated - twice as big as AMD64 or EM64T code. This means that if you want to compare an Itanium instruction cache to the Opteron instruction cache, you have to divide the Itanium Instruction cache in two.

So, the L1 cache of 8 KB (16 KB/2) looks tiny compared to the massive 64 KB of the Opteron. If we assume that data and instructions take about the same size in the shared L2, the Itanium 2's L2 is 192 KB big (128 KB/2 I + 128 KB D), which is small compared to the Opteron's 1 MB and Xeon's 2 MB L2. That is the reason why Montecito has a 1 MB L2-I Cache and a 256 KB Data cache. This will increase IPC significantly: cache misses are deadly for the in order Itanium.

Time is on the side of the Itanium. As new process technology was introduced, cache sizes have been growing very quickly during the past years, without introducing extra cost or high latency. No competitor has the advantages that Itanium has:
  1. As caches get bigger, Itanium benefits more than the x86 competition. X86 CPUs target higher clock speeds and, as such, it is more difficult to use large low latency caches.
  2. Intel has mastered as no other the skill to produce very dense and fast cache structures.
In 2001, the Itanium had only 96 KB of L2 on the die. In 2002, the Itanium "Mc Kinley" had a 256KB L2 cache and a 1.5MB L3 cache. In 2003, the Itanium 2 had 256 KB L2 and 6 MB of L3-cache on the die, which was increased to 9 MB in 2004. The fact that Itanium needs much larger caches than an x86 CPU has morphed from a catastrophic problem (Merced's Integer performance) into a minor nuisance (Itanium 2 Madison). There is no reason to believe that this trend won't continue.

EPIC 101 The limits of TLP...
Comments Locked

43 Comments

View All Comments

  • fitten - Thursday, November 10, 2005 - link

    I'm guessing they'll write an article on it when it actually exists... it's at least two years out still before they expect to have *real* silicon for it and a lot can change between now and then.
  • fic - Thursday, November 10, 2005 - link

    Hmmm, their press release says Q3 '06. I know that dates can and do slip, but I doubt they will slip a year.

    Besides, most of the Itanic stuff that was talked about in the article isn't shipping and probably never will. How late is the "next" version - 2+ years? - with no real expected ship date in the forseeable future. It would be nice to see an article about the architecture of the chips, decisions made and trade offs for the power efficiency that they are driving toward. Also, this was started a few years ago, what lead them down the power efficiency path before some of the major companies (notably intel) even realized it was an issue.
  • fitten - Friday, November 11, 2005 - link

    From the press release:
    "It will sample in the third calendar quarter of 2006, with single-core and quad-core versions due in early and late 2007, respectively, and an eight-core version planned for 2008."

    Sampling doesn't mean general avialability... not even close. The closest thing they have to availability is "early and late 2007" for availability of single- and quad-core versions.
  • xelpmoc - Wednesday, November 9, 2005 - link

    "TLP, caches and power consumption" is more than three words!

    Interesting article, though.
  • Questar - Wednesday, November 9, 2005 - link

    Excelent article.

    I've been telling people for years the Itanium architecture is the future (not the chip). In 20 years there will be no OOE chips on the market, everything will be similar to EPIC. AMD will be there too.

  • highlandsun - Wednesday, November 9, 2005 - link

    I don't see any need for EPIC or VLIW. The Itanium is basically using a 41 bit instruction word. The allocation of bits is only slightly different from the allocation used in a 32 bit RISC instruction. Indeed, point a 128-bit memory channel at a stream of 32 bit instructions and you'll get higher instruction dispatch rates and greater code density. EPIC is philosophically the same as hyperthreading - running multiple instruction streams in parallel in a single CPU core. But that just makes CPU designs unnecessarily complex. With the trend to multi-core CPUs, you get parallelism by using separate cores. Let each core crunch on a single instruction stream at a time, and all of that extra baggage is unnecessary. What is the point of having 11 execution units in a single core if you can only feed it 3 instructions per cycle? An efficient design would keep the number of execution units matched to the number of instructions available, any more is just wasted.

    Personally I would have invested more effort into scaling speeds on the MIPS design. The Itanium's predicated instructions are cool, but the MIPS architecture has those too. Anything you can do to avoid branching is definitely a win. But if you can pre-fetch 4 32-bit instructions in one cycle and decode and detect branches in advance, that's going to give higher IPC than this VLIW implementation.
  • Questar - Wednesday, November 9, 2005 - link

    You don't know what EPIC is. It's not hyperthreading, and it makes CPU's LESS complex as there is no need for all the hardware needed to support OOE. Cell and Xenos are examples.

    Think what you want, but the brightest mins in the CPU world are all looking this way.
  • highlandsun - Wednesday, November 9, 2005 - link

    Actually, having handwritten IA64 assembly code I'm acutely aware of what EPIC is and isn't. The point is that it's another lame attempt at increasing parallelism in one core. The problem is that it tries to give the illusion of indepent execution units, just as hyperthreading tries to give the illusion of multiple execution units, and neither implementation is sufficiently flexible. You would get more throughput from truly independent cores, letting the programmer (or some layer above the processor) explicitly allocate instructions to execution units.
  • roymbrown - Thursday, November 10, 2005 - link

    "it's another lame attempt at increasing parallelism in one core"
    It sounds like you are confusing different types of parallelism here. You are referring to TLP (thread level), but EPIC attempts to address ILP (instruction level). Hyperthreading is focused on running multiple independent threads on a single core. Hyperthreading improves TLP, often, at the expense of ILP. EPIC is focused on executing non-dependent instructions within a single thread in parallel. This is more analagous to the work done by complex out-of-order scheduler. EPIC attempts to push this scheduling work onto the compiler.

    "You would get more throughput from truly independent cores"
    Yes, you would, if you have lots of independent threads. Adding more independent cores improves TLP, but does nothing about ILP.
  • Thunder 57 - Monday, May 6, 2019 - link

    It may not be 20 years later, but OoOE is very much alive and Itanium is dead. We've been hearing for years now that ARM will kill off x86-64. I wonder where we will be in another 20 years.

Log in

Don't have an account? Sign up now