The Quest for More Processing Power, Part One: "Is the single core CPU doomed?"

Name: The Quest for More Processing Power, Part One: "Is the single core CPU doomed?"
Item: The Quest for More Processing Power, Part One: "Is the single core CPU doomed?"
Author: Johan De Gelas

by Johan De Gelas on February 8, 2005 4:00 PM EST

Posted in
CPUs

65 Comments | Add A Comment

65 Comments

When a CPU becomes a sieve

The real problem is leakage power, and the Intel power graph below illustrates this perfectly.

Fig 2. "Leakage power grows exponentially ".

As you can see, dynamic power - which does useful work - has increased relatively slowly despite the increase in CPU complexity. Leakage power, however, increases exponentially, and not linearly. It has grown quickly from a "minor nuisance" to a "circuit killing monster".

Leakage is comparable to a small hole in a waterhose of a firefighter. The more pressure (i.e. the higher the core voltage), the bigger the hole gets, and thus, the more water that leaks to the ground. The thinner the walls of the tube (i.e. smaller process technology), the quicker the holes become bigger, and the more water you lose, the harder the pumps must work to get the same amount of water to extinguish the fire. If the pumps overheat, you better throttle them down, or they will cease to work after a while.

Power Leakage happens as a part of the current, which is supposed to make our transistors switch leaks away in the substrate and finally in the ground. There are several leakage currents, but the two most important ones are the gate oxide tunnelling current and sub-threshold leakage.^[3]

Fig 3. I₃ is the Gate oxide tunnelling currents, I₂ is the Sub-threshold leakage current

Gate oxide tunnelling (I₃) currents get more important with smaller process technology as the gate oxide that is supposed to insulate the transistor becomes thinner and thinner. As a result, current that is going through the transistors leaks away - the gate oxide becomes a sieve instead of being the "wall of a tube".

Sub-threshold leakage (I₂) transistor is the leakage current flowing through the transistor when it is supposed to be turned off. To understand this, we got to back to basic transistor technology.

Normally, a voltage threshold of x volts is needed to get current across the transistor, with x volts being the threshold. This way, the transistor is being used as a switch with a binary function: more or equal to threshold voltage = ON = 1, less than the threshold voltage = OFF.

The point that you have to remember is this: ideally, as long as the threshold voltage is not reached, no current should run through the transistor. However, as transistors and interconnects get smaller and smaller (smaller process technology), the insulation between drain and source gets worse and worse. As a result, a small leakage current gets through the transistor (I 2) even though the threshold voltage is not reached (the Transistor is off).

That subthreshold leakage has become a major problem, which has been made clear by Shekhar Borkar ^[5] (Intel Fellow, Director of Circuit Research). He illustrated this by the logarithmic graph below.

Fig 4. Subthreshold leakage - notice the logarithmic scale!

Subthreshold leakage was only a small problem at the time of Willamette - the leakage problem wasted a few watts at 180 nm. The graph is based on Moore's law: every two years, the number of transistors doubles. As you can see, without countermeasures, it wouldn't be interesting to use devices that make use of 45 nm technology. They would simply leak too much power, up to 100 Watts!

And subthreshold leakage is only part of the leakage problem. Together with gate oxide tunnelling, CPUs made of 65 nm technology would leak more power than what they need for making the transistors switch. It is comparable to a fuel tank that has so many holes, causing it to leak more gasoline to the ground than what the fuel pump can pump to the engine.

Let us check the third and last problem for high performance CPUs.

Wire delay

It is hard to imagine that the little wires - the metal interconnects - between transistors can be a limiting factor. About twenty years ago, transistor switching speeds were pretty low, and wire delays were completely ignored. However, as process technology became better, transistors were capable of switching much faster. Right now, the fastest transistors in the labs can attain 100 GHz (the record being around 300-500 GHz) and more. So, transistor switching speed still has a lot of headroom.

The tiny wires between the different transistors are still not the problem. Functional blocks are also wired to the TLBs (Translation Lookaside Buffer) and caches. The real problem is these global wires - they are a lot longer . If the RC delay is too high, the clock speed will have to be reduced to get a working CPU.

The speeds at which signals travel through the global wires (from logic blocks to the caches, for example) are quite a bit slower than what the maximum speed (speed of light) allows. The reason is the resistance (R, Ohm) and capacitive resistance (C) of the wire. As the whole CPU was made with smaller process technology, the wires also shrunk. You probably know from your lessons of physics that resistance increases as the cross section of the wire gets smaller and the length of the wire gets longer. So, if you shrink a wire, the effect of the shorter length is completely negated by the smaller thickness of the wire. You could make the wires thicker, but it wouldn't be easy and that would increase the capacitance of the wire. The result is that wire delay remains, more or less, the same (in nanoseconds).

However, gate switching speed improves a lot with smaller transistors (for example, 100%). So, while RC delay improves with a very small percentage (or nothing all), gates might switch up to 100% (simplified example) faster as process technology improves. The RC delay of the global wires becomes more a bottleneck that makes bumping up the clock speed hard. Modern Integrated Circuits (ICs), such as CPUs, must be partitioned, as a signal can travel for a time slightly less than the length of one clockpulse.

CHAPTER 1: The brakes on CPU power CHAPTER 2: Why single core CPUs are no longer "cool"

PRINT THIS ARTICLE

Post Your Comment
Please log in or sign up to comment.

Comments Locked

65 Comments

View All Comments

Zak - Wednesday, August 22, 2007 - link
I seem to remember reading somewhere, probably couple of years ago, about research being done on hyperconductivity in "normal" temperatures. Right now hyperconductivity occurs only in extremely low temperatures, right? If materials were developed that achieve the same in normal temperatures it'd solve lots of these issues, like wire delay and power loss, wouldn't it?

Z.
Tellme - Monday, February 21, 2005 - link
Carl what i meant was that soon we might not see much improved performance with multicores as well because the data comes too late to the processor for quick execution. (That is true for single cores as well).

Did you checked the link?
Their idea is simple.
"If you can't bring the memory bandwidth to the processor, then bring the processors to the memory."
Intresting no?
Currently processor waits most of its time for data to be processed.
carl0ski - Saturday, February 19, 2005 - link
#61 i thought p4 already had memory bandwidth problems,
AMD has a temporary work around (on die memory controller) which aids in multiple CPU's/Dies using the same fsb to access the Ram.

Intel has proposed multiple fsb's , one each CPU/die.

Does anyone know if that means they will need sperate RAM dimms for each FSB? because that would prove an expensive system.
carl0ski - Saturday, February 19, 2005 - link
[quote]59 - Posted on Feb 12, 2005 at 11:28 AM by fitten Reply
#57 What was the performance comparison of the 1GHz Athlon vs. the 1GHz P3? IIRC, the Athlon was faster by some margin. If this was the case, then there was a little more than tweaking that went on in the Pentium-M line. Because they started out looking at the P3 doesn't mean that what they ended up with was the P3 with a tweak here or there. :)[/quote]

#59 didnt P3 1ghz run 133mhz sdram? on a 133fsb?
Athlon 1ghz had a nice DDR 266 fsb to support it.
Tellme - Monday, February 14, 2005 - link
Nice article.

I think dual cores will soon reach hit the wall ie Memory Bandwidth.

Hopefully memory and processors are integrates in near future.

See
http://www.ee.ualberta.ca/~elliott/cram/
ceefka - Monday, February 14, 2005 - link
Though still a little too technical for me, it makes a good read.

It's good to know that Intel has eaten their words and realized they had to go back to the drawing board.

I believe rather sooner than later multicore will mean 4 - 8 cores providing the power to emulate everything that is not necessarily native, like running MAC OSX on an AMD or Intel box. Iow the CELL will meet its match.
fitten - Saturday, February 12, 2005 - link
#57 What was the performance comparison of the 1GHz Athlon vs. the 1GHz P3? IIRC, the Athlon was faster by some margin. If this was the case, then there was a little more than tweaking that went on in the Pentium-M line. Because they started out looking at the P3 doesn't mean that what they ended up with was the P3 with a tweak here or there. :)
avijay - Friday, February 11, 2005 - link
EXCELLENT Article! One of the very best I've ever read. Nice to see all the references at the end as well. Could someone please point me to Johan's first article at AT please. Thanks.
Great Work!
fishbreath - Friday, February 11, 2005 - link
For those of you who don't actually know this:

1) The Dotham IS a Pentium 3. It was tweaked by Intel in Israel, but it's heart and soul is just a PIII.

1b) All P4's have hyperthreading in them, and always have had. It was a fuse feature that was not announced until there were applications to support them. But anyone who has HT and Windows XP knows that Windows simply has a smoother 'feel' when running on an HT processor!

2) Complex array processors are already in the pipeline (no pun intended). However the lack of an operating system or language to support them demands they make their first appearance in dedicated applications such as h264 encoders.
blckgrffn - Friday, February 11, 2005 - link
Yay for Very Large Scale Integration (more than 10,000 transistors per chip)! :) I wonder when the historians will put down in the history books that we have hit the fifth generation of computing org....

The Quest for More Processing Power, Part One: "Is the single core CPU doomed?"

When a CPU becomes a sieve

Wire delay

Post Your Comment

65 Comments

View All Comments

Zak - Wednesday, August 22, 2007 - link

Tellme - Monday, February 21, 2005 - link

carl0ski - Saturday, February 19, 2005 - link

carl0ski - Saturday, February 19, 2005 - link

Tellme - Monday, February 14, 2005 - link

ceefka - Monday, February 14, 2005 - link

fitten - Saturday, February 12, 2005 - link

avijay - Friday, February 11, 2005 - link

fishbreath - Friday, February 11, 2005 - link

blckgrffn - Friday, February 11, 2005 - link

Log in

Don't have an account? Sign up now