A Crash Course in CPU Architecture
It’s been years since I’ve gone through the life of an instruction, and when I last did it it was about a very high end desktop processor. I realize that not everyone interested in what’s powering the iPhone 3GS or Palm Pre may have been taken down this path, so I thought some of that knowledge might be useful here.
Applications spawn threads, threads are made up of instructions and instructions are what a CPU “processes”. The actual processing of an instruction is pretty simple; the CPU must fetch the instruction from memory, decode or somehow understand what the instruction is telling it to do (e.g. add two numbers), grab any data that is required by the instruction (e.g. find the numbers to be added), actually execute the instruction and finally write the result of the operation either to a register or memory.
Our basic microprocessor with a 5-stage pipeline
Based on the example above, executing an instruction requires five distinct stages. In a pipelined microprocessor, a different instruction can be active at each stage of the execution pipeline. For example, you can be grabbing data for one instruction, while decoding another and fetching yet another. All modern day processors work this way.
Multiple instructions can exist in the pipeline at once, but only one instruction may be active at any given stage
Each one of these stages should take the same amount of time for the processor to work efficiently; the length of time required at the longest stage actually determines the clock speed of the CPU. If the most complex stage in my example above is the decode stage and it requires 3ns to complete, then my CPU can run no faster than 333MHz (1 / 3ns).
To reach faster frequencies, we need to speed up each stage of the pipeline. You can speed up a stage by implementing some sweet new algorithms, or simply by splitting up complicated stages into simpler ones and increasing the number of stages in your pipeline.
In our previous example, the decode stage required 3ns to complete but if we split decode into three separate stages, each requiring 1ns, then we remove that bottleneck. Let’s say we do that but now some of our other stages become the bottleneck; with a target of a 1ns clock period (1ns spent per stage) we go from five stages to eight:
Now, with each stage running at 1ns, our maximum clock speed goes up from 333MHz to 1000MHz (1GHz). Sweet. Right?
With less work being done in each stage, we reach a higher clock speed, but we also depend on each stage being full in order to operate at peak efficiency.
5-stage pipeline (top) vs 8-stage pipeline (bottom). The 8 stage pipe is more desirable, but also requires more instructions to fill.
In the first CPU example we had a 5 stage pipeline, which meant that we needed to have the pipe full of 5 instructions at any given time to be operating at peak efficiency of 1 instruction completed every cycle. The second example has a ginormous 8 stage pipeline, which requires 8 instructions in the pipe for peak efficiency. In both cases you can only get one instruction out of the pipe every cycle, but the second chip can give us more completed instructions in say, 10 seconds.
Now think for a moment about the time periods we’re talking about here. The first CPU had a clock period of 3ns, where each stage took 3ns to complete. The second CPU had a clock period of 1ns. A single trip to main memory can easily take 60ns for a CPU with a very fast on-die memory controller, or over 100ns otherwise. For the sake of argument let’s say that we’re talking about a 100ns trip to main memory. Remember the Fetch Operands stage? Well if those operands are located in main memory that stage won’t take 3ns to complete, but rather 103ns since it has to get the operands from main memory.
Modern processors will perform a context switch upon any memory access to avoid stalling the pipeline for such an absurd length of time. The contents of the pipeline get flushed and filled with another thread while the data request goes off to main memory. Once the data is ready, the processor switches contexts once more and continues on its execution path. Here’s the problem: it takes time to refill the pipeline, and the longer the pipeline, the longer it takes to refill it. This is a bad, but regular occurrence in a microprocessor. Our instruction throughput drops from its 1 instruction per clock peak to 0; not good.
Other scenarios can create interruptions in the normal flow of things within our microprocessor. Some instructions may take multiple cycles at a single stage to complete. More complex arithmetic may spend significantly longer at the execute stage while the operation works out. With an in-order microprocessor, all instructions behind it must wait.
Again, the more stages in your pipeline, the bigger the penalty for a stall. But when the pipeline is full, a deeper pipeline will give us a higher clock speed and better overall performance - we just need to worry about keeping the pipeline full (which takes a great deal of additional transistors). And yes, there is an upper limit to how deep you can pipeline your processor before you start running into diminishing returns in both a performance and power sense, this was ultimately the downfall of the Pentium 4’s architecture.