A Crash Course in CPU Architecture

It’s been years since I’ve gone through the life of an instruction, and when I last did it it was about a very high end desktop processor. I realize that not everyone interested in what’s powering the iPhone 3GS or Palm Pre may have been taken down this path, so I thought some of that knowledge might be useful here.

Applications spawn threads, threads are made up of instructions and instructions are what a CPU “processes”. The actual processing of an instruction is pretty simple; the CPU must fetch the instruction from memory, decode or somehow understand what the instruction is telling it to do (e.g. add two numbers), grab any data that is required by the instruction (e.g. find the numbers to be added), actually execute the instruction and finally write the result of the operation either to a register or memory.


Our basic microprocessor with a 5-stage pipeline

Based on the example above, executing an instruction requires five distinct stages. In a pipelined microprocessor, a different instruction can be active at each stage of the execution pipeline. For example, you can be grabbing data for one instruction, while decoding another and fetching yet another. All modern day processors work this way.


Multiple instructions can exist in the pipeline at once, but only one instruction may be active at any given stage

Each one of these stages should take the same amount of time for the processor to work efficiently; the length of time required at the longest stage actually determines the clock speed of the CPU. If the most complex stage in my example above is the decode stage and it requires 3ns to complete, then my CPU can run no faster than 333MHz (1 / 3ns).

To reach faster frequencies, we need to speed up each stage of the pipeline. You can speed up a stage by implementing some sweet new algorithms, or simply by splitting up complicated stages into simpler ones and increasing the number of stages in your pipeline.

In our previous example, the decode stage required 3ns to complete but if we split decode into three separate stages, each requiring 1ns, then we remove that bottleneck. Let’s say we do that but now some of our other stages become the bottleneck; with a target of a 1ns clock period (1ns spent per stage) we go from five stages to eight:

Fetch
Decode 1
Decode 2
Decode 3
Fetch Operands
Execute 1
Execute 2
Write Output

Now, with each stage running at 1ns, our maximum clock speed goes up from 333MHz to 1000MHz (1GHz). Sweet. Right?

With less work being done in each stage, we reach a higher clock speed, but we also depend on each stage being full in order to operate at peak efficiency.


5-stage pipeline (top) vs 8-stage pipeline (bottom). The 8 stage pipe is more desirable, but also requires more instructions to fill.

In the first CPU example we had a 5 stage pipeline, which meant that we needed to have the pipe full of 5 instructions at any given time to be operating at peak efficiency of 1 instruction completed every cycle. The second example has a ginormous 8 stage pipeline, which requires 8 instructions in the pipe for peak efficiency. In both cases you can only get one instruction out of the pipe every cycle, but the second chip can give us more completed instructions in say, 10 seconds.

Now think for a moment about the time periods we’re talking about here. The first CPU had a clock period of 3ns, where each stage took 3ns to complete. The second CPU had a clock period of 1ns. A single trip to main memory can easily take 60ns for a CPU with a very fast on-die memory controller, or over 100ns otherwise. For the sake of argument let’s say that we’re talking about a 100ns trip to main memory. Remember the Fetch Operands stage? Well if those operands are located in main memory that stage won’t take 3ns to complete, but rather 103ns since it has to get the operands from main memory.

Modern processors will perform a context switch upon any memory access to avoid stalling the pipeline for such an absurd length of time. The contents of the pipeline get flushed and filled with another thread while the data request goes off to main memory. Once the data is ready, the processor switches contexts once more and continues on its execution path. Here’s the problem: it takes time to refill the pipeline, and the longer the pipeline, the longer it takes to refill it. This is a bad, but regular occurrence in a microprocessor. Our instruction throughput drops from its 1 instruction per clock peak to 0; not good.

Other scenarios can create interruptions in the normal flow of things within our microprocessor. Some instructions may take multiple cycles at a single stage to complete. More complex arithmetic may spend significantly longer at the execute stage while the operation works out. With an in-order microprocessor, all instructions behind it must wait.

Again, the more stages in your pipeline, the bigger the penalty for a stall. But when the pipeline is full, a deeper pipeline will give us a higher clock speed and better overall performance - we just need to worry about keeping the pipeline full (which takes a great deal of additional transistors). And yes, there is an upper limit to how deep you can pipeline your processor before you start running into diminishing returns in both a performance and power sense, this was ultimately the downfall of the Pentium 4’s architecture.

Index Superscalar to the Rescue
POST A COMMENT

60 Comments

View All Comments

  • psonice - Tuesday, July 07, 2009 - link

    My understanding is that the iphone 3gs GPU is actually a 535, not a 520. At least, this is the current understanding among iphone developers, and there's an SGX535 driver on the phone to support that. The extra power might explain the hit on battery life when playing games.

    Real numbers are pretty hard to come by, but it seems the 535 is roughly 4x faster than the 520. If so, that's a massive upgrade rather than just a decent one. The 535 also supports HD video decoding where the 520 doesn't - not that apple seem to be supporting it if it does.

    I heard too that the palm pre has a 530 GPU, which is 2x faster than the 520. That puts the iphone a long way ahead for graphics instead of behind.

    One thing in the article I really disagree with btw: you say that the phone makers should provide detailed specs. I think they shouldn't, as it's not helpful at all for the average buyer. If you go into a shop without having much clue and ask for an iphone because it's the latest thing, and the shop assistant says "well this is like an iphone, but it runs 200mhz faster" you'll end up buying the "better" phone based on the spec sheet, even if it's running win mobile 5.

    I was in Japan a while back, and they tend to buy phones based on the spec sheets there. The phones all compete on having the most features. They're all really big and HORRIBLE to actually use. None of that please!

    I think apple actually get their commercials right with the iphone on the whole: show somebody actually using the phone to do stuff. If the other manufacturers did the same, that would be a perfect way to compare.
    Reply
  • christinme7890 - Thursday, July 09, 2009 - link

    I agree with you holistically. There are not many people in this world that even understand the specs. Not to mention when it comes to specs, and the person has no clue, they end up getting the one with the highest numbers. This is bad. I think you are right in saying that the way apple works their commercials is perfect for people. They show people all the great apps that they could use and they say that ALL of these apps can be on one phone.

    This is why I hate the Best buy MS commercials where the kid goes into the BB and buys a PC instead of a mac. The person always buys the computer with the best specs and care little about the OS, which is what they will be using. Windows, imo after using a Mac for a year, sucks in comparison to Mac. I rarely have a problem with a mac. I sit in class everyday and watch all the pc people have startup errors and os sleep or hibernation errors. I can close my mac and KNOW WITHOUT A DOUBT that it will wake up totally fine. Not to mention it wakes up seamlessly without load screens or anything. I will not compare the two but for business and usability the MAC gets my vote and I think if Apple does their commercials for the macs just as great. Sure most people are still using MS but that is because MS strong arms people into buying their stuff everytime you buy a Computer (not to mention Apple is very strict with their software and rightly so).
    Reply
  • Anand Lal Shimpi - Tuesday, July 07, 2009 - link

    Ooh, very interesting - do you have any links to discussions on the 535 being in the 3GS?

    I don't think end users need to be bombarded with specs, but I think there needs to be more information put out about these things. We shouldn't have to play guessing games about clocks and specs; don't market them, but don't hide them either - that's my thinking.

    Take care,
    Anand
    Reply
  • BlazingDragon - Tuesday, July 07, 2009 - link

    Anand, here it is:
    http://www.macrumors.com/2009/06/25/iphone-3gs-has...">http://www.macrumors.com/2009/06/25/iph...has-more...
    Reply
  • Anand Lal Shimpi - Tuesday, July 07, 2009 - link

    Very interesting - thanks guys, I've updated the article.

    Take care,
    Anand
    Reply
  • ltcommanderdata - Tuesday, July 07, 2009 - link

    It should probably also be noted that the MBX-Lite supports OpenGL ES 1.1 as implemented by Apple not just OpenGL ES 1.0. I believe it's Android's implementation that currently only supports OpenGL ES 1.0.

    It's also been reported that the iPhone OS 3.1 betas include improvements to the OpenGL stack that include additional OpenGL extensions. Whether these are focused on OpenGL ES 2.0 and the SGX or are also for OpenGL ES 1.1 and the MBX remains to be seen. Although on the issue of reducing market segmentation, it'd be great if Apple could implement the OpenGL ES 1.1 Extension Pack although I don't know if the MBX-Lite can actually support it in hardware.
    Reply
  • BlazingDragon - Tuesday, July 07, 2009 - link

    Anand, here's it is:
    iPhone 3GS Has More Powerful PowerVR SGX 535 GPU?
    Reply
  • kelmerp - Tuesday, July 07, 2009 - link

    I'm trying to decide between the MyTouch or a jailbroken iphone. Reply
  • sxr7171 - Wednesday, July 08, 2009 - link

    JB iPhone vs. MyTouch? They're not even in the same league. Pre vs. iPhone is a comparison. Reply
  • pennyfan87 - Tuesday, July 07, 2009 - link

    anand,

    i love you writing and tech analysis.

    but please, drop the fanboyism.
    3 articles on such a minor upgrade? please.

    more SSD stuff please.
    Reply

Log in

Don't have an account? Sign up now