A Crash Course in CPU Architecture

It’s been years since I’ve gone through the life of an instruction, and when I last did it it was about a very high end desktop processor. I realize that not everyone interested in what’s powering the iPhone 3GS or Palm Pre may have been taken down this path, so I thought some of that knowledge might be useful here.

Applications spawn threads, threads are made up of instructions and instructions are what a CPU “processes”. The actual processing of an instruction is pretty simple; the CPU must fetch the instruction from memory, decode or somehow understand what the instruction is telling it to do (e.g. add two numbers), grab any data that is required by the instruction (e.g. find the numbers to be added), actually execute the instruction and finally write the result of the operation either to a register or memory.


Our basic microprocessor with a 5-stage pipeline

Based on the example above, executing an instruction requires five distinct stages. In a pipelined microprocessor, a different instruction can be active at each stage of the execution pipeline. For example, you can be grabbing data for one instruction, while decoding another and fetching yet another. All modern day processors work this way.


Multiple instructions can exist in the pipeline at once, but only one instruction may be active at any given stage

Each one of these stages should take the same amount of time for the processor to work efficiently; the length of time required at the longest stage actually determines the clock speed of the CPU. If the most complex stage in my example above is the decode stage and it requires 3ns to complete, then my CPU can run no faster than 333MHz (1 / 3ns).

To reach faster frequencies, we need to speed up each stage of the pipeline. You can speed up a stage by implementing some sweet new algorithms, or simply by splitting up complicated stages into simpler ones and increasing the number of stages in your pipeline.

In our previous example, the decode stage required 3ns to complete but if we split decode into three separate stages, each requiring 1ns, then we remove that bottleneck. Let’s say we do that but now some of our other stages become the bottleneck; with a target of a 1ns clock period (1ns spent per stage) we go from five stages to eight:

Fetch
Decode 1
Decode 2
Decode 3
Fetch Operands
Execute 1
Execute 2
Write Output

Now, with each stage running at 1ns, our maximum clock speed goes up from 333MHz to 1000MHz (1GHz). Sweet. Right?

With less work being done in each stage, we reach a higher clock speed, but we also depend on each stage being full in order to operate at peak efficiency.


5-stage pipeline (top) vs 8-stage pipeline (bottom). The 8 stage pipe is more desirable, but also requires more instructions to fill.

In the first CPU example we had a 5 stage pipeline, which meant that we needed to have the pipe full of 5 instructions at any given time to be operating at peak efficiency of 1 instruction completed every cycle. The second example has a ginormous 8 stage pipeline, which requires 8 instructions in the pipe for peak efficiency. In both cases you can only get one instruction out of the pipe every cycle, but the second chip can give us more completed instructions in say, 10 seconds.

Now think for a moment about the time periods we’re talking about here. The first CPU had a clock period of 3ns, where each stage took 3ns to complete. The second CPU had a clock period of 1ns. A single trip to main memory can easily take 60ns for a CPU with a very fast on-die memory controller, or over 100ns otherwise. For the sake of argument let’s say that we’re talking about a 100ns trip to main memory. Remember the Fetch Operands stage? Well if those operands are located in main memory that stage won’t take 3ns to complete, but rather 103ns since it has to get the operands from main memory.

Modern processors will perform a context switch upon any memory access to avoid stalling the pipeline for such an absurd length of time. The contents of the pipeline get flushed and filled with another thread while the data request goes off to main memory. Once the data is ready, the processor switches contexts once more and continues on its execution path. Here’s the problem: it takes time to refill the pipeline, and the longer the pipeline, the longer it takes to refill it. This is a bad, but regular occurrence in a microprocessor. Our instruction throughput drops from its 1 instruction per clock peak to 0; not good.

Other scenarios can create interruptions in the normal flow of things within our microprocessor. Some instructions may take multiple cycles at a single stage to complete. More complex arithmetic may spend significantly longer at the execute stage while the operation works out. With an in-order microprocessor, all instructions behind it must wait.

Again, the more stages in your pipeline, the bigger the penalty for a stall. But when the pipeline is full, a deeper pipeline will give us a higher clock speed and better overall performance - we just need to worry about keeping the pipeline full (which takes a great deal of additional transistors). And yes, there is an upper limit to how deep you can pipeline your processor before you start running into diminishing returns in both a performance and power sense, this was ultimately the downfall of the Pentium 4’s architecture.

Index Superscalar to the Rescue
POST A COMMENT

60 Comments

View All Comments

  • MrJim - Wednesday, July 08, 2009 - link

    Why no mention of the heat issues? Reply
  • ViRGE - Wednesday, July 08, 2009 - link

    Anand, if you haven't already, jailbreak the 3GS and grab SysInfoPlus from Cydia. It may be able to tell you the clock speed of the 3GS's ARM, although to what extent I'm not sure since it hasn't been specifically programmed for the A8. Reply
  • ltcommanderdata - Wednesday, July 08, 2009 - link

    I don't suppose that program can also tell the GPU clock speed too?

    I always thought that the MBX work at bus speed, ie. 103MHz for the iPhone/3G and 133MHz for the 2nd Gen iPod Touch instead of the 60Mhz that Anand has speculated. Assuming the iPhone 3G S has a 150MHz bus speed, the SGX could run at 150MHz which is a reasonable compromise between Anand's 100MHz and 200MHz estimates.
    Reply
  • fyleow - Wednesday, July 08, 2009 - link

    How useful is the new GPU? The iPhone's performance has come a long way from the first generation but I don't see developers taking full advantage of the jump. If you bump up the graphics of your game it might run smoothly on the 3GS but end up lagging on the 1st gen iPhone.

    The increase in load times and battery life is much welcomed, but when do we get to see some apps that take advantage of the upgraded hardware in other more interesting ways? I can see a resolution increase as being one way to do that. The game would look better on a higher resolution screen but performance wouldn't suffer on the older models because the lower resolution would place less demand on the hardware.

    2010 will be an interesting year. There should be a bigger upgrade to the iPhone, most likely a resolution bump and a significantly modified OS that supports background tasks. Apple has been keeping all the devices on the iPhone platform on feature parity so far with the OS upgrades (minus obvious limitations due to hardware differences). It would be interesting to see how they handle the switch and the resulting two classes of phones that come from it (i.e. old "legacy" iPhones/Touch vs new iPhones/Touch).
    Reply
  • ltcommanderdata - Wednesday, July 08, 2009 - link

    You're right that it's difficult to take full advantage of the SGX without writing a separate dedicated code path for it and one for the MBX. However, there are simpler ways to take advantage of the iPhone 3G S power without writing 2 separate code paths. For example, you can scale draw distance based on hardware. Firemint demonstrated the iPhone 3G S accelerating 40 cars in Real Racing compared to 6 in the iPhone 3G, so the potential for better scalable AI is there. For a RPG, perhaps having more NPCs walking around to make the environment more lifelike. This can all be done using existing OpenGL ES 1.1 code playable on all iPhones/Touches, optimizing for each device, without making older iPhone users feel like they are playing some Lite version of the game as implementing shaders and HDR using OpenGL ES 2.0 in the iPhone 3G S might do.

    I believe the reluctance of Apple to change the resolution is that it could break the interface layout for existing apps and/or make things ugly if apps haven't used vector graphics. It would have been nice if they had enforced resolution independence early on, but I don't believe they did. Resolution independence is also what is needed for Apple to introduce an iPhone nano with a smaller screen and presumably smaller resolution.
    Reply
  • smallpot - Wednesday, July 08, 2009 - link

    Thanks for the article Anand. Your long-form articles are the reason Anandtech is my number one tech website. I'm thinking of articles such as this, your articles on SSD performance, and the long-form story behind the RV770. After reading such articles, I really feel like I've learned something, rather than just had performance metrics thrown at me without context. Reply
  • Baron Fel - Tuesday, July 07, 2009 - link

    Interesting article.

    As far as portable gaming goes, the Ipod Touch/iPhone/Zune HD dont have a chance against the DS or even the PSP. The software support just isnt there.

    PSP hardware runs circles around the DS, so why is the DS killing it in sales? Good games.

    and are we getting more SSD articles anytime soon? I think thats what we want to see :D
    Reply
  • ltcommanderdata - Tuesday, July 07, 2009 - link

    Given all the media attention about discoloration and possible heat issues with the iPhone 3G S, I was wondering if you could comment on your experience in this area. Do you think it's a real concern or just stories popularized to generate page hits as Apple related stories tend to do? The latest reports on discoloration indicate that it might actually be from a reaction with some third-party cases that may be reversed by cleaning with alcohol.

    Similarly, there have been lower-key reports of build quality issues with the Palm Pre having a wobbly screen from it's slide-out keyboard. Has this been a major issue for you and do you think it'll be an issue over time?
    Reply
  • Anand Lal Shimpi - Tuesday, July 07, 2009 - link

    I haven't seen anything to indicate heat as being a bigger concern with the 3GS. It's a new processor so there's bound to be some bad chips out there, but I wouldn't be too concerned.

    The build quality on the Pre did bother me. It's something that I think bothered me more because of my experience with the iPhone. The screen was a bit wobbly and overall the device just didn't feel as well put together. Part of it is because of the slide out keyboard, but part of it has to be cost/experience related. I think you'd get used to it over time, but if you then held an iPhone you'd quickly grow tired of the build quality issues once again :)

    Take care,
    Anand
    Reply
  • tomoyo - Tuesday, July 07, 2009 - link

    Btw Anand, the chart for number of stages in the cpus shows the Iphone 3GS as 8 stage instead of 13. Reply

Log in

Don't have an account? Sign up now