OoOE

You’re going to come across the phrase out-of-order execution (OoOE) a lot here, so let’s go through a quick refresher on what that is and why it matters.

At a high level, the role of a CPU is to read instructions from whatever program it’s running, determine what they’re telling the machine to do, execute them and write the result back out to memory.

The program counter within a CPU points to the address in memory of the next instruction to be executed. The CPU’s fetch logic grabs instructions in order. Those instructions are decoded into an internally understood format (a single architectural instruction sometimes decodes into multiple smaller instructions). Once decoded, all necessary operands are fetched from memory (if they’re not already in local registers) and the combination of instruction + operands are issued for execution. The results are committed to memory (registers/cache/DRAM) and it’s on to the next one.

In-order architectures complete this pipeline in order, from start to finish. The obvious problem is that many steps within the pipeline are dependent on having the right operands immediately available. For a number of reasons, this isn’t always possible. Operands could depend on other earlier instructions that may not have finished executing, or they might be located in main memory - hundreds of cycles away from the CPU. In these cases, a bubble is inserted into the processor’s pipeline and the machine’s overall efficiency drops as no work is being done until those operands are available.

Out-of-order architectures attempt to fix this problem by allowing independent instructions to execute ahead of others that are stalled waiting for data. In both cases instructions are fetched and retired in-order, but in an OoO architecture instructions can be executed out-of-order to improve overall utilization of execution resources.

The move to an OoO paradigm generally comes with penalties to die area and power consumption, which is one reason the earliest mobile CPU architectures were in-order designs. The ARM11, ARM’s Cortex A8, Intel’s original Atom (Bonnell) and Qualcomm’s Scorpion core were all in-order. As performance demands continued to go up and with new, smaller/lower power transistors, all of the players here started introducing OoO variants of their architectures. Although often referred to as out of order designs, ARM’s Cortex A9 and Qualcomm’s Krait 200/300 are mildly OoO compared to Cortex A15. Intel’s Silvermont joins the ranks of the Cortex A15 as a fully out of order design by modern day standards. The move to OoO alone should be good for around a 30% increase in single threaded performance vs. Bonnell.

Pipeline

Silvermont changes the Atom pipeline slightly. Bonnell featured a 16 stage in-order pipeline. One side effect to the design was that all operations, including those that didn’t have cache accesses (e.g. operations whose operands were in registers), had to go through three data cache access stages even though nothing happened during those stages. In going out-of-order, Silvermont allows instructions to bypass those stages if they don’t need data from memory, effectively shortening the mispredict penalty from 13 stages down to 10. The integer pipeline depth now varies depending on the type of instruction, but you’re looking at a range of 14 - 17 stages.

Branch prediction improves tremendously with Silvermont, a staple of any progressive microprocessor architecture. Silvermont takes the gshare branch predictor of Bonnell and significantly increased the size of all associated data structures. Silvermont also added an indirect branch predictor. The combination of the larger predictors and the new indirect predictor should increase branch prediction accuracy.

Couple better branch prediction with a lower mispredict latency and you’re talking about another 5 - 10% increase in IPC over Bonnell.

Introduction & 22nm Sensible Scaling: OoO Atom Remains Dual-Issue
Comments Locked

174 Comments

View All Comments

  • silverblue - Monday, May 6, 2013 - link

    I do wonder how much having a dual channel memory interface helps Silvermont, though. It's something that neither Atom nor Bobcat has enjoyed previously, and I've not heard much about Jaguar on this subject (ignoring the PS4, that is). AMD certainly has the lead on ISAs though, so regardless of how good Silvermont is, it's going to trail in some places.

    I'm a little confused as to the virtual lack of a comparison to AMD in this piece; yes, Intel did say they wanted to beat ARM at its own game, but with Jaguar-powered devices already in the wild and AMD sporting a new custom-CPU team for whoever wants whatever, this is going to be interesting.

    Benchmarks, please! ;)
  • powerarmour - Monday, May 6, 2013 - link

    Atom had dual channel memory with the ION chipset btw.
  • silverblue - Monday, May 6, 2013 - link

    Really? Oh well, in that case then, maybe not too much.
  • Spunjji - Wednesday, May 8, 2013 - link

    Only until Intel murdered that, of course :|
  • ajp_anton - Monday, May 6, 2013 - link

    Where did you find "8x" in the slides?
  • Gigaplex - Tuesday, May 7, 2013 - link

    AMDs HSA is most definitely something to be enthusiastic about.
  • theos83 - Monday, May 6, 2013 - link

    You're right, I've seen this tendency in AT's reviews and discussions as well. I understand that a lot of it comes from reviewing PC components and processors where Intel dominated the market. Also, most of the slides here are marketing material. For example, the 22nm Ivy Bridge tri-gate plots have been out since 2011. True, Intel is the first and only foundry to bring FinFETs to the market successfully and I applaud them for that. However, the performance vs power advantage is not that evident, since even though Tri-gates allow 100mV reduction in threshold voltage and hence, supply voltage, various blogs have reported that most Ivy bridge processors did not scale down supply voltage below 0.9V. FinFETs are great for high performance parts, however, you need to really pay attention to reliability and variation to make it successful for SoCs, they are a completely different ball-game.

    Also, the rest of the SoC makers already have roadmaps ready for the future, they are a fast moving target. Hence lets see benchmarked numbers from Intel processors before jumping on the marketing bandwagon.
  • Pheesh - Monday, May 6, 2013 - link

    "However, the performance vs power advantage is not that evident, since even though Tri-gates allow 100mV reduction in threshold voltage and hence, supply voltage, various blogs have reported that most Ivy bridge processors did not scale down supply voltage below 0.9V." Didn't the start of the article cover that they are using a different manufacturing process for these lower power SOC's as compared to ivy bridge processors?
  • saurabhr8here - Monday, May 6, 2013 - link

    The SoC process has some differences in the metal stack for higher density and has additional transistor flavors (longer channel lengths). Check Intel's IEDM 2012 paper for more information, however, the truth is that their tri-gate process improvements claimed in the 'plots' shown and actual performance improvements in processors have a significant gap. I think that Intel tri-gates are great, but they aren't as 'wonderful' as presented in the marketing slides.
  • Krysto - Monday, May 6, 2013 - link

    Thank You! People are starting to get it.

Log in

Don't have an account? Sign up now