It’s an Out of Order Atom

Ever since the Pentium Pro (P6), we have been blessed with out of order microprocessor architectures - these being designs that can execute instructions out of program order to improve performance. Out of order architectures let you schedule independent instructions ahead of others that are either waiting for data from main memory or waiting for specific execution resources to free up. The resulting performance boost comes at the expense of power and die size. All of the tracking logic to make sure that instructions executed out of order still retire in order eats up die area as well as more power.

When Intel designed the Atom processor it went back to an in-order design as a way of reducing power. Intel has committed to using in-order architectures in Atom for 4 - 5 years post introduction (that would end sometime in the 2012 - 2013 time frame).

For smartphones, Intel’s commitment to in-order makes sense. Average power consumption under load needs to remain at less than 1W and you simply can’t hit that with an out-of-order Atom at 45nm.

For netbooks and notebooks however, the tradeoff makes less sense. Jarred has often argued that a CULV notebook is a far better performer than a netbook at very similar price/battery life metrics. No one is pleased with Atom’s performance in a netbook, but there’s clearly demand for the form factor and price point. Where there’s an architectural opportunity like this, AMD is usually there to act.

Over the past decade AMD has refrained from copying an Intel design, instead AMD usually looks to leapfrog Intel by implementing forward looking technologies earlier than its competitor. We saw this with the 64-bit K8 and the cache hierarchy of the original Phenom and Phenom II processors. Both featured design decisions that Intel would later adopt, they were simply ahead of their time.

With Atom stuck in an in-order world for the near future, AMD’s opportunity to innovate is clear.

The Architecture

Admittedly I was caught off guard by Bobcat’s architecture: it’s a dual-issue design, the first AMD has introduced since the K6 and also the same issue width Intel chose for Atom. Where AMD and Intel diverge however is in the execution side: Bobcat is a fully out of order architecture.

The move to out of order should provide a healthy single threaded performance boost over Atom, assuming AMD can ramp clocks up. Bobcat has a 15 stage integer pipeline, very close to Atom's 16 stage pipe. The two pipeline diagrams are below:


Click to Enlarge


Intel's Atom pipeline

You’ll note that there are technically six fetch stages, although only the first three are included in the 15 stage number I mentioned above. AMD mentioned that the remaining three stages are used for branch prediction, but in a manner it is unwilling to disclose at this time due to competitive concerns.

Bobcat has two independent, dual ported integer scheduler. One feeds two ALUs (one of which can perform integer multiplies) while the other feeds two AGUs (one for loads and one for stores).

The FPU has a single dual ported scheduler that feeds two independent FPUs. Similar to the Atom processor, only one of the ports can handle floating point multiplies. The FP mul and add units can perform two single precision (32-bit) multiplies/adds per cycle. Like the integer side, the FPU uses a physical register file to reduce power.

Bobcat supports SSE1-3, with future versions adding more instructions as necessary.

Bobcat supports out of order loads and stores similar to Intel’s Core architecture as well.

The Bobcat core has a 3-cycle 64KB L1 (32KB instruction + 32KB data cache) that’s 8-way set associative. The L2 cache is a 17-cycle, 512KB 16-way set associative cache. I originally measured Atom’s L1 and L2 at 3 and 18 cycles respectively (I’ve heard numbers as low as 15 for Atom’s L2) so AMD is definitely in the right ballpark here.


Intel's Atom Microarchitecture

Unlike the original Atom, Bobcat will never ship as a standalone microprocessor. Instead it will be integrated with other cores and a GPU and sold as a single SoC. The first incarnation of Bobcat will be a processor due out in early 2011 for netbooks and thin and light notebooks called Ontario. Ontario will integrate two Bobcat cores with an AMD GPU manufactured on TSMC’s 40nm process (Bobcat will be the first x86 core made at TSMC). This will be the first Fusion product to hit the market.

Note that there's an on-die memory controller but it's actually housed in between the CPU and GPU in order to equally serve both masters.

The Three Chip Roadmap Bobcat Performance & Power
Comments Locked

76 Comments

View All Comments

  • ROad86 - Thursday, August 26, 2010 - link

    I think without being a pc expert that amd was trying to maximaze the multi-thread perfomance in less die size and being more efficient at power consumption. But i believe that they are still developing Bulldozer in order to maximaze single thread perfomance too. In desktop not much applications are threaded well in enough so they have to be competive in single thread perfomance too. Thats why I believe they dont announce release date yet. Among side the new manufactaring procces at 32 nm and I think the waiting for the release of sandy-bridge in order to see how better are intel new processors, the release date will be probably Q4 2011. But these are just speculations.
  • Vallwesture - Thursday, August 26, 2010 - link

    It has been over two years...
  • ROad86 - Thursday, August 26, 2010 - link

    New architecture, completly new design, maybe softaware too needs too be optimazed(windows 7 for example), in the end lets hope to bring something truly amazing. On paper it does but lets wait for reviews!
  • KonradK - Thursday, August 26, 2010 - link

    "The basic building block is the Bulldozer module. AMD calls this a dual-core module because it has two independent integer cores and a single shared floating point core that can service instructions from two independent threads"

    I'm curious whether CPU shedulers can distinguish between cores located in the same module from cores located in other modules of Bulldozer .
    Because two cores located in the same module share one FPU unit , running two FPU heavy threads on two cores located in the same module and leaving cores in other modules idle would be at least unoptimal.
  • Simen1 - Tuesday, August 31, 2010 - link

    From page 6: "Aggressive prefetching usually means there’s a good amount of memory bandwidth available so I’m wondering if we’ll see Bulldozer adopt a 3 - 4 channel DDR3 memory controller in high end configurations similar to what we have today with Gulftown."

    AMD already have a 4 channel DDR3 design. Its in the Opteron 6100-line of processors on the G34 socket (LGA1974). AMD have promised it will be compatible with future bulldozer-based processors.
  • liem107 - Monday, September 6, 2010 - link

    I wonder how bobcat would fare against the VIA Nano. Considering VIA s portfolio, it would be a good aquisition for Nvidia for example to get their hands on a fairly good x86 core and license.

Log in

Don't have an account? Sign up now