Today at the annual Hot Chips conference, AMD’s new CTO Mark Papermaster unveiled the first details about the Steamroller x86 CPU core.

Steamroller is the third instantiation of AMD’s Bulldozer architecture, first conceived in the mid-2000s and finally brought to market in late 2011. Committed to this architecture for at least one more design after Steamroller, AMD has settled on roughly yearly updates to the architecture. For 2012 we have the introduction of Piledriver, the optimized Bulldozer derivative that formed the CPU foundation for AMD’s Trinity APU. By the end of the year we’ll also see a high-end desktop CPU without processor graphics based on Piledriver.

Piledriver saw a switch to hard edge flip flops, which allowed for a considerable decrease in power consumption at the expense of careful design and validation work. Performance didn’t change, but AMD saw a 10% - 20% reduction in active power. Piledriver also brought some scheduling efficiency improvements, but prefetching and branch prediction were the two other major design improvements in Piledriver.

Steamroller is designed to keep the ball rolling. It takes fundamentals from the Bulldozer/Piledriver architectures and offers a healthy set of evolutionary improvements on top of them. In Intel speak Steamroller wouldn’t be a tick as it isn’t accompanied by a significant process change (28nm bulk is pretty close to 32nm SOI), but it’s not a tock as the architecture is mostly enhanced but largely unchanged. Steamroller fits somewhere in between those two extremes when it comes to changes. 
 

Front End Improvements

 
One of the biggest issues with the front end of Bulldozer and Piledriver is the shared fetch and decode hardware. This table from our original Bulldozer review helps illustrate the problem:
 
Front End Comparison
  AMD Phenom II AMD FX Intel Core i7
Instruction Decode Width 3-wide 4-wide 4-wide
Single Core Peak Decode Rate 3 instructions 4 instructions 4 instructions
Dual Core Peak Decode Rate 6 instructions 4 instructions 8 instructions
Quad Core Peak Decode Rate 12 instructions 8 instructions 16 instructions
Six/Eight Core Peak Decode Rate 18 instructions (6C) 16 instructions 24 instructions (6C)
 
Steamroller addresses this by duplicating the decode hardware in each module. Now each core has its own 4-wide instruction decoder, and both decoders can operate in parallel rather than alternating every other cycle. Don’t expect a doubling of performance since it’s rare that a 4-issue front end sees anywhere near full utilization, but this is easily the single largest performance improvement from all of the changes in Steamroller. 
 
The penalties are pretty obvious: area goes up as does power consumption. However the tradeoff is likely worth it, and both of these downsides can be offset in other areas of the design as you’ll soon see.

Steamroller inherits the perceptron branch predictor from Piledriver, but in an improved form for better performance (mostly in server workloads). The branch target buffer is also larger, which contributes to a reduction in mispredicted branches by up to 20%. 
 

Execution Improvements

 
AMD streamlined the large, shared floating point unit in each Steamroller module. There’s no change in the execution capabilities of the FPU, but there’s a reduction in overall area. The MMX unit now shares some hardware with the 128-bit FMAC pipes. AMD wouldn’t offer too many specifics, just to say that the shared hardware only really applied for mutually exclusive MMX/FMA/FP operations and thus wouldn’t result in a performance penalty. 
 
The reduction of pipeline resources is supposed to deliver the same throughput at lower power and area, basically a smarter implementation of the Bulldozer/Piledriver FPU. 

There’s no change to the integer execution units themselves, but there are other improvements that improve integer performance. 
 
The integer and floating point register files are bigger in Steamroller, although AMD isn’t being specific about how much they’ve grown. Load operations (two operands) are also compressed so that they only take a single entry in the physical register file, which helps increase the effective size of each RF. 
 
The scheduling windows also increased in size, which should enable greater utilization of existing execution resources. 
 
Store to load forwarding sees an improvement. AMD is better at detecting interlocks, cancelling the load and getting data from the store in Steamroller than before.
Cache Improvements & Looking Forward
POST A COMMENT

126 Comments

View All Comments

  • CeriseCogburn - Friday, October 12, 2012 - link

    " AMD will catch up with Intel sooner than most of you thought "
    LOL
    Most think never, so how much sooner is sooner than never ?
    Reply
  • aesthetics84 - Sunday, May 26, 2013 - link

    So.... PS4 and Xbox One are both releasing with 8 core AMD cpus, "Never" is looking pretty damn close on that horizon, eh bud? Intel fanboys like you are about to be throwing more money away to try and keep up. Be sure to put that Haswell, you'll inevitably get, under water and fill the res with your sweet, sweet fanboy tears. Reply
  • phdchristmas - Sunday, September 09, 2012 - link

    Extreme editions are priced that high because they are the bleeding edge that pave the way for the next generation of chips. Funding for continued research on producing a high production chip of its kind. Reply
  • rarson - Tuesday, September 18, 2012 - link

    No, they're priced that high because demand for them is low. Supply and demand. Reply
  • CeriseCogburn - Friday, October 12, 2012 - link

    Wrong again rarson. Demand for any item can be very high, and drive the lacking supply price higher and higher.
    In this case they are priced high because there is sufficient demand to sustain that top tier price. If the demand was low, the price would drop, YOU AMD FANBOY brainfarter.

    ( do you feel better your self installed idiot version of economics "proved" to you internally that demand for the big top Intel chip is very low ?)
    LOL - so sad.....the emotions of a fanboy farting out uncontrollably, econ dumb oh one we'll call it.

    I'm beginning to understand why you amd freaks have twisted penny pinching frustrated price obsessions, you haven't a clue about the very basics, but your mind is very willing to arrogantly and in error, attempt to "correct others" with amd fanboyism as the leading call in the emotionally fulfilling statements you offer.

    It is like a crazy girl having her period and blurting out her out of control emotions. LOL
    No wonder I told giraradou or whatever miss sensitive's name is to take the midol.
    Reply
  • rarson - Tuesday, September 18, 2012 - link

    Regardless of what prices are now, they'd be even better with better competition from AMD. It's called "economics." Reply
  • CeriseCogburn - Friday, October 12, 2012 - link

    In this case it's called " amd fanboy fantasy " Reply
  • rocketbuddha - Tuesday, August 28, 2012 - link

    Anand was that a typo or really AMD is going to use TSMC 28nm to manufacture Steamroller based APUs? Reply
  • Paedric - Tuesday, August 28, 2012 - link

    They're supposed to switch from GF to TSMC sometimes soon.
    I guess that's the when, if it hasn't happened already.
    Reply
  • Anand Lal Shimpi - Tuesday, August 28, 2012 - link

    Er that's my mistake, GF 28nm is correct. Fixed :)

    Take care,
    Anand
    Reply

Log in

Don't have an account? Sign up now