Cache Improvements

The shared L1 instruction cache grew in size with Steamroller, although AMD isn’t telling us by how much. Bulldozer featured a 2-way 64KB L1 instruction cache, with each “core” using one of the ways. This approach gave Bulldozer less cache per core than previous designs, so the increase here makes a lot of sense. AMD claims the larger L1 can reduce i-cache misses by up to 30%. There’s no word on any possible impact to L1 d-cache sizes.

Although AMD doesn’t like to call it a cache, Steamroller now features a decoded micro-op queue. As x86 instructions are decoded into micro-ops, the address and decoded op are both stored in this queue. Should a fetch come in for an address that appears in the queue, Steamroller’s front end will power down the decode hardware and simply service the fetch request out of the micro-op queue. This is similar in nature to Sandy Bridge’s decoded uop cache, however it is likely smaller. AMD wasn’t willing to disclose how many micro-ops could fit in the queue, other than to say that it’s big enough to get a decent hit rate. 
 
The L1 to L2 interface has also been improved. Some queues have grown and logic is improved.
 
 
Finally on the caching front, Steamroller introduces a dynamically resizable L2 cache. Based on workload and hit rate in the cache, a Steamroller module can choose to resize its L2 cache (powering down the unused slices) in 1/4 intervals. AMD believes this is a huge power win for mobile client applications such as video decode (not so much for servers), where the CPU only has to wake up for short periods of time to run minor tasks that don’t have large L2 footprints. The L2 cache accounts for a large chunk of AMD’s core leakage, so shutting half or more of it down can definitely help with battery life. The resized cache is no faster (same access latency); it just consumes less power. 
 
Steamroller brings no significant reduction in L2/L3 cache latencies. According to AMD, they’ve isolated the reason for the unusually high L3 latency in the Bulldozer architecture, however fixing it isn’t a top priority. Given that most consumers (read: notebooks) will only see L3-less processors (e.g. Llano, Trinity), and many server workloads are less sensitive to latency, AMD’s stance makes sense. 
 

Looking Forward: High Density Libraries

 
This one falls into the reasons-we-bought-ATI column: future AMD CPU architectures will employ higher levels of design automation and new high density cell libraries, both heavily influenced by AMD’s GPU group. Automated place and route is already commonplace in AMD CPU designs, but AMD is going even further with this approach.
 
The methodology comes from AMD’s work in designing graphics cores, and we’ve already seen some of it used in AMD’s ‘cat cores (e.g. Bobcat). As an example, AMD demonstrated a 30% reduction in area and power consumption when these new automated procedures with high density libraries were applied to a 32nm Bulldozer FPU:

The power savings comes from not having to route clocks and signals as far, while the area savings are a result of the computer automated transistor placement/routing and higher density gate/logic libraries.
 
The tradeoff is peak frequency. These heavily automated designs won’t be able to clock as high as the older hand drawn designs. AMD believes the sacrifice is worth it however because in power constrained environments (e.g. a notebook) you won’t hit max frequency regardless, and you’ll instead see a 15 - 30% energy reduction per operation. AMD equates this with the power savings you’d get from a full process node improvement.
 
We won’t see these new libraries and automated designs in Steamroller, but rather its successor in 2014: Excavator.
 

Final Words

 
Steamroller seems like a good evolutionary improvement to AMD’s Bulldozer and Piledriver architectures. While Piledriver focused more on improving power efficiency, Steamroller should make a bigger impact on performance.
 
The architecture is still slated to debut in 2013 on GlobalFoundries' 28nm bulk process. The improvements look good on paper, but the real question remains whether or not Steamroller will be enough to go up against Haswell.
Front End & Execution Improvements
POST A COMMENT

126 Comments

View All Comments

  • just4U - Thursday, August 30, 2012 - link

    Overall.. the FX4100 is a better chip in a multi purpose computer. (in my opinion) Normally you will get a better Motherboard (more feature rich) then what you'd get from paying for the Intel boards as well.. so it's all a factor. As to these $469 dream machine.. well hell..

    A standard gaming rig that I'd be comfortable building (without an OS and at cost) will run .. $574.00
    - FX 4100
    - 8 G PC 1280
    - Gigabyte 970A -D3
    - Radeon 7770 1G
    - 1 TB Western Digitial Blue
    - LG DVDRW
    - Antec One Casing
    - Corsair Builder 500W

    Now going intel you could get an I3 and a H61 at a similiar price.. or you could go for a comparable MB to the Amd one for aprox $10 more over this system.

    To get it down into the $400 range I'd have to hmmm.. No video look for a $30 PSU and a $25 Case (save $50..) that would get it around the $450 range.. Drop down to 4G of ram would bring it into the 430.. (get where im going with this?)
    Reply
  • just4U - Thursday, August 30, 2012 - link

    That was thru price match here in Canada btw.. I know you can still get some things cheaper down south via combo's and such.. but putting together a half ways competent gaming machine with new computer parts.. in the $450 range.. Good luck with that. Unless you got some killer sale on it's not going to happen.. and forget about the rebates as we all know how those work out 75% of the time. At the end of the day you'd seriously have to compromize with your hardware selection just to get it into such a budget. Reply
  • Spunjji - Thursday, August 30, 2012 - link

    It's cool dude, the people here who aren't mentally defective already know that what you're saying is broadly accurate, pissing matches over AMD/Intel aside. There's not much sense demonstrating it to the rest... :/ Reply
  • Spunjji - Thursday, August 30, 2012 - link

    You have a funny definition of "STOMP". Reply
  • Galidou - Sunday, September 02, 2012 - link

    3 fps more in one benchmark out of 8 and the other 7 are equal, quite a nice stomping of a core i3 vs a fx 4100. Oh and the core i3 can't be overclocked... what an amazing stomping. Reply
  • CeriseCogburn - Friday, October 12, 2012 - link

    It can be overclocked, and dumb dumb, the op liar said $20 more for your amd loser.

    So in this case, the amd fanboys blow their freaking brains out through their backsides again, they actually LOSE, and pay $20 more.

    Thanks, I'll keep that in mind when you idiots all collude in the GPU reviews, and pull the EXACT OPPOSITE in clan idiot mode and fail to notice how stupid you all are even after it is explicitly pointed out, and "coming to grips" "with reality" and admitting you supported the big fat lie, of course will never occur.

    That's the fruitcake liar amd fan. Of course anyone else who takes exception to it, they are in the wrong...

    The amd fanboy mind is a terribly wasted thing, throw it out.
    Reply
  • Spunjji - Thursday, August 30, 2012 - link

    You couldn't really be a more transparent shill. Nobody mentioned "non competitive consumer screwing" here, yet you post an essay countering said imaginary comments backed up by some hand-waving and supposition which is disproven by easily-obtained facts. You started a whole argument though, so gz on that. Reply
  • nicamarvin - Thursday, August 30, 2012 - link

    15% IPC improvement right out of the box? keep dreaming, Ivy max performance boost is 5% on "some" benches and 1 to 3% on most benches and seeing how it cant OC as much as SandyB I say AMD will catch up with Intel sooner than most of you thought

    keeping in mind that AMD plans to encrease IPC by 15% on each of their updated Modules, PileDriver is already doing just that(15% IPC performance boost clock per clock against BD) and that Piledriver module was lacking the L3 cache the BD Module had and still was pulling the 15% performance boost
    Reply
  • seapeople - Friday, August 31, 2012 - link

    Ivy Bridge also came out with higher clocks for the same price, so add in the 1-5% IPC advantage and you get close to the 10-15% advantage mentioned.

    Note that he didn't say IPC.
    Reply
  • nicamarvin - Friday, August 31, 2012 - link

    SB can Oc much higher than Ivy, so thats a moot point, whats a 1-5% IPC gain when SB can OC 10% higher than Ivy? I suspect Haswell will not OC as high as the best SB could Reply

Log in

Don't have an account? Sign up now