Microprocessor architectures these days are largely limited, and thus defined, by power consumption. When it comes to designing an architecture around a power envelope the rule of thumb is any given microprocessor architecture can scale to target an order of magnitude of TDPs. For example, Intel’s Core architectures (Sandy/Ivy Bridge) effectively target the 13W - 130W range. They can surely be used in parts that consume less or more power, but at those extremes it’s more efficient to build another microarchitecture to target those TDPs instead.

Both AMD and Intel feel similarly about this order of magnitude rule, and thus both have two independent microprocessor architectures that they leverage to build chips for the computing continuum. From Intel we have Atom for low power, and Core for high performance. In 2010 AMD gave us Bobcat for its low power roadmap, and Bulldozer for high performance.

Both the Bobcat and Bulldozer lines would see annual updates. In 2011 we saw Bobcat used in Ontario and Zacate SoCs, as a part of the Brazos platform. Last year AMD announced Brazos 2.0, using slightly updated versions of those very same Bobcat based SoCs. Today AMD officially launches Kabini and Temash, APUs based on the first major architectural update to Bobcat: the Jaguar core.

Gallery: 6 Corner

Jaguar: Improved 2-wide Out-of-Order
 

At the core-level, Jaguar still looks a lot like Bobcat. The same dual-issue, out-of-order architecture that AMD introduced in 2010 remains intact with Jaguar. The same L1 cache, front end and execution blocks are all still here. Given the ARM transition from a dual-issue, out-of-order core with Cortex A9 to a three-issue, OoO design with the Cortex A15, I expected something similar from AMD. Despite moving to a smaller manufacturing process (28nm), AMD was very focused on increasing performance within the same TDP or lower with Jaguar. The driving motivator? While Bobcat ended up in netbooks, nettops and other low cost, but thick machines, Jaguar needed to go into even thinner form factors: tablets. AMD still has no intentions of getting into the smartphone SoC space, but the Windows 8 (and Android?) tablet market is fair game. Cellular connectivity isn’t a requirement there, particularly at the lower price points, and AMD can easily be a second source alternative to Intel Atom based designs.

The average number of instructions executed per clock (IPC) is still below 1 for most client workloads. There’s a certain amount of burst traffic to be expected but given the types of dependencies you see in most use cases, AMD felt the gain from making the machine wider wasn’t worth the power tradeoff. There’s also the danger of making the cat-cores too powerful. While just making them 3-issue to begin with wouldn’t dramatically close the gap between the cat-cores and the Bulldozer family, there’s still a desire for there to be clear separation between the two microarchitectures.

The move to a three-issue design would certainly increase performance, but AMD’s tablet ambitions and power sensitivity meant it would save that transition for another day. I should point out that ARM is increasingly looking like the odd-man-out here, with both Jaguar and Intel’s Silvermont retaining the dual-issue design of their predecessors. Part of this has to do with the fact that while AMD and Intel are very focused on driving power down, ARM has aspirations of moving up in the performance/power chain.

The width of the front end is only one lever AMD could have used to increase performance. While it was a pretty big lever that AMD chose not to pull, there are other smaller levers that were exercised in Jaguar.

There’s now a 4 x 32-byte loop buffer for the instruction cache. Whenever a loop is detected, instead of fetching instructions executed in the loop from the L1 I-cache over and over again, they’re serviced from this small loop buffer. If this sounds like a trace cache or decoded micro-op cache, don’t get too excited, Jaguar’s loop buffer is neither of these things. There are no pipeline savings or powered down fetch/decode units. The only benefit to the new loop buffer is the instruction cache doesn’t have to be fired up during every iteration of a buffered loop. In other words, this is a very specific play to reduce power consumption - not to improve performance.

All microprocessors see tons of simulation work before they’re ever brought to market. Even once a design is done, additional profiling is used to identify bottlenecks, which are then prioritized for addressing in future designs. All bottleneck removal has to be vetted against power, cost and schedule constraints. Given an infinite budget across all vectors you could eliminate all bottlenecks, but you’d likely take an infinite amount of time to complete the design. Taking all of those realities into account usually means making tradeoffs, even when improving a design.

We saw the first example of a clear tradeoff when AMD stuck with a 2-issue front end for Jaguar. Not including a decoded micro-op cache and opting for a simpler loop buffer instead is an example of another. AMD likely noticed a lot of power being wasted during loops, and the addition of a loop buffer was probably the best balance of complexity, power savings and cost.

AMD also improved the instruction cache prefetcher, not because of an over abundance of bandwidth but by revisiting the Bobcat design and spending some more time on the implementation in Jaguar. The IC prefetcher improvements are simply AMD doing things better in Jaguar, not being under the same pressure to introduce a brand new architecture as was the case with Bobcat.

The instruction buffer between the instruction cache and decoders grew in size with Jaguar, a sort of half step towards the more heavily decoupled fetch/decode stages in Bulldozer.

Jaguar adds support for new instructions (SSE4.1/4.2, AES, CLMUL, MOVBE, AVX, F16C, BMI1) as well as 40-bit physical addressing.

The final change to the front of Jaguar was the addition of another decode stage, purely for frequency gains. It turns out that in Bobcat the decoder was one of the critical paths limiting maximum frequency. Adding another decode stage simply gave AMD enough wiggle room to hit their frequency targets for Jaguar at 28nm.

Integer & FP Units, Load/Store Improvements
Comments Locked

78 Comments

View All Comments

  • blacks329 - Thursday, May 23, 2013 - link

    I know its definitely not that high for any individual platform, but I do remember a lot of major publishers, Ubi, EA and a bunch of other smaller studios had said (early-mid gen) that because porting to PS3 was such a nightmare and resource intensive, that it was more efficient to spend extra resources initially and use the PS3 as the lead and then have it ported over to 360, which was significantly easier.

    While I'm sure quite a large chunk still use 360's as their lead platform, I would say 90% was probably very early in this gen and since then has dropped to be much closer between 360 and PS3.

    Although at this point both architectures are well understood and accounted for that most engines should make it easier to develop for both regardless of what platform is started with.
  • mr_tawan - Sunday, May 26, 2013 - link

    I don't think using x86 would benefit the dev as much as many expected. Sure using the same hw-level arch may simplify the low-level code like asm, but seriously, I don't many of devs nowaday uses asm intensively anymore. (I had been working for current-gen console titles for a little, and never write even a single line of asm). Current-gen of game is complex, and need the best software architecture, otherwise it would lead to delay-to-death shipping schedule. Using asm would lead to premature optimisation that gains little-to-nothing.

    What would really affect the dev heavily is sdk. XB1 uses custom OS, but the SDK should be closed to Windows' DirectX (just like XB360). PS4, if it's in the same fashion as PS3, would use the custom-made SDK with OpenGL/OpenGL ES API (PS3 uses OpenGL ES, if I'm not mistaken). It needs another layer of abstration to make it easier to make it fully cross-platform, just like the current generation.

    The thing that might be shared across two platform might be the shader code, if AMD can convince both MS and Sony to use the same language.

    That's only guesses, I might be wrong.
  • mganai - Thursday, May 23, 2013 - link

    That, and Intel's been making a bigger push for the smartphone market; it even says so in the article!

    Silvermont should change things up quite favorably.
  • mschira - Thursday, May 23, 2013 - link

    Well all this is pointless if nobody makes good hardware using it.
    It's the old story. The last generation Trinity would have allowed very decent mid range notebooks with very long battery run time and more than sufficient power at reasonably low costs.

    Have we seen anything?
    Nope.

    So where is a nice 11" Trinity Laptop?
    Or a 10" Brazos?
    All either horrible cheap Atom or expensive ULV core anything.

    Are the hardware makers afraid that AMD can deliver enough chips?
    Are they worried stepping on Intels toes?
    Are they simply uncreative all running in the same direction some stupid mainstream guide tell them?

    I suspect it is largely the latter - and most current notebooks are simply uncreative. The loss of sales comes to no surprise I think. And its not all M$ fault.
    M.
  • Mathos - Thursday, May 23, 2013 - link

    It could be another instance of Intel paying oems to not use certain AMD parts. They've done it before, wouldn't be surprised if it happens again in area's where AMD might have a better component.

    But it's also not totally true, having worked at Wal-Mart and other big chain stores, I can tell you that many do carry laptops and ultrathins that use Trinity A series chips, and Brazos E series chips. But, right now, everyone still wants that ipad or galaxy tab. And in general the only people I saw buying laptops and ultrathins were the people during the back to school or back to college crowds. And of course black Friday hordes.

    And with AMD having both next gen consoles under their belt, them and many OEMs may be able to leverage that to draw sales of jaguar based systems.
  • Gest - Saturday, May 25, 2013 - link

    So does Jaguar have new (any) hardware instructions that intel processors don't? (Will intel add them in haswell?) I think game makers will use them actively during the consoles lifetime.
  • scaramoosh - Monday, May 27, 2013 - link

    Doesn't this just mean the console CPU power is lacking compared to what PCs currently have?
  • Silma - Wednesday, May 29, 2013 - link

    Absolutely.

Log in

Don't have an account? Sign up now