Introduction and Piledriver Overview

Brazos and Llano were both immensely successful parts for AMD. The company sold tons despite not delivering leading x86 performance. The success of these two APUs gave AMD a lot of internal confidence that it was possible to build something that didn't prioritize x86 performance but rather delivered a good balance of CPU and GPU performance.

AMD's commitment to the world was that we'd see annual updates to all of its product lines. Llano debuted last June, and today AMD gives us its successor: Trinity.

At a high level, Trinity combines 2-4 Piledriver x86 cores (1-2 Piledriver modules) with up to 384 VLIW4 Northern Islands generation Radeon cores on a single 32nm SOI die. The result is a 1.303B transistor chip (up from 1.178B in Llano) that measures 246mm^2 (compared to 228mm^2 in Llano).

Trinity Physical Comparison
  Manufacturing Process Die Size Transistor Count
AMD Llano 32nm 228mm2 1.178B
AMD Trinity 32nm 246mm2 1.303B
Intel Sandy Bridge (4C) 32nm 216mm2 1.16B
Intel Ivy Bridge (4C) 22nm 160mm2 1.4B

Without a change in manufacturing process, AMD is faced with the tough job of increasing performance without ballooning die size. Die size has only gone up by around 7%, but both CPU and GPU performance see double-digit increases over Llano. Power consumption is also improved over Llano, making Trinity a win across the board for AMD compared to its predecessor. If you liked Llano, you'll love Trinity.

The problem is what happens when you step outside of AMD's world. Llano had a difficult time competing with Sandy Bridge outside of GPU workloads. AMD's hope with Trinity is that its hardware improvements combined with more available OpenCL accelerated software will improve its standing vs. Ivy Bridge.

Piledriver: Bulldozer Tuned

While Llano featured as many as four 32nm x86 Stars cores, Trinity features up to two Piledriver modules. Given the not-so-great reception of Bulldozer late last year, we were worried about how a Bulldozer derivative would stack up in Trinity. I'm happy to say that Piledriver is a step forward from the CPU cores used in Llano, largely thanks to a bunch of clean up work from the Bulldozer foundation.

Piledriver picks up where Bulldozer left off. Its fundamental architecture remains completely unchanged, but rather improved in all areas. Piledriver is very much a second pass on the Bulldozer architecture, tidying everything up, capitalizing on low hanging fruit and significantly improving power efficiency. If you were hoping for an architectural reset with Piledriver, you will be disappointed. AMD is committed to Bulldozer and that's quite obvious if you look at Piledriver's high level block diagram:

Each Piledriver module is the same 2+1 INT/FP combination that we saw in Bulldozer. You get two integer cores, each with their own schedulers, L1 data caches, and execution units. Between the two is a shared floating point core that can handle instructions from one of two threads at a time. The single FP core shares the data caches of the dual integer cores.

Each module appears to the OS as two cores, however you don't have as many resources as you would from two traditional AMD cores. This table from our Bulldozer review highlights part of problem when looking at the front end:

Front End Comparison
  AMD Phenom II AMD FX Intel Core i7
Instruction Decode Width 3-wide 4-wide 4-wide
Single Core Peak Decode Rate 3 instructions 4 instructions 4 instructions
Dual Core Peak Decode Rate 6 instructions 4 instructions 8 instructions
Quad Core Peak Decode Rate 12 instructions 8 instructions 16 instructions
Six/Eight Core Peak Decode Rate 18 instructions (6C) 16 instructions 24 instructions (6C)

It's rare that you get anywhere near peak hardware utilization, so don't be too put off by these deltas, but it is a tradeoff that AMD made throughout Bulldozer. In general, AMD opted for better utilization of fewer resources (partially through increasing some data structures and other elements that feed execution units) vs. simply throwing more transistors at the problem. AMD also opted to reduce the ratio of integer to FP resources within the x86 portion of its architecture, clearly to support a move to the APU world where the GPU can be a provider of a significant amount of FP support. Piledriver doesn't fundamentally change any of these balances. The pipeline depth remains unchanged, as does the focus on pursuing higher frequencies.

Fundamental to Piledriver is a significant switch in the type of flip-flops used throughout the design. Flip-flops, or flops as they are commonly called, are simple pieces of logic that store some form of data or state. In a microprocessor they can be found in many places, including the start and end of a pipeline stage. Work is done prior to a flop and committed at the flop or array of flops. The output of these flops becomes the input to the next array of logic. Normally flops are hard edge elements—data is latched at the rising edge of the clock.

In very high frequency designs however, there can be a considerable amount of variability or jitter in the clock. You either have to spend a lot of time ensuring that your design can account for this jitter, or you can incorporate logic that's more tolerant of jitter. The former requires more effort, while the latter burns more power. Bulldozer opted for the latter.

In order to get Bulldozer to market as quickly as possible, after far too many delays, AMD opted to use soft edge flops quite often in the design. Soft edge flops are the opposite of their harder counterparts; they are designed to allow the clock signal to spill over the clock edge while still functioning. Piledriver on the other hand was the result of a systematic effort to swap in smaller, hard edge flops where there was timing margin in the design. The result is a tangible reduction in power consumption. Across the board there's a 10% reduction in dynamic power consumption compared to Bulldozer, and some workloads are apparently even pushing a 20% reduction in active power. Given Piledriver's role in Trinity, as a mostly mobile-focused product, this power reduction was well worth the effort.

At the front end, AMD put in additional work to improve IPC. The schedulers are now more aggressive about freeing up tokens. Similar to the soft vs. hard flip flop debate, it's always easier to be conservative when you retire an instruction from a queue. It eases verification as you don't have to be as concerned about conditions where you might accidentally overwrite an instruction too early. With the major effort of getting a brand new architecture off of the ground behind them, Piledriver's engineers could focus on greater refinement in the schedulers. The structures didn't get any bigger; AMD just now makes better use of them.

The execution units are also a bit beefier in Piledriver, but not by much. AMD claims significant improvements in floating point and integer divides, calls and returns. For client workloads these gains show minimal (sub 1%) improvements.

Prefetching and branch prediction are both significantly improved with Piledriver. Bulldozer did a simple sequential prefetch, while Piledriver can prefetch variable lengths of data and across page boundaries in the L1 (mainly a server workload benefit). In Bulldozer, if prefetched data wasn't used (incorrectly prefetched) it would clog up the cache as it would come in as the most recently accessed data. However if prefetched data isn't immediately used, it's likely it will never be used. Piledriver now immediately tags unused prefetched data as least-recently-used, allowing the cache controller to quickly evict it if the prefetch was incorrect.

Another change is that Piledriver includes a perceptron branch predictor that supplements the primary branch predictor in Bulldozer. The perceptron algorithm is a history based predictor that's better suited for predicting certain branches. It works in parallel with the old predictor and simply tags branches that it is known to be good at predicting. If the old predictor and the perceptron predictor disagree on a tagged branch, the perceptron's path is taken. Improving branch prediction accuracy is a challenge, but it's necessary in highly pipelined designs. These sorts of secondary predictors are a must as there's no one-size-fits-all when it comes to branch prediction.

Finally, Piledriver also adds new instructions to better align its ISA with Haswell: FMA3 and F16C.

Improved Turbo, Beefy Interconnects and the Trinity GPU
Comments Locked

271 Comments

View All Comments

  • Burticus - Tuesday, May 15, 2012 - link

    I wonder if they will release standalone mobile chips and if they are the same socket as the current Llano? Currently my laptop has an A8-3500 and I wouldn't mind upping to an A10.

    They did this in the past with the S1 socket, I wonder if it will be an option nowadays...

    For the most part I've been pretty impressed with the A8 for a $500 laptop (especially with some overclocking). Games are playable at moderate settings. Civ 5 still kicks it in the teeth though, and I see that the A10 got a 10fps jump which would be nice.
  • JarredWalton - Tuesday, May 15, 2012 - link

    The sockets are different: FS1r2 this time. I don't know precisely what changed, but apparently it's enough that AMD isn't making them backwards compatible.
  • Fallen Kell - Tuesday, May 15, 2012 - link

    The biggest problem with the design is that the OS doesn't know how to work with the CPU. Take the case where you have 2 of these piledrivers, with 1 floating point intensive job and 1 non-floating point intensive job already running, in which case the OS will place the first job, on one piledriver, and the next on the other piledriver. Then a user starts a new floating point intensive job, and the OS simply puts it on the next free core, which happens to be the one already running a floating point intensive application, and thus, you just bottlenecked both of those processes. The OS doesn't know if a process is floating-point heavy or not, and thus, can not properly schedule it to a core which has a floating point unit not in heavy use. That is why bulldozer failed. It is also why my work will never purchase it, as they do floating point intensive applications.
  • Beenthere - Tuesday, May 15, 2012 - link

    Most every reviewer has indicated that Trinity is a significant jump in performance in both CPU and GPU with extended battery performance yet some reviewers seem hard pressed to admit that for 90% of the laptop market Trinity is superior to Intel's best offerings.

    Some reviewers are trying to pretend that Intel's faster CPU performance some how is of importance to the majority of the laptop market when in fact it is not unless all you do is crunch numbers. I think Trinity sales just like llano and brazos will drive the point home who is leading the laptop market segment with what consumers actually desire.
  • JarredWalton - Tuesday, May 15, 2012 - link

    Beenthere, you have to be the biggest AMD fanatic I've seen around here. EVERY article where AMD comes up, you're there making things up to justify your worldview. As I indicate in the article, Trinity is 10-20% faster than Llano on CPU and 20% faster on GPU, which is a decent improvement. Unfortunately, a lot of places are quoting AMD's "up to 29% faster CPU and 56% faster GPU" and calling it a day. Those are results that just didn't show up in any testing that I conducted.

    Oh, wait, I've got one: using OpenCL in GIMP, Trinity is 72% faster than Llano! There, we now have one statistic you can point to where Trinity is better. For the 0.1% of the population that uses GIMP, and not even them really -- it's the 0.1% of people that use GIMP and will some day benefit when the next major release comes out and incorporates OpenCL. If you can't see the problem with that statement, I can't help you.

    For 90% of the market, Trinity might be enough, but to say it's "better than Intel's best" is pure fanaticism and nothing more. You are more biased than AMD's own marketing department. To pretend that moderately faster graphics with substantially less CPU performance is somehow more important than any other metric is insane. Sandy Bridge with GT 540M can be had for $600 right now, and it will beat Trinity in pretty much every single metric. Lucky for AMD, a lot of people like you will blindly purchase anything with AMD on it without regard for reality.
  • bji - Tuesday, May 15, 2012 - link

    While I agree with your points overall, I think there is a fine detail you need to consider:

    Benchmarks are only an approximation of the performance results that would be achieved on a whole variety of processor tasks. You can rightly point out that only a small fraction of tested programs benefitted greatly from improved OpenCL performance, but you can't claim that this only benefits the 0.1% of people that use GIMP and care about OpenCL, because there may be other programs available now, or in the future, that would see similar performance increases. What your benchmarking shows is that *most* programs don't see a huge OpenCL performance benefit, but that *some* do. This is likely to lead to a more significant performance benefit than would be enjoyed by 0.1% of the users of a particular application.

    However, I think that CPU reviewers are kind of in a hard place these days, since we're arguing over how big of an overkill one given processor is than another when considered for a wide variety of tasks, which starts to make any benchmarking about trying to find benchmarks where the performance difference would really matter. And that invites all kinds of debate about which kinds of performance actually matter to the average user, which is not a very fun or interesting argument.

    CPU performance can still matter for targeted tasks, but that kind of analysis requires a very different approach and is very user-specific, when compared to standard benchmarking.
  • JarredWalton - Tuesday, May 15, 2012 - link

    You're correct, and the real difficulty is first in finding anything where OpenCL is clearly faster, and then seeing similar techniques used in other software. Office for example isn't going to really get any faster because of you GPU or OpenCL -- and it doesn't need to be. Office spends its time waiting for user input. So what we really need are technologies that make the slow parts of using a computer faster. SSDs are a perfect example, because they make the initial boot and application load times all faster. OpenCL isn't doing that for the vast majority of applications, and neither is Quick Sync or DirectX or whatever other GPU related task you want to throw out there. They make graphics faster, but in my experience that's mostly important to gamers, or for high-end workstation stuff where you want OpenGL support.

    For many people, Core 2 Duo is fast enough, and Llano is fast enough, and Trinity is fast enough, etc. So for those users, it's about delivering the lowest cost. Trinity is twice the size of quad-core Ivy Bridge, so Intel could easily start a price war if they wanted, but they'd rather keep higher margins. Sandy Bridge laptops at $600 are still faster for general use than Llano and Trinity, particularly if they have an Optimus GPU around. Unless something is significantly faster in some important metric -- and I really don't see any single area where that's the case for Trinity -- then you just get whichever is the best price.
  • Beenthere - Tuesday, May 15, 2012 - link

    Wow, Jarred is having an unhappy day! :(

    Obviously AMD's testing is different than your's as is other websites. My comments were NOT in regards to your article, which I though was pretty balanced. The website I was referring to is listed below.

    Your knee-jerk reaction to my comment however shows you're loosing it. If you really believe that Intel's platform provides as good a result for mainstream consumers, you'd be in error especially when Trinty Ultrathins will be hundreds cheaper.

    It's pretty obvious you can't deal with differing POVs and you get upset when you're opinion is not shared by others. Losing your objectivety makes it difficult for anyone to take your articles seriously - even though this one was pretty balanced. You should consider a CHILL PILL before over-reacting.

    You really should THINK before you react. In this case my comment had NOTHING to do with your story. If your article has merit then you should not need to go POSTAL even if my comment was about your story. Being a reactionary and calling people names for having a different POV than you shows immaturity. The really funny part about your knee-jerk reaction was my comment was in regard to another story on Trinity on a different website. (see below).

    You must have a guilty conscience? Below is the story I commented on. Oops, I'm sure you are embarrassed now, but it's OK? I don't hold grudges. <LOL>

    Maybe the Intel fanbois are just beating you up too much because Trinity is a far better choice for laptops than anything Intel has at the moment? they'll get over it.

    http://www.pcper.com/reviews/Mobile/AMD-A10-4600M-...

    Cheer up Jarred. You can look forward to Piledriver/Vishera in a few months and more hate from the Intel fanbois.
  • bji - Tuesday, May 15, 2012 - link

    Sorry, but when you start a paragraph with "Some reviewers are trying to pretend" you are VERY CLEARLY implicating that the reviewer is being dishonest by trying to mislead people reading the review by stating intentionally false commentary.

    If you start with that kind of premise, then you deserve a response that, in kind, accuses you of doing the same, which is exactly what you got.

    Trying to then pretend that you're innocent and didn't deserve that response is just more lameness.
  • JarredWalton - Tuesday, May 15, 2012 - link

    Beenthere is your typical passive aggressive anonymous Internet poster. I called him on his post, and now he backpedals. You know what's hilarious, Beenthere? That article you link. Let me give you a quote from the conclusion to show what I'm talking about:

    "I can’t find a way to look at Trinity that paints a favorable picture. Though certainly an improvement over Llano, it’s not enough. AMD is way behind Intel in processor performance, and the graphics performance does not offer redemption. The only way systems based off Trinity will be made competitive is by slashing and burning the prices."

    Okay, that's pretty much what I said as well. Perhaps they're even more negative than I am. And yet... that paragraph is followed by a Silver Award? WTF is up with that? They're awarding something that they can't find a way to describe in a positive fashion? And then you suggest that "Some reviewers are trying to pretend that Intel's faster CPU performance some how is of importance to the majority of the laptop market when in fact it is not unless all you do is crunch numbers." I'd say the opposite: some reviewers are trying to kiss up to AMD with an award or backhanded praise when everything else they say is negative at best.

    But hey, let's not forget how open and unbiased Beenthere is. Here's a quote from page three of the comments that shows his amazing analytical skills and not-at-all-anti-Intel mindset:

    Subject: Excellent by Beenthere on Tuesday, May 15, 2012

    As expected Trinity delivers in all areas and should meet most people's needs quite well. Good job AMD. You get my money!


    Wow. Yup, Trinity is a far better choice for laptop than anything Intel has at the moment. Because Acer's AS4830TG with GT 540M and i5-2410M at $600 offers better CPU performance and better GPU performance. Yup. Far better. I like to pay more for less!

Log in

Don't have an account? Sign up now