Introduction and Piledriver Overview

Brazos and Llano were both immensely successful parts for AMD. The company sold tons despite not delivering leading x86 performance. The success of these two APUs gave AMD a lot of internal confidence that it was possible to build something that didn't prioritize x86 performance but rather delivered a good balance of CPU and GPU performance.

AMD's commitment to the world was that we'd see annual updates to all of its product lines. Llano debuted last June, and today AMD gives us its successor: Trinity.

At a high level, Trinity combines 2-4 Piledriver x86 cores (1-2 Piledriver modules) with up to 384 VLIW4 Northern Islands generation Radeon cores on a single 32nm SOI die. The result is a 1.303B transistor chip (up from 1.178B in Llano) that measures 246mm^2 (compared to 228mm^2 in Llano).

Trinity Physical Comparison
  Manufacturing Process Die Size Transistor Count
AMD Llano 32nm 228mm2 1.178B
AMD Trinity 32nm 246mm2 1.303B
Intel Sandy Bridge (4C) 32nm 216mm2 1.16B
Intel Ivy Bridge (4C) 22nm 160mm2 1.4B

Without a change in manufacturing process, AMD is faced with the tough job of increasing performance without ballooning die size. Die size has only gone up by around 7%, but both CPU and GPU performance see double-digit increases over Llano. Power consumption is also improved over Llano, making Trinity a win across the board for AMD compared to its predecessor. If you liked Llano, you'll love Trinity.

The problem is what happens when you step outside of AMD's world. Llano had a difficult time competing with Sandy Bridge outside of GPU workloads. AMD's hope with Trinity is that its hardware improvements combined with more available OpenCL accelerated software will improve its standing vs. Ivy Bridge.

Piledriver: Bulldozer Tuned

While Llano featured as many as four 32nm x86 Stars cores, Trinity features up to two Piledriver modules. Given the not-so-great reception of Bulldozer late last year, we were worried about how a Bulldozer derivative would stack up in Trinity. I'm happy to say that Piledriver is a step forward from the CPU cores used in Llano, largely thanks to a bunch of clean up work from the Bulldozer foundation.

Piledriver picks up where Bulldozer left off. Its fundamental architecture remains completely unchanged, but rather improved in all areas. Piledriver is very much a second pass on the Bulldozer architecture, tidying everything up, capitalizing on low hanging fruit and significantly improving power efficiency. If you were hoping for an architectural reset with Piledriver, you will be disappointed. AMD is committed to Bulldozer and that's quite obvious if you look at Piledriver's high level block diagram:

Each Piledriver module is the same 2+1 INT/FP combination that we saw in Bulldozer. You get two integer cores, each with their own schedulers, L1 data caches, and execution units. Between the two is a shared floating point core that can handle instructions from one of two threads at a time. The single FP core shares the data caches of the dual integer cores.

Each module appears to the OS as two cores, however you don't have as many resources as you would from two traditional AMD cores. This table from our Bulldozer review highlights part of problem when looking at the front end:

Front End Comparison
  AMD Phenom II AMD FX Intel Core i7
Instruction Decode Width 3-wide 4-wide 4-wide
Single Core Peak Decode Rate 3 instructions 4 instructions 4 instructions
Dual Core Peak Decode Rate 6 instructions 4 instructions 8 instructions
Quad Core Peak Decode Rate 12 instructions 8 instructions 16 instructions
Six/Eight Core Peak Decode Rate 18 instructions (6C) 16 instructions 24 instructions (6C)

It's rare that you get anywhere near peak hardware utilization, so don't be too put off by these deltas, but it is a tradeoff that AMD made throughout Bulldozer. In general, AMD opted for better utilization of fewer resources (partially through increasing some data structures and other elements that feed execution units) vs. simply throwing more transistors at the problem. AMD also opted to reduce the ratio of integer to FP resources within the x86 portion of its architecture, clearly to support a move to the APU world where the GPU can be a provider of a significant amount of FP support. Piledriver doesn't fundamentally change any of these balances. The pipeline depth remains unchanged, as does the focus on pursuing higher frequencies.

Fundamental to Piledriver is a significant switch in the type of flip-flops used throughout the design. Flip-flops, or flops as they are commonly called, are simple pieces of logic that store some form of data or state. In a microprocessor they can be found in many places, including the start and end of a pipeline stage. Work is done prior to a flop and committed at the flop or array of flops. The output of these flops becomes the input to the next array of logic. Normally flops are hard edge elements—data is latched at the rising edge of the clock.

In very high frequency designs however, there can be a considerable amount of variability or jitter in the clock. You either have to spend a lot of time ensuring that your design can account for this jitter, or you can incorporate logic that's more tolerant of jitter. The former requires more effort, while the latter burns more power. Bulldozer opted for the latter.

In order to get Bulldozer to market as quickly as possible, after far too many delays, AMD opted to use soft edge flops quite often in the design. Soft edge flops are the opposite of their harder counterparts; they are designed to allow the clock signal to spill over the clock edge while still functioning. Piledriver on the other hand was the result of a systematic effort to swap in smaller, hard edge flops where there was timing margin in the design. The result is a tangible reduction in power consumption. Across the board there's a 10% reduction in dynamic power consumption compared to Bulldozer, and some workloads are apparently even pushing a 20% reduction in active power. Given Piledriver's role in Trinity, as a mostly mobile-focused product, this power reduction was well worth the effort.

At the front end, AMD put in additional work to improve IPC. The schedulers are now more aggressive about freeing up tokens. Similar to the soft vs. hard flip flop debate, it's always easier to be conservative when you retire an instruction from a queue. It eases verification as you don't have to be as concerned about conditions where you might accidentally overwrite an instruction too early. With the major effort of getting a brand new architecture off of the ground behind them, Piledriver's engineers could focus on greater refinement in the schedulers. The structures didn't get any bigger; AMD just now makes better use of them.

The execution units are also a bit beefier in Piledriver, but not by much. AMD claims significant improvements in floating point and integer divides, calls and returns. For client workloads these gains show minimal (sub 1%) improvements.

Prefetching and branch prediction are both significantly improved with Piledriver. Bulldozer did a simple sequential prefetch, while Piledriver can prefetch variable lengths of data and across page boundaries in the L1 (mainly a server workload benefit). In Bulldozer, if prefetched data wasn't used (incorrectly prefetched) it would clog up the cache as it would come in as the most recently accessed data. However if prefetched data isn't immediately used, it's likely it will never be used. Piledriver now immediately tags unused prefetched data as least-recently-used, allowing the cache controller to quickly evict it if the prefetch was incorrect.

Another change is that Piledriver includes a perceptron branch predictor that supplements the primary branch predictor in Bulldozer. The perceptron algorithm is a history based predictor that's better suited for predicting certain branches. It works in parallel with the old predictor and simply tags branches that it is known to be good at predicting. If the old predictor and the perceptron predictor disagree on a tagged branch, the perceptron's path is taken. Improving branch prediction accuracy is a challenge, but it's necessary in highly pipelined designs. These sorts of secondary predictors are a must as there's no one-size-fits-all when it comes to branch prediction.

Finally, Piledriver also adds new instructions to better align its ISA with Haswell: FMA3 and F16C.

Improved Turbo, Beefy Interconnects and the Trinity GPU
Comments Locked

271 Comments

View All Comments

  • texasti89 - Tuesday, May 15, 2012 - link


    A10-4600M's TDP = 35W
    I7-3720QM's TDP = 45W

    I'm pretty sure that Intel's 22nm is more power efficient that any 32nm process available in the industry. The efficiency of Intel GPU architecture is what makes their graphic solution appears to be comparable to AMD fusion parts.
  • Lolimaster - Tuesday, May 15, 2012 - link

    As obviously with the biased reviewers.

    Yeah GJ. Compare a top of the line UBER-expensove IB quad core with the highest TDP and the highest frequency vs A10 Trinity wich costs 3times less(if not more) thant that i7 3720QM.

    HD4000 performance is craptastic. Don't fool people with biased comparisons, at medidum detail and low res, cpu take advantage. For mobile each Mhz towards the 3Ghz and above improve performance.

    BUT WE ARE TALKING ABOUT AN i7 IB 3x times MORE EXPENSIVE than Trinity with WAY HIGHER MHZ. It's not the pathetic HD4000 that is shining is just the cpu, you can put an HD6450M and it will appear "faster" than Trinity if you pair with a high end expensive cpu.

    It's like the moronic reviews with a i7 3770K ($300+) vs A8-3870K ($120).

    Everyone knows that the real competion are the dual core i5 and similar price.

    And again, medium details when APU's prooved to offer high quality in most games.
  • JarredWalton - Tuesday, May 15, 2012 - link

    http://www.anandtech.com/bench/Product/600?vs=580

    I've got Mainstream and Enthusiast performance results in there for the games, but there's not much point in running games at 1600x900 High settings at <30 FPS is there?

    I have a whole section stating why we're including the systems we're including. Are you seriously delusional enough to suggest that we not show HD 4000 performance? There are no other HD 4000 results available for the time being, so either I use the i7-3720QM or I omit Ivy Bridge entirely. For you to imply its inclusion (with the note--italicized even!--that "these two laptops do not target the same market") is somehow biased is in fact far more bias than anything I've shown. And the pricing is twice as high for the ASUS system, not three times -- in fact I'd guess the Trinity laptop would be closer to $800 as configured, since it has Blu-ray and an SSD.

    What's more, throughout the review, I've included dual-core i5-2410M results and discussed how AMD's Trinity stacks up. Judging by Sandy Bridge, dual-core Ivy Bridge will be within 10% of the quad-core scores for gaming--it's not like many games can use more than two CPUs, and so it's really just a matter of the HD 4000 clocks being slightly lower on i5 models. You fail to grasp this fact with your ranting and biased outlook, unfortunately.

    In other words, I think your "moronic reviews" comment reflects your reading comprehension skills--or lack there of. Better luck next time. You might want to sign up for the remedial math and basic reading classes at the local community college.
  • kyuu - Tuesday, May 15, 2012 - link

    "I've got Mainstream and Enthusiast performance results in there for the games, but there's not much point in running games at 1600x900 High settings at <30 FPS is there?"

    Is that that the FPS you get? Did you actually test this or just assuming? Also, you can run 1600x900 without automagically turning up the detail settings to High at the same time. I, for one, am interested to see if the performance advantage increases over Llano/HD4000 when you shift more of the burden to the GPU side. At x768, it seems like the CPU would still be handling enough to make the CPU a substantial bottleneck.
  • JarredWalton - Tuesday, May 15, 2012 - link

    Yes, the scores in Mobile Bench are all actually tested -- including the 5 FPS average score of Trinity at 1920x1080 with 4xAA in Battlefield 3. (Yes, watching that made me feel a bit nauseous....) I could test 1600x900 at medium detail, but I don't expect any major changes from what the existing scores show.
  • Denithor - Tuesday, May 15, 2012 - link

    Actually those facts are very interesting to some of us! It lays out what the system can/cannot handle in practical terms. Now, granted, BF3 @ 1080p/4xAA is kinda an obvious fail scenario, but 1080p medium detail might be good to know.

    One real question that I haven't seen mentioned yet - how come there were no Intel cpu + nVidia gpu systems included in this testing? That seemed like a no-brainer to me...
  • JarredWalton - Wednesday, May 16, 2012 - link

    I thought the Acer TimelineU was a good choice. The only other recently tested laptops with Intel + NVIDIA are the Razer Blade (if people complain that N56VM is too expensive, what would they say about a $3500 laptop!?) and the Alienware M17x R3 (completely different class of hardware and again over $2000). The others like Dell XPS 15z came before we changed our game list, so we don't have some of the results for such laptops.
  • vegemeister - Tuesday, May 15, 2012 - link

    CPU speed doesn't become significant at low resolution because the resolution is low, but because the frame rate is high. The CPU must create the scene to be rendered at much higher temporal resolution.
  • bji - Tuesday, May 15, 2012 - link

    I think this was a well written article and that you laid out the facts about as clearly as could be laid out. I agree that Lolimaster has poor reading comprehension and needs some remedial education.
  • raghu78 - Tuesday, May 15, 2012 - link

    OEM laptop pricing is what changes the discussion. Also the sandybridge stock clearing firesale is a crucial factor. Given that core i7 2630qm with nvidia GT 555M is at USD 800 and entry level core i5 laptops at USD 550

    http://www.newegg.com/Product/Product.aspx?Item=N8...
    http://www.newegg.com/Product/Product.aspx?Item=N8...

    The A10 trinity laptops need to come at USD 600 with a max of 650 for the best designs, with the A8 at 500- 550 and the A6 / A4 at USD 400 - 450.Then they can clearly avoid competing core i7 with discrete GPU configs and be considered good alternatives for the low end Intel core i5, core i3 and pentium/ celeron dual cores with crappy intel HD 3000 graphics. Not to forget the the GPU drivers advantage which AMD has, very good image quality and a rapidly growing GPU accelerated apps ecosystem.

Log in

Don't have an account? Sign up now