Fiji’s Architecture: The Grandest of GCN 1.2

We’ll start off our in-depth look at the R9 Fury X with a look at the Fiji GPU underneath.

Like the Hawaii GPU before it, from a release standpoint Fiji is not really the pathfinder chip for its architecture, but rather it’s the largest version of it. Fiji itself is based on what we unofficially call Graphics Core Next 1.2 (aka GEN3), and ignoring HBM for the moment, Fiji incorporates a few smaller changes but otherwise remaining nearly identical to the previous GCN 1.2 chips. The pathfinder for GCN 1.2 in turn was Tonga, which was released back in September of 2014 as the Radeon R9 285.

So what does GCN 1.2 bring to the table over Hawaii and the other GCN 1.1 chips? Certainly the most well-known and marquee GCN 1.2 feature is AMD’s latest generation delta color compression technology. Tied in to Fiji’s ROPs, delta color compression augments AMD’s existing color compression capabilities with additional compression modes that are based around the patterns of pixels within a tile and the differences between them (i.e. the delta), increasing how frequently and by how much frame buffers (and RTs) can be compressed.

Frame buffer operations are among the most bandwidth intensive in a GPU – it’s a lot of pixels that need to be resolved and written to a buffer – so reducing the amount of memory bandwidth these operations draw on can significantly increase the effective memory bandwidth of a GPU. In AMD’s case, GCN 1.2’s delta color compression improvements are designed to deliver up to a 40% increase in memory bandwidth efficiency, with individual tiles being compressible at up to an 8:1 ratio. Overall, while the lossless nature of this compression means that the exact amount of compression taking place changes frame by frame, tile by tile, it is at the end of the day one of the most significant improvements to GCN 1.2. For Radeon R9 285 it allowed AMD to deliver similar memory performance on a 256-bit memory bus (33% smaller than R9 280’s), and for Fiji it goes hand-in-hand with HBM to give Fiji an immense amount of effective memory bandwidth to play with.

Moving on, AMD has also made some changes under the hood at the ALU/shader level for GCN 1.2. Many of these changes are primarily for AMD’s Carrizo APU, where task scheduling improvements go hand-in-hand with the AMD’s Heterogeneous System Architecture initiative and deliver improvements to allow the CPU and GPU to more easily deliver work to each other. Similarly, 16-bit instructions are intended to save on power consumption in mobile devices that use lower precision math for basic rendering.

More applicable to Fiji and its derivatives are the improvements to data-parallel processing. GCN 1.2 now has the ability for data to be shared between SIMD lanes in a limited fashion, beyond existing swizzling and other data organizations methods. This is one of those low-level tweaks I’m actually a bit surprised AMD even mentioned (though I’m glad they did) as it’s a little tweak that’s going to be very algorithm specific. For non-programmers there’s not much to see, but for programmers – particularly OpenCL programmers – this will enable newer, more efficient algorithms where when the nature of the work requires working with data in adjacent lanes.

But for gamers, perhaps the most significant architectural improvement to GCN 1.2 and thereby Fiji are the changes made to tessellation and geometry processing. There is no single silver bullet here – after going with a 4-wide geometry front-end in Hawaii, AMD hasn’t changed it for Tonga or Fiji – but AMD has put in quite a bit of effort in to improving how geometry data moves around within the chip and how it’s used, on the basis that at this point the limitations aren’t in raw geometry performance, but rather the difficulties in achieving that performance.

Much of this effort has been invested in better handling small geometry, whether it’s large quantities of small batches, or even small quantities of small batches. The inclusion of small instance caching, for example, allows the GPU to better keep small batches of draw calls in cache, allowing them to be referenced and/or reused in the future without having to go to off-cache memory. Similarly, AMD can now store certain cases of vertex inputs for the geometry shader in shared memory, which like small instance caching allows for processing to take place more frequently on-chip, improving performance and cutting down on DRAM traffic.

More specific to Fiji’s incarnation of GCN is how distribution is handled. Load balancing and distribution among the geometry frontends is improved overall, including some low-level optimizations to how primitives generated from tessellation are distributed. Generally speaking distribution is a means to improve performance by removing bottlenecks, however AMD is now catching a specific edge case where small amplification factors don’t generate a lot of primitives, and in those cases they’re now skipping distribution since the gains are minimal, and more likely than not the cost from the bus traffic is greater than the benefits of distribution.

Finally, AMD has also expanded the vertex reuse window on GCN 1.2. As in the general case of reuse windows, the vertex reuse window is a cache of sorts for vertex data, allowing old results to be held in waiting in case they are needed again (as is often the cases in graphics). Though they aren’t telling us just how large the window now is, GCN 1.2 now features a larger window, which increases the hit rate for vertex data and as a result further edges geometry performance up since that data no longer needs to be regenerated.

As with our R9 285 review, I took the time to quickly run TessMark across the x8/x16/x32/x64 tessellation factors just to see how tessellation and geometry performance scales on AMD’s cards as the tessellation factor increases. Keeping in mind that all of the parts here have a 4-wide geometry front-end, the R9 285, R9 290X, and R9 Fury X all have the same geometry throughput on paper, give or take 10% for clockspeeds. What we find is that Fury X shows significant performance improvements at all levels, beating not only the Hawaii based R9 290X, but even the Tonga based R9 285. Tessellation performance is consistently 33% ahead of the R9 290X, while against Tonga it’s anywhere between a 33% lead at high factors to a 130% lead at low tessellation factors, showing the influence of AMD’s changes to how tessellation is handled with low factors.

The AMD Radeon R9 Fury X Review The Fiji GPU: Go Big or Go Home
Comments Locked

458 Comments

View All Comments

  • chizow - Friday, July 3, 2015 - link

    Pretty much, AMD supporters/fans/apologists love to parrot the meme that Intel hasn't innovated since original i7 or whatever, and while development there has certainly slowed, we have a number of 18 core e5-2699v3 servers in my data center at work, Broadwell Iris Pro iGPs that handily beat AMD APU and approach low-end dGPU perf, and ultrabooks and tablets that run on fanless 5W Core M CPUs. Oh, and I've upgraded also managed to find meaningful desktop upgrades every few years for no more than $300 since Core 2 put me back in Intel's camp for the first time in nearly a decade.
  • looncraz - Friday, July 3, 2015 - link

    None of what you stated is innovation, merely minor evolution. The core design is the same, gaining only ~5% or so IPC per generation, same basic layouts, same basic tech. Are you sure you know what "innovation" means?

    Bulldozer modules were an innovative design. A failure, but still very innovative. Pentium Pro and Pentium 4 were both innovative designs, both seeking performance in very different ways.

    Multi-core CPUs were innovative (AMD), HBM is innovative (AMD+Hynix), multi-GPU was innovative (3dfx), SMT was innovative (IBM, Alpha), CPU+GPU was innovative (Cyrix, IIRC)... you get the idea.

    Doing the exact same thing, more or less the exact same way, but slightly better, is not innovation.
  • chizow - Sunday, July 5, 2015 - link

    Huh? So putting Core level performance in a passive design that is as thin as a legal pad and has 10 hours of battery life isn't innovation?

    Increasing iGPU performance to the point it not only provides top-end CPU performance, and close to dGPU performance, while convincingly beating AMD's entire reason for buying ATI, their Fusion APUs isn't innovation?

    And how about the data center where Intel's *18* core CPUs are using the same TDP and sockets, in the same U rack units as their 4 and 6 core equivalents of just a few years ago?

    Intel is still innovating in different ways, that may not directly impact the desktop CPU market but it would be extremely ignorant to claim they aren't addressing their core growth and risk areas with new and innovative products.

    I've bought more Intel products in recent years vs. prior strictly because of these new innovations that are allowing me to have high performance computing in different form factors and use cases, beyond being tethered to my desktop PC.
  • looncraz - Friday, July 3, 2015 - link

    Show me intel CPU innovations since after the pentium 4.

    Mind you, innovations can be failures, they can be great successes, or they can be ho-hum.

    P6->Core->Nehalem->Sandy Bridge->Haswell->Skylake

    The only changes are evolutionary or as a result of process changes (which I don't consider CPU innovations).

    This is not to say that they aren't fantastic products - I'm rocking an i7-2600k for a reason - they just aren't innovative products. Indeed, nVidia's Maxwell is a wonderfully designed and engineered GPU, and products based on it are of the highest quality and performance. That doesn't make them innovative in any way. Nothing technically wrong with that, but I wonder how long before someone else came up with a suitable RAM just for GPUs if AMD hadn't done it?
  • chizow - Sunday, July 5, 2015 - link

    I've listed them above and despite slowing the pace of improvements on the desktop CPU side you are still looking at 30-45% improvement clock for clock between Nehalem and Haswell, along with pretty massive improvements in stock clock speed. Not bad given they've had literally zero pressure from AMD. If anything, Intel dominating in a virtual monopoly has afforded me much cheaper and consistent CPU upgrades, all of which provided significant improvements over the previous platform:

    E6600 $284
    Q6600 $299
    i7 920 $199!
    i7 4770K $229
    i7 5820K $299

    All cheaper than the $450 AMD wanted for their ENTRY level Athlon 64 when they finally got the lead over Intel, which made it an easy choice to go to Intel for the first time in nearly a decade after AMD got Conroe'd in 2006.
  • silverblue - Monday, July 6, 2015 - link

    I could swear that you've posted this before.

    I think the drop in prices were more of an attempt to strangle AMD than anything else. Intel can afford it, after all.
  • chizow - Monday, July 6, 2015 - link

    Of course I've posted it elsewhere because it bears repeating, the nonsensical meme AMD fanboys love to parrot about AMD being necessary for low prices and strong competition is a farce. I've enjoyed unparalleled stability at a similar or higher level of relative performance in the years that AMD has become UNCOMPETITIVE in the CPU market. There is no reason to expect otherwise in the dGPU market.
  • zoglike@yahoo.com - Monday, July 6, 2015 - link

    Really? Intel hasn't innovated? I really hope you are trolling because if you believe that I fear for you.
  • chizow - Thursday, July 2, 2015 - link

    Let's not also discount the fact that's just stock comparisons, once you overclock the cards as many are interested in doing in this $650 bracket, especially with AMD's clams Fury X is an "Overclocker's Dream", we quickly see the 980Ti cannot be touched by Fury X, water cooler or not.

    Fury X wouldn't have been the failure it is today if not for AMD setting unrealistic and ultimately, unattained expectations. 390X WCE at $550-$600 and its a solid alternative. $650 new "Premium" Brand that doesn't OC at all, has only 4GB, has pump whine issues and is slower than Nvidia's same priced $650 980Ti that launched 3 weeks before it just doesn't get the job done after AMD hyped it from the top brass down.
  • andychow - Thursday, July 2, 2015 - link

    Yeah, "Overclocker's dream", only overclocks by 75 MHz. Just by that statement, AMD has totally lost me.

Log in

Don't have an account? Sign up now