Graphics Core Next: Compute for Professionals

As we just discussed, a big part of AMD’s strategy for the FirePro W series relies on Graphics Core Next, their new GPU architecture. Whereas AMD’s previous VLIW architectures were strong at graphics (and hence most traditional professional graphics workloads), they were ill suited for compute tasks and professional graphics workloads that integrated compute. So as part of AMD’s much larger fundamental shift towards GPU computing, AMD has thrown out VLIW for an architecture that is strong in both compute and graphics: GCN.

Since we’ve already covered GCN in-depth when it was announced last year, we’re not going to go through a complete rehash of the architecture. If you wish to know more, please see our full analysis from 2011. In place of a full breakdown we’re going to have a quick refresher, focusing on GCN’s major features and what they mean for FirePro products and professional graphics users.

As we’ve already seen in some depth with the Radeon HD 6970, VLIW architectures are very good for graphics work, but they’re poor for compute work. VLIW designs excel in high instruction level parallelism (ILP) use cases, which graphics falls under quite nicely thanks to the fact that with most operations pixels and the color component channels of pixels are independently addressable datum. In fact at the time of the Cayman launch AMD found that the average slot utilization factor for shader programs on their VLIW5 architecture was 3.4 out of 5, reflecting the fact that most shader operations were operating on pixels or other data types that could be scheduled together.

Meanwhile, at a hardware level VLIW is a unique design in that it’s the epitome of the “more is better” philosophy. AMD’s high steam processor counts with VLIW4 and VLIW5 are a result of VLIW being a very thin type of architecture that purposely uses many simple ALUs, as opposed to fewer complex units (e.g. Fermi). Furthermore all of the scheduling for VLIW is done in advance by the compiler, so VLIW designs are in effect very dense collections of simple ALUs and cache.

The hardware traits of VLIW mean that for a VLIW architecture to work, the workloads need to map well to the architecture. Complex operations that the simple ALUs can’t handle are bad for VLIW, as are instructions that aren’t trivial to schedule together due to dependencies or other conflicts. As we’ve seen graphics operations do map well to VLIW, which is why VLIW has been in use since the earliest pixel shader equipped GPUs. Yet even then graphics operations don’t achieve perfect utilization under VLIW, but that’s okay because VLIW designs are so dense that it’s not a big problem if they’re operating at under full efficiency.

When it comes to compute workloads however, the idiosyncrasies of VLIW start to become a problem. “Compute” covers a wide range of workloads and algorithms; graphics algorithms may be rigidly defined, but compute workloads can be virtually anything. On the one hand there are compute workloads such as password hashing that are every bit as embarrassingly parallel as graphics workloads are, meaning these map well to existing VLIW architectures. On the other hand there are tasks like texture decompression which are parallel but not embarrassingly so, which means they map poorly to VLIW architectures. At one extreme you have a highly parallel workload, and at the other you have an almost serial workload.

So long as you only want to handle the highly parallel workloads VLIW is fine. But using VLIW as the basis of a compute architecture is going is limit what tasks your processor is sufficiently good at.

As a result of these deficiencies in AMD’s fundamental compute and memory architectures, AMD faced a serious roadblock in improving their products for the professional graphics and compute markets. If you want to handle a wider spectrum of compute workloads you need a more general purpose architecture, and this is what AMD set out to do.

Having established what’s bad about AMD’s VLIW architecture as a compute architecture, let’s discuss what makes a good compute architecture. The most fundamental aspect of compute is that developers want stable and predictable performance, something that VLIW didn’t lend itself to because it was dependency limited. Architectures that can’t work around dependencies will see their performance vary due to those dependencies. Consequently, if you want an architecture with stable performance that’s going to be good for compute workloads then you want an architecture that isn’t impacted by dependencies.

Ultimately dependencies and ILP go hand-in-hand. If you can extract ILP from a workload, then your architecture is by definition bursty. An architecture that can’t extract ILP may not be able to achieve the same level of peak performance, but it will not burst and hence it will be more consistent. This is the guiding principle behind NVIDIA’s Fermi architecture; GF100/GF110 have no ability to extract ILP, and developers love it for that reason.

So with those design goals in mind, let’s talk GCN.

VLIW is a traditional and well proven design for parallel processing. But it is not the only traditional and well proven design for parallel processing. For GCN AMD will be replacing VLIW with what’s fundamentally a Single Instruction Multiple Data (SIMD) vector architecture (note: technically VLIW is a subset of SIMD, but for the purposes of this refresher we’re considering them to be different).

At the most fundamental level AMD is still using simple ALUs, just like Cayman before it. In GCN these ALUs are organized into a single SIMD unit, the smallest unit of work for GCN. A SIMD is composed of 16 of these ALUs, along with a 64KB register file for the SIMDs to keep data in.

Above the individual SIMD we have a Compute Unit, the smallest fully independent functional unit. A CU is composed of 4 SIMD units, a hardware scheduler, a branch unit, L1 cache, a local date share, 4 texture units (each with 4 texture fetch load/store units), and a special scalar unit. The scalar unit is responsible for all of the arithmetic operations the simple ALUs can’t do or won’t do efficiently, such as conditional statements (if/then) and transcendental operations.

Because the smallest unit of work is the SIMD and a CU has 4 SIMDs, a CU works on 4 different wavefronts at once. As wavefronts are still 64 operations wide, each cycle a SIMD will complete ¼ of the operations on their respective wavefront, and after 4 cycles the current instruction for the active wavefront is completed.


Wavefront Execution Example: SIMD vs. VLIW. Not To Scale - Wavefront Size 16

Cayman by comparison would attempt to execute multiple instructions from the same wavefront in parallel, rather than executing a single instruction from multiple wavefronts. This is where Cayman got bursty – if the instructions were in any way dependent, Cayman would have to let some of its ALUs go idle. GCN on the other hand does not face this issue, because each SIMD handles single instructions from different wavefronts they are in no way attempting to take advantage of ILP, and their performance will be very consistent.

There are other aspects of GCN that influence its performance – the scalar unit plays a huge part – but in comparison to Cayman, this is the single biggest difference. By not taking advantage of ILP, but instead taking advantage of Thread Level Parallism (TLP) in the form of executing more wavefronts at once, GCN will be able to deliver high compute performance and to do so consistently.

Bringing this all together, to make a complete GPU a number of these GCN CUs will be combined with the rest of the parts we’re accustomed to seeing on a GPU. A frontend is responsible for feeding the GPU, as it contains both the command processors (ACEs) responsible for feeding the CUs and the geometry engines responsible for geometry setup.

The ACEs are going to be particularly interesting to see in practice – we haven’t seen much evidence that AMD’s use of dual ACEs has had much of an impact in the consumer space, but things are a bit different in the professional space. Having two ACEs gives AMD the ability to work around some context switching performance issues, which is of importance for tasks that need steady graphics and compute performance at the same time. NVIDIA has invested in Maximus technology – Quadro + Tesla in a single system – for precisely this reason, and with the Kepler generation has made it near-mandatory as the Quadro K5000 is a compute-weak product. AMD takes a certain degree of pleasure in having a high-end GPU that can do both of these tasks well, and with the ACEs they may be able to deliver a Maximus-like experience with only a single card at a much lower cost.

Meanwhile coming after the CUs will be the ROPs that handle the actual render operations, the L2 cache, the memory controllers, and the various fixed function controllers such as the display controllers, PCIe bus controllers, Universal Video Decoder, and Video Codec Engine.

At the end of the day because AMD has done their homework GCN significantly improves AMD compute performance relative to VLIW4 while graphics performance should be just as good. Graphics shader operations will execute across the CUs in a much different manner than they did across VLIW, but they should do so at a similar speed. It’s by building out a GPU in this manner that AMD can make an architecture that’s significantly better at compute without sacrificing graphics performance, and this is why the resulting GCN architecture is balanced for both compute and graphics.

AMD’s Plan of Attack The Rest of the FirePro W Series Feature Set
Comments Locked

35 Comments

View All Comments

  • ManuelLP - Thursday, August 30, 2012 - link

    When is publish the second part? Tomorrow is the end of the month.
  • Gadgety - Wednesday, December 19, 2012 - link

    This very interesting part I was done back in August. It's now December. I've searched for W9000 on Anandtech but no part II shows up. What gives?
  • nitro912gr - Friday, January 18, 2013 - link

    Where is PART 2?
  • theprofessionalschoice - Thursday, December 5, 2013 - link

    As an entrepreneur in Bitcoin mining, I have made roughly $1.7m. from a start-up fund of only $40,000. I can personally vouch for the W9000 cards because I've mined with them exclusively and now happily manage a small empire because of their vast throughput. Kudos to the author for going so in-depth with this review... VERY accurate and I highly recommend this card to miners of any level... I can't wait for the 20nm dual-chips to launch (exclusively in the new Mac Pro) sometime this month!!! Great professional cards - NOT for consumers! ;)
  • ballsystemlord - Wednesday, October 30, 2019 - link

    I guess we'll never see part 2... :(
    Really, I am curious.

Log in

Don't have an account? Sign up now