Graphics Core Next: Compute for Professionals

As we just discussed, a big part of AMD’s strategy for the FirePro W series relies on Graphics Core Next, their new GPU architecture. Whereas AMD’s previous VLIW architectures were strong at graphics (and hence most traditional professional graphics workloads), they were ill suited for compute tasks and professional graphics workloads that integrated compute. So as part of AMD’s much larger fundamental shift towards GPU computing, AMD has thrown out VLIW for an architecture that is strong in both compute and graphics: GCN.

Since we’ve already covered GCN in-depth when it was announced last year, we’re not going to go through a complete rehash of the architecture. If you wish to know more, please see our full analysis from 2011. In place of a full breakdown we’re going to have a quick refresher, focusing on GCN’s major features and what they mean for FirePro products and professional graphics users.

As we’ve already seen in some depth with the Radeon HD 6970, VLIW architectures are very good for graphics work, but they’re poor for compute work. VLIW designs excel in high instruction level parallelism (ILP) use cases, which graphics falls under quite nicely thanks to the fact that with most operations pixels and the color component channels of pixels are independently addressable datum. In fact at the time of the Cayman launch AMD found that the average slot utilization factor for shader programs on their VLIW5 architecture was 3.4 out of 5, reflecting the fact that most shader operations were operating on pixels or other data types that could be scheduled together.

Meanwhile, at a hardware level VLIW is a unique design in that it’s the epitome of the “more is better” philosophy. AMD’s high steam processor counts with VLIW4 and VLIW5 are a result of VLIW being a very thin type of architecture that purposely uses many simple ALUs, as opposed to fewer complex units (e.g. Fermi). Furthermore all of the scheduling for VLIW is done in advance by the compiler, so VLIW designs are in effect very dense collections of simple ALUs and cache.

The hardware traits of VLIW mean that for a VLIW architecture to work, the workloads need to map well to the architecture. Complex operations that the simple ALUs can’t handle are bad for VLIW, as are instructions that aren’t trivial to schedule together due to dependencies or other conflicts. As we’ve seen graphics operations do map well to VLIW, which is why VLIW has been in use since the earliest pixel shader equipped GPUs. Yet even then graphics operations don’t achieve perfect utilization under VLIW, but that’s okay because VLIW designs are so dense that it’s not a big problem if they’re operating at under full efficiency.

When it comes to compute workloads however, the idiosyncrasies of VLIW start to become a problem. “Compute” covers a wide range of workloads and algorithms; graphics algorithms may be rigidly defined, but compute workloads can be virtually anything. On the one hand there are compute workloads such as password hashing that are every bit as embarrassingly parallel as graphics workloads are, meaning these map well to existing VLIW architectures. On the other hand there are tasks like texture decompression which are parallel but not embarrassingly so, which means they map poorly to VLIW architectures. At one extreme you have a highly parallel workload, and at the other you have an almost serial workload.

So long as you only want to handle the highly parallel workloads VLIW is fine. But using VLIW as the basis of a compute architecture is going is limit what tasks your processor is sufficiently good at.

As a result of these deficiencies in AMD’s fundamental compute and memory architectures, AMD faced a serious roadblock in improving their products for the professional graphics and compute markets. If you want to handle a wider spectrum of compute workloads you need a more general purpose architecture, and this is what AMD set out to do.

Having established what’s bad about AMD’s VLIW architecture as a compute architecture, let’s discuss what makes a good compute architecture. The most fundamental aspect of compute is that developers want stable and predictable performance, something that VLIW didn’t lend itself to because it was dependency limited. Architectures that can’t work around dependencies will see their performance vary due to those dependencies. Consequently, if you want an architecture with stable performance that’s going to be good for compute workloads then you want an architecture that isn’t impacted by dependencies.

Ultimately dependencies and ILP go hand-in-hand. If you can extract ILP from a workload, then your architecture is by definition bursty. An architecture that can’t extract ILP may not be able to achieve the same level of peak performance, but it will not burst and hence it will be more consistent. This is the guiding principle behind NVIDIA’s Fermi architecture; GF100/GF110 have no ability to extract ILP, and developers love it for that reason.

So with those design goals in mind, let’s talk GCN.

VLIW is a traditional and well proven design for parallel processing. But it is not the only traditional and well proven design for parallel processing. For GCN AMD will be replacing VLIW with what’s fundamentally a Single Instruction Multiple Data (SIMD) vector architecture (note: technically VLIW is a subset of SIMD, but for the purposes of this refresher we’re considering them to be different).

At the most fundamental level AMD is still using simple ALUs, just like Cayman before it. In GCN these ALUs are organized into a single SIMD unit, the smallest unit of work for GCN. A SIMD is composed of 16 of these ALUs, along with a 64KB register file for the SIMDs to keep data in.

Above the individual SIMD we have a Compute Unit, the smallest fully independent functional unit. A CU is composed of 4 SIMD units, a hardware scheduler, a branch unit, L1 cache, a local date share, 4 texture units (each with 4 texture fetch load/store units), and a special scalar unit. The scalar unit is responsible for all of the arithmetic operations the simple ALUs can’t do or won’t do efficiently, such as conditional statements (if/then) and transcendental operations.

Because the smallest unit of work is the SIMD and a CU has 4 SIMDs, a CU works on 4 different wavefronts at once. As wavefronts are still 64 operations wide, each cycle a SIMD will complete ¼ of the operations on their respective wavefront, and after 4 cycles the current instruction for the active wavefront is completed.


Wavefront Execution Example: SIMD vs. VLIW. Not To Scale - Wavefront Size 16

Cayman by comparison would attempt to execute multiple instructions from the same wavefront in parallel, rather than executing a single instruction from multiple wavefronts. This is where Cayman got bursty – if the instructions were in any way dependent, Cayman would have to let some of its ALUs go idle. GCN on the other hand does not face this issue, because each SIMD handles single instructions from different wavefronts they are in no way attempting to take advantage of ILP, and their performance will be very consistent.

There are other aspects of GCN that influence its performance – the scalar unit plays a huge part – but in comparison to Cayman, this is the single biggest difference. By not taking advantage of ILP, but instead taking advantage of Thread Level Parallism (TLP) in the form of executing more wavefronts at once, GCN will be able to deliver high compute performance and to do so consistently.

Bringing this all together, to make a complete GPU a number of these GCN CUs will be combined with the rest of the parts we’re accustomed to seeing on a GPU. A frontend is responsible for feeding the GPU, as it contains both the command processors (ACEs) responsible for feeding the CUs and the geometry engines responsible for geometry setup.

The ACEs are going to be particularly interesting to see in practice – we haven’t seen much evidence that AMD’s use of dual ACEs has had much of an impact in the consumer space, but things are a bit different in the professional space. Having two ACEs gives AMD the ability to work around some context switching performance issues, which is of importance for tasks that need steady graphics and compute performance at the same time. NVIDIA has invested in Maximus technology – Quadro + Tesla in a single system – for precisely this reason, and with the Kepler generation has made it near-mandatory as the Quadro K5000 is a compute-weak product. AMD takes a certain degree of pleasure in having a high-end GPU that can do both of these tasks well, and with the ACEs they may be able to deliver a Maximus-like experience with only a single card at a much lower cost.

Meanwhile coming after the CUs will be the ROPs that handle the actual render operations, the L2 cache, the memory controllers, and the various fixed function controllers such as the display controllers, PCIe bus controllers, Universal Video Decoder, and Video Codec Engine.

At the end of the day because AMD has done their homework GCN significantly improves AMD compute performance relative to VLIW4 while graphics performance should be just as good. Graphics shader operations will execute across the CUs in a much different manner than they did across VLIW, but they should do so at a similar speed. It’s by building out a GPU in this manner that AMD can make an architecture that’s significantly better at compute without sacrificing graphics performance, and this is why the resulting GCN architecture is balanced for both compute and graphics.

AMD’s Plan of Attack The Rest of the FirePro W Series Feature Set
Comments Locked

35 Comments

View All Comments

  • cjb110 - Tuesday, August 14, 2012 - link

    No interest in the product unfortunatly, but the article was a well written and interesting read.
  • nathanddrews - Tuesday, August 14, 2012 - link

    I certainly miss the days of softmodding consumer cards to pro cards. I think the last card I did it on was either the 8800GT or the 4850. Some of the improvements in rendering quality and drawing speed were astounding - but it certainly nerfed gaming capability. It's a shame (from a consumer perspective) to no longer be able to softmod.
  • augiem - Thursday, August 16, 2012 - link

    I miss those days too, but I have to say I never saw any improvement in Maya sadly over the course of 3 different generations of softmodded cards. And I spent so much time and effort researching the right card models, etc. I think the benefits for Autocad and such must have been more pronounced than Maya.
  • mura - Tuesday, August 14, 2012 - link

    I understand, how important it is to validate and bug-fix these cards, it is not the same, if a card malfunctions under Battlefield3 or some kind of an engineering software - but is such a premium price necessary?

    I know, this is the market - everybody tries to acheive maximum profit, but seeing these prices, and comparing the specs with consumer cards, which cost a fraction - I don't see the bleeding-edge, I don't see the added value.
  • bhima - Tuesday, August 14, 2012 - link

    Who has chatted with some of the guys at Autodesk: They use high-end gaming cards. Not sure if they ALL do, but a good portion of them do simply because of the cost of these "professional" cards.
  • wiyosaya - Thursday, August 16, 2012 - link

    Exactly my point. If the developers at a high-end company like Autodesk use gaming cards, that speaks volumes.

    People expect that they will get better service, too, if a bug crops up. Well, even in the consumer market, I have an LG monitor that was seen by nVidia drivers as an HD TV, and kept me from using the 1980x1200 resolution of the monitor. I reported this to nVidia and within days, there was a beta version of the drivers that had fixed the problem.

    As I see it, the reality is that if you have a problem, there is no guarantee that the vendor will fix it no matter how much you paid for the card. Just look at t heir license agreement. Somewhere in the agreement, it is likely that you will find some clause that says that they do not guarantee a fix to any of the problems that you may report.
  • bwoochowski - Tuesday, August 14, 2012 - link

    No one seems to be asking the hard questions of AMD:

    1) What happened to the 1/2 rate double precision FP performance that we were supposed to see on professional GCN cards?

    2) Now that we're barely starting to see some support for the cl_khr_fp4 header, when can we expect the compiler to support the full suite of options? When will OpenCL 1.2 be fully supported?

    3) Why mention the FirePro S8000 in press reports and never release it? I have to wonder about how much time and effort was wasted on adding support for the S8000 to the HMPP and other compilers.

    I suppose it's pointless to even ask about any kind of accelerated infiniband features at this point.

    With the impending shift to hybrid clusters in the HPC segment, I find it baffling that AMD would choose to kill off their dedicated compute card now. Since the release of the 4870 they had been attracting developers that were eager to capitalize on the cheap double precision fp performance. Now that these applications are ready to make the jump from a single PC to large clusters, the upgrade path doesn't exist. By this time next year there won't be anyone left developing on AMD APP, they'll all have moved back to CUDA. Brilliant move, AMD.
  • N4g4rok - Tuesday, August 14, 2012 - link

    Providing they don't develop new hardware to meet that need. Keeping older variations of dedicated compute cards wouldn't make any sense for moving into large cluster computing. They could keep that same line, but it would need an overall anyway. why not end it and start something new?
  • boeush - Tuesday, August 14, 2012 - link

    "I find it baffling that AMD would choose to kill off their dedicated compute card now."

    It's not that they won't have a compute card (their graphics card is simply pulling double duty under this new plan.) The real issue is, to quote from the article:

    "they may be underpricing NVIDIA’s best Quadro, but right now they’re going to be charging well more than NVIDIA’s best Tesla card. So there’s a real risk right now that FirePro for compute may be a complete non-starter once Tesla K20 arrives at the end of the year."

    I find this approach by AMD baffling indeed. It's as if they just decided to abdicate whatever share they had of the HPC market. A very odd stance to take, particularly if they are as invested in OpenCL as they would like everyone to believe. The more time passes, and the more established code is created around CUDA, the harder it will become for AMD to push OpenCL in the HPC space.
  • CeriseCogburn - Wednesday, August 29, 2012 - link

    LOL - thank you, as the amd epic fail is written all over that.
    Mentally ill self sabotage, what else can it be when you're amd.
    The have their little fanboys yapping opencl now for years on end, and they lack full support for ver 1.2 - LOL
    It's sad - so sad, it's funny.
    Actually that really is sad, I feel sorry for them they are such freaking failures.

Log in

Don't have an account? Sign up now