AMD Graphics Core Next: Out With VLIW, In With SIMD

The fundamental issue moving forward is that VLIW designs are great for graphics; they are not so great for computing. However AMD has for all intents and purposes bet the company on GPU computing – their Fusion initiative isn’t just about putting a decent GPU right on die with a CPU, but then utilizing the radically different design attributes of a GPU to do the computational work that the CPU struggles at. So a GPU design that is great at graphics and poor at computing work simply isn’t sustainable for AMD’s future.

With AMD Graphics Core Next, VLIW is going away in favor of a non-VLIW SIMD design. In principal the two are similar – run lots of things in parallel – but there’s a world of difference in execution. Whereas VLIW is all about extracting instruction level parallelism (ILP), a non-VLIW SIMD is primarily about thread level parallelism (TLP).

Without getting unnecessarily deep into the differences between VLIW and non-VLIW (we’ll save that for another time), the difference in the architectures is about what VLIW does poorly for GPU computing purposes, and why a non-VLIW SIMD fixes it. The principal issue is that VLIW is hard to schedule ahead of time and there’s no dynamic scheduling during execution, and as a result the bulk of its weaknesses follow from that. As VLIW5 was a good fit for graphics, it was rather easy to efficiently compile and schedule shaders under those circumstances. With compute this isn’t always the case; there’s simply a wider range of things going on and it’s difficult to figure out what instructions will play nicely with each other. Only a handful of tasks such as brute force hashing thrive under this architecture.

Furthermore as VLIW lives and dies by the compiler, which means not only must the compiler be good, but that every compiler is good. This is an issue when it comes to expanding language support, as even with abstraction through intermediate languages you can still run into issues, including issues with a compiler producing intermediate code that the shader compiler can’t handle well.

Finally, the complexity of a VLIW instruction set also rears its head when it comes to optimizing and hand-tuning a program. Again this isn’t normally a problem for graphics, but it is for compute. The complex nature of VLIW makes it harder to disassemble and to debug, and in turn difficult to predict performance and to find and fix performance critical sections of the code. Ideally a coder should never have to work in assembly, but for HPC and other uses there is a good deal of performance to be gained by doing so and optimizing down to the single instruction.

AMD provided a short example of this in their presentation, showcasing the example output of their VLIW compiler and their new compiler for Graphics Core Next. Being a coder helps, but it’s not hard to see how contrived things are under VLIW.

VLIW
// Registers r0 contains "a", r1 contains "b"
// Value is returned in r2

00   ALU_PUSH_BEFORE
       1  x: PREDGT     ____, R0.x,  R1.x
             UPDATE_EXEC_MASK UPDATE PRED
01 JUMP   ADDR(3)
02 ALU
       2  x: SUB        ____, R0.x,  R1.x
       3  x: MUL_e      R2.x, PV2.x, R0.x
03 ELSE POP_CNT(1) ADDR(5)
04 ALU_POP_AFTER
       4  x: SUB        ____, R1.x,  R0.x
       5  x: MUL_e      R2.x, PV4.x, R1.x
05 POP(1) ADDR(6)

 

Non-VLIW SIMD
// Registers r0 contains "a", r1 contains "b"
// Value is returned in r2

v_cmp_gt_f32       r0,r1        
  //a > b, establish VCC
s_mov_b64    
      s0,exec        //Save current exec mask
s_and_b64    
      exec,vcc,exec  //Do "if"
s_cbranch_vccz 
   label0         //Branch if all lanes fail
v_sub_f32    
      r2,r0,r1       //result = a - b
v_mul_f32    
      r2,r2,r0       //result=result * a


s_andn2_b64    
    exec,s0,exec   //Do "else" (s0 & !exec)
s_cbranch_execz    label1         //Branch if all lanes fail
v_sub_f32    
      r2,r1,r0       //result = b - a
v_mul_f32    
      r2,r2,r1       //result = result * b

s_mov_b64    
      exec,s0        //Restore exec mask

 

VLIW: it’s good for graphics, it’s often not as good for compute.

So what does AMD replace VLIW with? They replace it with a traditional SIMD vector processor. While elements of Cayman do not directly map to elements of Graphics Core Next (GCN), since we’ve already been talking about the SP we’ll talk about its closest replacement: the SIMD.

Not to be confused with the SIMD on Cayman (which is a collection of SPs), the SIMD on GCN is a true 16-wide vector SIMD. A single instruction and up to 16 data elements are fed to a vector SIMD to be processed over a single clock cycle. As with Cayman, AMD’s wavefronts are 64 instructions meaning it takes 4 cycles to actually complete a single instruction for an entire wavefront.  This vector unit is combined with a 64KB register file and that composes a single SIMD in GCN.

As is the case with Cayman's SPs, the SIMD is capable of a number of different integer and floating point operations. AMD has not gone into fine detail yet of what those are, but we’re expecting something similar to Cayman with the possible exception of how transcendentals are handled. One thing that we do know is that FP64 performance has been radically improved: the GCN architecture is capable of FP64 performance up to ½ its FP32 performance. For home users this isn’t going to make a significant impact right away, but it’s going to help AMD get into professional markets where such precision is necessary.

 

Prelude: The History of VLIW & Graphics Many SIMDs Make One Compute Unit
Comments Locked

83 Comments

View All Comments

  • EJ257 - Saturday, June 18, 2011 - link

    I can't believe it's been 6 years since the X360 and PS3 release. It seems like this latest generation of consoles stuck around a lot longer than previous versions did. Any speculations on what kind of hardware MS and Sony will throw into the next gen?
  • DanNeely - Sunday, June 19, 2011 - link

    They have. The big console makers, at the gave devs requests, were trying to make the current generation last a decade to allow more time to recover the work expended figuring out how to best program them. The motion capture cameras were supposed to be the thing that kept the platforms from getting too stale. I suspect however, that by planning to launch its new console early Nintendo may have blown those plans out of the water.
  • jabber - Sunday, June 19, 2011 - link

    I'm pretty sure the hardware specs for both the next Xbox and Playstation have been set in stone already.

    I'm still betting on a 2013 release too.

    So right now GPU wise I reckon we're looking at GPUs currently sitting in the $100 range for both boxes. By 2013, the cost of these chips (suitably modified) will be down to $15 -$10 a box.

    I wouldnt have thought anything higher than a 5770 or 450 would be suitable/required.
  • Targon - Monday, June 20, 2011 - link

    It all depends on what you expect. Things feel a bit stagnant on the PC game front because consoles are not evolving, and too many companies want almost exactly the same experience on the PC version as what you have on the console.
  • Stargrazer - Saturday, June 18, 2011 - link

    Whereas VLIW is all about extracting instruction level parallelism (ILP), a non-VLIW SIMD is primarily about thread level parallelism (TLP).


    Something doesn't feel right here. In itself, SIMD is about *Data* Level Parallelism, not Thread Level Parallelism. Sure, you could use SIMD units as part of some larger scheme that exploits TLP, but that's not what *SIMD* is about.
  • Loki726 - Saturday, June 18, 2011 - link

    If you use a strict definition of a SIMD programming model, then yes, you are probably right: SIMD is a single sequence of operations executed over multiple data elements.

    However, over time SIMD has been used to refer to both the aforementioned programming model and the hardware used to implement it. The hardware typically consists of a single control unit that broadcasts instructions to multiple functional units. When people say "a SIMD", they typically mean that hardware implementation rather than the computing model.

    If that wasn't confusing enough, in the 1980s GPUs started using that SIMD hardware to execute multiple threads as long as the threads were all executing the same instruction at the same time.

    So the statement about using "a SIMD" to exploit TLP is accurate, if you take "a SIMD" to mean a processor pipeline with a single control unit that broadcasts to multiple functional units, and have some scheme for scheduling threads onto functional units.
  • RedemptionAD - Saturday, June 18, 2011 - link

    It seems like a good thing potentially. I hope that their good intentions are followed with good execution, at least better than Fermi.
  • Targon - Sunday, June 19, 2011 - link

    It should be interesting going forward. Now that AMD is finally into the 32nm process node, standalone GPUs also stand to gain quite a bit. As long as graphics don't become an afterthought to GPGPU, AMD should be in good shape. Radeon 7970(if that is the next generation GPU) may really be a game changer.
  • Navier - Saturday, June 18, 2011 - link

    Will the GCN architecture be able to be virtualized? Can a VMWare/XEN/KVM/HyperV hypervisor create vGPUs accessible by VMs in much the same way as vCPUs are today? With GPUs being integrated within the CPU package it would be a waste of resources if it could not be virtualized.

    This will become a critical feature for enterprise computing beyond HPC applications. One example would be gaming in a cloud computing environment, where a company provides a service that runs a game on their compute and graphics hardware for a game and streams the output to your mobile device for you to enjoy.
  • hechacker1 - Saturday, June 18, 2011 - link

    Yeah I'm also curious about this. Perhaps with the IOMMU and other CPU like features that the GPU now has, it would be much easier to timeshare the GPU.

Log in

Don't have an account? Sign up now