AMD Graphics Core Next: Out With VLIW, In With SIMD

The fundamental issue moving forward is that VLIW designs are great for graphics; they are not so great for computing. However AMD has for all intents and purposes bet the company on GPU computing – their Fusion initiative isn’t just about putting a decent GPU right on die with a CPU, but then utilizing the radically different design attributes of a GPU to do the computational work that the CPU struggles at. So a GPU design that is great at graphics and poor at computing work simply isn’t sustainable for AMD’s future.

With AMD Graphics Core Next, VLIW is going away in favor of a non-VLIW SIMD design. In principal the two are similar – run lots of things in parallel – but there’s a world of difference in execution. Whereas VLIW is all about extracting instruction level parallelism (ILP), a non-VLIW SIMD is primarily about thread level parallelism (TLP).

Without getting unnecessarily deep into the differences between VLIW and non-VLIW (we’ll save that for another time), the difference in the architectures is about what VLIW does poorly for GPU computing purposes, and why a non-VLIW SIMD fixes it. The principal issue is that VLIW is hard to schedule ahead of time and there’s no dynamic scheduling during execution, and as a result the bulk of its weaknesses follow from that. As VLIW5 was a good fit for graphics, it was rather easy to efficiently compile and schedule shaders under those circumstances. With compute this isn’t always the case; there’s simply a wider range of things going on and it’s difficult to figure out what instructions will play nicely with each other. Only a handful of tasks such as brute force hashing thrive under this architecture.

Furthermore as VLIW lives and dies by the compiler, which means not only must the compiler be good, but that every compiler is good. This is an issue when it comes to expanding language support, as even with abstraction through intermediate languages you can still run into issues, including issues with a compiler producing intermediate code that the shader compiler can’t handle well.

Finally, the complexity of a VLIW instruction set also rears its head when it comes to optimizing and hand-tuning a program. Again this isn’t normally a problem for graphics, but it is for compute. The complex nature of VLIW makes it harder to disassemble and to debug, and in turn difficult to predict performance and to find and fix performance critical sections of the code. Ideally a coder should never have to work in assembly, but for HPC and other uses there is a good deal of performance to be gained by doing so and optimizing down to the single instruction.

AMD provided a short example of this in their presentation, showcasing the example output of their VLIW compiler and their new compiler for Graphics Core Next. Being a coder helps, but it’s not hard to see how contrived things are under VLIW.

VLIW
// Registers r0 contains "a", r1 contains "b"
// Value is returned in r2

00   ALU_PUSH_BEFORE
       1  x: PREDGT     ____, R0.x,  R1.x
             UPDATE_EXEC_MASK UPDATE PRED
01 JUMP   ADDR(3)
02 ALU
       2  x: SUB        ____, R0.x,  R1.x
       3  x: MUL_e      R2.x, PV2.x, R0.x
03 ELSE POP_CNT(1) ADDR(5)
04 ALU_POP_AFTER
       4  x: SUB        ____, R1.x,  R0.x
       5  x: MUL_e      R2.x, PV4.x, R1.x
05 POP(1) ADDR(6)

 

Non-VLIW SIMD
// Registers r0 contains "a", r1 contains "b"
// Value is returned in r2

v_cmp_gt_f32       r0,r1        
  //a > b, establish VCC
s_mov_b64    
      s0,exec        //Save current exec mask
s_and_b64    
      exec,vcc,exec  //Do "if"
s_cbranch_vccz 
   label0         //Branch if all lanes fail
v_sub_f32    
      r2,r0,r1       //result = a - b
v_mul_f32    
      r2,r2,r0       //result=result * a


s_andn2_b64    
    exec,s0,exec   //Do "else" (s0 & !exec)
s_cbranch_execz    label1         //Branch if all lanes fail
v_sub_f32    
      r2,r1,r0       //result = b - a
v_mul_f32    
      r2,r2,r1       //result = result * b

s_mov_b64    
      exec,s0        //Restore exec mask

 

VLIW: it’s good for graphics, it’s often not as good for compute.

So what does AMD replace VLIW with? They replace it with a traditional SIMD vector processor. While elements of Cayman do not directly map to elements of Graphics Core Next (GCN), since we’ve already been talking about the SP we’ll talk about its closest replacement: the SIMD.

Not to be confused with the SIMD on Cayman (which is a collection of SPs), the SIMD on GCN is a true 16-wide vector SIMD. A single instruction and up to 16 data elements are fed to a vector SIMD to be processed over a single clock cycle. As with Cayman, AMD’s wavefronts are 64 instructions meaning it takes 4 cycles to actually complete a single instruction for an entire wavefront.  This vector unit is combined with a 64KB register file and that composes a single SIMD in GCN.

As is the case with Cayman's SPs, the SIMD is capable of a number of different integer and floating point operations. AMD has not gone into fine detail yet of what those are, but we’re expecting something similar to Cayman with the possible exception of how transcendentals are handled. One thing that we do know is that FP64 performance has been radically improved: the GCN architecture is capable of FP64 performance up to ½ its FP32 performance. For home users this isn’t going to make a significant impact right away, but it’s going to help AMD get into professional markets where such precision is necessary.

 

Prelude: The History of VLIW & Graphics Many SIMDs Make One Compute Unit
Comments Locked

83 Comments

View All Comments

  • hammer256 - Friday, June 17, 2011 - link

    It's good to see AMD more committed to the GPGPU. I use GPGPU for neural network simulations, and currently the default choice has been Nvidia with CUDA. It would be nice to see some competition in this space.
    From the article it sounds like AMD knows to put a lot of emphasis on the software side of things for the developers. Hopefully they'll have a capable programming system that's as good as CUDA, maybe even better.
    Finally, Given AMD's strategies in the past with medium sized GPU chips and multi-GPU for high-end, hopefully they'll put sufficient emphasis into support for easier multi-GPU programming.

    Exciting times indeed.
  • krumme - Friday, June 17, 2011 - link

    What a pleasure to read articles like this. I would gladly pay for it, more directly, so to speak.

    Some animations or video, especially for us less tech savvy, would be highly appriciated too.

    Competition for x86 is comming ! :)
  • mczak - Friday, June 17, 2011 - link

    I wouldn't really call it radical, Cayman already had the same theoretic 1/2 performance for FP64 adds compared to FP32. Muls/FMAs though are now 1/2 too it seems (though it might not extend to all products) whereas it was 1/4 on Cayman. Still, a factor two is not what I'd call a "radical" improvement.
  • ahmedz_1991 - Friday, June 17, 2011 - link

    I really appreciated the letters A M D. Since Athlon, one could feel that AMD is lagging behind Intel more and more, but now with them beingh the first successful CPU\GPU combination (Llano out there now ) now AMD can make their own way and API's even into OS's just like what Intel and NVidia always do. This way I'm more than sure that we'll see titles (apps and games ) with the unified AMD brand instead of those ( meant to be played ) or ( smart solution ) with some stupid stars for Core i3,5 or 7
  • frozentundra123456 - Wednesday, December 21, 2011 - link

    Well, technically Sandy Bridge is also a CPU/GPU combination, and I think I would call it successful. Granted, the graphics are not up to AMD levels, but their CPU performance is much better. And considering the debacle of Bulldozer and the architecture that was not optimized for current software, AMD will have to do a much better job of integrating their hardware with software than they have done so far.
  • haukionkannel - Friday, June 17, 2011 - link

    So maybe not big upgrades in graphic power, but improvement in computing power. Its really good for CPGPU usage. It allso makes it easier to run physic calculations in AMD GPUs.

    Hmm... It allso means that more silicon space is neede for same graphic power...

    Interesting to see how it all sums up.
  • Targon - Saturday, June 18, 2011 - link

    Right now, there has been a shortage of software that really pushes the graphics limits, mostly because you have the substandard Intel graphics out there that still has a significant market share. How many games out there really make you feel that a Radeon 6970 just isn't enough? The polygon count for objects(characters) in games have not been going up as much as more world detail has been going in.

    Now, when developers want to try aiming for 5 million polygon figures in games, THAT is where there will be a bigger demand for more graphics power, and with that level of detail, the CPU power needed to properly animate the objects needs to be higher. This is where all of this work with GPU compute comes in, to handle all the complexities of properly animating these super-high detailed objects.

    I will note that The Witcher 2 is one of the first games I have seen in a long time where CPU power needs to be higher than a Phenom 2 945, and I am waiting for the AMD Bulldozer core CPUs(not APUs) to come out to see how big of an improvement it will make.
  • IlllI - Friday, June 17, 2011 - link

    can someone explain all this to me? lol this is all beyond my understanding
  • tipoo - Saturday, June 18, 2011 - link

    They are making GPU compute much more capable and possible, in a nutshell. This will greatly increase the processing speed of many tasks on computers.
  • khimera2000 - Sunday, June 19, 2011 - link

    AMD has CPU and GPU, but there seperate. They want this to change.

    There combining the CPU and GPU so that they are more able to talk to each other, and do the tasks there best at. this is done by remaking the way they build video cards.

    C++... great for CPU not so great for gpu... they want to change this.

    Out of order operations suck on the GPU. they want to change this, so it can hammer through more work faster.

    There also throwing in a bunch of tools to help tell developers where there messing up in this regard.

    fusion APUs will have a nice trick... they will be able to talk to each other without needing to send information back to memory. Imagion passing letters but having to use fedex, this would be like a move to passing letters in class (no fedex) its quicker :) and your mail isint delayed.

    APU will talk over PCI-E... Im wondering how that will work to 0.o

Log in

Don't have an account? Sign up now