AMD Graphics Core Next: Out With VLIW, In With SIMD

The fundamental issue moving forward is that VLIW designs are great for graphics; they are not so great for computing. However AMD has for all intents and purposes bet the company on GPU computing – their Fusion initiative isn’t just about putting a decent GPU right on die with a CPU, but then utilizing the radically different design attributes of a GPU to do the computational work that the CPU struggles at. So a GPU design that is great at graphics and poor at computing work simply isn’t sustainable for AMD’s future.

With AMD Graphics Core Next, VLIW is going away in favor of a non-VLIW SIMD design. In principal the two are similar – run lots of things in parallel – but there’s a world of difference in execution. Whereas VLIW is all about extracting instruction level parallelism (ILP), a non-VLIW SIMD is primarily about thread level parallelism (TLP).

Without getting unnecessarily deep into the differences between VLIW and non-VLIW (we’ll save that for another time), the difference in the architectures is about what VLIW does poorly for GPU computing purposes, and why a non-VLIW SIMD fixes it. The principal issue is that VLIW is hard to schedule ahead of time and there’s no dynamic scheduling during execution, and as a result the bulk of its weaknesses follow from that. As VLIW5 was a good fit for graphics, it was rather easy to efficiently compile and schedule shaders under those circumstances. With compute this isn’t always the case; there’s simply a wider range of things going on and it’s difficult to figure out what instructions will play nicely with each other. Only a handful of tasks such as brute force hashing thrive under this architecture.

Furthermore as VLIW lives and dies by the compiler, which means not only must the compiler be good, but that every compiler is good. This is an issue when it comes to expanding language support, as even with abstraction through intermediate languages you can still run into issues, including issues with a compiler producing intermediate code that the shader compiler can’t handle well.

Finally, the complexity of a VLIW instruction set also rears its head when it comes to optimizing and hand-tuning a program. Again this isn’t normally a problem for graphics, but it is for compute. The complex nature of VLIW makes it harder to disassemble and to debug, and in turn difficult to predict performance and to find and fix performance critical sections of the code. Ideally a coder should never have to work in assembly, but for HPC and other uses there is a good deal of performance to be gained by doing so and optimizing down to the single instruction.

AMD provided a short example of this in their presentation, showcasing the example output of their VLIW compiler and their new compiler for Graphics Core Next. Being a coder helps, but it’s not hard to see how contrived things are under VLIW.

VLIW
// Registers r0 contains "a", r1 contains "b"
// Value is returned in r2

00   ALU_PUSH_BEFORE
       1  x: PREDGT     ____, R0.x,  R1.x
             UPDATE_EXEC_MASK UPDATE PRED
01 JUMP   ADDR(3)
02 ALU
       2  x: SUB        ____, R0.x,  R1.x
       3  x: MUL_e      R2.x, PV2.x, R0.x
03 ELSE POP_CNT(1) ADDR(5)
04 ALU_POP_AFTER
       4  x: SUB        ____, R1.x,  R0.x
       5  x: MUL_e      R2.x, PV4.x, R1.x
05 POP(1) ADDR(6)

 

Non-VLIW SIMD
// Registers r0 contains "a", r1 contains "b"
// Value is returned in r2

v_cmp_gt_f32       r0,r1        
  //a > b, establish VCC
s_mov_b64    
      s0,exec        //Save current exec mask
s_and_b64    
      exec,vcc,exec  //Do "if"
s_cbranch_vccz 
   label0         //Branch if all lanes fail
v_sub_f32    
      r2,r0,r1       //result = a - b
v_mul_f32    
      r2,r2,r0       //result=result * a


s_andn2_b64    
    exec,s0,exec   //Do "else" (s0 & !exec)
s_cbranch_execz    label1         //Branch if all lanes fail
v_sub_f32    
      r2,r1,r0       //result = b - a
v_mul_f32    
      r2,r2,r1       //result = result * b

s_mov_b64    
      exec,s0        //Restore exec mask

 

VLIW: it’s good for graphics, it’s often not as good for compute.

So what does AMD replace VLIW with? They replace it with a traditional SIMD vector processor. While elements of Cayman do not directly map to elements of Graphics Core Next (GCN), since we’ve already been talking about the SP we’ll talk about its closest replacement: the SIMD.

Not to be confused with the SIMD on Cayman (which is a collection of SPs), the SIMD on GCN is a true 16-wide vector SIMD. A single instruction and up to 16 data elements are fed to a vector SIMD to be processed over a single clock cycle. As with Cayman, AMD’s wavefronts are 64 instructions meaning it takes 4 cycles to actually complete a single instruction for an entire wavefront.  This vector unit is combined with a 64KB register file and that composes a single SIMD in GCN.

As is the case with Cayman's SPs, the SIMD is capable of a number of different integer and floating point operations. AMD has not gone into fine detail yet of what those are, but we’re expecting something similar to Cayman with the possible exception of how transcendentals are handled. One thing that we do know is that FP64 performance has been radically improved: the GCN architecture is capable of FP64 performance up to ½ its FP32 performance. For home users this isn’t going to make a significant impact right away, but it’s going to help AMD get into professional markets where such precision is necessary.

 

Prelude: The History of VLIW & Graphics Many SIMDs Make One Compute Unit
POST A COMMENT

83 Comments

View All Comments

  • haplo602 - Saturday, June 18, 2011 - link

    I hope that AMD delivers. This is exactly what I expected them to do once Llano was anounced. GPU as a coprocessor. Actualy I hoped that AMD would implement a HTX capable GPU, so I can just plug it into a C32 socket (for example) along with an Opteron.

    The future past Trinity looks interesting.
    Reply
  • jamescox - Monday, June 20, 2011 - link

    It would be interesting if they produced a form factor with CPU+GPU on a separate card with memory. Ever since AMD moved the memory controller on die, I wondered if we would see CPU + memory on a separate card. It seems to make a lot of sense. A 4 socket motherboard is huge, especially where each socket has 4 to 6 memory slots associated with it. If the CPU and memory were on a separate card, then you could pack them a lot denser, like you can run 4 GPUs off an ATX board now. It might be cheaper than a massive 4 socket board also. I don't know how many HT links you can run through a slot, but you could always use extra cables/connectors like they use for multiple graphics cards.

    With the GPU using the same memory space as the CPU, then why leave the CPU attached to the slow system memory? Just put one of these hybrid chips attached to some high-speed graphics card like memory on a separate board. Move the slow system memory out to the chipset again. The current memory hierarchy is not exactly optimal in my opinion. I am using a slightly older macbook pro, which only supports 3 GB of memory. With all of the stuff I run, it is paging a lot to a super slow laptop hard drive. I have been tempted to get an SSD to speed it up rather than a new laptop.

    Anyway, with the way the memory hierarchy works now, system memory is kind of like a cache for the swap space on disk. System memory has gotten a lot faster, but disk have not, so people are using SSDs to fill the gap. If you directly connect the "graphics memory" to a CPU/GPU combo, then you don't need as much total memory in the system because you would not need multiple copies of the data. You would just pass pointers to data back and forth between the CPU and GPU components.

    Also, it would be nice to switch to something non-volatile for the memory connected to the chipset; just use disk as mass storage only. "System" memory wouldn't need to be that fast, since you would probably have a GB or two of high-speed memory on each processor board. The "system" memory would be used more like the SSD boot/swap drive in a current system. I don't think flash is quite there yet, and the other types of non-volatile memory (magnetic RAM , phase-shift RAM, etc) that promise much better performance and durability seem to still be all talk with no real products.

    With keeping the current form factor, it would be nice if they could put a large amount of memory in with the CPU/GPU package to act as high-speed memory for the GPU and L4 cache for the CPU. This form factor doesn't support scaling up to multiple chips easily (too large of main-board), but it would be very power efficient for laptops and other small form factor systems. It would require very little off-module communication which saves a lot of power. Maybe they could use a low-power, wide-interface dram chip originally meant for mobile devices.

    Hopefully Trinity is more than just a meaningless code name...
    Reply
  • Quantumboredom - Sunday, June 19, 2011 - link

    On page 4 ("Many SIMDs Make One Compute Unit") there are two figures showing wavefront scheduling on VLIW4 versus GCN. As I read it the figures seem to indicate that in VLIW4, one 4-wide VLIW handles operations from four wavefronts in parallel, but that's not how I've understood AMD's VLIW4. Only a single work-item is executing on a VLIW4-core at any point in time, the occupancy problems of VLIW4 come from ILP within a work-item, not across wavefronts.

    At any one point in time, a Cayman/VLIW4 compute unit is only executing instructions from a single wavefront (though they need at least two wavefronts to switch between on VLIW4). Again at any one point in time only 16 work-items are actually being executed, and it's within those 16 work-items that ILP must be extraced to fill the VLIW4 units. Since each work-item is executing on a VLIW4-processor, a total of 16*4=64 operations can be done in parallel, but that requires ILP within the work-items.

    On GCN this is quite different, where the four 16-wide vector units are actually executing 64 work-items at a time (four times as many as in Cayman). However the point is that each of these work-items are basically executing on a scalar processor, there's no need for ILP anymore. So again we are executing 64 operations in parallel, but now without any need for ILP.

    At least this is how I understood the presentation (I was at AFDS). Basically I agree with how the GCN scheduling is illustrated in this article, but the Cayman part looks wrong to me. A Cayman CU can only execute one wavefront at a time, and it only needs two wavefronts to switch between to be able to fully utilize the hardware, not four like the figures here seem to suggest.

    Now I'm just a programmer, not an architecture guy, so if anyone could clear this up for me it would be greatly appreciated :)
    Reply
  • Ryan Smith - Monday, June 20, 2011 - link

    Hi Quantum;

    After further consideration you're basically right. I should have made a distinction in the figures between instructions and wholly distinct wavefronts. While there are some ILP considerations to be had, basically the elements Cayman accepts should all be instructions from the same wavefront rather than different wavefronts. Cayman can't really work on multiple wavefronts at once.

    I don't have the original files on me, but we'll get this fixed in the morning to show that Cayman is consuming multiple instructions from the same wavefront.

    -Thanks
    Ryan Smith
    Reply
  • jamescox - Monday, June 20, 2011 - link


    Would a CPU/GPU integrated chip only be a replacement for integrated graphics, or does it have the possibility to move a little farther up? With multi-threading, 4 to 8 thread CPUs will be common in the mainstream, but that will not be a very big die on smaller processes. Most PC software doesn't make use of more than 4 compute intensive threads, so how much room does that leave for GPU hardware? If they solve the memory speed problem by integrating some high-speed memory into the socket (multi-chip module), or something, then it seems like they could possibly get more mainstream performance out of an integrated chip.

    If the integrated GPU isn't being used for graphics, then I really don't see that much software that would use it for compute in the PC space. One of the main things mentioned was usually video encode/decode, but it seems that the best solution is to include specific media encode/decode hardware like sandy bridge does. It seems to be just as fast and much more power efficient. If AMD doesn't include a media processing engine, then that could still be a reason to go with Intel. What other PC software could use the compute power?

    There is plenty of software that could use it in professional/HPC markets, so it makes sense to make a GPU that can be used for both if it doesn't sacrifice the graphics performance. The newest generations of GPUs have some things in common with Larrabee and Sony's Cell processors, except both of those tried to move too much of the graphics processing abilities into software. AMD didn't make that mistake, but talk of compute abilities for GPUs in the PC/consumer space seems a bit premature without any real applications to take advantage of it.
    Reply
  • GaMEChld - Monday, June 20, 2011 - link

    Llano already has low level discrete GPU performance, and that's just the tip of the iceberg. You are correct that on smaller processes they will be able to allocate more space to the GPU while maintaining CPU performance. I believe the successor to Trinity (which is the Bulldozer based successor to Llano) is supposed to be on 28nm. If everything goes exactly right, you could potentially have some kind of monster that has i5-2500K CPU performance with Radeon 6800 GPU performance in some maintstream laptop chip a year or two down the road. (Those numbers are all pure speculation)

    I encourage everyone to take a moment and remember the first computer you ever used, just to pay homage to what we are capable of as a species in just a few short years.

    I remember an IBM computer flipped on by a big red toggle that took 2 minutes to boot to a dos prompt...
    Reply
  • Targon - Monday, June 20, 2011 - link

    I remember the Timex Sinclaire, with 2KB of memory standard hooked up to a black and white TV and cassette tapes to save/load programs. Z80 running at 1MHz...the old 5.25 inch floppies were MUCH better, at least you could get a list of what was on the storage medium without having to load it. Reply
  • jabber - Monday, June 20, 2011 - link

    If only our attitudes to each other and other issues had advanced as much as well. Reply
  • GaMEChld - Monday, June 20, 2011 - link

    "Because in the end, aren't all religions the same? They tell us what to eat, when to pray, that this lump of clay called Man can somehow shape himself to resemble the divine. But we can never attain that perfect grace if we have hatred in our hearts. So let us celebrate our commonalites. Some of us don't eat pork. Some of us don't eat shellfish. But we all eat chicken. So spread the word: peace and chicken!"
    ~HOMER SIMPSON

    :-D
    Reply
  • Cyber.Angel - Saturday, October 15, 2011 - link

    off-topic?

    7th day Adventist don't eat meat, yes, not even chicken
    AND
    in Christian religion it's God who sacrifices, not human
    PLUS
    there is a requirement of TOTAL change according to Jesus
    That is, the "ME" is buried, forgotten and God lives inside of you
    meaning a total change in life

    God bless America - but...where is the change?
    Reply

Log in

Don't have an account? Sign up now