AMD Graphics Core Next: Out With VLIW, In With SIMD

The fundamental issue moving forward is that VLIW designs are great for graphics; they are not so great for computing. However AMD has for all intents and purposes bet the company on GPU computing – their Fusion initiative isn’t just about putting a decent GPU right on die with a CPU, but then utilizing the radically different design attributes of a GPU to do the computational work that the CPU struggles at. So a GPU design that is great at graphics and poor at computing work simply isn’t sustainable for AMD’s future.

With AMD Graphics Core Next, VLIW is going away in favor of a non-VLIW SIMD design. In principal the two are similar – run lots of things in parallel – but there’s a world of difference in execution. Whereas VLIW is all about extracting instruction level parallelism (ILP), a non-VLIW SIMD is primarily about thread level parallelism (TLP).

Without getting unnecessarily deep into the differences between VLIW and non-VLIW (we’ll save that for another time), the difference in the architectures is about what VLIW does poorly for GPU computing purposes, and why a non-VLIW SIMD fixes it. The principal issue is that VLIW is hard to schedule ahead of time and there’s no dynamic scheduling during execution, and as a result the bulk of its weaknesses follow from that. As VLIW5 was a good fit for graphics, it was rather easy to efficiently compile and schedule shaders under those circumstances. With compute this isn’t always the case; there’s simply a wider range of things going on and it’s difficult to figure out what instructions will play nicely with each other. Only a handful of tasks such as brute force hashing thrive under this architecture.

Furthermore as VLIW lives and dies by the compiler, which means not only must the compiler be good, but that every compiler is good. This is an issue when it comes to expanding language support, as even with abstraction through intermediate languages you can still run into issues, including issues with a compiler producing intermediate code that the shader compiler can’t handle well.

Finally, the complexity of a VLIW instruction set also rears its head when it comes to optimizing and hand-tuning a program. Again this isn’t normally a problem for graphics, but it is for compute. The complex nature of VLIW makes it harder to disassemble and to debug, and in turn difficult to predict performance and to find and fix performance critical sections of the code. Ideally a coder should never have to work in assembly, but for HPC and other uses there is a good deal of performance to be gained by doing so and optimizing down to the single instruction.

AMD provided a short example of this in their presentation, showcasing the example output of their VLIW compiler and their new compiler for Graphics Core Next. Being a coder helps, but it’s not hard to see how contrived things are under VLIW.

VLIW
// Registers r0 contains "a", r1 contains "b"
// Value is returned in r2

00   ALU_PUSH_BEFORE
       1  x: PREDGT     ____, R0.x,  R1.x
             UPDATE_EXEC_MASK UPDATE PRED
01 JUMP   ADDR(3)
02 ALU
       2  x: SUB        ____, R0.x,  R1.x
       3  x: MUL_e      R2.x, PV2.x, R0.x
03 ELSE POP_CNT(1) ADDR(5)
04 ALU_POP_AFTER
       4  x: SUB        ____, R1.x,  R0.x
       5  x: MUL_e      R2.x, PV4.x, R1.x
05 POP(1) ADDR(6)

 

Non-VLIW SIMD
// Registers r0 contains "a", r1 contains "b"
// Value is returned in r2

v_cmp_gt_f32       r0,r1        
  //a > b, establish VCC
s_mov_b64    
      s0,exec        //Save current exec mask
s_and_b64    
      exec,vcc,exec  //Do "if"
s_cbranch_vccz 
   label0         //Branch if all lanes fail
v_sub_f32    
      r2,r0,r1       //result = a - b
v_mul_f32    
      r2,r2,r0       //result=result * a


s_andn2_b64    
    exec,s0,exec   //Do "else" (s0 & !exec)
s_cbranch_execz    label1         //Branch if all lanes fail
v_sub_f32    
      r2,r1,r0       //result = b - a
v_mul_f32    
      r2,r2,r1       //result = result * b

s_mov_b64    
      exec,s0        //Restore exec mask

 

VLIW: it’s good for graphics, it’s often not as good for compute.

So what does AMD replace VLIW with? They replace it with a traditional SIMD vector processor. While elements of Cayman do not directly map to elements of Graphics Core Next (GCN), since we’ve already been talking about the SP we’ll talk about its closest replacement: the SIMD.

Not to be confused with the SIMD on Cayman (which is a collection of SPs), the SIMD on GCN is a true 16-wide vector SIMD. A single instruction and up to 16 data elements are fed to a vector SIMD to be processed over a single clock cycle. As with Cayman, AMD’s wavefronts are 64 instructions meaning it takes 4 cycles to actually complete a single instruction for an entire wavefront.  This vector unit is combined with a 64KB register file and that composes a single SIMD in GCN.

As is the case with Cayman's SPs, the SIMD is capable of a number of different integer and floating point operations. AMD has not gone into fine detail yet of what those are, but we’re expecting something similar to Cayman with the possible exception of how transcendentals are handled. One thing that we do know is that FP64 performance has been radically improved: the GCN architecture is capable of FP64 performance up to ½ its FP32 performance. For home users this isn’t going to make a significant impact right away, but it’s going to help AMD get into professional markets where such precision is necessary.

 

Prelude: The History of VLIW & Graphics Many SIMDs Make One Compute Unit
Comments Locked

83 Comments

View All Comments

  • Targon - Saturday, June 18, 2011 - link

    With Windows 7 having a 80 percent(or higher at this point) install base being 64 bit, it will take until late 2013 before we see the majority of the old 32 bit install base being phased out in the home computer market(as people replace their computers at the four-five year mark). Until then, application developers have to expect that they MUST support both 32 and 64 bit platforms. Lowest common denominator for your user base is what developers generally have to compile for.
  • DanNeely - Saturday, June 18, 2011 - link

    I assume you're using the steam hardware survey since they're showing 4:1. Unfortunately steam's not a good source for broad market stats since it excludes the low end boxes bought by non-gamers and corporate boxes. Surveys that capture these numbers only show a 2:1ish ratio for win7 64:32.

    Beyond that, it's the people with the low end 32bit boxes that will keep their old clunkers the longest. You're also underestimating how long support for legacy OSes will continue despite their very small market shares. Firefox 4 still runs on win2k, despite it's market share having been negligible for several years and being officially out of support for almost a year.

    Excepting apps that actually can benefit from going 64bit I expect most to stay 32bit for at least the next 5 years.
  • swaaye - Saturday, June 18, 2011 - link

    Indeed. In the non-gamer realm, I know of people happy with 2003 Pentium 4s and Athlon XPs yet. I have no doubt that there are many people with even older hardware. This stuff tends to stick around until the PCs die and the owner is told it's not worth the money to upgrade. Fear of change and the simple lack of a true need to upgrade is the reason.
  • swaaye - Saturday, June 18, 2011 - link

    Oops. I meant that the owner is told it's not worth the money to fix the dead old hardware. But they do also tend to ask about upgrading their ancient box too.
  • Randomblame - Saturday, June 18, 2011 - link

    I was at office max the other day and a guy was screaming at a sales rep because they didn't carry any serial mice that supported his rig. I don't mean ps2 either. He was carrying around a busted up brown serial mouse. He said his rig came with windows 95 but last year he upgraded it to windows 98. Seriously. This is the world we live in.
  • EJ257 - Saturday, June 18, 2011 - link

    I still have my Compaq (that came with Win95 which I upgraded to win98) running on a Pentium 133 with 32MB of EDO RAM and a 2.1GB HDD. Its sitting ilde in my basement collecting dust at the moment. :D
  • Operandi - Sunday, June 19, 2011 - link

    But Steam is good representation of those who could benefit from and will ultimately will be using these future technologies, professionals and enthusiasts. Such is always the way of high-end computing.
  • softdrinkviking - Monday, June 20, 2011 - link

    exactly. people still running XP are probably not the target market for developers because if they are so slow on the uptake of new technology, it would follow that they are also relatively uninterested in other new programs.
  • Targon - Sunday, June 19, 2011 - link

    Nope, I am going on what my customers have and are upgrading to. If you BUY a machine with Windows 7 on it, 9 out of 10 have Windows 7 64 bit on them. Those that have 32 bit are either the very low-end machines with only 1GB of RAM(yes, they still sell those), or they are the result of doing an upgrade from Windows Vista 32 bit.

    That is the thing about 64 bit, people don't "go to 64 bit" at this point, they get a new computer that comes with 64 bit Windows on it. The number of people who do an upgrade on an older machine has dropped, since those who would have done the upgrade did that back in 2009 and early 2010 when Windows 7 first came out.

    Now, the real benefit to 64 bit isn't as much about the software as it is about how much RAM the machine comes with. If you get a machine with 4GB of RAM, you want 64 bit, just so you don't lose memory due to the 4GB limit on 32 bit Windows, and hardware mapping below the 4GB mark.

    A part of this is also about the area you live in, and how much money there is going around. I live in an area where it is the norm to pay over $8 per person for lunch at a deli, and as a result, the value of the dollar isn't as high. Spending $20/day just on lunch and minor expenses is the norm, so with that in mind, replacing a computer every 4-5 years, even for the non-technical is NORMAL. The last time I encountered Windows 95 or 98 was around 6 years ago.
  • UrQuan3 - Thursday, June 23, 2011 - link

    There is a little more benefit. A few of us were doing an internal benchmark of our software using VStudio 2010 and all the random hardware we have around. 32bit, 32bit + SSE2, and 64bit + SSE2. We found across the board, 64bit is about 5-10% than 32bit + SSE2 and 5-20% faster than basic x86.

    However, a 64bit OS gave no benefit (or penalty) for a 32bit program. The same 32bit software ran the same speed on XP32, XP64, Vista32, Vista64, and 7-64.

Log in

Don't have an account? Sign up now