AMD's Graphics Core Next Preview: AMD's New GPU, Architected For Compute

Name: AMD's Graphics Core Next Preview: AMD's New GPU, Architected For Compute
Item: AMD's Graphics Core Next Preview: AMD's New GPU, Architected For Compute
Author: Ryan Smith

by Ryan Smith on December 21, 2011 9:38 PM EST

Posted in
GPUs
AMD
Radeon
GCN

83 Comments | Add A Comment

83 Comments

AMD Graphics Core Next: Out With VLIW, In With SIMD

The fundamental issue moving forward is that VLIW designs are great for graphics; they are not so great for computing. However AMD has for all intents and purposes bet the company on GPU computing – their Fusion initiative isn’t just about putting a decent GPU right on die with a CPU, but then utilizing the radically different design attributes of a GPU to do the computational work that the CPU struggles at. So a GPU design that is great at graphics and poor at computing work simply isn’t sustainable for AMD’s future.

With AMD Graphics Core Next, VLIW is going away in favor of a non-VLIW SIMD design. In principal the two are similar – run lots of things in parallel – but there’s a world of difference in execution. Whereas VLIW is all about extracting instruction level parallelism (ILP), a non-VLIW SIMD is primarily about thread level parallelism (TLP).

Without getting unnecessarily deep into the differences between VLIW and non-VLIW (we’ll save that for another time), the difference in the architectures is about what VLIW does poorly for GPU computing purposes, and why a non-VLIW SIMD fixes it. The principal issue is that VLIW is hard to schedule ahead of time and there’s no dynamic scheduling during execution, and as a result the bulk of its weaknesses follow from that. As VLIW5 was a good fit for graphics, it was rather easy to efficiently compile and schedule shaders under those circumstances. With compute this isn’t always the case; there’s simply a wider range of things going on and it’s difficult to figure out what instructions will play nicely with each other. Only a handful of tasks such as brute force hashing thrive under this architecture.

Furthermore as VLIW lives and dies by the compiler, which means not only must the compiler be good, but that every compiler is good. This is an issue when it comes to expanding language support, as even with abstraction through intermediate languages you can still run into issues, including issues with a compiler producing intermediate code that the shader compiler can’t handle well.

Finally, the complexity of a VLIW instruction set also rears its head when it comes to optimizing and hand-tuning a program. Again this isn’t normally a problem for graphics, but it is for compute. The complex nature of VLIW makes it harder to disassemble and to debug, and in turn difficult to predict performance and to find and fix performance critical sections of the code. Ideally a coder should never have to work in assembly, but for HPC and other uses there is a good deal of performance to be gained by doing so and optimizing down to the single instruction.

AMD provided a short example of this in their presentation, showcasing the example output of their VLIW compiler and their new compiler for Graphics Core Next. Being a coder helps, but it’s not hard to see how contrived things are under VLIW.

VLIW
// Registers r0 contains "a", r1 contains "b" // Value is returned in r2 00 ALU_PUSH_BEFORE 1 x: PREDGT ____, R0.x, R1.x UPDATE_EXEC_MASK UPDATE PRED 01 JUMP ADDR(3) 02 ALU 2 x: SUB ____, R0.x, R1.x 3 x: MUL_e R2.x, PV2.x, R0.x 03 ELSE POP_CNT(1) ADDR(5) 04 ALU_POP_AFTER 4 x: SUB ____, R1.x, R0.x 5 x: MUL_e R2.x, PV4.x, R1.x 05 POP(1) ADDR(6)

Non-VLIW SIMD
// Registers r0 contains "a", r1 contains "b" // Value is returned in r2 v_cmp_gt_f32 r0,r1 //a > b, establish VCC s_mov_b64 s0,exec //Save current exec mask s_and_b64 exec,vcc,exec //Do "if" s_cbranch_vccz label0 //Branch if all lanes fail v_sub_f32 r2,r0,r1 //result = a - b v_mul_f32 r2,r2,r0 //result=result * a s_andn2_b64 exec,s0,exec //Do "else" (s0 & !exec) s_cbranch_execz label1 //Branch if all lanes fail v_sub_f32 r2,r1,r0 //result = b - a v_mul_f32 r2,r2,r1 //result = result * b s_mov_b64 exec,s0 //Restore exec mask

VLIW: it’s good for graphics, it’s often not as good for compute.

So what does AMD replace VLIW with? They replace it with a traditional SIMD vector processor. While elements of Cayman do not directly map to elements of Graphics Core Next (GCN), since we’ve already been talking about the SP we’ll talk about its closest replacement: the SIMD.

Not to be confused with the SIMD on Cayman (which is a collection of SPs), the SIMD on GCN is a true 16-wide vector SIMD. A single instruction and up to 16 data elements are fed to a vector SIMD to be processed over a single clock cycle. As with Cayman, AMD’s wavefronts are 64 instructions meaning it takes 4 cycles to actually complete a single instruction for an entire wavefront. This vector unit is combined with a 64KB register file and that composes a single SIMD in GCN.

As is the case with Cayman's SPs, the SIMD is capable of a number of different integer and floating point operations. AMD has not gone into fine detail yet of what those are, but we’re expecting something similar to Cayman with the possible exception of how transcendentals are handled. One thing that we do know is that FP64 performance has been radically improved: the GCN architecture is capable of FP64 performance up to ½ its FP32 performance. For home users this isn’t going to make a significant impact right away, but it’s going to help AMD get into professional markets where such precision is necessary.

Prelude: The History of VLIW & Graphics Many SIMDs Make One Compute Unit

PRINT THIS ARTICLE

Post Your Comment
Please log in or sign up to comment.

Comments Locked

83 Comments

View All Comments

ClagMaster - Tuesday, June 21, 2011 - link
What is being describe is tantamont Vector Processing that was featured on CRAY supercomputers available in the 70's through 90's. In the machines I once programmed (using CFT77 compiler), a vector was 64 64-bit words that was processed through a pipe.
789427 - Thursday, June 23, 2011 - link
Is it just me, or will we be seeing AMD refresh cycles quadruple for their processors because of on-die graphics?

I sense a prefix/suffix CPU/GPU diversification happening soon - and a bit of confusion with maybe some sideport memory enabled chips coming our way.

2/4/8 cores with
6550, 6750, 6850 level graphics and
512Mb/1Gb sideport
all for $100-$200 and crossfire capable?
Drool now?
cb
Kakkoii - Sunday, August 21, 2011 - link
This pleases me, because this will likely mean that AMD no longer has such a performance per dollar and watt difference from Nvidia. Thus further degrading most arguments AMD fanboys have against Nvidia. I see this being a benefit for Nvidia in the long term. After AMD claiming what Nvidia was doing wasn't right, they basically give up and are doing it themselves now too.
Cyber.Angel - Saturday, October 15, 2011 - link
exactly what I was thinking
AMD/ATI is catching up - in the HPC sector
otherwise they are still a better buy in the consumer market
and in 2012 also in HPC
Nvidia uses too much power

too bad if even Trinity is not using this new GPU design...
Wreckage - Wednesday, December 21, 2011 - link
I'm guessing we won't see product until sometime next year.
tzhu07 - Wednesday, December 21, 2011 - link
Looking forward to buying a 7970 (or possibly a 7950) to go along with my Sandy Bridge build. I'm currently running on Intel HD3000 and it's killing me. But just a few more days now. Hopefully I can hit the refresh button on my browser fast enough to catch one before they sell out.
OwnedKThxBye - Thursday, December 22, 2011 - link
Typo on the last page. At no point has AMD specified when a GPU will appear using GCN will appear, so it’s very much a guessing game.
R3MF - Thursday, December 22, 2011 - link
"We expect AMD to take a page from NVIDIA here and configure lower-end consumer parts to use the slower rates since FP64 is not currently important for consumer uses."

Will AMD be likewise crippling the FP64 support native to the chip, in products that have the resident features, if they are sold in a consumer SKU rather than a more expensive professional SKU?

I refer to nvidia's practice of crippling access to FP64 functionality in Geforce 580 cards that is otherwise available in Tesla 580 products.
zarck - Thursday, December 22, 2011 - link
For the GPGPU GRID, a test with Radeon 7970 and Folding@Home it's possible ?

https://fah-web.stanford.edu/projects/FAHClient/wi...
morricone - Thursday, December 22, 2011 - link
I'm a developer myself and you have to look really hard to find an article as good as this. Keep this stuff up!

AMD's Graphics Core Next Preview: AMD's New GPU, Architected For Compute

Post Your Comment

83 Comments

View All Comments

ClagMaster - Tuesday, June 21, 2011 - link

789427 - Thursday, June 23, 2011 - link

Kakkoii - Sunday, August 21, 2011 - link

Cyber.Angel - Saturday, October 15, 2011 - link

Wreckage - Wednesday, December 21, 2011 - link

tzhu07 - Wednesday, December 21, 2011 - link

OwnedKThxBye - Thursday, December 22, 2011 - link

R3MF - Thursday, December 22, 2011 - link

zarck - Thursday, December 22, 2011 - link

morricone - Thursday, December 22, 2011 - link

Log in

Don't have an account? Sign up now