Many SIMDs Make One Compute Unit

When we move up a level we have the Compute Unit, what AMD considers the fundamental unit of computation. Whereas a single SIMD can execute vector operations and that’s it, combined with a number of other functional units it makes a complete unit capable of the entire range of compute tasks. In practice this replaces a Cayman SIMD, which was a collection of Cayman SPs. However a GCN Compute Unit is capable of far, far more than a Cayman SIMD.

So what’s in a Compute Unit? Just as a Cayman SIMD was a collection of SPs, a Compute Unit starts with a collection of SIMDs. 4 SIMDs are in a CU, meaning that like a Cayman SIMD, a GCN CU can work on 4 instructions at once. Also in a Compute Unit is the control hardware & branch unit responsible for fetching, decoding, and scheduling wavefronts and their instructions. This is further augmented with a 64KB Local Data Store and 16KB of L1 data + texture cache. With GCN data and texture L1 are now one and the same, and texture pressure on the L1 cache has been reduced by the fact that AMD is now keeping compressed rather than uncompressed texels in the L1 cache. Rounding out the memory subsystem is access to the L2 cache and beyond. Finally there is a new unit: the scalar unit. We’ll get back to that in a bit.

But before we go any further, let’s stop here for a moment. Now that we know what a CU looks like and what the weaknesses are of VLIW, we can finally get to the meat of the issue: why AMD is dropping VLIW for non-VLIW SIMD. As we mentioned previously, the weakness of VLIW is that it’s statically scheduled ahead of time by the compiler. As a result if any dependencies crop up while code is being executed, there is no deviation from the schedule and VLIW slots go unused. So the first change is immediate: in a non-VLIW SIMD design, scheduling is moved from the compiler to the hardware. It is the CU that is now scheduling execution within its domain.

Now there’s a distinct tradeoff with dynamic hardware scheduling: it can cover up dependencies and other types of stalls, but that hardware scheduler takes up die space. The reason that the R300 and earlier GPUs were VLIW was because the compiler could do a fine job for graphics, and the die space was better utilized by filling it with additional functional units. By moving scheduling into hardware it’s more dynamic, but we’re now consuming space previously used for functional units. It’s a tradeoff.

So what can you do with dynamic scheduling and independent SIMDs that you could not do with Cayman’s collection of SPs (SIMDs)? You can work around dependencies and schedule around things. The worst case scenario for VLIW is that something scheduled is completely dependent or otherwise blocking the instruction before and after it – it must be run on its own. Now GCN is not an out-of-order architecture; within a wavefront the instructions must still be executed in order, so you can’t jump through a pixel shader program for example and execute different parts of it at once. However the CU and SIMDs can select a different wavefront to work on; this can be another wavefront spawned by the same task (e.g. a different group of pixels/values) or it can be a wavefront from a different task entirely.

Wavefront Execution Example: SIMD vs. VLIW. Not To Scale - Wavefront Size 16

Cayman had a very limited ability to work on multiple tasks at once. While it could consume multiple wavefronts from the same task with relative ease, its ability to execute concurrent tasks was reliant on the API support, which was limited to an extension to OpenCL. With these hardware changes, GCN can now concurrently work on tasks with relative ease. Each GCN SIMD has 10 wavefronts to choose from, meaning each CU in turn has up to a total of 40 wavefronts in flight. This in a nutshell is why AMD is moving from VLIW to non-VLIW SIMD for Graphics Core Next: instead of VLIW slots going unused due to dependencies, independent SIMDs can be given entirely different wavefronts to work on.

As a consequence, compiling also becomes much easier. With the compiler freed from scheduling tasks, compilation behaves in a rather standard manner, since most other architectures are similarly scheduled in hardware. Writing a compiler still isn’t absolutely easy, but when it comes to optimizing the execution of a program the compiler can focus on other matters, making it much easier for other languages to target GCN. In fact without the need to generate long VLIW instructions or to including scheduling information, the underlying ISA for GCN is also much simpler. This makes debugging much easier since the code generated reflects the fact that scheduling is now done in hardware, which is reflected in our earlier assembly code example.

Now while leaving behind the drawbacks of VLIW is the biggest architectural improvement for compute performance coming from Cayman, the move to non-VLIW SIMDs is not the only benefit. We still have not discussed the final component of the CU: the Scalar ALU. New to GCN, the Scalar unit serves to further keep inefficient operations out of the SIMDs, leaving the vector ALUs on the SIMDs to execute instructions en mass. The scalar unit is composed of a single scalar ALU, along with an 8KB register file.

So what does a scalar unit do? First and foremost it executes “one-off” mathematical operations. Whole groups of pixels/values go through the vector units together, but independent operations go to the scalar unit as to not waste valuable SIMD time. This includes everything from simple integer operations to control flow operations like conditional branches (if/else) and jumps, and in certain cases read-only memory operations from a dedicated scalar L1 cache. Overall the scalar unit can execute one instruction per cycle, which means it can complete 4 instructions over the period of time it takes for one wavefront to be completed on a SIMD.

Conceptually this blurs a bit more of the remaining line between a scalar GPU and a vector GPU, but by having both types of units it means that each unit type can work on the operations best suited for it. Besides avoiding feeding SIMDs non-vectorized datasets, this will also improve the latency for control flow operations, where Cayman had a rather nasty 44 cycle latency.

AMD Graphics Core Next: Out With VLIW, In With SIMD And Many Compute Units Make A GPU
Comments Locked

83 Comments

View All Comments

  • Targon - Saturday, June 18, 2011 - link

    With Windows 7 having a 80 percent(or higher at this point) install base being 64 bit, it will take until late 2013 before we see the majority of the old 32 bit install base being phased out in the home computer market(as people replace their computers at the four-five year mark). Until then, application developers have to expect that they MUST support both 32 and 64 bit platforms. Lowest common denominator for your user base is what developers generally have to compile for.
  • DanNeely - Saturday, June 18, 2011 - link

    I assume you're using the steam hardware survey since they're showing 4:1. Unfortunately steam's not a good source for broad market stats since it excludes the low end boxes bought by non-gamers and corporate boxes. Surveys that capture these numbers only show a 2:1ish ratio for win7 64:32.

    Beyond that, it's the people with the low end 32bit boxes that will keep their old clunkers the longest. You're also underestimating how long support for legacy OSes will continue despite their very small market shares. Firefox 4 still runs on win2k, despite it's market share having been negligible for several years and being officially out of support for almost a year.

    Excepting apps that actually can benefit from going 64bit I expect most to stay 32bit for at least the next 5 years.
  • swaaye - Saturday, June 18, 2011 - link

    Indeed. In the non-gamer realm, I know of people happy with 2003 Pentium 4s and Athlon XPs yet. I have no doubt that there are many people with even older hardware. This stuff tends to stick around until the PCs die and the owner is told it's not worth the money to upgrade. Fear of change and the simple lack of a true need to upgrade is the reason.
  • swaaye - Saturday, June 18, 2011 - link

    Oops. I meant that the owner is told it's not worth the money to fix the dead old hardware. But they do also tend to ask about upgrading their ancient box too.
  • Randomblame - Saturday, June 18, 2011 - link

    I was at office max the other day and a guy was screaming at a sales rep because they didn't carry any serial mice that supported his rig. I don't mean ps2 either. He was carrying around a busted up brown serial mouse. He said his rig came with windows 95 but last year he upgraded it to windows 98. Seriously. This is the world we live in.
  • EJ257 - Saturday, June 18, 2011 - link

    I still have my Compaq (that came with Win95 which I upgraded to win98) running on a Pentium 133 with 32MB of EDO RAM and a 2.1GB HDD. Its sitting ilde in my basement collecting dust at the moment. :D
  • Operandi - Sunday, June 19, 2011 - link

    But Steam is good representation of those who could benefit from and will ultimately will be using these future technologies, professionals and enthusiasts. Such is always the way of high-end computing.
  • softdrinkviking - Monday, June 20, 2011 - link

    exactly. people still running XP are probably not the target market for developers because if they are so slow on the uptake of new technology, it would follow that they are also relatively uninterested in other new programs.
  • Targon - Sunday, June 19, 2011 - link

    Nope, I am going on what my customers have and are upgrading to. If you BUY a machine with Windows 7 on it, 9 out of 10 have Windows 7 64 bit on them. Those that have 32 bit are either the very low-end machines with only 1GB of RAM(yes, they still sell those), or they are the result of doing an upgrade from Windows Vista 32 bit.

    That is the thing about 64 bit, people don't "go to 64 bit" at this point, they get a new computer that comes with 64 bit Windows on it. The number of people who do an upgrade on an older machine has dropped, since those who would have done the upgrade did that back in 2009 and early 2010 when Windows 7 first came out.

    Now, the real benefit to 64 bit isn't as much about the software as it is about how much RAM the machine comes with. If you get a machine with 4GB of RAM, you want 64 bit, just so you don't lose memory due to the 4GB limit on 32 bit Windows, and hardware mapping below the 4GB mark.

    A part of this is also about the area you live in, and how much money there is going around. I live in an area where it is the norm to pay over $8 per person for lunch at a deli, and as a result, the value of the dollar isn't as high. Spending $20/day just on lunch and minor expenses is the norm, so with that in mind, replacing a computer every 4-5 years, even for the non-technical is NORMAL. The last time I encountered Windows 95 or 98 was around 6 years ago.
  • UrQuan3 - Thursday, June 23, 2011 - link

    There is a little more benefit. A few of us were doing an internal benchmark of our software using VStudio 2010 and all the random hardware we have around. 32bit, 32bit + SSE2, and 64bit + SSE2. We found across the board, 64bit is about 5-10% than 32bit + SSE2 and 5-20% faster than basic x86.

    However, a 64bit OS gave no benefit (or penalty) for a 32bit program. The same 32bit software ran the same speed on XP32, XP64, Vista32, Vista64, and 7-64.

Log in

Don't have an account? Sign up now