Rapid Packed Math: Fast FP16 Comes to Consumer Cards (& INT16 Too!)

Arguably AMD’s marquee feature from a compute standpoint for Vega is Rapid Packed Math. Which is AMD’s name for packing two FP16 operations inside of a single FP32 operation in a vec2 style. This is similar to what NVIDIA has done with their high-end Pascal GP100 GPU (and Tegra X1 SoC), which allows for potentially massive improvements in FP16 throughput. If a pair of instructions are compatible – and by compatible, vendors usually mean instruction-type identical – then those instructions can be packed together on a single FP32 ALU, increasing the number of lower-precision operations that can be performed in a single clock cycle. This is an extension of AMD’s FP16 support in GCN 3 & GCN 4, where the company supported FP16 data types for the memory/register space savings, but FP16 operations themselves were processed no faster than FP32 operations.

The purpose of integrating fast FP16 and INT16 math is all about power efficiency. Processing data at a higher precision than is necessary unnecessarily burns power, as the extra work required for the increased precision accomplishes nothing of value. In this respect fast FP16 math is another step in GPU designs becoming increasingly min-maxed; the ceiling for GPU performance is power consumption, so the more energy efficient a GPU can be, the more performant it can be.

Taking advantage of this feature, in turn, requires several things. It requires API support and it requires compiler support, but above all it requires code that explicitly asks for FP16 data types. The reason why that matters is two-fold: virtually no existing programs use FP16s, and not everything that is FP32 is suitable for FP16. In the compute world especially, precisions are picked for a reason, and compute users can be quite fussy on the matter. Which is why fast FP64-capable GPUs are a whole market unto themselves. That said, there are whole categories of compute tasks where the high precision isn’t necessary; deep learning is the poster child right now, and for Vega Instinct AMD is practically banking on it.

As for gaming, the situation is more complex still. While FP16 operations can be used for games (and in fact are somewhat common in the mobile space), in the PC space they are virtually never used. When PC GPUs made the jump to unified shaders in 2006/2007, the decision was made to do everything at FP32 since that’s what vertex shaders typically required to begin with, and it’s only recently that anyone has bothered to look back. So while there is some long-term potential here for Vega’s fast FP16 math to become relevant for gaming, at the moment it doesn’t do much outside of a couple of benchmarks and some AMD developer relations enhanced software. Vega will, for the present, live and die in the gaming space primarily based on its FP32 performance.

The biggest obstacle for AMD here in the long-term is in fact NVIDIA. NVIDIA also supports native FP16 operations, however unlike AMD, they restrict it to their dedicated compute GPUs (GP100 & GV100). GP104, by comparison, offers a painful 1/64th native FP16 rate, making it just useful enough for compatibility/development purposes, but not fast enough for real-world use. So for AMD there’s a real risk of developers not bothering with FP16 support when 70% of all GPUs sold similarly don’t support it. It will be an uphill battle, but one that can significantly improve AMD’s performance if they can win it, and even more so if NVIDIA chooses not to budge on their position.

Though overall it’s important to keep in mind here that even in the best case scenario, only some operations in a game are suitable for FP16. So while FP16 execution is twice as fast as FP32 execution on paper specifically for a compute task, the percentage of such calculations in a game will be lower. In AMD’s own slide deck, they illustrate this, pointing out that using 16-bit functions makes specific rendering steps of 3DMark Serra 20-25% faster, and those are just parts of a whole.

Moving on, AMD is also offering limited native 8-bit support via a pair of specific instructions. On Vega the Quad Sum of Absolute Differences (QSAD) and its masked variant can be executed on Vega in a highly packed form using 8-bit integers. SADs are a rather common image processing operation, and are particularly relevant for AMD’s Instinct efforts since they are used in image recognition (a major deep learning task).

Finally, let’s talk about API support for FP16 operations. The situation isn’t crystal-clear across the board, but for certain types of programs, it’s possible to use native FP16 operations right now.

Surprisingly, native FP16 operations are not currently exposed to OpenCL, according to AIDA64. So within a traditional AMD compute context, it doesn’t appear to be possible to use them. This obviously is not planned to remain this way, and while AMD hasn’t been able to offer more details by press time, I expect that they’ll expose FP16 operations under OpenCL (and ROCm) soon enough.

Meanwhile, High Level Shader Model 5.x, which is used in DirectX 11 and 12, does support native FP16 operations. And so does Vulkan, for that matter. So it is possible to use FP16 right now, even in games. Running SiSoftware’s Sandra GP GPU benchmark with a DX compute shader shows a clear performance advantage, albeit not a complete 2x advantage, with the switch to FP16 improving compute throughput by 70%.

However based on some other testing, I suspect that native FP16 support may only be enabled/working for compute shaders at this time, and not for pixel shaders. In which case AMD may still have some work to do. But for developers, the message is clear: you can take advantage of fast FP16 performance today.

The Vega Architecture: AMD’s Brightest Day Sizing Up Today’s Launch: RX Vega 64 & RX Vega 56
POST A COMMENT

214 Comments

View All Comments

  • npz - Monday, August 14, 2017 - link

    My point was that since most modern games have recieved enhancements for ps4 pro and more will moving forward -- given it's the engine the devs use -- and that the vast majority are cross platform, then major PC games will already have a built in fp16 optimazation path to be taken advantage of.

    Also don't forget s
    Scorpio's arrival which will likely feature the same, so there's would be even more incentive for using this on PC
    Reply
  • Yojimbo - Tuesday, August 15, 2017 - link

    From what I have heard, Scorpio will not contain double rate fp16.

    And I am not sure that your claim that most modern game engines have been enhanced to take advantage of double rate fp16. I highly doubt that's true. Maybe a few games have cobbled in code to take advantage of low-hanging fp16 fruit.

    As far as AMD's "advantage", don't forget that NVIDIA had double rate FP16 before AMD. They left it out of Pascal to help differentiate their various data center cards (namely the P100 from the P40) in machine learning tasks. But now that the Volta GV100 has tensor cores it's not necessary to restrict double rate FP16 to only the GV100. For all we know double rate FP16 will be in their entire Volta lineup.
    Reply
  • Yojimbo - Tuesday, August 15, 2017 - link

    edit: I meant to say "They left it out of mainstream Pascal..." (as in GP102, GP104, GP106, GP107, GP108) Reply
  • Santoval - Tuesday, August 15, 2017 - link

    I am almost 100% certain that consumer Volta GPUs will have disabled double rate FP16 and completely certain that it will have disabled tensor cores. Otherwise they will kiss their super high margins of professional GPU cards goodbye, and Nvidia is never going to do that. Tensor cores were largely added so that Nvidia can compete with Google's tensor CPU in the AI / deep learning space. Google still does not sell that CPU but that might change. Unlike Google's CPU, which can be used only for AI inference, Volta's tensor cores will do both inference and training, and that is very important for this market. Reply
  • Yojimbo - Wednesday, August 16, 2017 - link

    Well, my point was that since they have tensor cores they can afford to have double rate FP16, so of course I agree that there will not be tensor cores enabled on consumer Volta cards. If the tensor cores give significantly superior performance to simple double rate FP16 (and NVIDIA's benchmarks show that they do) then why would NVIDIA need to wall off simple double rate FP16 to protect their V100 card? As much as NVIDIA want to try to protect their margins they also need to stave off competition. The tensor cores allow them to do both at once. They push forward the capabilities of the ultra high end (V100 while allowing double rate FP16 to trickle down to cheaper cards to stave off competition. I am not saying that I think they definitely will do it, but I see the opportunity is there. Frankly, I think the reason they wouldn't do it is if they don't think the cost of power budget or dollars to implement it is worth the gain in performance in gaming. Also, perhaps they want to create three tiers: the V100 with tensor cores, the Volta Titan X and/or Tesla V40 with double rate FP16, and everything else.

    As far as Google's TPUs, their TPU 2 can do training and inferencing. Their first TPU did only inferencing on 8 bit quantized (integer) networks. The TPU 2 does training and inferencing on FP16-based networks. The advantage NVIDIA's GPUs have are that they are general purpose parallel processors, and not specific to running computations for convolutional neural networks.
    Reply
  • Santoval - Tuesday, August 15, 2017 - link

    Nope, it was explicitly stated by MS that Scorpio's GPU will ship with disabled Rapid Math. Why? I have no idea. Reply
  • Nintendo Maniac 64 - Tuesday, August 15, 2017 - link

    Codemasters apparently doesn't realize that the Tegra X1 used in the Nintendo Switch also supports fp16, so it's not something unique to the PS4 Pro... Reply
  • OrphanageExplosion - Tuesday, August 15, 2017 - link

    There was also FP16 support in the PlayStation 3's RSX GPU. Generally speaking, the PS3 still lagged behind Xbox 360 in platform comparisons.

    The 30% perf improvement for Mass Effect is referring to the checkerboard resolve shader, not the entire rendering pipeline.

    For a more measured view of what FP16 brings to the table, check out this post: http://www.neogaf.com/forum/showpost.php?p=2223481...
    Reply
  • Wise lnvestor - Tuesday, August 15, 2017 - link

    Did you even read the gamingbolt article? And look at the picture? When a dev talk about how much they saved in milliseconds, IT IS THE ENTIRE rendering pipeline. Reply
  • romrunning - Monday, August 14, 2017 - link

    6th para - "seceded" should be "ceded" - AMD basically yielded the high-market to Nvidia, not "withdraw" to Nvidia. :) Reply

Log in

Don't have an account? Sign up now