With the launch of Kaveri, some people have been wondering if the platform is suitable for HPC applications.  Floating point peak performance of the CPU and GPU  on both fp32 and fp64 datatypes is one of the considerations. At launch time, we were not clear on the fp64 performance of Kaveri's GPU but now we have official confirmation from AMD that it is 1/16th the rate of fp32 (similar to most GCN based GPUs except the flagships) and we have verified this on our 7850K by running FlopsCL.  

I am taking this opportunity to summarize the info about Kaveri, Trinity, Llano and Intel's competing platforms Haswell and Ivy Bridge on both the CPU and GPU side. We provide a per-cycle estimate for the chips as well as peak calculated in gflops. The estimates are chip-wide, i.e. already take into account the number of cores or modules. Due to turbo boost, it was difficult to decide what frequency to use for peak calculations. For CPUs, we are using the base frequency and for GPUs we are using the boost frequency because in multithreaded and/or heterogeneous scenarios the CPU is less likely to turbo. In any case, we believe our readers are smart enough to calculate peaks at any frequency they want, given that we already supply per-cycle peaks :)

The peak CPU performance will depend on the SIMD ISA that your code was written and compiled for. We consider three cases: SSE, AVX (without FMA) and AVX with FMA (either FMA3 or FMA4).

 

CPU floating-point peak performance
Platform Kaveri Trinity Llano Haswell Ivy Bridge
Chip 7850K 5800K 3870K 4770K 3770K
CPU frequency 3.7 GHz 3.8 GHz 3.0GHz 3.5GHz 3.5GHz
SSE fp32 (/cycle) 16 16 32 32 32
SSE fp64 (/cycle) 8 8 16 16 16
AVX fp32 (/cycle) 16 16 - 64 64
AVX fp64 (/cycle) 8 8 - 32 32
AVX FMA fp32 (/cycle) 32 32 - 128 -
AVX FMA fp64 (/cycle) 16 16 - 64 -
SSE fp32 (gflops) 59.2 60.8 96 112 112
SSE fp64 (gflops) 29.6 30.4 48 56 56
AVX fp32 (gflops) 59.2 60.8 - 224 224
AVX fp64 (gflops) 29.6 30.4 - 112 112
AVX FMA fp32 (gflops) 118.4 121.6 - 448 -
AVX FMA fp64 (gflops) 59.2 60.8 - 224 -

It is no secret that AMD's Bulldozer family cores (Steamroller in Kaveri and Piledriver in Trinity) are no match for recent Intel cores in FP performance due to the shared FP unit in each module. As a comparison point, one core in Haswell has the same floating point performance per cycle as two modules (or four cores) in Steamroller.

Now onto GPU peaks. Here, for Haswell, we chose to include both GT2 and GT3e variants.

Platform Kaveri Trinity Llano Haswell GT3e Haswell GT2 Ivy Bridge
GPU floating-point peak performance
Chip 7850K 5800K 3870K 4770R 4770K 3770K
GPU frequency 720 MHz 800 MHz 600 MHz 1.3 GHz 1.25 GHz 1.15 GHz
fp32/cycle 1024 768 800 640 320 256

fp64/cycle (OpenCL)

64 48** 0 0 0 0

fp64/cycle (Direct3D)

64 0? 0 160 80 64
fp32 gflops 737.3 614 480 832 400 294.4

fp64 gflops (OpenCL)

46.1 38.4** 0 0 0 0

fp64 gflops (Direct3D)

46.1 0? 0 208 100 73.6

The fp64 support situation is a bit of a mess because some GPUs only support fp64 under some APIs.  The fp64 rate of Intel's GPUs does not appear to be published but David Kanter provides an estimate of 1/4 speed compared to fp32. However Intel only enables fp64 under DirectCompute but does not enable fp64 under OpenCL for any of its GPUs.

Situation on AMD's Trinity/Richland is even more complicated. fp64 support under OpenCL is not standards-compliant and depends upon using a proprietary extension (cl_amd_fp64). Trinity/Richland do not appear to support fp64 under DirectCompute (and MS C++ AMP implementation) from what I can tell. From an API standapoint, Kaveri's GCN GPUs should work fine on for fp64 under all APIs.

Some of you might be wondering whether Kaveri is good for HPC applications. Compared to discrete GPUs, applications that are already ported and work well on discrete GPUs will continue to be best run on discrete GPUs.  However, Kaveri and HSA will enable many more applications  to be GPU accelerated. 

Now we compare Kaveri against Haswell. In applications depending upon fp64 performance, conditions are not generally favorable to Kaveri. Kaveri's fp64 peak including both the CPU and GPU is only about 110 gflops.  You will generally be better off first optimizing your code for AVX and FMA instructions and running on Haswell's CPU cores. If you are using Windows 8,  you might also want to explore using Iris Pro through C++ AMP in conjunction with the CPU. Overall I doubt we will see Kaveri being used for fp64 workloads.

For heterogeneous fp32 applications, Kaveri should outperform Haswell GT2 and Ivy Bridge.  Haswell GT3e will again be a strong contender on Windows given the extremely capable Haswell CPU cores and Iris Pro graphics.  Intel's GPUs  do not currently support OpenCL under Linux, but a driver is being worked on.  Thus, on Linux, Kaveri will simply win out on fp32 heterogeneous applications. However, even on Windows Haswell GT3e will get strong competiton from Kaveri.  While AMD has advantages such as excellent GCN architecture and HSA software stack (when ready) enabling many more applications to take advantage of GPU, Iris Pro will have the eDRAM to potentially provide much improved bandwidth and the backing of strong CPU cores.

I hope I have provided a fair overview of the FP capabilities of each platform. Application performance will of course depend on many more factors. Your questions and comments are welcome.

Comments Locked

101 Comments

View All Comments

  • rahulgarg - Thursday, January 23, 2014 - link

    AFAIK, the 256-bit units in Haswell can be used for non-FMA AVX ops as well.
  • kantian - Thursday, January 23, 2014 - link

    True. A non-FMA AVX op will provide one 128 bit vector to one 256-bit unit at a time. But is it possible that it can provide two different 128 bit vectors in parallel, in order to take advantage of the full 256-bit unit potential? AFAIK, it is not.
  • rahulgarg - Thursday, January 23, 2014 - link

    AVX includes 256-bit ops for both FMA and non-FMA. So there is a 256-bit add operation for example.
  • kantian - Thursday, January 23, 2014 - link

    In that cases you are right.
  • kantian - Friday, January 24, 2014 - link

    Following our discussion so far, I think, you have errors in your numbers for the CPU floating-point peak performance of Ivy Bridge 3770K processor. The AVX FMA units in Ivy Bridge processors are 128 bit. Only Haswell ones are 256 bit, which gives the 4x multiplier to Steamroller numbers. That means the following numbers in the table are not correct:
    - AVX fp32 (/cycle) - 64, correct 32
    - AVX fp64 (/cycle) - 32, correct 16
    - AVX fp32 (gflops), correct 112
    - AVX fp64 (gflops), correct 56
  • BMNify - Thursday, January 23, 2014 - link

    i dont see your point ! it seems AMD where all over the shop wjile intel did one change so far
    https://en.wikipedia.org/wiki/FMA_instruction_set
    "May 2009: AMD changes the specification of their FMA instructions from the 3-operand DREX form to the 4-operand VEX form, compatible with the April 2008 Intel specification rather than the December 2008 Intel specification.[9]
    October 2011: AMD Bulldozer processor supports FMA4.[10]
    January 2012: AMD announces FMA3 support in future processors codenamed Trinity and Vishera; they are based on the Piledriver architecture.[11]
    May 2012: AMD Piledriver processor supports both FMA3 and FMA4.[10]
    June 2013: Intel Haswell processor supports FMA3.[12]
    It is currently uncertain whether the 3-operand VEX coded form (here called FMA3) or the 4-operand form (FMA4) will be the dominating standard in the future."

    the only thing that really matters OC is the fact that Different compilers provide different levels of support for FMA4:
    GCC supports FMA4 with -mfma4 since version 4.5.0[13] and FMA3 with -mfma since version 4.7.0

    NASM supports FMA3 instructions since version 2.03 and FMA4 instructions since 2.06.
    YAsm supports FMA3 and FMA4 instructions since version 1.1.0.
  • kantian - Thursday, January 23, 2014 - link

    The non-FMA AVX ops are currently the most widely used vector instructions in the x86 applications. The newer AVX2 ones are not widely adopted, and thus have just tiny share. The non-FMA AVX 128 bit operands are executed using 256 bit FMA units in Haswell, but take no advantage of those 256 bits, as the 256 bit FMA unit can execute only one 128 bit operand at a time. That's why the 256 bit FMA units in Haswell give performance advantage only for FMA AVX 256 bit ops (AVX2), but not for the widely adopted non-FMA AVX ops. That is what I think and can explain in simple terms.
  • milli - Friday, January 24, 2014 - link

    It's because AMD originally planned to support SSE5 with BD.
    http://en.wikipedia.org/wiki/SSE5
  • silverblue - Saturday, January 25, 2014 - link

    Well, AMD drew up SSE5, and instead had to implement it differently in order to offer compatibility with AVX. Has AMD ever created an instruction set that Intel has adopted, besides AMD64?
  • Th-z - Thursday, January 23, 2014 - link

    AMD is shooting itself in the foot if it doesn't have a Kaveri with full GPU FP64 capability similar to 7970. Together with HSA, it should be powerful for a new breed of applications that require FP64. It's a window of opportunity for them to popularize this product in HPC. In gaming, it also requires a "killer app" that utilizes HSA and iGPU to assist new techniques in rendering, e.g renderings that require dependency, compute-based rendering, and interactive GPU physics, and coupled with a dGPU only for rendering.

Log in

Don't have an account? Sign up now