Efficiency vs flexibility is one of the fundamental tradeoffs in any engineering discipline and it is true in computer architecture as well. For any given task, for example video decode, dedicated hardware is a more power-efficient solution than writing a software decoder that runs on a general purpose processor such as the CPU, or even a GPU's SIMD arrays. Chips designed for a specific purpose are called Application Specific ICs or ASICs.  However, designing and manufacturing ASICs is obviously difficult and once a chip is deployed, you cannot use the dedicated silicon area to anything else.

FPGAs, or field programmable gate arrays, fall somewhere between general purpose processors such as CPUs and ASICs in the spectrum of programmability and efficiency.  FPGAs consist of a large array of logic blocks and memory cells. The logic blocks are typically small programmable lookup tables that can be used to compute simple logic functions. The connections between the cells are also reconfigurable. Multiple programmable logic blocks and the connections can be configured to create more complex units such as ALUs.

You can utilize the reconfigurability of the FPGA to convert it into a computing device specialized for your application.  For example, consider an algorithm that is only performing certain types of integer arithmetic. In this case, you can reconfigure the FPGA to act as a large set of integer ALUs with support for the integer operations needed by your application. There is no need to waste any logic cells on  floating point logic, and further the integer ALU can be custom for your application instead of a generic unit. Thus, for some applications, an FPGA implementation can often offer much higher performance/watt than a CPU or a GPU implementation of the same algorithm. The efficiency of FPGAs comes partly from the fact that the hardware is reconfigured for your application.

In turn, the integer units in your FPGA will likely not be as efficient (power or area-wise) as an ASIC specifically designed and optimized for your application. However, unlike an ASIC, if you decide to tweak your algorithm in the future, you can simply reflash your FPGA with the new program rather than going to the drawing board again to design, validate and manufacture new ASICs while throwing out the old ones. Some FPGAs even allow for dynamic partial reconfiguration, where one part of the FPGA is reprogrammed while the other part is still active.

However, programming FPGAs has traditionally been difficult and requires expertise in specialized "hardware description languages" (HDLs) like VHDL or Verilog.  Some other options, such as SystemC, have also remained somewhat niche. There has been considerable interest in easier tools for programming FPGAs and this is where OpenCL comes in. OpenCL is considerably easier to learn and use than tools like VHDL and Verilog thus addressing one of the traditional weakenesses of FPGAs. Further, there is already university courses and industrial workshops teaching heterogeneous programming concepts in OpenCL or similar languages like CUDA or C++ AMP and thus the number of programmers familiar with OpenCL concepts is increasing quite rapidly.

While experts will likely continue using HDLs, OpenCL will enable many more programmers to use FPGAs. Even HDL experts may use OpenCL as a quick way to prototype their ideas on an FPGA. Interestingly, Xilinx (the biggest FPGA vendor currently) has recently also announced that they are working to bring OpenCL for their FPGAs in the future but no timeline has been announced. In this article we are looking at Altera's OpenCL offering which is already available and shipping.  I will add one caveat before proceeding further. My own expertise and experience is primarily in using OpenCL (and similar APIs) on GPUs and CPUs, and not in FPGA or HDLs.  You can think of this article as CPU/GPU programmer's view of the FPGA world. I don't have first-hand experience with Altera's SDK yet. This article is based upon my reading of the Altera documentation and whitepapers as well as various FPGA related literature around the web. The folks from Altera were also a big help for this article, as they were able to get answers for many of my questions.

Altera's Products and Roadmap

Altera designs and manufactures FPGA chips and these chips are then sold to partners and clients. CPU and GPU companies typically have multiple products differing in specificatons such as number of cores, frequency, features, memory interfaces etc. Similarly, Altera offers multiple product lines and multiple products within each line. Products are differentiated along specifications such as the number of adaptive logic modules (ALMs), the type and size of on-chip memory and external I/O bandwidth.  FPGA vendors have also started including some additional programmable processors on-chip. For example, some of Altera's Stratix V FPGAs integrate DSP blocks on chip and Cyclone V FPGAs integrate ARM CPU cores on-chip. FPGAs may also have high-speed transceivers on-chip to connect to external I/O devices such as video cameras, medical imaging devices, network devices and high-speed storage devices. Networking and high-speed streaming/filtering type applications are particularly suited for such devices.

A block diagram of Altera's Stratix V FPGA (source) showing the core logic fabric with logic blocks and interconnects, on-chip memory (m20k blocks), DSP blocks, transceivers and other I/O interfaces is shown below:

 

Altera partners will design a product, for example a PCIe based board, around the FPGA and may add their own customizations such as I/O interfaces supported on the board, peripherals as well as the size and bandwidth of associated onboard memory (if any).  PCIe based boards are far from the only way to deploy FPGAs and some customers may choose a custom solution. However, we will focus on the PCIe based use case for this article.

Altera's current generation high-end product line is under the brand Stratix V and is currently their only product series to support OpenCL. Stratix V series is currently built on a 28nm process at TSMC. Interestingly, FPGA manufacturs are typically one of the first products to adopt new process technologies. Altera's 28nm products started shipping well before the first 28nm GPUs or mobile CPUs. Altera has also announced 20nm products (branded Arria 10). Significantly, Altera has announced a deal with Intel to fabricate their upcoming Stratix 10 branded 14nm FPGAs in Intel's manufacturing facilities.

Altera introduced a private beta for OpenCL on FPGAs late last year. The SDK has now been made public. Altera's implementation is built on top of OpenCL 1.0 but offers custom extensions to tap into the unique features of FPGAs. More information can be found on Altera's OpenCL page. They are also adopting some features, such as pipes, from the OpenCL 2.0 provisional spec. From a performance standpoint, Altera has posted whitepapers where they show that FPGAs offer much higher performance/watt on some applications compared to CPUs and GPUs. Typical FPGA board power used in Altera's studies is somewhere in the range of 20W which is much lower than the high-end discrete GPUs such as Tesla series GPUs which are often in the 200W range. Altera claims that OpenCL running on an FPGA will either outperform the GPU or match its performance at considerably lower power on some applications.  Altera does not claim that this will be true for every application but I do think it is a reasonable claim for some types of applications.

We will go into some OpenCL terminology and how the concepts map to FPGAs. Next we will look at some details of Altera's OpenCL implementation and finally I will offer some concluding remarks.

OpenCL Programming Model and Suitability for FPGAs
POST A COMMENT

56 Comments

View All Comments

  • MrSpadge - Wednesday, October 09, 2013 - link

    BTW 2: David, you might want to contact Slicker, the admin of Collatz@Home. His project is fairly simple (and not that useful.. but people like it nevertheless) and has regularly been at the forefront of new technology (CUDA, ATI Stream, OpenCL, Intel GPUs..). Usually he's also very responsive. I could imagine a deal like: you give him access to your hardware, and if he succeeds you could get loads of publicity (attracting buyers and further developers) and quite a few sales. Reply
  • viv32 - Wednesday, October 09, 2013 - link

    Application driven reconfigurable hardware is an exciting idea. I am not sure how dense the fpga should be to support the complexity of today's GPUs (If they want FPGAs to replace GPU ASICs). We design network processors and our fpga emulation boards need atleast 4 Stratixs for complete emulation. If the FPGA gate count can match the GPU then can they still be cost effective? My2c .. please correct me if I'm wrong (I'm no FPGA jockey). Reply
  • rahulgarg - Wednesday, October 09, 2013 - link

    Well it depends. The objective in this case isn't to emulate the GPU at all. If the GPU is actually already a very good fit for your application, then going to FGPAs won't gain you much. But let us say in an application that does not use GPU's texture units, you don't really want to generate texture units on an FPGA. The idea isn't to emulate GPU's units or its pipeline, rather it is to generate a *different* pipeline that is more suitable for your application. Reply
  • wyx087 - Wednesday, October 09, 2013 - link

    Benefit of using hardware description languages such as VHDL is just that, it describes the hardware, forcing you to think in terms of the cells gets placed down. OpenCL is a compute language, its programmers won't take into account something as simple as multipliers are very expensive in hardware unless done in powers of 2.

    Also, vast number of university courses do VHDL/Verilog/SystemVerilog as standard. Electronics is the course title. I have no doubt the number of HDL experts is much more than OpenCL experts on this planet.

    The way around "slow compile" is simulation. I see no mention of simulation tools for designing OpenCL on FPGA. Without simulation tools, it is impossible for this to take off. Simulation is the way we verify our design on a functional level.

    The "compile" (it is known as synthesis and implementation or map, place and route) time is indeed in hours for large designs. Remember you are not just generating a binary for a processor, you are generating a binary file that describes the actual hardware. Put it simply, you are generating THE processor.

    - Professional VHDL programmer
    Reply
  • loki1725 - Sunday, October 13, 2013 - link

    This is actually what I was going to say. When I was an undergrad in EE (1997-2001) our embedded electronics course used VHDL. I taught in the EE department of a different university from 2009 to 2013 and we offered several courses that used VHDL. While there may not be more VHDL courses then OpenCL, the numbers are probably comparable.

    Still, really cool article, and anything that helps drive the adoption of FPGAs is a good step forward.
    Reply
  • toyotabedzrock - Wednesday, October 09, 2013 - link

    So the compile time happens beforehand but how long does it take for the fpga to configure itself when you run a program. Reply
  • rahulgarg - Wednesday, October 09, 2013 - link

    Well once you have done the compilation, my understanding is that flashing the binary is actually very fast so that is not an issue. Reply
  • John32 - Wednesday, October 09, 2013 - link

    Are you saying Altera doesn't provide a simulation stage to testing functionality? That's done before generating the binary file for all designs. Generating the binary file is the last thing you do after verifying everything works functionally.

    You say Altera generates Verilog code in then I assume that goes through their standard synthesis, place and route tools. I don't see why you can't do a software and "hardware" (ie. the Verilog code) co-simulation. That's what is normally done during verification. I have C/C++ code that talks to the Verilog code. The C/C++ code is compiled to a binary file and the Verilog code is compiled within an HDL simulator software. Then the entire thing is simulated together. Once that checks out, I generate the binary file and load into the FPGA. I use the same C/C++ code but now with the actual FPGA.
    Reply
  • John32 - Wednesday, October 09, 2013 - link

    Also, the whole "will it fit into the FPGA" issue is probably going to be a big problem for the likely target audience for this. You have no idea how the OpenCL code is being translated into hardware (ie. gates, LUTs, flip-flops, etc.). That all depends on your code and Altera's software to hardware algorithm.

    This reminds me of Xilinx's System Generator for MATLAB. It's a nice and easy way to get scientists to test their algorithms in hardware to see a ballpark figure of how fast it can be but it's definitely not the way to go for a final product.
    Reply
  • John32 - Wednesday, October 09, 2013 - link

    I guess there's also the "will it meet timing" problem. What clock speeds does Altera use? Do they just use whatever clock speed they can achieve (ie. one design clocks at 400 MHz while another can only go 100 MHz)? Reply

Log in

Don't have an account? Sign up now