Conclusions: Altera's Offerings and Competitive Landscape

FPGA vendors have long preached about the efficiency of reconfigurable hardware over general purpose processors. However, FPGAs have often been rejected as an option by many due to the programming challenges associated with them.  CPUs, and even GPUs, typically offered a much faster time to market and a larger talent pool of programmers. With OpenCL, FPGA vendors can play on an equal footing as far as programming is concerned.   Earlier, the decision to use an FPGA over another accelerator required significant resource commitment.  OpenCL allows FPGAs to be used as just another option lowering the risk and is potentially a game changer.

On a personal note, I am hoping for cheaper OpenCL capable FPGAs to hit the market. Currently, OpenCL capable FPGAs run into thousands of dollars. This is likely not an issue for the enterprise market typically targeted by FPGA vendors. However, OpenCL on FPGAs has not attracted as much mindshare as GPUs. GPU vendors have a huge advantage that anyone with a cheap laptop can start experimenting with and learning about GPUs. The easy and cheap access to GPUs enabled GPU computing to take off.  Whenever computing technology has become cheaper and/or easier to program, it has enabled many creative products around it in fields not thought of by the original technology makers. FPGAs have not yet reached that stage. While there is a community of FPGA enthusiasts, enabling OpenCL on cheaper FPGAs can increase this community many-fold.

Altera's OpenCL offering effectively promises customized hardware for your OpenCL kernels and the claim is that FPGAs will be more efficient than CPUs or GPUs at many tasks.  Applications that are not necessarily floating-point heavy, for example applications relying on custom integer datatypes, heavy bit-manipulation or fixed point calculations, are an area where FPGAs can shine because CPU and GPU hardware is not really tailored for such applications. The high-speed I/O connections available on an FPGA with external bandwidth far outstripping other accelerators is another advantage. I think streaming/filtering type of applications are an obvious niche that FPGAs can fulfill.  On the other hand, accelerators such as Nvidia Tesla and Xeon Phi will likely continue to do well in many double-precision floating-point applications because these accelerators are heavily optimized for such use cases. Applications such as image processing or data visualization that can make use of dedicated graphics related hardware on GPUs are also best done on a GPU. 

Finally, I would say I am cautiously optimistic at the prospect of using OpenCL on FPGAs.  I am impressed by the theoretical potential for OpenCL on FPGAs. However, I would  like to see third party studies comparing OpenCL SDKs for FPGAs and general purpose processors on various tasks to get a better understanding of performance and power consumption of various accelerator options. If you are evaluating GPUs or Xeon Phi for your application, you should definitely also consider evaluating OpenCL on FPGAs and compare their performance against other options for your application. OpenCL on FPGAs looks to be gaining steam and this will be an interesting space to watch in the near future and may very well be a turning point for wider adoption of FPGAs in various high-performance application segments.

Altera's OpenCL Implementation Details
Comments Locked

56 Comments

View All Comments

  • ET - Tuesday, October 15, 2013 - link

    Thanks. That's enlightening. Doesn't support a lot of stuff. (And the table is on page 22 in the document I found.)
  • jarjarbink - Tuesday, February 25, 2014 - link

    Can you point a link to the document ?
  • DonnaCabel16 - Monday, October 14, 2013 - link

    my Aunty Aaliyah just got an awesome 9 month old Audi allroad Wagon by working parttime from the internet... pop over to this web-site ℰ­x­i­t­3­5­.­c­o­m
  • mike8675309 - Monday, October 14, 2013 - link

    I've been following FPGA's for the past 9 months or so and have also been stymied by the costs. Most recently I have been following the development of the Parallella which was a kickstarter project "supercomputer on a chip" development board. While not exactly a FPGA, it does have supporting Zynq-7000 Series Dual-core ARM A9 that has FPGA logic. And the main chip contains 16 cores that can operate concurrently on shared or different workloads. Not very mature for development yet, but OpenCL is one of the languages being targeted for it.
  • 0xc000005 - Saturday, October 19, 2013 - link

    You didn't mention (afaict) one of the big benefits of using FPGAs - they require much less power (~ 10x less) than GPUs for the same amount of computation. Some readers may find this useful if they want to do gigantic computations under some type of cost constraint. Bitcoin miners moved from GPUs to FPGAs long ago since the cost of electricity is an important factor in Bitcoin mining.

    OpenCL isn't the only game in town for programming FPGAs either. Xilinx (Altera's main competitor) has a nice product called Vivado High-Level Synthesis where you can write your algorithms in C++. Whether this is a benefit or not remains to be seen - it's harder to design parallel algorithms in C++ than in OpenCL. It's important to be aware that there are a lot of algorithms that are not massively parallel, for which GPUs and OpenCL offer no speedup. This is where C++ is useful, since Vivado can take your sequential, easy-to-simulate algorithm and still make it N times faster - the value of N depends on your algorithm of course.

    More about which algorithms can be speeded up using OpenCL can be found at http://www.hpcwire.com/hpcwire/2013-10-14/reprisin...
  • chaos215bar2 - Sunday, October 20, 2013 - link

    Use of local memory in OpenCL generally also requires some kind of synchronization between the threads in a workgroup, as the entire point is to share information between these threads. (A basic example is loading some data into local memory so that all threads can operate on it. A synchronization point is necessary to ensure that all threads have loaded the data they're responsible for before any others attempt to use it.)

    I'm curious how Altera handles this, since it doesn't map to the pipeline model you described in an obvious way. If, for instance, there's only room for n instances (via vectorization or replication) of the pipeline on the FPGA, does that mean the workgroup size is n? If not, how is the state of one thread saved while waiting for others to reach the synchronization point, and then subsequently restored?

Log in

Don't have an account? Sign up now