A Look at Altera's OpenCL SDK for FPGAs

Name: A Look at Altera's OpenCL SDK for FPGAs
Item: A Look at Altera's OpenCL SDK for FPGAs
Author: Rahul Garg

by Rahul Garg on October 9, 2013 8:00 AM EST

Posted in
OpenCL
FPGA
Altera

56 Comments | Add A Comment

56 Comments

Conclusions: Altera's Offerings and Competitive Landscape

FPGA vendors have long preached about the efficiency of reconfigurable hardware over general purpose processors. However, FPGAs have often been rejected as an option by many due to the programming challenges associated with them. CPUs, and even GPUs, typically offered a much faster time to market and a larger talent pool of programmers. With OpenCL, FPGA vendors can play on an equal footing as far as programming is concerned. Earlier, the decision to use an FPGA over another accelerator required significant resource commitment. OpenCL allows FPGAs to be used as just another option lowering the risk and is potentially a game changer.

On a personal note, I am hoping for cheaper OpenCL capable FPGAs to hit the market. Currently, OpenCL capable FPGAs run into thousands of dollars. This is likely not an issue for the enterprise market typically targeted by FPGA vendors. However, OpenCL on FPGAs has not attracted as much mindshare as GPUs. GPU vendors have a huge advantage that anyone with a cheap laptop can start experimenting with and learning about GPUs. The easy and cheap access to GPUs enabled GPU computing to take off. Whenever computing technology has become cheaper and/or easier to program, it has enabled many creative products around it in fields not thought of by the original technology makers. FPGAs have not yet reached that stage. While there is a community of FPGA enthusiasts, enabling OpenCL on cheaper FPGAs can increase this community many-fold.

Altera's OpenCL offering effectively promises customized hardware for your OpenCL kernels and the claim is that FPGAs will be more efficient than CPUs or GPUs at many tasks. Applications that are not necessarily floating-point heavy, for example applications relying on custom integer datatypes, heavy bit-manipulation or fixed point calculations, are an area where FPGAs can shine because CPU and GPU hardware is not really tailored for such applications. The high-speed I/O connections available on an FPGA with external bandwidth far outstripping other accelerators is another advantage. I think streaming/filtering type of applications are an obvious niche that FPGAs can fulfill. On the other hand, accelerators such as Nvidia Tesla and Xeon Phi will likely continue to do well in many double-precision floating-point applications because these accelerators are heavily optimized for such use cases. Applications such as image processing or data visualization that can make use of dedicated graphics related hardware on GPUs are also best done on a GPU.

Finally, I would say I am cautiously optimistic at the prospect of using OpenCL on FPGAs. I am impressed by the theoretical potential for OpenCL on FPGAs. However, I would like to see third party studies comparing OpenCL SDKs for FPGAs and general purpose processors on various tasks to get a better understanding of performance and power consumption of various accelerator options. If you are evaluating GPUs or Xeon Phi for your application, you should definitely also consider evaluating OpenCL on FPGAs and compare their performance against other options for your application. OpenCL on FPGAs looks to be gaining steam and this will be an interesting space to watch in the near future and may very well be a turning point for wider adoption of FPGAs in various high-performance application segments.

Altera's OpenCL Implementation Details

PRINT THIS ARTICLE

Post Your Comment
Please log in or sign up to comment.

Comments Locked

56 Comments

View All Comments

kishonti - Wednesday, October 9, 2013 - link
We've tested Altera's OpenCL SDK : https://twitter.com/KishontiI/status/3647712482716...
Most of the more complex tests (with multiple kernels) had size issues to fit on the chip. As compilation takes literally hours, the compile/debug cycles are much harder to manage.
rahulgarg - Wednesday, October 9, 2013 - link
Thanks for the extremely informative datapoints!
Todd Thompson - Wednesday, October 9, 2013 - link
kishonti, thanks for posting/tweeting about your benchmark...you mention a long compile time...is this something that you think could be pushed to the cloud and compiled on more robust hardware...if so, is this something that you would actually do? I might be able to help if you are interested...thanks again
kishonti - Wednesday, October 9, 2013 - link
Currently the OpenCL SDK runs within Altera's Quartus toolchain, so I don't think it is possible to run this in cloud. We used a relatively powerful 2 x 8 core Xeon workstation, but the compile process did not scale much - used 1 or 2 cores most of the time.
Obviously we tested the code first on GPUs and CPUs (hundreds of them, actually) but it was still a trial and error process because only after several hours of number crunching we get the info that our kernel fits or not. This could be still faster than building a VHDL model from scratch...
Jaybus - Thursday, October 10, 2013 - link
It is a decision problem, similar to the problem of routing traces on a chip such that the length of the longest trace is minimized, also known as the traveling salesman problem. So it belongs to a class of problems known as NP-complete. The NP stands for Non-deterministic Polynomial time. We express the complexity of most algorithms using "Big O" terminology, but we can not do so for these problems due to their non-deterministic nature. Actually, whether or not it is even possible to solve these problems quickly is one of the principle unsolved problems of computer science. I'm not saying that the compiler doesn't come up with a correct solution, only that it must do so basically by brute force trial and error. Deterministic problems can be broken into independent parts and processed in parallel. Not so for non-deterministic problems, and so it doesn't scale.

That said, it is possible to break it into parts, calculate in parallel, then check for conflicts. If a conflict is found, throw that one out and repeat until you find one without conflicts. There still is no way to determine if it is optimal, but you can repeat the process until you find N solutions and pick the best one. Currently, trial and error is the best solution. It could even be the only solution. Some very smart people are working on the problem, but nobody has a solution yet.
Alexey.Martin - Friday, November 8, 2013 - link
kishonti, do you have any actual results from Altera's OpenCL testing?
chowyuncat - Wednesday, October 9, 2013 - link
Is it viable to iteratively test on a GPU and only compile once at the end for an FPGA?
dneto - Wednesday, October 9, 2013 - link
Yes. See another of my comments.
tuxfool - Wednesday, October 9, 2013 - link
Not really. The gpu is running software. A FPGA, however is effectively generating hardware to process a particular algorithm.

The generation of this hardware is subject to a great deal of optimization in terms of clock signals available, availability of logic cells etc.
tuxfool - Wednesday, October 9, 2013 - link
well, apparently you can. But what happens when your program uses a kernel that is unsynthesizable in the users FPGA? Any further iteration will need to be done using the FPGA....right?

A Look at Altera's OpenCL SDK for FPGAs

Conclusions: Altera's Offerings and Competitive Landscape

Post Your Comment

56 Comments

View All Comments

kishonti - Wednesday, October 9, 2013 - link

rahulgarg - Wednesday, October 9, 2013 - link

Todd Thompson - Wednesday, October 9, 2013 - link

kishonti - Wednesday, October 9, 2013 - link

Jaybus - Thursday, October 10, 2013 - link

Alexey.Martin - Friday, November 8, 2013 - link

chowyuncat - Wednesday, October 9, 2013 - link

dneto - Wednesday, October 9, 2013 - link

tuxfool - Wednesday, October 9, 2013 - link

tuxfool - Wednesday, October 9, 2013 - link

Log in

Don't have an account? Sign up now