OpenCL 1.0: The Road to Pervasive GPU Computingby Derek Wilson on December 31, 2008 6:40 PM EST
- Posted in
Why is Parallel Computing Hard?
There are plenty of issues with parallel programming. Breaking up the problem is often the most important and complex step, especially when the parallelism is not obvious. As we are rooted in a world of sequential programming, conceptualizing the parallelization of tasks that lend themselves to sequential programming is tough. This can require not only the reworking of code, but redesigning the entire process of solving a problem.
Even in problems that lend themselves to parallelism, exploiting the parallelism can be tough. Even if you know the best and fastest algorithm for solving a data parallel problem, it isn't always possible to translate that to an efficient program. For instance, if I want to multiply two matrices with 100k x 100k dimentions, I can't just spawn all the threads I would need. If I were using POSIX threads to calculate one cell of the result matrix each, I would spend more time creating threads and allocating resources than actually doing the computation. I've got to take the resources I have and use them to the best of my ability. Though I can do matrix multiplication in parallel, I have to be careful about how I break up the problem and I can't exploit all the parallelism possible because of the tools I normally work with.
We are also limited in terms of hardware resources. With only a few processors available for general purpose programming, even if the software overhead weren't an issue we couldn't actually get any speed up from parallelizing beyond a certain point. This not only means that we can't exploit tons of parallelism even if the algorithm lends itself to it and this discourages programmers from thinking in terms of parallelism.
How Does OpenCL Help?
What if we had not only a pool of hardware resources hundreds wide that could handle thousands of threads in flight at a time with no software overhead? Well, we do: it's called a GPU. And if we could use the GPU for processing, then we could spawn a bunch of threads and really chew through the matrix multiplication we talked about earlier (or whatever). We might still have to be concerned about how many hardware resources we have in order to best map the problem to the specific device in the system. And we still have the problem of actually spawning, managing and running threads on the GPU hardware.
But what if we could write a special function, called a kernel, that can instantly be spawned hundreds or thousands or millions of times and run on different data all without needing to handle creating and managing all the threads ourselves. And what if we didn't need to worry about how to break up our problem and left actually determining how to handle allocating threads to the runtime? Well, now we have a solution: that's OpenCL.
The GPU is the vehicle for exploiting data parallelism. But before now our vehicle has run like a train on a track called real-time 3D graphics acceleration. OpenCL removes the track and the limitations and builds in a steering wheel developers can use to take the GPU (and other parallel devices) anywhere a programmer can imagine.