Earlier this month, the OpenCL specification was released by the Khronos group. Khronos is a group made up of representatives from companies in the computing industry. The group focuses on creating and managing standards for graphics, multimedia and parallel computing on everything from mobile devices to desktop and workstation computers. Part of Khronos' charge is OpenGL and all it's relatives with the Open- prefix, so naming also makes sense.

 

 

The goal of OpenCL is to make certain types of parallel programming easier and to provide vendor agnostic hardware accelerated parallel execution of code. That's a bit of a mouth full, but the bottom line is that OpenCL will give developers a common set of easy to use tools to use to take advantage of any device with an OpenCL driver (processors, graphics cards, ect.) for the processing of parallel code.

While there are already tools available that enable parallel processing, these tools are largely dedicated to task parallel models. The task parallel model is built around the idea that parallelism can be extracted by constructing threads that each have their own goal or task to complete. While most parallel programming is task parallel, there is another form of parallelism that can greatly benefit from a different model.

In contrast to the task parallel model, data parallel programming runs the same block of code on hundreds (or thousands or millions or ...) of data points. Whereas my video game may have threads for handling AI, physics, audio, game state, rendering, and possibly more finely grained tasks if I'm up to the challenge, a data parallel program to do something like image processing may spawn millions of threads to do the processing on each pixel. The way these threads are actually grouped and handled will depend on both the way the program is written and the hardware the program is running on.

 

 

As we've said many times in the past, graphics is almost infinitely parallelizable. Millions of pixels on the screen can all act (mostly) independently of each other. Light weight threads handle the calculation of everything that has to do with a particular pixel. As pixels get smaller and we pack more on screens, there is more opportunity for parallel work. Graphics cards are currently the best data parallel processing engines we have available. And once OpenCL drivers are available, developers will have access to all that horsepower for any other data parallel tasks they see fit.

Now, it won't make sense to run a word processor on your graphics card, as there just isn't enough happening at once to take advantage of the hardware. Single threaded performance on a GPU isn't that great, especially compared to a general purpose CPU, and trying to run code that isn't massively parallel just isn't going to be a great idea. But there are plenty of things that can benefit from the GPU. Basically any multimedia processing can benefit, from video and audio decoding, editing, and encoding, to image manipulation, to helping speed up your math homework (brute force computation ala Maple, Matlab, and Mathematica could certainly benefit from the GPU). There could be some interesting encryption and/or compression techniques that are born out of the data parallel approach as well.

The best applications of data parallel computing have likely not been seriously considered at this point, as it takes time to get from the availability of tools to the finished product, let alone the conception of ideas that have heretofore been precluded by the realities of parallel programming. But OpenCL isn't a miracle that will make everything speed up. Rather it is a vehicle by which developers will be able to make a small subset of tasks orders of magnitude faster using hardware that is already in most people's computers. Which is certainly nice. But let's take a closer look.

Parallel Computing: Why We Need OpenCL
Comments Locked

37 Comments

View All Comments

  • DerekWilson - Wednesday, December 31, 2008 - link

    that is not possible -- when using the CPU to process data you need to copy it off the GPU ... when using the GPU to do processing, you need to copy data onto the GPU.

    What you don't need to do is to worry about copying data from an OpenGL buffer that resides on the GPU to another buffer in order to work on it with OpenCL.

    In DX11, you can share buffers between the Pixel Shader and the Compute Shader. Both of these are processed on the graphics card. You can do graphics work and general purpose compute work on the same data ... this is useful for effects physics, visualization of calculations, or complex shaders that might not be possible in the constraints of HLSL.

    With OpenGL + OpenCL, you can do the same thing -- share data between graphics buffers and OpenCL buffers. But these buffers reside on the GPU.

    In both DX11 and OpenCL, data must be moved off the GPU to process it with the CPU.

    If OpenGL and OpenCL did not have binary level buffer compatibility, worst case we would need to copy OpenGL buffers off the GPU, convert them, copy them into OpenCL buffers in the correct format, and then re-upload the data to the GPU. Alternately, we could modify the buffer on the GPU, but that would still require processing power and incur a performance penalty.
  • Jaybus - Friday, January 2, 2009 - link

    I think kevinkreiser was advocating a shared memory architecture, where CPU and GPU could access the same physical RAM, so that there would be no need to copy buffers. However, I disagree with such an approach, because that is only eliminating the buffer copy overhead by forcing the use of a global mutex or some other method of shared memory arbitration. The bottleneck would then become memory contention, offsetting any performance gained by eliminating the copy.
  • Loki726 - Friday, January 2, 2009 - link

    PCIe latency is incredibly large compared to memory copy latency. For example, a synchronous copy of a single byte from CPU memory to GPU memory on an 8800GT using CUDA takes around 100k cpu cycles to complete, where non-cached CPU memory copies of the same size are in the order of 100-1000s of cycles. PCIe transfers only become fast when copying large chunks of data.

    You are right that a shared memory architecture would require synchronization via some mechanism (mutex or other), but this would still be much faster than a DMA copy over PCIe for small data sizes if it was implemented correctly. There is no reason it should be any slower than sharing data between two threads in an SMP.

    I think the reason why no one builds systems like this is because low latency access to a shared DRAM would require complex protocols between the GPU and CPU memory controllers to ensure memory consistency and coherence, and no one builds CPUs and GPUs that closely integrated.
  • DerekWilson - Saturday, January 3, 2009 - link

    Some people built / build systems like this -- they are called game consoles ;-)
  • Loki726 - Saturday, January 3, 2009 - link

    Good point Derek. The Xbox 360 supports tightly integrated CPU-GPU communication:

    "The bus design and the CPU L2 provide added support that allows the GPU to read directly from the CPU L2 cache."[1]

    [1] Andrews, J. and Baker, N. 2006. Xbox 360 System Architecture. IEEE Micro 26, 2 (Mar. 2006), 25-37


  • Wwhat - Monday, January 5, 2009 - link

    Whatever happened to the hypertransport bus on motherboards and making graphics card for it? That would nicely cover both issues, plus since intel also is going in that direction they might agree with AMD at some far point in the future on a universal direct CPU transport bus connector.

    Or perhaps the graphicscard makers should consider making a universal socket on their graphicscards that connects to the motherboard to a dedicated connector designed for DMA between a shared memory space with the CPU, a cache designed for shared GPU/CPU use, the advantage would be that people would yet again be forced to buy a new motherboard, and chipset, and that will keep the money rolling in ;]

    Personally I think they should sit down in some room alone with themselves and think a bit until they realise having everybody doing their own propriety interfaces and systems is NOT a nice and positive and helpful and even economical way to go about thing and that making a plan then talking in a group with the 'opposition' and then tweaking it before releasing isn't such a bad idea and might actually lead to MORE profit and innovation.
  • Loki726 - Monday, January 5, 2009 - link

    The interface you are thinking of is called HTX and there are some specialized products that use it. Hypertransport may be an open spec, but the memory transfer and coherence protocols used by AMD are not open. So it is not possible for a third party vendor to sit down and implement an HTX card that could work cooperatively with an AMD processor without negotiating a license from AMD. Intel's equivalent Quickpath is similar, but not even an open spec. PCIe is not an open spec either, but is controlled by a consortium that offers third parties pretty much equal opportunities to obtain a license.

    Someone correct me if I'm wrong, but I'm not sure if dramatically reduced CPU/GPU memory copy latency would be useful for graphics applications. Games seem to scale just fine with the PCIe. Obviously there will be specific cases where it will be useful, but in general, the industry hasn't had a problem getting huge speedups over CPUs without it.

Log in

Don't have an account? Sign up now