OpenCL Extending OpenGL

OpenGL 3.0 was a disappointment to game developers who hoped the API would add some key features that ended up being left behind. With the latest release, Khronos relegated OpenGL to professional and workstation applications like CAD/CAM and 3D content creation software, foregoing the wants and desires of game programmers. While not ideal from our perspective (competition is always good), the move is understandable, as OpenGL hasn't been consistently used by any major game engine developer other than Id software for quite some time. DirectX is seen as the graphics API of choice for game programming, and it looks like it will remain that way for the foreseeable future.

But OpenCL does bring an interesting element to the table. One of the major advancements of DirectX 11 will be the addition of a compute shader to the pipeline. This compute shader will be general purpose and capable of operating on diverse data structures that pixel shaders are not geared towards. It will be capable of things like OpenCL is, though it will be tuned and geared toward doing so in the context of graphics. It is, after all, still DirectX. In DX11, the pixel shader and compute shader will share data via data structures rather than any sort of formal input/output mechanism. Because of the high level of integration, game developers (and other graphics engine developers) will be capable of tightly combining current techniques with more general purpose code that can handle a broader array of algorithms.

OpenGL doesn't have anything like this in the works, but OpenCL fixes that. OpenCL is capable of sharing data with OpenGL. And we aren't talking about copying data back and forth easily, we are talking about physically sharing data structures and memory locations. This essentially adds a compute shader to OpenGL for those who want it. Why is that the case? well, offering OpenCL users a means of using OpenGL images and buffers as OpenCL images and buffers means that OpenGL and OpenCL can share data with no copy or conversion overhead. This means that not only are OpenGL and OpenCL able to work on the same data, but that the method by which they communicate is very similar to what DX11 does to allow the passing of data between pixel an geometry shaders.

While game developers may be intrigued, the professional app developers may have more of a reason to get excited. Sure, this will allow OpenGL game developers to use a compute shader like option, but it gives professional application developers the ability to actually combine the real work of simulation or data manipulation with visualization. With support for double precision in hardware that supports it, this could be useful for applications where a lot of real work needs to be done both on the thing being visualized and the visualization itself. This could speed things up quite a bit and allow fluid realtime visualization and manipulation of much more complicated data sets.

Additionally, this compute shader will work on hardware not specifically designed as DX11 class hardware. DX11, as a strict superset of DX10, will extend some functionality to DX10 hardware, but we aren't yet certain about the specifics of this and it may include CS functionality. On top of this, OpenCL should get drivers in the first quarter of next year. This puts the combination of OpenGL 3.0 plus OpenCL 1.0, for the first time in a long time, ahead of DirectX in terms of technology and capability. This is by no means a result of the sluggish and non-innovative OpenGL ARB. But maybe this will inspire more use of OpenGL, which maybe will inspire more innovation from the ARB. But I'm not going to hold my breath on that one.

In any case, the fact that OpenGL and OpenCL can share data without requiring a copy or conversion is a key feature. Not only will OpenCL allow developers to use the GPU for general purpose computing, but using OpenCL with OpenGL will help build a bridge between data parallel computing and visualization. Existing solutions like CUDA and Brook+ haven't done very well in this area, and using OpenGL or DirectX for data parallel processing makes it difficult to get work done efficiently. OpenCL + OpenGL solves these problems.

And maybe we'll even see things go the other way as well. Maybe developers doing massive amounts of parallel data processing using OpenCL not formerly interested in "seeing" what's happening will find it easy and beneficial to enable advanced visualization of their data or the processing thereof through integration with OpenGL. However they are used together, OpenCL and OpenGL will definitely both benefit from their symbiotic relationship.

Why NVIDIA Thinks CUDA for C and Brook+ Are Viable Alternatives Final Words


View All Comments

  • DerekWilson - Wednesday, December 31, 2008 - link

    that is not possible -- when using the CPU to process data you need to copy it off the GPU ... when using the GPU to do processing, you need to copy data onto the GPU.

    What you don't need to do is to worry about copying data from an OpenGL buffer that resides on the GPU to another buffer in order to work on it with OpenCL.

    In DX11, you can share buffers between the Pixel Shader and the Compute Shader. Both of these are processed on the graphics card. You can do graphics work and general purpose compute work on the same data ... this is useful for effects physics, visualization of calculations, or complex shaders that might not be possible in the constraints of HLSL.

    With OpenGL + OpenCL, you can do the same thing -- share data between graphics buffers and OpenCL buffers. But these buffers reside on the GPU.

    In both DX11 and OpenCL, data must be moved off the GPU to process it with the CPU.

    If OpenGL and OpenCL did not have binary level buffer compatibility, worst case we would need to copy OpenGL buffers off the GPU, convert them, copy them into OpenCL buffers in the correct format, and then re-upload the data to the GPU. Alternately, we could modify the buffer on the GPU, but that would still require processing power and incur a performance penalty.
  • Jaybus - Friday, January 02, 2009 - link

    I think kevinkreiser was advocating a shared memory architecture, where CPU and GPU could access the same physical RAM, so that there would be no need to copy buffers. However, I disagree with such an approach, because that is only eliminating the buffer copy overhead by forcing the use of a global mutex or some other method of shared memory arbitration. The bottleneck would then become memory contention, offsetting any performance gained by eliminating the copy. Reply
  • Loki726 - Friday, January 02, 2009 - link

    PCIe latency is incredibly large compared to memory copy latency. For example, a synchronous copy of a single byte from CPU memory to GPU memory on an 8800GT using CUDA takes around 100k cpu cycles to complete, where non-cached CPU memory copies of the same size are in the order of 100-1000s of cycles. PCIe transfers only become fast when copying large chunks of data.

    You are right that a shared memory architecture would require synchronization via some mechanism (mutex or other), but this would still be much faster than a DMA copy over PCIe for small data sizes if it was implemented correctly. There is no reason it should be any slower than sharing data between two threads in an SMP.

    I think the reason why no one builds systems like this is because low latency access to a shared DRAM would require complex protocols between the GPU and CPU memory controllers to ensure memory consistency and coherence, and no one builds CPUs and GPUs that closely integrated.
  • DerekWilson - Saturday, January 03, 2009 - link

    Some people built / build systems like this -- they are called game consoles ;-) Reply
  • Loki726 - Saturday, January 03, 2009 - link

    Good point Derek. The Xbox 360 supports tightly integrated CPU-GPU communication:

    "The bus design and the CPU L2 provide added support that allows the GPU to read directly from the CPU L2 cache."[1]

    [1] Andrews, J. and Baker, N. 2006. Xbox 360 System Architecture. IEEE Micro 26, 2 (Mar. 2006), 25-37

  • Wwhat - Monday, January 05, 2009 - link

    Whatever happened to the hypertransport bus on motherboards and making graphics card for it? That would nicely cover both issues, plus since intel also is going in that direction they might agree with AMD at some far point in the future on a universal direct CPU transport bus connector.

    Or perhaps the graphicscard makers should consider making a universal socket on their graphicscards that connects to the motherboard to a dedicated connector designed for DMA between a shared memory space with the CPU, a cache designed for shared GPU/CPU use, the advantage would be that people would yet again be forced to buy a new motherboard, and chipset, and that will keep the money rolling in ;]

    Personally I think they should sit down in some room alone with themselves and think a bit until they realise having everybody doing their own propriety interfaces and systems is NOT a nice and positive and helpful and even economical way to go about thing and that making a plan then talking in a group with the 'opposition' and then tweaking it before releasing isn't such a bad idea and might actually lead to MORE profit and innovation.
  • Loki726 - Monday, January 05, 2009 - link

    The interface you are thinking of is called HTX and there are some specialized products that use it. Hypertransport may be an open spec, but the memory transfer and coherence protocols used by AMD are not open. So it is not possible for a third party vendor to sit down and implement an HTX card that could work cooperatively with an AMD processor without negotiating a license from AMD. Intel's equivalent Quickpath is similar, but not even an open spec. PCIe is not an open spec either, but is controlled by a consortium that offers third parties pretty much equal opportunities to obtain a license.

    Someone correct me if I'm wrong, but I'm not sure if dramatically reduced CPU/GPU memory copy latency would be useful for graphics applications. Games seem to scale just fine with the PCIe. Obviously there will be specific cases where it will be useful, but in general, the industry hasn't had a problem getting huge speedups over CPUs without it.

Log in

Don't have an account? Sign up now