OpenCL 1.0: The Road to Pervasive GPU Computing

Name: OpenCL 1.0: The Road to Pervasive GPU Computing
Item: OpenCL 1.0: The Road to Pervasive GPU Computing
Author: Derek Wilson

by Derek Wilson on December 31, 2008 6:40 PM EST

Posted in
GPUs

37 Comments | Add A Comment

37 Comments

Open, Closed, Proprietary ... Sorting out the Confusion

Over the past few months, we've seen plenty of confusion over the direction NVIDIA and AMD are taking with respect to GPU computing. This isn't helped by either AMD or NVIDIA who both tend to tout the advantages of their approach and the disadvantages of the other guy's take on it.

AMD and supporters tend to claim that NVIDIA's CUDA is not optimal because it is not an open standard and that AMD supports openness because their solution (Brook+) is open source. But Brook+ isn't an open standard either: it was developed at Stanford University and hasn't been standardized. While the source for the Brook+ compiler is available, it would take a large investment to retool it for NVIDIA hardware. Even then, you'd need to build different versions of a program for AMD and NVIDIA platforms. The original GPGPU based Brook is a different story as it generated OpenGL code to do the GPGPU work, but modifying it to generate CAL code makes it very not interoperable and not very open or standard. At least as those terms are used when talking about languages, APIs and interoperability.

NVIDIA isn't much better though. They tend to act like anything AMD does is to copy them and amounts to nothing because CUDA for C is the gold standard for GPU computing and they don't have it, which just isn't the case. In fact, AMD started demonstrating concerted efforts to advance GPU computing before we saw anything from NVIDIA, and in much more interesting ways.

With R580 AMD (then ATI) actually published part of their ISA and called the initiative CTM (for Close to Metal). Before we had a beta version of CUDA, we had folding@home GPU accelerated on R520 and R580. Beyond that, CUDA for C has done really well in the HPC (high performance computing) space, but it hasn't caught on in the consumer space. Neither AMD nor NVIDIA have a viable consumer oriented solution for GPU computing.

So NVIDIA has the HPC market with CUDA and have gotten some universities to start teaching data parallel programming using CUDA for C. AMD could make an investment in the CUDA for C language and create either their own compiler (nothing is stopping them). But then you still have the same problem of interoperability as if NVIDIA implemented Brook+. If NVIDIA or AMD want to make their solution work with the other guy, they would need to write a wrapper to translate CAL to PTX or PTX to CAL. Or we could go a different direction and work on building an industry standard virtual ISA for data parallel architectures. But I doubt that effort would ever take off.

So the bottom line is that both AMD and NVIDIA support both proprietary (Brook+ and CUDA for C) and open standard (OpenCL) solutions. There are further differences between Brook+ and CUDA, but the important part is that these proprietary solutions are not ever going to be able to produce one binary that runs on both AMD and NVIDIA hardware both because of the approach used and the fact that AMD and NVIDIA aren't going to work closely enough to make something like that work. At least in the foreseeable future.

OpenCL, on the other hand, offers developers the ability to write an application once, compile it once, and expect it to run on all major GPU hardware. Something that could never happen with ether CUDA or Brook+.

Parallel Computing: Why We Need OpenCL Why NVIDIA Thinks CUDA for C and Brook+ Are Viable Alternatives

PRINT THIS ARTICLE

Post Your Comment
Please log in or sign up to comment.

Comments Locked

37 Comments

View All Comments

DerekWilson - Wednesday, December 31, 2008 - link
that is not possible -- when using the CPU to process data you need to copy it off the GPU ... when using the GPU to do processing, you need to copy data onto the GPU.

What you don't need to do is to worry about copying data from an OpenGL buffer that resides on the GPU to another buffer in order to work on it with OpenCL.

In DX11, you can share buffers between the Pixel Shader and the Compute Shader. Both of these are processed on the graphics card. You can do graphics work and general purpose compute work on the same data ... this is useful for effects physics, visualization of calculations, or complex shaders that might not be possible in the constraints of HLSL.

With OpenGL + OpenCL, you can do the same thing -- share data between graphics buffers and OpenCL buffers. But these buffers reside on the GPU.

In both DX11 and OpenCL, data must be moved off the GPU to process it with the CPU.

If OpenGL and OpenCL did not have binary level buffer compatibility, worst case we would need to copy OpenGL buffers off the GPU, convert them, copy them into OpenCL buffers in the correct format, and then re-upload the data to the GPU. Alternately, we could modify the buffer on the GPU, but that would still require processing power and incur a performance penalty.
Jaybus - Friday, January 2, 2009 - link
I think kevinkreiser was advocating a shared memory architecture, where CPU and GPU could access the same physical RAM, so that there would be no need to copy buffers. However, I disagree with such an approach, because that is only eliminating the buffer copy overhead by forcing the use of a global mutex or some other method of shared memory arbitration. The bottleneck would then become memory contention, offsetting any performance gained by eliminating the copy.
Loki726 - Friday, January 2, 2009 - link
PCIe latency is incredibly large compared to memory copy latency. For example, a synchronous copy of a single byte from CPU memory to GPU memory on an 8800GT using CUDA takes around 100k cpu cycles to complete, where non-cached CPU memory copies of the same size are in the order of 100-1000s of cycles. PCIe transfers only become fast when copying large chunks of data.

You are right that a shared memory architecture would require synchronization via some mechanism (mutex or other), but this would still be much faster than a DMA copy over PCIe for small data sizes if it was implemented correctly. There is no reason it should be any slower than sharing data between two threads in an SMP.

I think the reason why no one builds systems like this is because low latency access to a shared DRAM would require complex protocols between the GPU and CPU memory controllers to ensure memory consistency and coherence, and no one builds CPUs and GPUs that closely integrated.
DerekWilson - Saturday, January 3, 2009 - link
Some people built / build systems like this -- they are called game consoles ;-)
Loki726 - Saturday, January 3, 2009 - link
Good point Derek. The Xbox 360 supports tightly integrated CPU-GPU communication:

"The bus design and the CPU L2 provide added support that allows the GPU to read directly from the CPU L2 cache."[1]

[1] Andrews, J. and Baker, N. 2006. Xbox 360 System Architecture. IEEE Micro 26, 2 (Mar. 2006), 25-37
Wwhat - Monday, January 5, 2009 - link
Whatever happened to the hypertransport bus on motherboards and making graphics card for it? That would nicely cover both issues, plus since intel also is going in that direction they might agree with AMD at some far point in the future on a universal direct CPU transport bus connector.

Or perhaps the graphicscard makers should consider making a universal socket on their graphicscards that connects to the motherboard to a dedicated connector designed for DMA between a shared memory space with the CPU, a cache designed for shared GPU/CPU use, the advantage would be that people would yet again be forced to buy a new motherboard, and chipset, and that will keep the money rolling in ;]

Personally I think they should sit down in some room alone with themselves and think a bit until they realise having everybody doing their own propriety interfaces and systems is NOT a nice and positive and helpful and even economical way to go about thing and that making a plan then talking in a group with the 'opposition' and then tweaking it before releasing isn't such a bad idea and might actually lead to MORE profit and innovation.
Loki726 - Monday, January 5, 2009 - link
The interface you are thinking of is called HTX and there are some specialized products that use it. Hypertransport may be an open spec, but the memory transfer and coherence protocols used by AMD are not open. So it is not possible for a third party vendor to sit down and implement an HTX card that could work cooperatively with an AMD processor without negotiating a license from AMD. Intel's equivalent Quickpath is similar, but not even an open spec. PCIe is not an open spec either, but is controlled by a consortium that offers third parties pretty much equal opportunities to obtain a license.

Someone correct me if I'm wrong, but I'm not sure if dramatically reduced CPU/GPU memory copy latency would be useful for graphics applications. Games seem to scale just fine with the PCIe. Obviously there will be specific cases where it will be useful, but in general, the industry hasn't had a problem getting huge speedups over CPUs without it.

OpenCL 1.0: The Road to Pervasive GPU Computing

Open, Closed, Proprietary ... Sorting out the Confusion

Post Your Comment

37 Comments

View All Comments

DerekWilson - Wednesday, December 31, 2008 - link

Jaybus - Friday, January 2, 2009 - link

Loki726 - Friday, January 2, 2009 - link

DerekWilson - Saturday, January 3, 2009 - link

Loki726 - Saturday, January 3, 2009 - link

Wwhat - Monday, January 5, 2009 - link

Loki726 - Monday, January 5, 2009 - link

Log in

Don't have an account? Sign up now