Last week NVIDIA released their first set of end-user OpenCL drivers. Previously OpenCL drivers had only been available for developers on the NVIDIA side of things, and this continues to be the case on the AMD side of things. With NVIDIA’s driver release, the launch of AMD’s 5800 series, and some recent developments with OpenCL, this is a good time to recap the current state of OpenCL, and what has changed since our OpenCL introductory article from last year.

A CPU & GPU Framework

Although we commonly talk about OpenCL alongside GPUs, it’s technically a hardware agnostic parallel programming framework. Any device implementing OpenCL should be cable of running any OpenCL kernel, so long as the developers take in to account querying the host device ahead of time as to not spawn too many threads at once. And while GPUs (being the parallel beasts that they are) are the primary focus, OpenCL is also intended for use on CPUs and more exotic processors such as the Cell BE and DSPs.

What this means is that when it comes to discussing the use of OpenCL on computers, we have two things to focus on. Not only is there the use of OpenCL on the GPU, but there’s the use of OpenCL on CPUs. If Khronos has their way, then OpenCL will be a commonly used framework for CPUs both to take better advantage of multi-core CPUs (8 threaded i7 anyone?) and as a fallback mechanism for when OpenCL isn’t available on a GPU.

This also makes things tricky when it comes to who is responsible for what. AMD for example, in making both GPUs and CPUs, is writing drivers for both. They are currently sampling their CPU driver as part of their latest Stream SDK (even if it is a GPU programming SDK), and their entire CPU+GPU driver set has been submitted to the Khronos group for certification.

NVIDIA on the other hand is not a CPU manufacturer (Tegra aside), so they are only responsible for having a GPU OpenCL driver, which is what they have been giving to developers for months. They have submitted it to Khronos and it has been certified, and as we mentioned they have released it to the public as of last week. NVIDIA is not responsible for a CPU driver, and as such they are reliant on AMD and Intel for OpenCL CPU drivers. AMD likes to pick at NVIDIA for this, but ultimately it’s not going to matter once everyone finally gets up to speed.

Intel thus far is the laggard; they do not have an OpenCL implementation in any kind of public testing, for either CPUs or GPUs. For AMD GPU users this won’t be an issue, since AMD’s CPU driver will work on Intel CPUs as well. For NVIDIA GPU users with Intel CPUs, they'll be waiting on Intel for a CPU driver. Do note however that a CPU driver isn't required to use OpenCL on a GPU, and indeed we expect the first significant OpenCL applications to be intended to run solely on GPUs anyhow. So it's not a bad situation for NVIDIA, it's just one that needs to be solved sooner than later.

OpenCL ICD: Coming Soon

Unfortunately matters are made particularly complex by the fact that on Windows and Linux, writing an OpenCL program right now requires linking against a vendor-specific OpenCL driver. The code itself is still cross-platform/cross-device, but in terms of compiling and linking OpenCL has not been fully abstracted. It’s not yet at the point where it’s possible to write and run a single Windows/Linux program that will work with any OpenCL device. It would be the equivalent of requiring an OpenGL game (e.g. Quake) to have a different binary for each GPU vendor’s drivers.

The solution to this problem is that OpenCL needs an Installable Client Driver (ICD), just like OpenGL does. With an ICD developers can link against that, and it will handle the duties of passing things off to vendor-specific drivers. However an ICD isn’t ready yet, and in fact we don’t know when it will be ready. NVIDIA - who chairs the OpenCL working group - tells us that the WG is “driving to get an ICD implementation released as quickly as possible”, but with no timetable attached to that. The effort right now appears to be on getting more OpenCL 1.0 implementations certified (NV is certified, AMD is in progress), with an ICD to follow.

Meanwhile Apple, in the traditional Apple manner, has simply done a runaround on the whole issue. When it comes to drivers they shipped Snow Leopard with their own OpenCL CPU driver, and they have GPU drivers for both AMD and NVIDIA cards. Their OpenCL framework doesn’t have an ICD per-say, but it has features that allow developers to query for devices and use any they like. It effectively accomplishes the same thing, but it’s only of use when writing programs against Apple’s framework. But to Apple’s credit, as of this moment they currently have the only complete OpenCL platform, offering CPU+GPU development and execution with a full degree of abstraction.

What GPUs Will Support OpenCL

One final matter is what GPUs will support OpenCL. While OpenCL is based around the hardware aspects of DirectX10-class hardware, being DX10 compliant isn’t enough. Even among NVIDIA and AMD, there will be some DX10 hardware that won’t support OpenCL.

NVIDIA: Anything that runs CUDA will run OpenCL. In practice, this means anything in the 8-series or later that has 256MB or more of VRAM. NVIDIA has a full list here.

AMD: AMD will only be supporting OpenCL on the 4000 series and later. Presumably there was some feature in the OpenCL 1.0 specification that AMD didn’t implement until the 4000 series, which NVIDIA had since the launch of the 8-series. Given that AMD is giving Brook+ the heave-ho in favor of OpenCL, this will mean that there’s going to continue to be a limited selection of GPGPU applications that work on these cards as compared to the 4000 series and later.

End-User Drivers

Finally to wrap this up, we have the catalyst of this story: drivers. As we previously mentioned, NVIDIA released their OpenCL-enabled 190.89 drivers to the public last week, which we’re happy to see even if the applications themselves aren’t quite ready. This driver release was a special release outside of NVIDIA’s mainline driver releases however, and as such they’re already out of date. NVIDIA released their 191.07 WHQL-certified driver set yesterday, and these drivers don’t include OpenCL support. So while NVIDIA is shipping an OpenCL driver for both developers and end-users, it’s going to be a bit longer until it shows up in a regular release.

AMD meanwhile is still in a developer-only beta, which makes sense given that they’re still waiting on certification. The estimates we’ve heard is that the process takes a month, so with AMD having submitted their drivers early last month, they should be certified soon if everything went well.



View All Comments

  • dragonsqrrl - Wednesday, October 07, 2009 - link

    Scali sounds like someone who is interested in GPGPU computing, more specifically wide opencl support across a mature platform, which opens up the GPU for applications other then gaming, definitely not like the scathing often misinformed comments of SiliconDoc. And you're making the argument that it doesn't matter who's first to support knew technologies, after you criticize Nvidia's "phantom GT300/Fermi" announcement for being late, and at the same time sympathizing with AMD's first to dx11 HD5870 launch? You're comment ended up sounding a lot more like SiliconDoc then then anything Scali wrote. Reply
  • Titanius - Thursday, October 08, 2009 - link

    You are a bit naive and you obviously cannot read between the lines. I didn't praise anyone, hell if I use your argument that I praised someone, that would mean I praised AMD for their first to market DX11 cards and NVIDIA for their first to market OpenCL drivers, I did neither. I simply mentioned facts and how irrelevant this argument about AMD being late to the game is compared to NVIDIA also being late in another regard. If you don't understand that, well you're a lost cause.

    As for mentioning the "phantom Fermi" comment, I am sorry you can't comprehend sarcasm. I'll stop making too much sense in the future.

    But the point of my comment stands, GET OVER IT!

    BTW, regarding Scali and SiliconDoc, they are both using the same type of argument and don't seem to stop arguing when everyone is trying to smarten them up, that is where I saw the similarity. As for me, I know when to stop...unlike some people.
  • Maian - Wednesday, October 07, 2009 - link

    I doubt it. Scali sounds like a developer anxious to develop OpenCL apps for whatever purpose and possibly doesn't care about DX11 at all (e.g. not developing a game/gfx acceleration or targeting non-Windows platform). If I were in his shoes, I too would annoyed at AMD, since they're "blocking" access to a substantial marketshare of GPUs with their late drivers. As a developer myself, I don't give a shit about which vendor is better - I care about what features I can play with and how much hardware in the market will support those features. Reply
  • Scali - Tuesday, October 06, 2009 - link

    By the way, the missing feature in pre-4000 series GPUs from AMD is shared memory.
    Pre-4000 series GPUs also won't be able to use DirectCompute, even with CS4.x downlevel. Again nVidia's 8-series and higher will all support DirectCompute CS4.0.
  • bobvodka - Tuesday, October 06, 2009 - link

    Unfortunately, it's the SM5.0 profile which has the more useful things in it, such as the Interlock functions which (as I understand it) don't work on SM4.0 hardware when it comes to DirectCompute.

    From a gaming devs point of view these are pretty vital for various processes (such as single pass luminance or single pass deferred lighting), which imo reduces the usefulness of DirectCompute on anything pre-SM5.0, certainl in the games sphere.
  • Scali - Wednesday, October 07, 2009 - link

    I don't think so, to be honest. Why would you need interlocking for deferred lighting?
    It is unfortunate though that there is no interlocking support in CS4.0, since only the original G80 series doesn't support it. G92 and later do have interlocking and various other additions which aren't exposed through OpenCL or DirectCompute.

    You actually see complaints about DirectCompute in the nVidia GPU Computing SDK, such as:
    // Notice: In CS5.0, we can output up to 8 RWBuffers but in CS4.x only one output buffer is allowed,
    // that way we have to allocate one big buffer and manage the offsets manually. The restriction is
    // not caused by NVIDIA GPUs and does not present on NVIDIA GPUs when using other computing APIs like
    // CUDA and OpenCL.
  • bobvodka - Wednesday, October 07, 2009 - link

    For single pass Interlock functions are used to work out the depth of a "tile" for processing and accumalate which lights are effecting that "tile" before performing the lighting resolve.

    Granted, the process could be carried out without such things, it would probably require more passes however and generally be less efficient.

  • Scali - Wednesday, October 07, 2009 - link

    Depth of a tile? Are you now confusing tile-based rendering with deferred rendering, not to mention confusing compute shaders with conventional rendering techniques?
    And even then, I still don't see why interlock functions would be required.
  • bobvodka - Thursday, October 08, 2009 - link

    No, I'm not confusing anything, it was an idea put forward by Johan Andersson of DICE at the Siggrap 09 conference.

    You do the standard deferred rendering step for creating a g-buffer
    Then you dispatch compute shaders in blocks of 16 pixels aka a 'tile'
    Each thread then retrieves the z depth for its pixel from a g-buffer; interlockMin and interlockMax at then used to obtain the min/max z for that block of pixels
    The compute shader then goes on to calculate which lights intersect this 'tile' given the extents and min/max data (processing 16 lights at once)
    The light count is increased using an interlockAdd for each intersecting light and the light index is stored in a buffer
    Finally, the compute shader goes back to pixel processing, where each thread in the group sums the lighting for its pixel and writes out the final data.

    No confusion at all and a good example of how a compute shader can be used to calculate and output graphics.
  • Scali - Thursday, October 08, 2009 - link

    It's just one very specific example. It's a gross over-generalization to say that anything lower than CS5.0 is useless for graphics based on this single example.
    In fact, your entire hypothesis is wrong. You go from "If CS5.0 can do it better, CS4.x is useless".
    The correct hypothesis would ofcourse be "If CS4.x can do better than using only conventional geometry/vertex/pixelshaders, then CS4.x is useful".

Log in

Don't have an account? Sign up now