Last year, NVIDIA introduced it's CUDA development package. Existing as a stand alone download for a while, eventually CUDA was rolled in to the driver itself. Today, AMD is following suit rolling their own GPU computing package, called ATI Stream, into their Catalyst 8.12 driver. While the package has been available for a while now, AMD is really starting to try and push forward on the idea of using their hardware as a GPU computing platform. In both market penetration and branding, AMD is way behind NVIDIA and CUDA.

NVIDIA has been pushing forward in the HPC market very well with CUDA, and PhysX on the desktop uses CUDA to implement hardware accelerated physics. While AMD is a bit behind, we don't see them as hugely lagging either. NVIDIA has some good ground work laid, but the market for GPU computing is still incredibly untapped. Both AMD and NVIDIA are in a good position to take advantage of GPU computing efforts when OpenCL and DriectX 11 come along, and we really do see this effort by both camps to sell GPUs using stream computing as a pitch that is very preliminary.

Software like the Adobe's photoshop perform GPU acceleration using OpenGL. We are at a place where, if a commercial application would significantly benefit from GPU acceleration, software companies still want to develop it once and just have it work. Targeting either NVIDIA through CUDA or AMD through Brook+ is too much of a headache for most software vendors. Having an API (or better yet a choice of APIs) targeted at hardware agnostic GPU computing will kick off the real revolution that both AMD and NVIDIA want their solutions to provide.

But ATI Stream isn't the only thing in Catalyst 8.12 that piqued our interest. To show off the inclusion of the new package, AMD built-in a free video transcoder that actually makes use of GPU acceleration. It is somewhat limited (as is the Badaboom package that runs on NVIDIA hardware), but free is always a nice price. We will compare what you get with the Avivo Video Converter to Badaboom, but we must remind our readers that these are distinct applications that approach the problem of video encoding in different ways and thus a direct comparison isn't as telling as if we could run the same code on both hardware platforms. It's sort of like comparing Unreal Tournament 3 performance on AMD hardware to Enemy Territory: Quake Wars performance on NVIDIA hardware. We've got to look at what is being done, the quality of the output, and we must consider the fact that completely different approaches are likely to have been used by each development team.

The final bit of goodness in the 8.12 driver are some performance tweaks in some apps and fixes for performance and CrossFire in Far Cry 2. There were quite a number of unresolved issues we had to deal with especially on our Core i7 systems that we needed to run down. We'll let you know what issues remain in the following pages as well. This driver release has been highly anticipated not only by reviewers, but by consumers awaiting the merging of older hot fixes into a WHQL driver. We have high hopes, and we'll see if AMD has delivered something that meets our expectations.

But first, let's go a little deeper into ATI Stream and CUDA and take a look at the competing video transcoding offerings now available.


ATI and NVIDIA both know that ubiquitous GPU computing is the future for all data parallel tasks. Computer graphics is one of the most heavily parallel tasks around. It also happens to be a problem that easily lends itself to parallelization because of how independent the parallel tasks can be. These two factors, plus demand for high quality graphics, are what made the consumer GPU industry explode in it's dozen years of existence. While the GPGPU (general purpose use of GPUs) philosophy has been around for a while, the advent of using the GPU for general data parallel tasks has been slow on the uptake in the main stream. This has not only to do with the fact that there were no specialized tools available (developers needed to shoehorn algorithms into OpenGL or DirectX and "draw" triangles to solve problems), but was further complicated by the fact that GPU architecture did not lend itself to the implementation of many useful data parallel algorithms.

Hardware designed for graphics, until very recently, has meant floating point only acceleration of completely independent data points with heavy restrictions on reading and writing data. Only a small subset of problems really map to that kind of architecture. With DirectX 9 we only saw inklings of real programmability, and DirectX 10 class hardware has really brought the tools developers need to bear. We now have hardware that can handle not only floating point but integer and bitwise operations as well. This combines nicely with the fact that local and global data stores have been added for sharing data and there is better support for non-sequential reads and writes out there. With DirectX 11, the package will be fairly feature complete, adding real support for data structures beyond simple arrays, optional double precision support and a whole host of other minor improvements that really add up for GPU computing (which is no surprise because DX11 will also feature a general purpose Compute Shader that doesn't need to tie work to triangles, vertecies, fragments or pixels).

Both ATI and NVIDIA have been working on GPU computing for quite some time, though NVIDIA has really been pushing it lately. Even before either company got involved, pioneers in GPU computing were using graphics languages to solve their problems. Out of this, at Stanford, grew a project called Brook. Brook was able to take specially written code that looked fairly similar to standard C programing and source-to-source compile it into code that would implement the appropriate graphics API calls and shaders to run the program. ATI took an early interest in this project and sort of latched onto it (perhaps because it showed better performance on ATI hardware than NVIDIA hardware at the time). After releasing CTM (the low level ISA spec for their graphics hardware) and later CAL (their abstracted pseudo instruction set that can span hardware generations), Brook was modified to compile directly to code targeted at ATI GPUs. This modification was called Brook+ and is the major vehicle AMD uses for GPU Computing today.

With the Catalyst 8.12 release, AMD is now including the necessary software to build and run GPU computing applications with Brook+ and CAL in the driver itself. This software is bundled up in a package AMD likes to call the ATI Stream SDK and has been available as a separate download for a while now. NVIDIA also did this with CUDA, first offering it as a separate download and later integrating it into their driver.

With both Brook+ and CUDA there are limitations in what can be done from both a language and a hardware target standpoint. At this point, the documentation for CUDA is more practical giving better guidance on how to organize things, while the ATI Stream documentation is much lower level and arguably more complete than NVIDIA's. The long and short of it is that you'll be more likely to get up and running quickly with CUDA, but with ATI Stream there is the information to really understand what is going on if you want to go in and tweak the low level code generated by Brook+ (or if you just want to program at an assembly level).

As far as language extensions in general go, I prefer the CUDA approach to data parallel computing in spite of the fact that I still have my qualms about both. The major draw back to either is still the fact that they are locked in to a specific hardware target. Good data parallel programming is hard, and there's no reason to make it more difficult than it needs to be by forcing developers to write their code twice in two different languages and very likely in two completely different ways to take advantage of both architectures. It's ridiculous.

Both NVIDIA and AMD like to get on their high horse when talking about their GPU computing efforts. We have AMD talking about openness and standards and NVIDIA talking about their investment in CUDA and the already apparent adoption and market penetration in high performance computing. The problem is that both approaches are lacking and both companies are fully capable of writing compilers to take the current Brook or CUDA C language extensions and target them at their own architecture. Both companies will eventually support OpenCL when it hits and the DirectX 11 Compute Shader as well. But in the mean time they just aren't interested in working together. Which may or may not make sense from a business standpoint, but it certainly isn't the best path for the consumer or the industry.

Meanwhile NVIDIA, and now AMD, want to push their proprietary GPU computing technologies as a reason end users should want their hardware. At best, Brook+ and CUDA as language technologies are stop gap short term solutions. Both will fail or fall out of use as standards replace them. Developers know this and will simply not adopt the technology if it doesn't provide the return on investment they need, and 9 times out of 10, in the consumer space, it just won't make sense to develop either a solution for only one IHV's GPUs or to develop the same application twice using two different languages and techniques.

Where proprietary solutions for GPU computing do make sense is where the bleedingest edge performance is absolutely necessary: the HPC market. High Performance Computing for large companies with tons of cash for research will save more money than they put in when developing for the types of scientific computing that really benefit from the GPU today. No matter how long it takes a development team to port a solution to CUDA or Brook+, if the application has anything like the order of magnitude speedups we are used to seeing in this space, the project will have more than made up for the investment no time at all. Realized compute per dollar goes up at a similar rate to application speedup. GPU computing just makes sense here, even with proprietary solutions that only target one hardware platform.

In the consumer space, the real advantage that CUDA has over ATI Stream is PhysX. But this is barely even a real advantage at this point as PhysX suffers from the same fundamental problem that CUDA does: it only targets one hardware vendor's products. While there are a handful of PhysX titles out there, they aren't that compelling at this point either. We will have to start seeing some real innovation in physics with PhysX before it becomes a selling point. The closest we've got so far is the upcoming Mirror's Edge for the PC, but we must reserve judgement on that one because we haven't had the opportunity to play it yet.

And now we've got AMD's first real effort with the Avivo video converter finally using the GPU to do something (it did not in it's original incarnation). This competes with the only real consumer level application available for CUDA: Badaboom. Now that there are video converters available on both sides of the aisle, we have the opportunity to compare something that really still doesn't matter that much: we get to see the relative performance of two applications written by different teams with different goals targeted at different hardware for different markets. Great. Let's get started.

ATI Catalyst 8.12 Changes and Bug Fixes


View All Comments

  • plonk420 - Tuesday, December 16, 2008 - link

    quick and dirty tests... (emphesis on dirty; i have Opera with 30+ tabs open, uTorrent, VNC Viewer, handful of other programs open)
    source: Star Wars 4 VTS_02_4, 18 mins
    low res: 320x128 took 310 seconds (±10 sec), processor use was ~30-40%
    high res: 640x272 took 325 seconds (±10 sec), processor use was 60-70%
    bicubic resizing, no cropping to exactly 2.39:1; couldn't be arsed. just used MeGUI's default iPod encoding settings, but set Quant 19. did SubME 2, 4, 5, and one test with 6 (the high res). it didn't seem to change the time it took to encode.

    i'm sure you could up quality, significantly, too with better resizing options (i ususally use Lanczos4), other iPod compatible switches, at minimal speed cost. i don't usually encode for low bitrate/PMPs, but settings to do so are a google away.

    but this pq looks decent to me. no issues the hardware encode had. using x264 has a BIT of a learning curve, but can be as fast as these hardware solutions (and possibly excede PQ with the proper options). recent builds have Peryn, i7, and even Phenom optimizations (that weren't utilized in one of the other site's i7 x264 tests).

    my tests were encoded on a "mere" Phenom 9550 @ 2.2ghz on Vista x64 SP1, drives fragmented to hell.

    options were --qp 19 --level 3 --nf --no-cabac --partitions none --merange 12 --threads auto --thread-input --progress --no-psnr --no-ssim (with --level being 1.3 for low or 3 for high, and --subme being 2, 4, 5, 6). build was 1051 (a few builds out of date; 1055 has better CAVLC PQ according to changelog)
  • mvrx - Monday, December 15, 2008 - link

    I've been grumpy about this for years. Most all the commercial video editing packages have treated mananging and encoding 1080p as a pro-only, or "coming next year" feature.. 1080p isn't pro folks. It's consumer level.

    I have to use StaxRip with x264.exe encoder to do what I really want, as most commericial packages still have issues with 1080p. Everything should be ready for any resolution, including super high def. Don't try to charge customers more just because they are ready for what will be considered common in another year or two.

    I'd also like to see the upconverting technolologies that HD dvd/BR players do in real time mature for software converters. I'd like to take my home movies and DVDs and convert them to 1080p, then encode to h.264. I know that doesn't give me true native quality content, I just like the idea of standardizing all my media to 1080p.
  • MadMan007 - Monday, December 15, 2008 - link

    Thanks for keeping up with these articles. GPGPU's killer app for most consumers is video encoding. One thing that's missing from this article as was alluded to by an earlier comment is a reasonable price comparison - I'd like to see how the GPGPU encoders stack up to a dual core CPU since adding a video card is a much easier upgrade and lots of people have dual core CPU systems already. Reply
  • psychobriggsy - Monday, December 15, 2008 - link

    It's interesting comparing this review to the Avivo vs Badaboom article I read elsewhere earlier this month (using the leaked pre-release Catalyst) where they achieved significant performance improvements over using a CPU (i.e., it was actually encoding on the GPU in that review, whereas the conclusion here is that it isn't). They didn't use a Core i7 however. Still, when it comes to using a $200 video card with a $200 CPU, or a $1000 CPU to achieve the same thing, the choice is obvious (well, when the quality is sorted out).

    Regardless AMD really shouldn't have released Avivo in that state, or they could have at least just called it a preview. What's wrong with AMD/ATI's developers? Don't they have any pride in their work?
  • bobsm1 - Tuesday, December 16, 2008 - link

    Looks like Anand is actually reviewing the products instead of taking the marketing crap that is being provided and just posting it. Takes a bit more work but the result is more interesting Reply
  • daletkine - Tuesday, December 16, 2008 - link

    I used my Sapphire Toxic 4870 to convert the MPEG2 recording of Lost in Space VCR tape, to DivX using Avivo Video converter. My size went from 5.10GB to 3.48GB, from 720x480 (12Mbps) to 640x480 (converter doesn't give res options). Speed wasn't drammatic, like around 40min for 2h10min movie, on an Athlon 64 X2 @ 3.23GHz. CPU was utilized 100% and GPU was utilized (can be seen with RivaTuner) around 8-10%. Video quality was great, nothing to complain, no artifacts like anand pointed out, looks fine. So, seems to have worked for me, but not what I expected. The bad: WMV displays artifacting just like anand showed, Catalyst 8.12 drivers have issue with Xvid stuttering playback, so I saw that in WMP when playing my converted file (solved by using VLC player), no compression that I can see taking place (I used max quality setting, but medium only gives savings of 500MB, don't know about low), and it still needs a quad core.
    The good: it actually works for me, free, simple, good quality (when works).
    This on Vista 64, 8GB ram, Athlon 64 X2 5000+ @ 3.23GHz.
  • Jovec - Monday, December 15, 2008 - link

    8.12 Avivio just creates a 0 byte file when trying to transcode my Xvid (AutoGK) videos. Ah well... Reply
  • Etsp - Monday, December 15, 2008 - link

    Man, that's some high level of compression. Careful with that, it may just have made an information black-hole on your hard-drive. :) Reply
  • LtGoonRush - Monday, December 15, 2008 - link

    I noticed that the x264 example image is suffering from artifacts due to nearest-neighbor resizing, it looks like it was encoded at a substantially lower resolution, or at the very least not resampled correctly for playback. You might want to verify that the settings used were correct, as it should look substantially better than the BadaBoom results. Reply
  • Griswold - Monday, December 15, 2008 - link

    I was thinking the same. That just doesnt look right. Reply

Log in

Don't have an account? Sign up now