GPU Transcoding Throwdown: Elemental's Badaboom vs. AMD's Avivo Video Converterby Anand Lal Shimpi & Derek Wilson on December 15, 2008 3:00 PM EST
- Posted in
Last year, NVIDIA introduced it's CUDA development package. Existing as a stand alone download for a while, eventually CUDA was rolled in to the driver itself. Today, AMD is following suit rolling their own GPU computing package, called ATI Stream, into their Catalyst 8.12 driver. While the package has been available for a while now, AMD is really starting to try and push forward on the idea of using their hardware as a GPU computing platform. In both market penetration and branding, AMD is way behind NVIDIA and CUDA.
NVIDIA has been pushing forward in the HPC market very well with CUDA, and PhysX on the desktop uses CUDA to implement hardware accelerated physics. While AMD is a bit behind, we don't see them as hugely lagging either. NVIDIA has some good ground work laid, but the market for GPU computing is still incredibly untapped. Both AMD and NVIDIA are in a good position to take advantage of GPU computing efforts when OpenCL and DriectX 11 come along, and we really do see this effort by both camps to sell GPUs using stream computing as a pitch that is very preliminary.
Software like the Adobe's photoshop perform GPU acceleration using OpenGL. We are at a place where, if a commercial application would significantly benefit from GPU acceleration, software companies still want to develop it once and just have it work. Targeting either NVIDIA through CUDA or AMD through Brook+ is too much of a headache for most software vendors. Having an API (or better yet a choice of APIs) targeted at hardware agnostic GPU computing will kick off the real revolution that both AMD and NVIDIA want their solutions to provide.
But ATI Stream isn't the only thing in Catalyst 8.12 that piqued our interest. To show off the inclusion of the new package, AMD built-in a free video transcoder that actually makes use of GPU acceleration. It is somewhat limited (as is the Badaboom package that runs on NVIDIA hardware), but free is always a nice price. We will compare what you get with the Avivo Video Converter to Badaboom, but we must remind our readers that these are distinct applications that approach the problem of video encoding in different ways and thus a direct comparison isn't as telling as if we could run the same code on both hardware platforms. It's sort of like comparing Unreal Tournament 3 performance on AMD hardware to Enemy Territory: Quake Wars performance on NVIDIA hardware. We've got to look at what is being done, the quality of the output, and we must consider the fact that completely different approaches are likely to have been used by each development team.
The final bit of goodness in the 8.12 driver are some performance tweaks in some apps and fixes for performance and CrossFire in Far Cry 2. There were quite a number of unresolved issues we had to deal with especially on our Core i7 systems that we needed to run down. We'll let you know what issues remain in the following pages as well. This driver release has been highly anticipated not only by reviewers, but by consumers awaiting the merging of older hot fixes into a WHQL driver. We have high hopes, and we'll see if AMD has delivered something that meets our expectations.
But first, let's go a little deeper into ATI Stream and CUDA and take a look at the competing video transcoding offerings now available.
ATI Stream vs. NVIDIA CUDA
ATI and NVIDIA both know that ubiquitous GPU computing is the future for all data parallel tasks. Computer graphics is one of the most heavily parallel tasks around. It also happens to be a problem that easily lends itself to parallelization because of how independent the parallel tasks can be. These two factors, plus demand for high quality graphics, are what made the consumer GPU industry explode in it's dozen years of existence. While the GPGPU (general purpose use of GPUs) philosophy has been around for a while, the advent of using the GPU for general data parallel tasks has been slow on the uptake in the main stream. This has not only to do with the fact that there were no specialized tools available (developers needed to shoehorn algorithms into OpenGL or DirectX and "draw" triangles to solve problems), but was further complicated by the fact that GPU architecture did not lend itself to the implementation of many useful data parallel algorithms.
Hardware designed for graphics, until very recently, has meant floating point only acceleration of completely independent data points with heavy restrictions on reading and writing data. Only a small subset of problems really map to that kind of architecture. With DirectX 9 we only saw inklings of real programmability, and DirectX 10 class hardware has really brought the tools developers need to bear. We now have hardware that can handle not only floating point but integer and bitwise operations as well. This combines nicely with the fact that local and global data stores have been added for sharing data and there is better support for non-sequential reads and writes out there. With DirectX 11, the package will be fairly feature complete, adding real support for data structures beyond simple arrays, optional double precision support and a whole host of other minor improvements that really add up for GPU computing (which is no surprise because DX11 will also feature a general purpose Compute Shader that doesn't need to tie work to triangles, vertecies, fragments or pixels).
Both ATI and NVIDIA have been working on GPU computing for quite some time, though NVIDIA has really been pushing it lately. Even before either company got involved, pioneers in GPU computing were using graphics languages to solve their problems. Out of this, at Stanford, grew a project called Brook. Brook was able to take specially written code that looked fairly similar to standard C programing and source-to-source compile it into code that would implement the appropriate graphics API calls and shaders to run the program. ATI took an early interest in this project and sort of latched onto it (perhaps because it showed better performance on ATI hardware than NVIDIA hardware at the time). After releasing CTM (the low level ISA spec for their graphics hardware) and later CAL (their abstracted pseudo instruction set that can span hardware generations), Brook was modified to compile directly to code targeted at ATI GPUs. This modification was called Brook+ and is the major vehicle AMD uses for GPU Computing today.
With the Catalyst 8.12 release, AMD is now including the necessary software to build and run GPU computing applications with Brook+ and CAL in the driver itself. This software is bundled up in a package AMD likes to call the ATI Stream SDK and has been available as a separate download for a while now. NVIDIA also did this with CUDA, first offering it as a separate download and later integrating it into their driver.
With both Brook+ and CUDA there are limitations in what can be done from both a language and a hardware target standpoint. At this point, the documentation for CUDA is more practical giving better guidance on how to organize things, while the ATI Stream documentation is much lower level and arguably more complete than NVIDIA's. The long and short of it is that you'll be more likely to get up and running quickly with CUDA, but with ATI Stream there is the information to really understand what is going on if you want to go in and tweak the low level code generated by Brook+ (or if you just want to program at an assembly level).
As far as language extensions in general go, I prefer the CUDA approach to data parallel computing in spite of the fact that I still have my qualms about both. The major draw back to either is still the fact that they are locked in to a specific hardware target. Good data parallel programming is hard, and there's no reason to make it more difficult than it needs to be by forcing developers to write their code twice in two different languages and very likely in two completely different ways to take advantage of both architectures. It's ridiculous.
Both NVIDIA and AMD like to get on their high horse when talking about their GPU computing efforts. We have AMD talking about openness and standards and NVIDIA talking about their investment in CUDA and the already apparent adoption and market penetration in high performance computing. The problem is that both approaches are lacking and both companies are fully capable of writing compilers to take the current Brook or CUDA C language extensions and target them at their own architecture. Both companies will eventually support OpenCL when it hits and the DirectX 11 Compute Shader as well. But in the mean time they just aren't interested in working together. Which may or may not make sense from a business standpoint, but it certainly isn't the best path for the consumer or the industry.
Meanwhile NVIDIA, and now AMD, want to push their proprietary GPU computing technologies as a reason end users should want their hardware. At best, Brook+ and CUDA as language technologies are stop gap short term solutions. Both will fail or fall out of use as standards replace them. Developers know this and will simply not adopt the technology if it doesn't provide the return on investment they need, and 9 times out of 10, in the consumer space, it just won't make sense to develop either a solution for only one IHV's GPUs or to develop the same application twice using two different languages and techniques.
Where proprietary solutions for GPU computing do make sense is where the bleedingest edge performance is absolutely necessary: the HPC market. High Performance Computing for large companies with tons of cash for research will save more money than they put in when developing for the types of scientific computing that really benefit from the GPU today. No matter how long it takes a development team to port a solution to CUDA or Brook+, if the application has anything like the order of magnitude speedups we are used to seeing in this space, the project will have more than made up for the investment no time at all. Realized compute per dollar goes up at a similar rate to application speedup. GPU computing just makes sense here, even with proprietary solutions that only target one hardware platform.
In the consumer space, the real advantage that CUDA has over ATI Stream is PhysX. But this is barely even a real advantage at this point as PhysX suffers from the same fundamental problem that CUDA does: it only targets one hardware vendor's products. While there are a handful of PhysX titles out there, they aren't that compelling at this point either. We will have to start seeing some real innovation in physics with PhysX before it becomes a selling point. The closest we've got so far is the upcoming Mirror's Edge for the PC, but we must reserve judgement on that one because we haven't had the opportunity to play it yet.
And now we've got AMD's first real effort with the Avivo video converter finally using the GPU to do something (it did not in it's original incarnation). This competes with the only real consumer level application available for CUDA: Badaboom. Now that there are video converters available on both sides of the aisle, we have the opportunity to compare something that really still doesn't matter that much: we get to see the relative performance of two applications written by different teams with different goals targeted at different hardware for different markets. Great. Let's get started.