Discrete HTPC GPU Shootout

Name: Discrete HTPC GPU Shootout
Item: Discrete HTPC GPU Shootout
Author: Ganesh T S

by Ganesh T S on June 12, 2011 10:30 PM EST

Posted in
GPUs
AMD
Sapphire
MSI
HTPC
NVIDIA

70 Comments | Add A Comment

70 Comments

LAV CUVID can be benchmarked using GraphStudio's inbuilt benchmark to check the video decoder performance. Unfortunately, GraphStudio can't use madVR in this process. Since our intent was to determine the performance of the GPU with and without madVR enabled, it was essential that madVR be a part of the benchmark. The developer of madVR, Mathias Rauen, created a special benchmarking build which was used to generate the figures in this section.

The picture below shows the madVR benchmark build working in the decode-only mode on the GT 430 for a 1080i60 H264 clip.

Click to Enlarge

LAV CUVID is doing the actual decoding (that is not visible in the picture) and sending frames over to the madVR filter, but the filter just keeps track of the decode frame rate and doesn't render it. All the driver post processing steps are enabled. The interlaced clip being played back uses around 76% of the VPU. Decoding is being performed at 91 fps, much more than the clip's 60 fps rate. The GPU load is 79%, and that is because of the deinterlacing being performed using the shaders. This shows there is some headroom available in the GPU for further post processing. Is there enough for madVR ? The picture below shows the benchmark build working in the decode + post processing mode.

Click to Enlarge

Note that the frame rate falls below the real time requirement. At 52 fps, the renderer drops approximately 8 frames every second. The VPU load falls to 38% because the process is now limited by how fast the processing steps in madVR can execute. GPU-Z shows that madVR has caused the GPU load to hike up to 97%, and this becomes the bottleneck in the chain.

Another interesting aspect to note in the GPU-Z screenshots above is that madVR increases the load on the GPU's memory controller from 23% to 36%. This is to be expected, as madVR makes multiple passes over the frame and needs to move data back and forth between the shaders and the GPU's DRAM.

The extent of drop in the frame rate (and whether it fails to meet real time requirements) is decided by the options enabled in the madVR settings. We ran the benchmarks with various madVR configurations and for various codecs to get an idea of the performance of LAV CUVID, madVR and of course, the GPUs.

Before moving on to the benchmarking results, we have some more notes about the upsampling algorithms in madVR. Human eyes are much less sensitive to chroma resolution than to luma resolution. This is the reason why chroma is stored in a lower resolution with 4:2:0 compression. Due to the low chroma resolution, chroma often tends to look blocky with visible aliasing (especially visible when you have e.g. red fonts on black background). Usually, the best way to upsample chroma is to use a very soft interpolator to remove all the aliasing. However, that comes at the cost of chroma sharpness. A less soft chroma upsampling algorithm will achieve sharpness. Basically, one can't have the cake and eat it too. So, it is a matter of taste as to whether one prefers removal of aliasing or wants a sharper picture.

The default luma algorithm used by madVR is Lanczos. The default chroma algorithm is SoftCubic 100 (which is very soft). It is not recommended to set chroma upsampling to Lanczos or Spline as they are very sharp. The cost in performance is also too big to be worth the gain for chroma. SoftCubic, Bicubic or Mitchell-Netravali are suggested for chroma upsampling as they are all 2-tap and need less GPU resources. In any case, it is hard to spot differences between various chroma algorithms in most real life images.

For luma upsampling the situation is very different. Most people prefer sharp results. The luma algorithm has a much bigger impact on overall image quality than the chroma upsampling algorithm. For luma upscaling, the nice sharp Lanczos 4 or Spline 4 is preferred by some users. Some prefer the SoftCubic 50 because it does a better job at hiding source artifacts. Others prefer Mitchell-Netravali or Bicubic for a more all around solution. There is no hard recommendation for this.

The madVR settings used for benchmarking were classified broadly into three categories:

Low Quality : Bilinear luma and chroma scaling
Medium Quality : Bicubic (sharpness 50) luma scaling and Bilinear chroma scaling
High Quality : Lanczos (4-tap) luma scaling and SoftCubic (softness 70) chroma scaling

Scaling is one of the core functions in madVR, but it is not needed if the display resolution matches that of the video. In the 1080p and 1080i videos presented below, there is no scaling of luma, but chroma needs to be upsampled, though. The 'trade quality for performance' madVR options didn't seem to improve performance too much, and all of them were kept unchecked for benchmarking.

In the graphs below, 'Full VPP' refers to all the video post processing options as set in the NVIDIA Control Panel. The other entries refer to the madVR settings described above. The top row in each graph indicates the performance of the LAV CUVID decoder. When compared with the benchmarks of the DXVA2 decoders (presented in an earlier section), we see that the LAV CUVID decoder has almost no performance penalty.

In the graphs below, we try to identify what causes the throughput to fall below 60 fps. First, let us take a look at the 1080p H.264 clip.

In the above graph, we see that the lack of shaders in the GT 520 affects the madVR performance. The madVR steps become the bottleneck in this case. On the GT 430, the VPU remains the bottleneck till the more complicated scaling algorithms (of theoretical interest) are enabled (which are not presented in the graph above).

We see the same trends continuing for MPEG-2 and VC-1 also. Now, we move on to get a first glimpse at the extent of hardware acceleration available for MPEG-4 streams.

As expected, we get decent hardware acceleration for MPEG-4 and the post processing impact is the same as that for the other codecs.

Interlaced streams don't seem to alter the trend. The absolute values of the maximum decode frame rate is slightly lower in the high stress cases due to the overhead from deinterlacing. The GT 430's efficiency is now limited by shader power, rather than the VPU.

How do things change when we try to upscale the non-1080p content onto a 1080p display? This is probably where madVR's algorithms are needed most. To test this out, we put some non-1080i/p H.264 clips through the same benchmark.

720p H.264

480p H.264

480i H.264

An interesting result in the above benchmark is that the 480i H.264 stream can be processed faster using the GT 430 compared to the GT 520 with madVR disabled. It is quite obvious here that the deinterlacing using the GT 520's shaders is the bottleneck once the VPU hits 300 fps.

In all of the above non-1080i/p benchmarks, the lack of shaders in the GT 520 really hurt it. At 720p60, the High Quality frame rate is very close to 60 fps, and can't be recommended. The GT 430 holds up pretty decently in all the cases.

The takeaway from this section is that the GT 520 is not entirely suitable for madVR processing if you deal with a lot of SD material. The GT 430 is quite suitable for madVR processing as long as you keep the settings sane.

madVR is still an advanced HTPC user's tool. However, it should gain further traction with support for integrated hardware decoding and other driver supplied post processing options. We have covered a solution for NVIDIA GPU based HTPCs in this section. Let us see how this plays out for the AMD and Intel GPU platforms in the future.

Software for NVIDIA HTPCs : LAV CUVID and madVR Miscellaneous Issues

PRINT THIS ARTICLE

Post Your Comment
Please log in or sign up to comment.

Comments Locked

70 Comments

View All Comments

qwertymac93 - Monday, June 13, 2011 - link
What the heck are you talking about?
velis - Monday, June 13, 2011 - link
A great review. Provides all the answers one could wish for and even gives some further hints.
I sure hope you have something like this lined up for llano.

If I may suggest a couple or three things:
Perhaps you should also mention reclock - it will solve most 23.976 and similar problems... It's not like many will detect that the video is running 1/24000th faster. Plus it's insanely easy to use.
I understand you couldn't just post full blown images for space problems, but those thumbnails require too much work too. Is it possible to display a popup of sorts when one mouse-overs those thumbnails?
Also a vertical line showing 60FPS in those DXVA tests would be great :)
ganeshts - Tuesday, June 14, 2011 - link
I will pass on your request(s) to the person in charge of the graphing engine :)
Salfalot - Monday, June 13, 2011 - link
What might have been a nice option is to see what sound levels the cards produced. Even it was only for the GT430 and the HD6570. I know that the decibels can differ between manufacturers but it would have been nice!
For the rest a very nice detailed review between HTPC cards. I was deciding which card to buy so this helped a great deal! I was only looking between the HD6450 and the HD6570 but the GT430 is a better option than the HD6450.
nevcairiel - Monday, June 13, 2011 - link
HDMI Audio is purely digital, there is no diference based on what card you use.

It depends on the audio decoder, and your receiver at the other end of the HDMI link, the HDMI sound card on those cards does not change the audio.
Salfalot - Monday, June 13, 2011 - link
I think I did not use the right word, as I meant the levels of decibel the fan of the cards produce and not the audio too and through speakers.
All reviewed cards have a fan on them and since most of the HTPC setups are in the living room it would have been nice to know which of the cards are most silent.
ganeshts - Monday, June 13, 2011 - link
Though we considered cards with fans in this review, we made it a point to note that the same configuration (GPU model + DRAM bus width + operating frequencies) can be obtained with passive cooling from other vendors.

For example, the 6570 has a passively cooled model from HIS with the same config and Zotac has a passively cooled 430 too. Other vendors have also demonstrated passively cooled models in Computex.
cjs150 - Monday, June 13, 2011 - link
Firstly, a truly informative article. Very high quality.

The fact that none of AMD, Intel and Nvidia can lock onto to the correct frame rates is unforgiveable. It is not as though these frame rates have changed over the last 6 months. It should not be necessary to be an advanced HTPC user and delve into custom creation of frame rates.

I really hope that the representatives of AMD, Intel and NVidia are hanging their heads in shame at such basic errors - sadly I doubt they care.
Grasso789 - Monday, January 28, 2013 - link
The mistake is rather with Microsoft. Video playback speed should be adapted to the refresh rate of the grafx card. There is a software called Reclock doing that. Then, for example 23,996 Hz can be run with a monitor refresh rate of n times 24 Hz. (The same with audio, because bit-perfect transmission only works with synchronization.) In the end and for most sources, the RAMDAC needed only (multiples of) 24, 25 and 30 Hz. In any system, one of its parts should be the clock master, while the other parts serve.
casteve - Monday, June 13, 2011 - link
Excellent review, Ganesh! Your HTPC insight/reviews have been missed.

Discrete HTPC GPU Shootout

Post Your Comment

70 Comments

View All Comments

qwertymac93 - Monday, June 13, 2011 - link

velis - Monday, June 13, 2011 - link

ganeshts - Tuesday, June 14, 2011 - link

Salfalot - Monday, June 13, 2011 - link

nevcairiel - Monday, June 13, 2011 - link

Salfalot - Monday, June 13, 2011 - link

ganeshts - Monday, June 13, 2011 - link

cjs150 - Monday, June 13, 2011 - link

Grasso789 - Monday, January 28, 2013 - link

casteve - Monday, June 13, 2011 - link

Log in

Don't have an account? Sign up now