LAV CUVID can be benchmarked using GraphStudio's inbuilt benchmark to check the video decoder performance. Unfortunately, GraphStudio can't use madVR in this process. Since our intent was to determine the performance of the GPU with and without madVR enabled, it was essential that madVR be a part of the benchmark. The developer of madVR, Mathias Rauen, created a special benchmarking build which was used to generate the figures in this section.

The picture below shows the madVR benchmark build working in the decode-only mode on the GT 430 for a 1080i60 H264 clip.

Click to Enlarge

LAV CUVID is doing the actual decoding (that is not visible in the picture) and sending frames over to the madVR filter, but the filter just keeps track of the decode frame rate and doesn't render it. All the driver post processing steps are enabled. The interlaced clip being played back uses around 76% of the VPU. Decoding is being performed at 91 fps, much more than the clip's 60 fps rate. The GPU load is 79%, and that is because of the deinterlacing being performed using the shaders. This shows there is some headroom available in the GPU for further post processing. Is there enough for madVR ? The picture below shows the benchmark build working in the decode + post processing mode.

Click to Enlarge

Note that the frame rate falls below the real time requirement. At 52 fps, the renderer drops approximately 8 frames every second. The VPU load falls to 38% because the process is now limited by how fast the processing steps in madVR can execute. GPU-Z shows that madVR has caused the GPU load to hike up to 97%, and this becomes the bottleneck in the chain.

Another interesting aspect to note in the GPU-Z screenshots above is that madVR increases the load on the GPU's memory controller from 23% to 36%. This is to be expected, as madVR makes multiple passes over the frame and needs to move data back and forth between the shaders and the GPU's DRAM.

The extent of drop in the frame rate (and whether it fails to meet real time requirements) is decided by the options enabled in the madVR settings. We ran the benchmarks with various madVR configurations and for various codecs to get an idea of the performance of LAV CUVID, madVR and of course, the GPUs.

Before moving on to the benchmarking results, we have some more notes about the upsampling algorithms in madVR. Human eyes are much less sensitive to chroma resolution than to luma resolution. This is the reason why chroma is stored in a lower resolution with 4:2:0 compression. Due to the low chroma resolution, chroma often tends to look blocky with visible aliasing (especially visible when you have e.g. red fonts on black background). Usually, the best way to upsample chroma is to use a very soft interpolator to remove all the aliasing. However, that comes at the cost of chroma sharpness. A less soft chroma upsampling algorithm will achieve sharpness. Basically, one can't have the cake and eat it too. So, it is a matter of taste as to whether one prefers removal of aliasing or wants a sharper picture.

The default luma algorithm used by madVR is Lanczos. The default chroma algorithm is SoftCubic 100 (which is very soft). It is not recommended to set chroma upsampling to Lanczos or Spline as they are very sharp. The cost in performance is also too big to be worth the gain for chroma. SoftCubic, Bicubic or Mitchell-Netravali are suggested for chroma upsampling as they are all 2-tap and need less GPU resources. In any case, it is hard to spot differences between various chroma algorithms in most real life images.

For luma upsampling the situation is very different. Most people prefer sharp results. The luma algorithm has a much bigger impact on overall image quality than the chroma upsampling algorithm. For luma upscaling, the nice sharp Lanczos 4 or Spline 4 is preferred by some users. Some prefer the SoftCubic 50 because it does a better job at hiding source artifacts. Others prefer Mitchell-Netravali or Bicubic for a more all around solution. There is no hard recommendation for this.

The madVR settings used for benchmarking were classified broadly into three categories:

  1. Low Quality : Bilinear luma and chroma scaling
  2. Medium Quality : Bicubic (sharpness 50) luma scaling and Bilinear chroma scaling
  3. High Quality : Lanczos (4-tap) luma scaling and SoftCubic (softness 70) chroma scaling

Scaling is one of the core functions in madVR, but it is not needed if the display resolution matches that of the video. In the 1080p and 1080i videos presented below, there is no scaling of luma, but chroma needs to be upsampled, though. The 'trade quality for performance' madVR options didn't seem to improve performance too much, and all of them were kept unchecked for benchmarking.

In the graphs below, 'Full VPP' refers to all the video post processing options as set in the NVIDIA Control Panel. The other entries refer to the madVR settings described above. The top row in each graph indicates the performance of the LAV CUVID decoder. When compared with the benchmarks of the DXVA2 decoders (presented in an earlier section), we see that the LAV CUVID decoder has almost no performance penalty.

In the graphs below, we try to identify what causes the throughput to fall below 60 fps. First, let us take a look at the 1080p H.264 clip.

1080p H.264

In the above graph, we see that the lack of shaders in the GT 520 affects the madVR performance. The madVR steps become the bottleneck in this case. On the GT 430, the VPU remains the bottleneck till the more complicated scaling algorithms (of theoretical interest) are enabled (which are not presented in the graph above).

1080p MPEG-2

1080p VC-1

We see the same trends continuing for MPEG-2 and VC-1 also. Now, we move on to get a first glimpse at the extent of hardware acceleration available for MPEG-4 streams.

1080p MPEG-4 [DiVX]

1080p MPEG-4 [XViD]

As expected, we get decent hardware acceleration for MPEG-4 and the post processing impact is the same as that for the other codecs.

Interlaced streams don't seem to alter the trend. The absolute values of the maximum decode frame rate is slightly lower in the high stress cases due to the overhead from deinterlacing. The GT 430's efficiency is now limited by shader power, rather than the VPU.

1080i H.264

1080i MPEG-2

1080i VC-1

How do things change when we try to upscale the non-1080p content onto a 1080p display? This is probably where madVR's algorithms are needed most. To test this out, we put some non-1080i/p H.264 clips through the same benchmark.

720p H.264

480p H.264

480i H.264

An interesting result in the above benchmark is that the 480i H.264 stream can be processed faster using the GT 430 compared to the GT 520 with madVR disabled. It is quite obvious here that the deinterlacing using the GT 520's shaders is the bottleneck once the VPU hits 300 fps.

In all of the above non-1080i/p benchmarks, the lack of shaders in the GT 520 really hurt it. At 720p60, the High Quality frame rate is very close to 60 fps, and can't be recommended. The GT 430 holds up pretty decently in all the cases.

The takeaway from this section is that the GT 520 is not entirely suitable for madVR processing if you deal with a lot of SD material. The GT 430 is quite suitable for madVR processing as long as you keep the settings sane.

madVR is still an advanced HTPC user's tool. However, it should gain further traction with support for integrated hardware decoding and other driver supplied post processing options. We have covered a solution for NVIDIA GPU based HTPCs in this section. Let us see how this plays out for the AMD and Intel GPU platforms in the future.

Software for NVIDIA HTPCs : LAV CUVID and madVR Miscellaneous Issues
Comments Locked

70 Comments

View All Comments

  • enki - Monday, June 13, 2011 - link

    How about a short conclusion section for those who just use a Windows 7 box with a Ceton tuner card to watch hdtv in Windows Media Center? (i.e. will just be playing back WTV files recorded directly on the box)

    What provides the best quality output?

    What can stream better then stereo over HDMI? On my old 3400 ATI card it either streams the Dolby Digital directly (the computer doesn't do any processing of the audio) or can output stereo (doesn't think there can be more then 2 speakers connected)

    Thanks
  • BernardP - Monday, June 13, 2011 - link

    The inability to create and scale custom resolutions within AMD graphics drivers is, for me, a deal-breaker that keeps me from even considering AMD graphics. It will also keep me from Llano, Trinity and future AMD Fusion APU's. I'll stay with NVidia as long as they keep allowing for custom resolutions.

    My older eyes are grateful for the custom 1536 X 960 desktop resolution on my 24 inch 16:10 monitor. I couldn't create this resolution with AMD graphics drivers.
  • bobbozzo - Tuesday, June 14, 2011 - link

    In your case, you should just increase the size of the fonts and widgets instead of lowering the screen res.
  • Assimilator87 - Tuesday, June 14, 2011 - link

    I wish there was a section dedicated to the silent stream bug. I have a GTX 470 hooked up to an Onkyo TX-SR805 and this issue is driving me insane. For instance, does this issue only plague certain cards or do all nVidia suffer from it? I was hoping the latest WHQL driver (275.33) would fix this, but sadly, no. Otherwise, the article was amazing and I'll definitely have to check out LAV Splitter.
  • ganeshts - Tuesday, June 14, 2011 - link

    The problem with the silent stream bug is that one driver version has it, the next one doesn't and then the next release brings it back. It is hard to pinpoint where the issue is.

    Amongst our candidates, even with the same driver release, the GT 520 had the bug, but the GT 430 didn't. I am quite confident that the GT 520 issue will get resolved in a future update, but then, I can just hope that it doesn't break the GT 430.
  • JoeHH - Tuesday, June 14, 2011 - link

    This is simply one of the best articles I have ever seen about HTPC. Congrats Ganesh and thank you. Very informative and useful.
  • bobbozzo - Tuesday, June 14, 2011 - link

    Hi, Can you please compare hardware de-intelacing, etc., vs software?

    e.g. many players/codecs can do de-interlacing, de-noise, etc. in software, using the CPU.

    How does this compare with a hardware implementation?

    thanks
  • ganeshts - Tuesday, June 14, 2011 - link

    This is a good suggestion. Let me try that out in the next HTPC / GPU piece.
  • CiNcH - Wednesday, June 15, 2011 - link

    Hey guys,

    here is how I understand the refresh rate issue. It does not matter weather it is 0.005 Hz off. You can't calculate frame drops/repeats from that. In DirectShow, frames are scheduled with the graph reference clock. So the real problem is how much the clock which the VSync is based on and the reference clock in the DirectShow graph drift from each other. And here comes ReClock into play. It derives the DirectShow graph clock from the VSync, i.e. synchronizes the two. So it does not matter weather your VSync is off as long as playback speed is adjusted accordingly. A problem here is synchronizing audio which is not too easy if you bitstream it...
  • NikosD - Thursday, June 16, 2011 - link

    Nice guide but you missed something.
    It's called PotPlayer, it's free and has built-in almost everything.
    CPU & DXVA (partial, full) codecs and splitters for almost every container and every video file out there.
    The same is true for audio, too.
    It has even Pass through (S/PDIF, HDMI) for AC3/TrueHD/DTS, DTS-HD. Only EAC3 is not working.
    It has also support for madVR and a unique DXVA-renderless mode which combines DXVA & madVR!
    I think it's close to perfect!
    BTW, in the article says that there is no free audio decoder for DTS, DTS-HD.
    That's not correct.
    FFDShow is capable of decoding and pass through (S/PDIF, HDMI) both DTS and DTS-HD.
    And PotPlayer of course!

Log in

Don't have an account? Sign up now