Intel’s Quick Sync Technology

In recent years video transcoding has become one of the most widespread consumers of CPU power. The popularity of YouTube alone has turned nearly everyone with a webcam into a producer, and every PC into a video editing station. The mobile revolution hasn’t slowed things down either. No smartphone can play full bitrate/resolution 1080p content from a Blu-ray disc, so if you want to carry your best quality movies and TV shows with you, you’ll have to transcode to a more compressed format. The same goes for the new wave of tablets.

At a high level, video transcoding involves taking a compressed video stream and further compressing it to better match the storage and decoding abilities of a target device. The reason this is transcoding and not encoding is because the source format is almost always already encoded in some sort of a compressed format. The most common, these days, being H.264/AVC.

Transcoding is a particularly CPU intensive task because of the three dimensional nature of the compression. Each individual frame within a video can be compressed; however, since sequential frames of video typically have many of the same elements, video compression algorithms look at data that’s repeated temporally as well as spatially.

I remember sitting in a hotel room in Times Square while Godfrey Cheng and Matthew Witheiler of ATI explained to me the challenges of decoding HD-DVD and Blu-ray content. ATI was about to unveil hardware acceleration for some of the stages of the H.264 decoding pipeline. Full hardware decode acceleration wouldn’t come for another year at that point.

The advent of fixed function video decode in modern GPUs is important because it helped enable GPU accelerated transcoding. The first step of the video transcode process is to first decode the source video. Since transcoding involves taking a video already in a compressed format and encoding it in a new format, hardware accelerated video decode is key. How fast a decode engine is has a tremendous impact on how fast a hardware accelerated video encode can run. This is true for two reasons.

First, unlike in a playback scenario where you only need to decode faster than the frame rate of the video, when transcoding the video decode engine can run as fast as possible. The faster frames can be decoded, the faster they can be fed to the transcode engine. The second and less obvious point is that some of the hardware you need to accelerate video encoding is already present in a video decode engine (e.g. iDCT/DCT hardware).

With video transcoding as a feature of Sandy Bridge’s GPU, Intel beefed up the video decode engine from what it had in Clarkdale. In the first generation Core series processors, video decode acceleration was split between fixed function decode hardware and the GPU’s EU array. With Sandy Bridge and the second generation Core CPUs, video decoding is done entirely in fixed function hardware. This is not ideal from a flexibility standpoint (e.g. newer video codecs can’t be fully hardware accelerated on existing hardware), but it is the most efficient method to build a video decoder from a power and performance standpoint. Both AMD and NVIDIA have fixed function video decode hardware in their GPUs now; neither rely on the shader cores to accelerate video decode.

The resulting hardware is both performance and power efficient. To test the performance of the decode engine I launched multiple instances of a 15Mbps 1080p high profile H.264 video running at 23.976 fps. I kept launching instances of the video until the system could no longer maintain full frame rate in all of the simultaneous streams. The graph below shows the maximum number of streams I could run in parallel:

  Intel Core i5-2500K NVIDIA GeForce GTX 460 AMD Radeon HD 6870
Number of Parallel 1080p HP Streams 5 streams 3 streams 1 stream

AMD’s Radeon HD 6000 series GPUs can only manage a single high profile, 1080p H.264 stream, which is perfectly sufficient for video playback. NVIDIA’s GeForce GTX 460 does much better; it could handle three simultaneous streams. Sandy Bridge however takes the cake as a single Core i5-2500K can decode five streams in tandem.

The Sandy Bridge decoder is likely helped by the very large (and high bandwidth) L3 cache connected to it. This is the first advantage Intel has in what it calls its Quick Sync technology: a very fast decode engine.

The decode engine is also reused during the actual encode phase. Once frames of the source video are decoded, they are actually fed to the programmable EU array to be split apart and prepared for transcoding. The data in each frame is transformed from the spatial domain (location of each pixel) to the frequency domain (how often pixels of a certain color appear); this is done by the use of a discrete cosine transform. You may remember that inverse discrete cosine transform hardware is necessary to decode video; well, that same hardware is useful in the domain transform needed when transcoding.

Motion search, the most compute intensive part of the transcode process, is done in the EU array. It's the combination of the fast decoder, the EU array, and fixed function hardware that make up Intel's Quick Sync engine.

A Near-Perfect HTPC Quick Sync: The Best Way to Transcode
Comments Locked

283 Comments

View All Comments

  • auhgnist - Monday, January 17, 2011 - link

    For example, between i3-2100 and i7-2600?
  • timminata - Wednesday, January 19, 2011 - link

    I was wondering, does the integrated GPU provide any benefit if you're using it with a dedicated graphics card anyway (GTX470) or would it just be idle?
  • James5mith - Friday, January 21, 2011 - link

    Just thought I would comment with my experience. I am unable to get bluray playback, or even CableCard TV playback with the Intel integrated graphics on my new I5-2500K w/ Asus Motherboard. Why you ask? The same problem Intel has always had, it doesn't handle the EDID's correctly when there is a receiver in the path between it and the display.

    To be fair, I have an older Westinghouse Monitor, and an Onkyo TX-SR606. But the fact that all I had to do was reinstall my HD5450 (which I wanted to get rid of when I did the update to SandyBridge) and all my problems were gone kind of points to the fact that Intel still hasn't gotten it right when it comes to EDID's, HDCP handshakes, etc.

    So sad too, because otherwise I love the upgraded platform for my HTPC. Just wish I didn't have to add-in the discrete graphics.
  • palenholik - Wednesday, January 26, 2011 - link

    As i could understand from article, you have used just this one software for all these testings. And I understand why. Is it enough to conclude that CUDA causes bad or low picture quality.

    I am very interested and do researches over H.264 and x264 encoding and decoding performance, especially over GPU. I have tested Xilisoft Video Converter 6, that supports CUDA, and i didn't problems with low quality picture when using CUDA. I did these test on nVidia 8600 GT and for TV station that i work for. I was researching for solution to compress video for sending over internet with low or no quality loss.

    So, could it be that Arcsoft Media Converter co-ops bad with CUDA technology?

    And must notice here how well AMD Phenom II x6 performs well comparable to nVidia GTX 460. This means that one could buy MB with integrated graphics and AMD Phenom II x6 and have very good encoding performances in terms of speed and quality. Though, Intel is winner here no doubt, but jumping from sck. to sck. and total platform changing troubles me.

    Nice and very useful article.
  • ellarpc - Wednesday, January 26, 2011 - link

    I'm curious why bad company 2 gets left out of Anand's CPU benchmarks. It seems to be a CPU dependent game. When I play it all four cores are nearly maxed out while my GPU barely reaches 60% usage. Where most other games seem to be the opposite.
  • Kidster3001 - Friday, January 28, 2011 - link

    Nice article. It cleared up much about the new chips I had questions on.

    A suggestion. I have worked in the chip making business. Perhaps you could run an article on how bin-splits and features are affected by yields and defects. Many here seem to believe that all features work on all chips (but the company chooses to disable them) when that is not true. Some features, such as virtualization, are excluded from SKU's for a business reason. These are indeed disabled by the manufacturer inside certain chips (they usually use chips where that feature is defective anyway, but can disable other chips if the market is large enough to sell more). Other features, such as less cache or lower speeds are missing from some SKU's because those chips have a defect which causes that feature to not work or not to run as fast in those chips. Rather than throwing those chips away, companies can sell them at a cheaper price. i.e. Celeron -> 1/2 the cache in the chip doesn't work right so it's disabled.

    It works both ways though. Some of the low end chips must come from better chips that have been down-binned, otherwise there wouldn't be enough low-end chips to go around.
  • katleo123 - Tuesday, February 1, 2011 - link

    It is not expected to compete Core i7 processors to take its place.
    Sandy bridge uses fixed function processing to produce better graphics using the same power consumption as Core i series.
    visit http://www.techreign.com/2010/12/intels-sandy-brid...
  • jmascarenhas - Friday, February 4, 2011 - link

    Problem is we need to choose between using integrated GPU where we have to choose a H67 board or do some over clocking with a P67. I wonder why we have to make this option... this just means that if we dont do gaming and the 3000 is fine we have to go for the H67 and therefore cant OC the processor.....
  • jmascarenhas - Monday, February 7, 2011 - link

    and what about those who want to OC and dont need a dedicated Graphic board??? I understand Intel wanting to get money out of early adopters, but dont count on me.
  • fackamato - Sunday, February 13, 2011 - link

    Get the K version anyway? The internal GPU gets disabled when you use an external GPU AFAIK.

Log in

Don't have an account? Sign up now