Intel’s Quick Sync Technology

In recent years video transcoding has become one of the most widespread consumers of CPU power. The popularity of YouTube alone has turned nearly everyone with a webcam into a producer, and every PC into a video editing station. The mobile revolution hasn’t slowed things down either. No smartphone can play full bitrate/resolution 1080p content from a Blu-ray disc, so if you want to carry your best quality movies and TV shows with you, you’ll have to transcode to a more compressed format. The same goes for the new wave of tablets.

At a high level, video transcoding involves taking a compressed video stream and further compressing it to better match the storage and decoding abilities of a target device. The reason this is transcoding and not encoding is because the source format is almost always already encoded in some sort of a compressed format. The most common, these days, being H.264/AVC.

Transcoding is a particularly CPU intensive task because of the three dimensional nature of the compression. Each individual frame within a video can be compressed; however, since sequential frames of video typically have many of the same elements, video compression algorithms look at data that’s repeated temporally as well as spatially.

I remember sitting in a hotel room in Times Square while Godfrey Cheng and Matthew Witheiler of ATI explained to me the challenges of decoding HD-DVD and Blu-ray content. ATI was about to unveil hardware acceleration for some of the stages of the H.264 decoding pipeline. Full hardware decode acceleration wouldn’t come for another year at that point.

The advent of fixed function video decode in modern GPUs is important because it helped enable GPU accelerated transcoding. The first step of the video transcode process is to first decode the source video. Since transcoding involves taking a video already in a compressed format and encoding it in a new format, hardware accelerated video decode is key. How fast a decode engine is has a tremendous impact on how fast a hardware accelerated video encode can run. This is true for two reasons.

First, unlike in a playback scenario where you only need to decode faster than the frame rate of the video, when transcoding the video decode engine can run as fast as possible. The faster frames can be decoded, the faster they can be fed to the transcode engine. The second and less obvious point is that some of the hardware you need to accelerate video encoding is already present in a video decode engine (e.g. iDCT/DCT hardware).

With video transcoding as a feature of Sandy Bridge’s GPU, Intel beefed up the video decode engine from what it had in Clarkdale. In the first generation Core series processors, video decode acceleration was split between fixed function decode hardware and the GPU’s EU array. With Sandy Bridge and the second generation Core CPUs, video decoding is done entirely in fixed function hardware. This is not ideal from a flexibility standpoint (e.g. newer video codecs can’t be fully hardware accelerated on existing hardware), but it is the most efficient method to build a video decoder from a power and performance standpoint. Both AMD and NVIDIA have fixed function video decode hardware in their GPUs now; neither rely on the shader cores to accelerate video decode.

The resulting hardware is both performance and power efficient. To test the performance of the decode engine I launched multiple instances of a 15Mbps 1080p high profile H.264 video running at 23.976 fps. I kept launching instances of the video until the system could no longer maintain full frame rate in all of the simultaneous streams. The graph below shows the maximum number of streams I could run in parallel:

  Intel Core i5-2500K NVIDIA GeForce GTX 460 AMD Radeon HD 6870
Number of Parallel 1080p HP Streams 5 streams 3 streams 1 stream

AMD’s Radeon HD 6000 series GPUs can only manage a single high profile, 1080p H.264 stream, which is perfectly sufficient for video playback. NVIDIA’s GeForce GTX 460 does much better; it could handle three simultaneous streams. Sandy Bridge however takes the cake as a single Core i5-2500K can decode five streams in tandem.

The Sandy Bridge decoder is likely helped by the very large (and high bandwidth) L3 cache connected to it. This is the first advantage Intel has in what it calls its Quick Sync technology: a very fast decode engine.

The decode engine is also reused during the actual encode phase. Once frames of the source video are decoded, they are actually fed to the programmable EU array to be split apart and prepared for transcoding. The data in each frame is transformed from the spatial domain (location of each pixel) to the frequency domain (how often pixels of a certain color appear); this is done by the use of a discrete cosine transform. You may remember that inverse discrete cosine transform hardware is necessary to decode video; well, that same hardware is useful in the domain transform needed when transcoding.

Motion search, the most compute intensive part of the transcode process, is done in the EU array. It's the combination of the fast decoder, the EU array, and fixed function hardware that make up Intel's Quick Sync engine.

A Near-Perfect HTPC Quick Sync: The Best Way to Transcode
Comments Locked

283 Comments

View All Comments

  • GeorgeH - Monday, January 3, 2011 - link

    With the unlocked multipliers, the only substantive difference between the 2500K and the 2600K is hyperthreading. Looking at the benchmarks here, it appears that at equivalent clockspeeds the 2600K might actually perform worse on average than the 2500K, especially if gaming is a high priority.

    A short article running both the 2500K and the 2600K at equal speeds (say "stock" @3.4GHz and overclocked @4.4GHz) might be very interesting, especially as a possible point of comparison for AMD's SMT approach with Bulldozer.

    Right now it looks like if you're not careful you could end up paying ~$100 more for a 2600K instead of a 2500K and end up with worse performance.
  • Gothmoth - Monday, January 3, 2011 - link

    and what benchmarks you are speaking about?

    as anand wrote HT has no negative influence on performance.
  • GeorgeH - Monday, January 3, 2011 - link

    The 2500K is faster in Crysis, Dragon Age, World of Warcraft and Starcraft II, despite being clocked slower than a 2600K. If it weren't for that clockspeed deficiency, it looks like it also might be faster in Left 4 Dead, Far Cry 2, and Dawn of War II. Just about the only game that looks like a "win" for HT is Civ5 and Fallout 3.

    The 2500K also wins the x264 HD 3.03 1st Pass benchmark, and comes pretty close to the 2600K in a few others, again despite a clockspeed deficiency.

    Intel's new "no overclocking unless you get a K" policy looks like it might be a double-edged sword. Ignoring the IGP stuff, the only difference between a 2500K and a 2600K is HT; if you're spending extra for a K you're going to be overclocking, making the 2500K's base clockspeed deficiency irrelevant. That means HT's deficiencies won't be able to hide behind lower clockspeeds and locked multipliers (as with the i5-7xx and i7-8xx.)

    In the past HT was a no-brainer; it might have hurt performance in some cases but it also came with higher clocks that compensated for HT's shortcomings. Now that Intel has cut enthusiasts down to two choices, HT isn't as clear cut, especially if those enthusiasts are gamers - and most of them are.
  • Shorel - Monday, January 3, 2011 - link

    I don't ever watch soap operas (why somebody can enjoy such crap is beyond me) but I game a lot. All my free time is spent gaming.

    High frame rate reminds me of good video cards (or games that are not cutting edge) and the so called film 24p reminds me of the Michael Bay movies where stuff happens fast but you can't see anything, like in transformers.

    Please don't assume that your readers know or enjoy soap operas. Standard TV is for old people and movies look amazing at 120hz when almost all you do is gaming.
  • mmcc575 - Monday, January 3, 2011 - link

    Just want to say thanks for such a great opening article on desktop SNB. The VS2008 benchmark was also a welcome addition!

    SNB launch and CES together must mean a very busy time for you, but it would be great to get some clarification/more in depth articles on a couple of areas.

    1. To clarify, if the LGA-2011 CPUs won't have an on-chip GPU, does this mean they will forego arguably the best feature in Quick Sync?

    2. Would be great to have some more info on the Overclocking of both the CPU and GPU, such as the process, how far you got on stock voltage, the effect on Quick Sync and some OC'd CPU benchmarks.

    3. A look at the PQ of the on-chip GPU when decoding video compared to discrete low-end rivals from nVidia and AMD, as it is likely that the main market for this will be those wanting to decode video as opposed to play games. If you're feeling generous, maybe a run through the HQV benchmark? :P

    Thanks for reading, and congrats again for having the best launch-day content on the web.
  • ajp_anton - Monday, January 3, 2011 - link

    In the Quantum of Solace comparison, x86 and Radeon screens are the same.

    I dug up a ~15Mbit 1080p clip with some action and transcoded it to 4Mbit 720p using x264. So entirely software-based. My i7 920 does 140fps, which isn't too far away from Quick Sync. I'd love to see some quality comparisons between x264 on fastest settings and QS.
  • ajp_anton - Monday, January 3, 2011 - link

    Also, in the Dark Knight comparison, it looks like the Radeon used the wrong levels (so not the encoder's fault). You should recheck the settings used both in the encoder and when you took the screenshot.
  • testmeplz - Monday, January 3, 2011 - link

    Thanks for the great reveiw! I believe the colors in the legend of the graphs on the Graphics overclocking page are mixed up.

    THanks,
    Chris
  • politbureau - Monday, January 3, 2011 - link

    Very concise. Cheers.

    One thing I miss is clock-for-clock benchmarks to highlight the effect of architectural changes. Though not perhaps within the scope of this review, it would nonetheless be interesting to see how SNB fairs against Bloomfield and Lynnfield at similar clock speeds.

    Cheerio
  • René André Poeltl - Monday, January 3, 2011 - link

    Good performance for a bargain - that was amd's terrain.

    Now sandy bridge for ~200 $ targets on amd's clientel. A Core i5-2500K for $216 - that's a bargain. (included is even a 40$ value gpu) And the overclocking ability!

    If I understood it correctly: Intel Core i7 2600K @ 4.4GHz 111W under load is quite efficient. At 3.4 ghz 86 W and a ~30% more 4.4 ghz = ~30% more performance ... that would mean it scales ~ 1:1 power consumption/performance.

    Many people need more performance per core, but not more cores. At 111 W under load this would be the product they wanted. e.g. People who make music with pc's, not playing mp3's but mixing, producing music.

    But for more cores the x6 Thuban is the better choice on a budget. For e.g. building a server on a budget intel has no product to rival it. Or developers - they may also want as many cores as they can get for their apps to test multithreading performance.
    And Amd's also scores with their more conservative approach when it comes to upgrading e.g. motherboards. People don't like to buy a new motherboard every time they upgrade the cpu.

Log in

Don't have an account? Sign up now