Intel’s Quick Sync Technology

In recent years video transcoding has become one of the most widespread consumers of CPU power. The popularity of YouTube alone has turned nearly everyone with a webcam into a producer, and every PC into a video editing station. The mobile revolution hasn’t slowed things down either. No smartphone can play full bitrate/resolution 1080p content from a Blu-ray disc, so if you want to carry your best quality movies and TV shows with you, you’ll have to transcode to a more compressed format. The same goes for the new wave of tablets.

At a high level, video transcoding involves taking a compressed video stream and further compressing it to better match the storage and decoding abilities of a target device. The reason this is transcoding and not encoding is because the source format is almost always already encoded in some sort of a compressed format. The most common, these days, being H.264/AVC.

Transcoding is a particularly CPU intensive task because of the three dimensional nature of the compression. Each individual frame within a video can be compressed; however, since sequential frames of video typically have many of the same elements, video compression algorithms look at data that’s repeated temporally as well as spatially.

I remember sitting in a hotel room in Times Square while Godfrey Cheng and Matthew Witheiler of ATI explained to me the challenges of decoding HD-DVD and Blu-ray content. ATI was about to unveil hardware acceleration for some of the stages of the H.264 decoding pipeline. Full hardware decode acceleration wouldn’t come for another year at that point.

The advent of fixed function video decode in modern GPUs is important because it helped enable GPU accelerated transcoding. The first step of the video transcode process is to first decode the source video. Since transcoding involves taking a video already in a compressed format and encoding it in a new format, hardware accelerated video decode is key. How fast a decode engine is has a tremendous impact on how fast a hardware accelerated video encode can run. This is true for two reasons.

First, unlike in a playback scenario where you only need to decode faster than the frame rate of the video, when transcoding the video decode engine can run as fast as possible. The faster frames can be decoded, the faster they can be fed to the transcode engine. The second and less obvious point is that some of the hardware you need to accelerate video encoding is already present in a video decode engine (e.g. iDCT/DCT hardware).

With video transcoding as a feature of Sandy Bridge’s GPU, Intel beefed up the video decode engine from what it had in Clarkdale. In the first generation Core series processors, video decode acceleration was split between fixed function decode hardware and the GPU’s EU array. With Sandy Bridge and the second generation Core CPUs, video decoding is done entirely in fixed function hardware. This is not ideal from a flexibility standpoint (e.g. newer video codecs can’t be fully hardware accelerated on existing hardware), but it is the most efficient method to build a video decoder from a power and performance standpoint. Both AMD and NVIDIA have fixed function video decode hardware in their GPUs now; neither rely on the shader cores to accelerate video decode.

The resulting hardware is both performance and power efficient. To test the performance of the decode engine I launched multiple instances of a 15Mbps 1080p high profile H.264 video running at 23.976 fps. I kept launching instances of the video until the system could no longer maintain full frame rate in all of the simultaneous streams. The graph below shows the maximum number of streams I could run in parallel:

  Intel Core i5-2500K NVIDIA GeForce GTX 460 AMD Radeon HD 6870
Number of Parallel 1080p HP Streams 5 streams 3 streams 1 stream

AMD’s Radeon HD 6000 series GPUs can only manage a single high profile, 1080p H.264 stream, which is perfectly sufficient for video playback. NVIDIA’s GeForce GTX 460 does much better; it could handle three simultaneous streams. Sandy Bridge however takes the cake as a single Core i5-2500K can decode five streams in tandem.

The Sandy Bridge decoder is likely helped by the very large (and high bandwidth) L3 cache connected to it. This is the first advantage Intel has in what it calls its Quick Sync technology: a very fast decode engine.

The decode engine is also reused during the actual encode phase. Once frames of the source video are decoded, they are actually fed to the programmable EU array to be split apart and prepared for transcoding. The data in each frame is transformed from the spatial domain (location of each pixel) to the frequency domain (how often pixels of a certain color appear); this is done by the use of a discrete cosine transform. You may remember that inverse discrete cosine transform hardware is necessary to decode video; well, that same hardware is useful in the domain transform needed when transcoding.

Motion search, the most compute intensive part of the transcode process, is done in the EU array. It's the combination of the fast decoder, the EU array, and fixed function hardware that make up Intel's Quick Sync engine.

A Near-Perfect HTPC Quick Sync: The Best Way to Transcode
Comments Locked

283 Comments

View All Comments

  • Kevin G - Monday, January 3, 2011 - link

    There is the Z67 chipset which will allow both overclocking and integrated video. However, this chipset won't arrive until Q2.
  • Tanel - Monday, January 3, 2011 - link

    Well, yes, but one wonders who came up with this scheme in the first place. Q2 could be half a year from now.
  • teohhanhui - Monday, January 3, 2011 - link

    I've been thinking the same thing while reading this article... It makes no sense at all. Bad move, Intel.
  • micksh - Monday, January 3, 2011 - link

    Exactly my thoughts. No Quick Sync for enthusiasts right now - that's a disappointment. I think it should be stated more clearly in review.
    Another disappointment - missing 23.976 fps video playback.
  • has407 - Monday, January 3, 2011 - link

    Yeah, OK, lack of support for VT-d ostensibly sucks on the K parts, but as previously posted, I think there may be good reasons for it. But lets look at it objectively...

    1. Do you have an IO-intensive VM workload that requires VT-d?
    2. Is the inefficiency/time incurred by the lack of VT-d support egregious?
    3. Does your hypervisor, BIOS and chipset support VT-d?

    IF you answered "NO" or "I don't know" to any of those questions, THEN what does it matter? ELSE IF you answered "YES" to all of those questions, THEN IMHO SB isn't the solution you're looking for. END IF. Simple as that.

    So because you--who want that feature and the ability to OC--which is likely 0.001% of the customers who are too cheap to spend the $300-400 for a real solution--the vendor should spend 10-100X to support that capability--which will thus *significantly* increase the cost to the other 99.999% of the customers. And that makes sense how and to whom (other than you and the other 0.0001%)?

    IMHO you demand a solution at no extra cost to a potential problem you do not have (or have not articulated); or you demand a solution at no extra cost to a problem you have and for which the market is not yet prepared to offer at a cost you find acceptable (regardless of vendor).
  • Tanel - Tuesday, January 4, 2011 - link

    General best practice is not to feed the trolls - but in this case your arguments are so weak I will go ahead anyway.

    First off, I like how you - without having any insight in my usage profile - question my need for VT-d and choose to call it "lets look at it objectively".

    VT-d is excellent when...
    a) developing hardware drivers and trying to validate functionality on different platforms.
    b) fooling around with GPU passthrough, something I did indeed hope to deploy with SB.

    So yes, I am in need of VT-d - "Simple as that".

    Secondly, _all_ the figures you've presented are pulled out of your ass. I'll be honest, I had a hard time following your argument as much of what you said makes no sense.

    So I should spend more money to get an equivalent retail SKU? Well then Sir, please go ahead and show me where I can get a retail SB SKU clocked at >4.4GHz. Not only that, you're in essence implying that that people only overclock because they're cheap. In case you've missed it it's the enthusiasts buying high-end components that enable much of the next-gen research and development.

    The rest - especially the part with 10-100X cost implication for vendors - is the biggest pile of manure I've come across on Anandtech. What we're seeing here is a vendor stripping off already existing functionality from a cheaper unit while at the same time asking for a premium price.

    If I were to make a car analogy, it'd be the same as if Ferrari sold the 458 in two versions. One with a standard engine, and one with a more powerful engine that lacks headlights. By your reasoning - as my usage profile is in need of headlights - I'd just have to settle with the tame version. Not only would Ferrari lose the added money they'd get from selling a premium version, they would lose a sell as I'd rather be waiting until they present a version that fits my needs. I sure hope you're not running a business.

    There is no other way to put it, Intel fucked up. I'd be jumping on the SB-bandwagon right now if it wasn't for this. Instead, I'll be waiting.
  • has407 - Tuesday, January 4, 2011 - link

    Apologies, didn't mean to come across as a troll or in-your-face idjit (although I admittedly did--lesson learned ). Everyone has different requirements/demands, and I presumed and assumed too much when I should not have, and should have been more measured in my response.

    You're entirely correct to call me on the fact that I know little or nothing about your requirements. Mea culpa. That said, I think SB is not for the likes of you (or I). While it is a "mainstream" part, it has a few too many warts..

    Does that mean Intel "fucked up"? IMHO no--they made a conscious decision to serve a specific market and not serve others. And no, that "10-100X" is not hot air but based on costing from several large scale deployments. Frickin amazing what a few outliers can do to your cost/budget.
  • Akv - Monday, January 3, 2011 - link

    I didn't have time to read all reviews, and furthermore I am not sure I will be able to express what I mean with the right nuances, since English is not my first language.

    For the moment I am a bit disappointed. To account for my relative coldness, it is important to explain where I start from :

    1) For gaming, I already have more than I need with a quad core 775 and a recent 6x ati graphic card.

    2) For office work, I already have more than I need with an i3 clarkdale.

    Therefore since I am already equipped, I am of course much colder than those who need to buy a new rig just now.

    Also, the joy of trying on a new processor must be tempered with several considerations :

    1) With Sandy Bridge, you have to add a new mobo in the price of the processor. That makes it much more expansive. And you are not even sure that 1155 will be kept for Ivy Bridge. That is annoying.

    2) There are always other valuable things that you can buy for a rig, apart from the sheer processor horsepower : more storage, better monitor...

    3) The power improvement that comes with Sandy Bridge with what I call a normal improvement for a new generation of processors. It is certainly not a quantum leap in the nature of processors.

    Now, there are two things I really dislike :

    1) If you want to use P67 with a graphic card, you still have that piece of hardware, the IGP, that you actually bought and that you cannot use. That seems to me extremely unelegant compared to the 775 generation of processors. It is not an elegant architecture.

    2) If you want to use H67 and the Intel IGP for office work and movies, the improvement compared to clarkdale is not sufficient to justify the buying of a new processor and a new mobo. With H67 you will be able to do office work fluently and watch quasi perfectly, with clarkdale you already could.

    The one thing that I like is the improvement in consumption. Otherwise it all seems to me a bit awkward.
  • sviola - Monday, January 3, 2011 - link

    Well, the IGP non being removable is like having on-board sound, but also having a dedicated soundcard. Not much of a deal, since you can't buy a motherboard withou integrated sound nowadays...
  • Shadowmaster625 - Monday, January 3, 2011 - link

    You say you want Intel to provide a $70 gpu. Well, here's a math problem for you: If the gpu on a 2600K is about 22% of the die, and the die costs $317 retail, then how much are you paying for the gpu? If you guessed $70, you win! Congrats, you now payed $70 for a crap gpu. The question is.... why? There is no tock here... only ridiculously high margins for Intel.

Log in

Don't have an account? Sign up now