Intel’s Quick Sync Technology

In recent years video transcoding has become one of the most widespread consumers of CPU power. The popularity of YouTube alone has turned nearly everyone with a webcam into a producer, and every PC into a video editing station. The mobile revolution hasn’t slowed things down either. No smartphone can play full bitrate/resolution 1080p content from a Blu-ray disc, so if you want to carry your best quality movies and TV shows with you, you’ll have to transcode to a more compressed format. The same goes for the new wave of tablets.

At a high level, video transcoding involves taking a compressed video stream and further compressing it to better match the storage and decoding abilities of a target device. The reason this is transcoding and not encoding is because the source format is almost always already encoded in some sort of a compressed format. The most common, these days, being H.264/AVC.

Transcoding is a particularly CPU intensive task because of the three dimensional nature of the compression. Each individual frame within a video can be compressed; however, since sequential frames of video typically have many of the same elements, video compression algorithms look at data that’s repeated temporally as well as spatially.

I remember sitting in a hotel room in Times Square while Godfrey Cheng and Matthew Witheiler of ATI explained to me the challenges of decoding HD-DVD and Blu-ray content. ATI was about to unveil hardware acceleration for some of the stages of the H.264 decoding pipeline. Full hardware decode acceleration wouldn’t come for another year at that point.

The advent of fixed function video decode in modern GPUs is important because it helped enable GPU accelerated transcoding. The first step of the video transcode process is to first decode the source video. Since transcoding involves taking a video already in a compressed format and encoding it in a new format, hardware accelerated video decode is key. How fast a decode engine is has a tremendous impact on how fast a hardware accelerated video encode can run. This is true for two reasons.

First, unlike in a playback scenario where you only need to decode faster than the frame rate of the video, when transcoding the video decode engine can run as fast as possible. The faster frames can be decoded, the faster they can be fed to the transcode engine. The second and less obvious point is that some of the hardware you need to accelerate video encoding is already present in a video decode engine (e.g. iDCT/DCT hardware).

With video transcoding as a feature of Sandy Bridge’s GPU, Intel beefed up the video decode engine from what it had in Clarkdale. In the first generation Core series processors, video decode acceleration was split between fixed function decode hardware and the GPU’s EU array. With Sandy Bridge and the second generation Core CPUs, video decoding is done entirely in fixed function hardware. This is not ideal from a flexibility standpoint (e.g. newer video codecs can’t be fully hardware accelerated on existing hardware), but it is the most efficient method to build a video decoder from a power and performance standpoint. Both AMD and NVIDIA have fixed function video decode hardware in their GPUs now; neither rely on the shader cores to accelerate video decode.

The resulting hardware is both performance and power efficient. To test the performance of the decode engine I launched multiple instances of a 15Mbps 1080p high profile H.264 video running at 23.976 fps. I kept launching instances of the video until the system could no longer maintain full frame rate in all of the simultaneous streams. The graph below shows the maximum number of streams I could run in parallel:

  Intel Core i5-2500K NVIDIA GeForce GTX 460 AMD Radeon HD 6870
Number of Parallel 1080p HP Streams 5 streams 3 streams 1 stream

AMD’s Radeon HD 6000 series GPUs can only manage a single high profile, 1080p H.264 stream, which is perfectly sufficient for video playback. NVIDIA’s GeForce GTX 460 does much better; it could handle three simultaneous streams. Sandy Bridge however takes the cake as a single Core i5-2500K can decode five streams in tandem.

The Sandy Bridge decoder is likely helped by the very large (and high bandwidth) L3 cache connected to it. This is the first advantage Intel has in what it calls its Quick Sync technology: a very fast decode engine.

The decode engine is also reused during the actual encode phase. Once frames of the source video are decoded, they are actually fed to the programmable EU array to be split apart and prepared for transcoding. The data in each frame is transformed from the spatial domain (location of each pixel) to the frequency domain (how often pixels of a certain color appear); this is done by the use of a discrete cosine transform. You may remember that inverse discrete cosine transform hardware is necessary to decode video; well, that same hardware is useful in the domain transform needed when transcoding.

Motion search, the most compute intensive part of the transcode process, is done in the EU array. It's the combination of the fast decoder, the EU array, and fixed function hardware that make up Intel's Quick Sync engine.

A Near-Perfect HTPC Quick Sync: The Best Way to Transcode
Comments Locked

283 Comments

View All Comments

  • aviat72 - Tuesday, January 4, 2011 - link

    Though SB will be great for some applications, there are still rough edges in terms of the overall platform. I think it will be best to wait for SNB-E or at least the Z68. SNB-E seems to be the best future-proofing bet.

    I also wonder how a part rated for 95W TDP was drawing 111W in the 4.4GHz OC (the Power Consumption Page). SB's power budget controller must be really smart to allow the higher performance without throttling down, assuming your cooling system can manage the thermals.
  • marraco - Tuesday, January 4, 2011 - link

    I wish to know more about this Sandy Bridge "feature":

    http://www.theinquirer.net/inquirer/news/1934536/i...
  • PeterO - Tuesday, January 4, 2011 - link

    Anand, Thanks for the great schooling and deep test results -- something surely representing an enormous amount of time to write, produce, and massage within Intel's bumped-forward official announcement date.

    Here's a crazy work-around question:

    Can I have my Quick Synch cake and eat my Single-monitor-with-Discrete-Graphics-card too if I, say:

    1). set my discrete card output to mirror Sandy Bridge's IGP display output;

    2). and, (should something exist), add some kind of signal loopback adapter to the IGP port to spoof the presence of a monitor? A null modem, of sorts?

    -- I have absolutely no mobo/video signaling background, so my idea may be laugh in my face funny to anybody who does but I figure it's worth a post, if only for your entertainment. :)
  • Hrel - Wednesday, January 5, 2011 - link

    It makes me SO angry when Intel does stupid shit like disable HT on most of their CPU's even though the damn CPU already has it on it, they already paid for. It literally wouldn't cost them ANYTHING to turn HT on those CPU's yet the greedy bastards don't do it.
  • Moizy - Wednesday, January 5, 2011 - link

    The HD Graphics 3000 performance is pretty impressive, but won't be utilized by most. Most who utilize Intel desktop graphics will be using the HD Graphics 2000, which is okay, but I ran back to the AMD Brazos performance review to get some comparisons.

    In Modern Warfare 2, at 1024 x 768, the new Intel HD Graphics 2000 in the Core i3 2100 barely bests the E-350. Hmm--that's when it's coupled with a full-powered, hyper-threaded desktop compute core that would run circles around the compute side of the Brazos E-350, an 18w, ultra-thin chip.

    This either makes Intel's graphics less impressive, or AMD's more impressive. For me, I'm more impressed with the graphics power in the 18w Brazos chip, and I'm very excited by what mainstream Llano desktop chips (65w - 95w) will bring, graphics-wise. Should be the perfect HTPC solution, all on the CPU (ahem, APU, I mean).

    I'm very impressed with Intel's video transcoding, however. Makes CUDA seem...less impressive, like a bunch of whoop-la. Scary what Intel can do when it decides that it cares about doing it.
  • andywuwei - Wednesday, January 5, 2011 - link

    not sure if anybody else noticed. CPU temp of the i5@3.2GHz is ~140 degrees. any idea why it is so high?
  • SantaAna12 - Wednesday, January 5, 2011 - link

    Did I miss the part where you tell of about the DRM built into this chip?
  • Cb422 - Wednesday, January 5, 2011 - link

    When will Sandy Bridge be available on Newegg or Amazon for me to purchase?
  • DesktopMan - Thursday, January 6, 2011 - link

    Very disappointed in the lack of vt-d and txt on k-variants. They are after all the high end products. I also find the fact that only the k-variants having the faster GPU very peculiar, as those are the CPUs most likely to be paired with a discrete GPU.
  • RagingDragon - Thursday, January 6, 2011 - link

    Agreed. I find the exclusion of VT-d particularly irritating: many of the overclockers and enthusiasts to whom the K chips are marketed also use virtualization. Though I don't expect many enthusiasts, if any, to miss TXT (it's more for locked down corporate systems, media appliances, game consoles, etc.).

    With the Z68 chipset coming in the indeterminate near future, the faster GPU on K chips would have made sense if the K chips came with every other feature enabled (i.e. if they were the "do eveything chips").

    Also, I'd like to have the Sandy Bridge video encode/decode features separate from the GPU functionality - i.e. I'd like to choose between Intel and Nvidia/AMD video decode/encode when using a discrete GPU.

Log in

Don't have an account? Sign up now