Intel’s Quick Sync Technology

In recent years video transcoding has become one of the most widespread consumers of CPU power. The popularity of YouTube alone has turned nearly everyone with a webcam into a producer, and every PC into a video editing station. The mobile revolution hasn’t slowed things down either. No smartphone can play full bitrate/resolution 1080p content from a Blu-ray disc, so if you want to carry your best quality movies and TV shows with you, you’ll have to transcode to a more compressed format. The same goes for the new wave of tablets.

At a high level, video transcoding involves taking a compressed video stream and further compressing it to better match the storage and decoding abilities of a target device. The reason this is transcoding and not encoding is because the source format is almost always already encoded in some sort of a compressed format. The most common, these days, being H.264/AVC.

Transcoding is a particularly CPU intensive task because of the three dimensional nature of the compression. Each individual frame within a video can be compressed; however, since sequential frames of video typically have many of the same elements, video compression algorithms look at data that’s repeated temporally as well as spatially.

I remember sitting in a hotel room in Times Square while Godfrey Cheng and Matthew Witheiler of ATI explained to me the challenges of decoding HD-DVD and Blu-ray content. ATI was about to unveil hardware acceleration for some of the stages of the H.264 decoding pipeline. Full hardware decode acceleration wouldn’t come for another year at that point.

The advent of fixed function video decode in modern GPUs is important because it helped enable GPU accelerated transcoding. The first step of the video transcode process is to first decode the source video. Since transcoding involves taking a video already in a compressed format and encoding it in a new format, hardware accelerated video decode is key. How fast a decode engine is has a tremendous impact on how fast a hardware accelerated video encode can run. This is true for two reasons.

First, unlike in a playback scenario where you only need to decode faster than the frame rate of the video, when transcoding the video decode engine can run as fast as possible. The faster frames can be decoded, the faster they can be fed to the transcode engine. The second and less obvious point is that some of the hardware you need to accelerate video encoding is already present in a video decode engine (e.g. iDCT/DCT hardware).

With video transcoding as a feature of Sandy Bridge’s GPU, Intel beefed up the video decode engine from what it had in Clarkdale. In the first generation Core series processors, video decode acceleration was split between fixed function decode hardware and the GPU’s EU array. With Sandy Bridge and the second generation Core CPUs, video decoding is done entirely in fixed function hardware. This is not ideal from a flexibility standpoint (e.g. newer video codecs can’t be fully hardware accelerated on existing hardware), but it is the most efficient method to build a video decoder from a power and performance standpoint. Both AMD and NVIDIA have fixed function video decode hardware in their GPUs now; neither rely on the shader cores to accelerate video decode.

The resulting hardware is both performance and power efficient. To test the performance of the decode engine I launched multiple instances of a 15Mbps 1080p high profile H.264 video running at 23.976 fps. I kept launching instances of the video until the system could no longer maintain full frame rate in all of the simultaneous streams. The graph below shows the maximum number of streams I could run in parallel:

  Intel Core i5-2500K NVIDIA GeForce GTX 460 AMD Radeon HD 6870
Number of Parallel 1080p HP Streams 5 streams 3 streams 1 stream

AMD’s Radeon HD 6000 series GPUs can only manage a single high profile, 1080p H.264 stream, which is perfectly sufficient for video playback. NVIDIA’s GeForce GTX 460 does much better; it could handle three simultaneous streams. Sandy Bridge however takes the cake as a single Core i5-2500K can decode five streams in tandem.

The Sandy Bridge decoder is likely helped by the very large (and high bandwidth) L3 cache connected to it. This is the first advantage Intel has in what it calls its Quick Sync technology: a very fast decode engine.

The decode engine is also reused during the actual encode phase. Once frames of the source video are decoded, they are actually fed to the programmable EU array to be split apart and prepared for transcoding. The data in each frame is transformed from the spatial domain (location of each pixel) to the frequency domain (how often pixels of a certain color appear); this is done by the use of a discrete cosine transform. You may remember that inverse discrete cosine transform hardware is necessary to decode video; well, that same hardware is useful in the domain transform needed when transcoding.

Motion search, the most compute intensive part of the transcode process, is done in the EU array. It's the combination of the fast decoder, the EU array, and fixed function hardware that make up Intel's Quick Sync engine.

A Near-Perfect HTPC Quick Sync: The Best Way to Transcode
Comments Locked

283 Comments

View All Comments

  • Taft12 - Tuesday, January 4, 2011 - link

    You first.
  • ReaM - Tuesday, January 4, 2011 - link

    the six core 980x still owns them in all tests where all cores are used.

    I dont know 22k in cinebench is really not a reason to buy the new i7, I reach 24k on air with i7 860 and my i5 runs on 20k on air.

    Short term performance is real good, but I dont care if I wait for a package to unpack for 7 seconds or 8, for long term like rendering, neither there is a reason to upgrade.

    I recommend you get the older 1156 off ebay and save a ton of money.

    I have the i5 on hackintosh, I am wondering if 1155 will be hackintoshable
  • Spivonious - Tuesday, January 4, 2011 - link

    I have to disagree with Anand; I feel the QuickSync image is the best of the four in all cases. Yes, there is some edge-softening going on, so you lose some of the finer detail that ATi and SNB gives you, but when viewing on a small screen such as one on an iPhone/iPod, I'd rather have the smoothed-out shapes than pixel-perfect detail.
  • wutsurstyle - Tuesday, January 4, 2011 - link

    I started my computing days with Intel but I'm so put off by the way Intel is marketing their new toys. Get this but you can't have that...buy that, but your purchase must include other things. And even after I throw my wallet to Intel, I still would not have a OC'd Sandy Bridge with useful IGP and Quicksync. But wait, throw more money on a Z68 a little later. Oh...and there's a shiny new LGA2011 in the works. Anyone worried that they started naming sockets after the year it comes out? Yay for spending!

    AMD..please save us!
  • MrCrispy - Tuesday, January 4, 2011 - link

    Why the bloody hell don't the K parts support VT-d ?! I can only imagine it will be introduced at a price premium in a later part.
  • slick121 - Tuesday, January 4, 2011 - link

    Wow I just realized this. I really hate this type of market segmentation.
  • Navier - Tuesday, January 4, 2011 - link

    I'm a little confused why Quick Sync needs to have a monitor connected to the MB to work. I'm trying to understand why having a monitor connected is so important for video transcoding, vs. playback etc.

    Is this a software limitation? Either in the UEFI (BIOS) or drivers? Or something more systemic in the hardware.

    What happens on a P67 motherboard? Does the P67 board disable the on die GPU? Effectively disabling Quick Sync support? This seems a very unfortunate over-site for such a promising feature. Will a future driver/firmware update resolve this limitation?

    Thanks
  • NUSNA_moebius - Tuesday, January 4, 2011 - link

    Intel HD 3000 - ~115 Million transistors
    AMD Radeon HD 3450 - 181 Million transistors - 8 SIMDs
    AMD Radeon HD 4550 - 242 Million transistors - 16 SIMDs
    AMD Radeon HD 5450 - 292 Million transistors - 16 SIMDs
    AMD Xenos (Xbox 360 GPU) - 232 Million transistors + 105 Million (eDRAM daughter die) = 337 Million transistors - 48 SIMDs

    Xenos I think in the end is still a good two, two and a half times more powerful than the Radeon 5450. Xenos does not have to be OpenCL, Direct Compute, DX11 nor fully DX10 compliant (a 50 million jump from the 4550 going from DX10.1 to 11), nor contains hardware video decode, integrated HDMI output with 5.1 audio controller (even the old Radeon 3200 clocks in at 150 million + transistors). What I would like some clarification on is if the transistor count for the Xenos includes Northbridge functions..............

    Clearly PC GPUs have insane transistor counts in order to be highly compatible. It is commendable how well the Intel HD 3000 does with only 115 Million, but it's important to note that older products like the X1900 had 384 Million transistors, back when DX9.0c was the aim and in pure throughput, it should match or closely trail Xenos at 500 MHz. Going from the 3450 to 4550 GPUs, we go up another 60 million for 8 more SIMDs of a similar DX10.1 compatible nature, as well as the probable increases for hardware video decode, etc. So basically, to come into similar order as the Xenos in terms of SIMD counts (of which Xenos is 48 of it's own type I must emphasize), we would need 60 million transistors per 8 SIMDs, which would put us at about 360 million transistors for a 48 SIMD (240 SP) AMD part that is DX 10.1 compatible and not equipped with anything unrelated to graphics processing.

    Yes, it's a most basic comparison (and probably fundamentally wrong in some regards), but I think it sheds some light on the idea that the Radeon HD 5450 really still pales in comparison to the Xenos. We have much better GPUs like Redwood that are twice as powerful with their higher clock speeds + 400 SPs (627 Million transistors total) and consume less energy than Xenos ever did. Of course, this isn't taking memory bandwidth or framebuffer size into account, nor the added benefits of console optimization.
  • frankanderson - Tuesday, January 4, 2011 - link

    I'm still rocking my Q6600 + Gigabyte X38 DS5 board, upgraded to a GTX580 and been waiting for Sandy, definitely looking forward to this once the dust settles..

    Thanks Anand...
  • Spivonious - Wednesday, January 5, 2011 - link

    I'm still on E6600 + P965 board. Honestly, I would upgrade my video card (HD3850) before doing a complete system upgrade, even with Sandy Bridge being so much faster than my old Conroe. I have yet to run a game that wasn't playable at full detail. Maybe my standards are just lower than others.

Log in

Don't have an account? Sign up now