Intel’s Quick Sync Technology

In recent years video transcoding has become one of the most widespread consumers of CPU power. The popularity of YouTube alone has turned nearly everyone with a webcam into a producer, and every PC into a video editing station. The mobile revolution hasn’t slowed things down either. No smartphone can play full bitrate/resolution 1080p content from a Blu-ray disc, so if you want to carry your best quality movies and TV shows with you, you’ll have to transcode to a more compressed format. The same goes for the new wave of tablets.

At a high level, video transcoding involves taking a compressed video stream and further compressing it to better match the storage and decoding abilities of a target device. The reason this is transcoding and not encoding is because the source format is almost always already encoded in some sort of a compressed format. The most common, these days, being H.264/AVC.

Transcoding is a particularly CPU intensive task because of the three dimensional nature of the compression. Each individual frame within a video can be compressed; however, since sequential frames of video typically have many of the same elements, video compression algorithms look at data that’s repeated temporally as well as spatially.

I remember sitting in a hotel room in Times Square while Godfrey Cheng and Matthew Witheiler of ATI explained to me the challenges of decoding HD-DVD and Blu-ray content. ATI was about to unveil hardware acceleration for some of the stages of the H.264 decoding pipeline. Full hardware decode acceleration wouldn’t come for another year at that point.

The advent of fixed function video decode in modern GPUs is important because it helped enable GPU accelerated transcoding. The first step of the video transcode process is to first decode the source video. Since transcoding involves taking a video already in a compressed format and encoding it in a new format, hardware accelerated video decode is key. How fast a decode engine is has a tremendous impact on how fast a hardware accelerated video encode can run. This is true for two reasons.

First, unlike in a playback scenario where you only need to decode faster than the frame rate of the video, when transcoding the video decode engine can run as fast as possible. The faster frames can be decoded, the faster they can be fed to the transcode engine. The second and less obvious point is that some of the hardware you need to accelerate video encoding is already present in a video decode engine (e.g. iDCT/DCT hardware).

With video transcoding as a feature of Sandy Bridge’s GPU, Intel beefed up the video decode engine from what it had in Clarkdale. In the first generation Core series processors, video decode acceleration was split between fixed function decode hardware and the GPU’s EU array. With Sandy Bridge and the second generation Core CPUs, video decoding is done entirely in fixed function hardware. This is not ideal from a flexibility standpoint (e.g. newer video codecs can’t be fully hardware accelerated on existing hardware), but it is the most efficient method to build a video decoder from a power and performance standpoint. Both AMD and NVIDIA have fixed function video decode hardware in their GPUs now; neither rely on the shader cores to accelerate video decode.

The resulting hardware is both performance and power efficient. To test the performance of the decode engine I launched multiple instances of a 15Mbps 1080p high profile H.264 video running at 23.976 fps. I kept launching instances of the video until the system could no longer maintain full frame rate in all of the simultaneous streams. The graph below shows the maximum number of streams I could run in parallel:

  Intel Core i5-2500K NVIDIA GeForce GTX 460 AMD Radeon HD 6870
Number of Parallel 1080p HP Streams 5 streams 3 streams 1 stream

AMD’s Radeon HD 6000 series GPUs can only manage a single high profile, 1080p H.264 stream, which is perfectly sufficient for video playback. NVIDIA’s GeForce GTX 460 does much better; it could handle three simultaneous streams. Sandy Bridge however takes the cake as a single Core i5-2500K can decode five streams in tandem.

The Sandy Bridge decoder is likely helped by the very large (and high bandwidth) L3 cache connected to it. This is the first advantage Intel has in what it calls its Quick Sync technology: a very fast decode engine.

The decode engine is also reused during the actual encode phase. Once frames of the source video are decoded, they are actually fed to the programmable EU array to be split apart and prepared for transcoding. The data in each frame is transformed from the spatial domain (location of each pixel) to the frequency domain (how often pixels of a certain color appear); this is done by the use of a discrete cosine transform. You may remember that inverse discrete cosine transform hardware is necessary to decode video; well, that same hardware is useful in the domain transform needed when transcoding.

Motion search, the most compute intensive part of the transcode process, is done in the EU array. It's the combination of the fast decoder, the EU array, and fixed function hardware that make up Intel's Quick Sync engine.

A Near-Perfect HTPC Quick Sync: The Best Way to Transcode
Comments Locked

283 Comments

View All Comments

  • dgingeri - Monday, January 3, 2011 - link

    I have a really good reason for X58: I/O

    I have 2X GTX 470 video cards and a 3Ware PCIe X4 RAID controller. None of the P67 motherboards I've seen would handle all that hardware, even with cutting the video cards' I/O in half.

    This chip fails in that one very important spot. if they had put a decent PCIe controller in it, with 36 PCIe lanes instead of 16, then I'd be much happier.
  • Exodite - Monday, January 3, 2011 - link

    That's exactly why this is the mainstream platform, while x58 is the enthusiast one, though. Your requirements aren't exactly mainstream, indeed they are beyond what most enthusiasts need even.
  • sviola - Monday, January 3, 2011 - link

    You may want to look at the Gigabyte GA-P67A-UD5 and GA-P67A-UD7 as they can run your configuration.
  • Nihility - Monday, January 3, 2011 - link

    Considering the K versions of the CPUs don't have it.

    If I'm a developer and use VMs a lot, how important will VT-d be within the 3-4 years that I would own such a chip?

    I know that it basically allows direct access to hardware and I don't want to get stuck without it, if it becomes hugely important (Like how you need VT-x to run 64 bit guests).

    Any thoughts?
  • code65536 - Monday, January 3, 2011 - link

    My question is whether or not that chart is even right. I'm having a hard time believing that Intel would disable a feature in an "enthusiast" chip. Disabling features in lower-end cheaper chips, sure, but in "enthusiast" chips?! Unless they are afraid of those K series (but not the non-K, apparently?) cannibalizing their Xeon sales?
  • has407 - Monday, January 3, 2011 - link

    Relatively unimportant IMHO if you're doing development. If you're running a VM/IO-intensive production workload (which isn't likely with one of these), then more important.

    Remember, you need several things for Vt-d to work:
    1. CPU support (aka "IOMMU").
    2. Chip-set/PCH support (e.g., Q57 has it, P57 does not).
    3. BIOS support (a number of vendor implementations are broken).
    4. Hypervisor support.

    Any of 1-3 might result in "No" for the K parts. Even though it *should* apply only to the CPU's capabilities, Intel may simply be saying it is not supported. (Hard to tell as the detailed info isn't up on Intel's ark site yet, and it would otherwise require examining the CPU capability registers to determine.)

    However, it's likely to be an intentional omission on Intel's part as, e.g., the i7-875K doesn't support Vt-d either. As to why that might be there are several possible reasons, many justifiable IMHO. Specifically, the K parts are targeted at people who are likely to OC, and OC'ing--even a wee bit, especially when using VT-d--may result in instability such as to make the system unusable.

    If Vt-d is potentially important to you, then I suggest you back up through steps 4-1 above; all other things equal, 4-2 are likely to be far more important. If you're running VM/IO-intensive workloads where performance and VT-d capability is a priority, then IMHO whether you can OC the part will be 0 or -1 on the list of priorities.

    And while VT-d can make direct access to hardware a more effective option (again, assuming Hypervisor support), it's primary purpose is to make all IO more efficient in a virtualized environment (e.g., IOMMU and interrupt mapping). It's less a matter of "Do I have to have it to get to first base?" than "How much inefficiency am I willing to tolerate?" And again, unless you're running IO-intensive VM workloads in a production environment, the answer is probably "The difference is unlikely to be noticeable for the work [development] I do."

    p.s. code65536 -- I doubt Intel is concerned with OC'd SB parts cannibalizing Xeon sales. (I'd guess the count of potentially lost Xeon sales could be counted on two hands with fingers to spare.:) Stability is far more important than pure speed for anyone I know running VM-intensive loads and, e.g., no ECC support on these parts is for me deal killer. YMMV.
  • DanNeely - Tuesday, January 4, 2011 - link

    For as long as MS dev tools take to install, I'd really like to be able to do all my dev work in a VM backed up to the corporate lan to ease the pain of a new laptop and to make a loaner actually useful. Unfortunately the combination of lousy performance with MS VPC, and the inability of VPC to run two virtual monitors of different sizes mean I don't have a choice about running visual studio in my main OS install.
  • mino - Wednesday, January 5, 2011 - link

    VMware Workstation is what you need. VPC is for sadists.

    Even if your budget is 0(zero), and VPC is free, KVM/QEMU might be a better idea.

    Also, Hyper-V locally and (via RDP) is pretty reasonable.
  • cactusdog - Monday, January 3, 2011 - link

    If we cant overclock the chipset how do we get high memory speeds of 2000Mhz+? Is there still a QPI/Dram voltage setting?

  • Tanel - Monday, January 3, 2011 - link

    No VT-d on K-series? FFFFUUUU!

    So just because I want to use VT-d I'll also be limited to 6 EUs and have no possibility to overclock?

    Then there's the chipset-issue. Even if I got the enthusiast targeted K-series I would still need to get the:
    a) ...H67-chipset to be able to use the HD-unit and QS-capability - yet not be able to overclock.
    b) ...P67-chipset to be able to overclock - yet to lose QS-capability and the point of having 6 extra EUs as the HD-unit can't be used at all.

    What the hell Intel, what the hell! This makes me furious.

Log in

Don't have an account? Sign up now