Intel’s Quick Sync Technology

In recent years video transcoding has become one of the most widespread consumers of CPU power. The popularity of YouTube alone has turned nearly everyone with a webcam into a producer, and every PC into a video editing station. The mobile revolution hasn’t slowed things down either. No smartphone can play full bitrate/resolution 1080p content from a Blu-ray disc, so if you want to carry your best quality movies and TV shows with you, you’ll have to transcode to a more compressed format. The same goes for the new wave of tablets.

At a high level, video transcoding involves taking a compressed video stream and further compressing it to better match the storage and decoding abilities of a target device. The reason this is transcoding and not encoding is because the source format is almost always already encoded in some sort of a compressed format. The most common, these days, being H.264/AVC.

Transcoding is a particularly CPU intensive task because of the three dimensional nature of the compression. Each individual frame within a video can be compressed; however, since sequential frames of video typically have many of the same elements, video compression algorithms look at data that’s repeated temporally as well as spatially.

I remember sitting in a hotel room in Times Square while Godfrey Cheng and Matthew Witheiler of ATI explained to me the challenges of decoding HD-DVD and Blu-ray content. ATI was about to unveil hardware acceleration for some of the stages of the H.264 decoding pipeline. Full hardware decode acceleration wouldn’t come for another year at that point.

The advent of fixed function video decode in modern GPUs is important because it helped enable GPU accelerated transcoding. The first step of the video transcode process is to first decode the source video. Since transcoding involves taking a video already in a compressed format and encoding it in a new format, hardware accelerated video decode is key. How fast a decode engine is has a tremendous impact on how fast a hardware accelerated video encode can run. This is true for two reasons.

First, unlike in a playback scenario where you only need to decode faster than the frame rate of the video, when transcoding the video decode engine can run as fast as possible. The faster frames can be decoded, the faster they can be fed to the transcode engine. The second and less obvious point is that some of the hardware you need to accelerate video encoding is already present in a video decode engine (e.g. iDCT/DCT hardware).

With video transcoding as a feature of Sandy Bridge’s GPU, Intel beefed up the video decode engine from what it had in Clarkdale. In the first generation Core series processors, video decode acceleration was split between fixed function decode hardware and the GPU’s EU array. With Sandy Bridge and the second generation Core CPUs, video decoding is done entirely in fixed function hardware. This is not ideal from a flexibility standpoint (e.g. newer video codecs can’t be fully hardware accelerated on existing hardware), but it is the most efficient method to build a video decoder from a power and performance standpoint. Both AMD and NVIDIA have fixed function video decode hardware in their GPUs now; neither rely on the shader cores to accelerate video decode.

The resulting hardware is both performance and power efficient. To test the performance of the decode engine I launched multiple instances of a 15Mbps 1080p high profile H.264 video running at 23.976 fps. I kept launching instances of the video until the system could no longer maintain full frame rate in all of the simultaneous streams. The graph below shows the maximum number of streams I could run in parallel:

  Intel Core i5-2500K NVIDIA GeForce GTX 460 AMD Radeon HD 6870
Number of Parallel 1080p HP Streams 5 streams 3 streams 1 stream

AMD’s Radeon HD 6000 series GPUs can only manage a single high profile, 1080p H.264 stream, which is perfectly sufficient for video playback. NVIDIA’s GeForce GTX 460 does much better; it could handle three simultaneous streams. Sandy Bridge however takes the cake as a single Core i5-2500K can decode five streams in tandem.

The Sandy Bridge decoder is likely helped by the very large (and high bandwidth) L3 cache connected to it. This is the first advantage Intel has in what it calls its Quick Sync technology: a very fast decode engine.

The decode engine is also reused during the actual encode phase. Once frames of the source video are decoded, they are actually fed to the programmable EU array to be split apart and prepared for transcoding. The data in each frame is transformed from the spatial domain (location of each pixel) to the frequency domain (how often pixels of a certain color appear); this is done by the use of a discrete cosine transform. You may remember that inverse discrete cosine transform hardware is necessary to decode video; well, that same hardware is useful in the domain transform needed when transcoding.

Motion search, the most compute intensive part of the transcode process, is done in the EU array. It's the combination of the fast decoder, the EU array, and fixed function hardware that make up Intel's Quick Sync engine.

A Near-Perfect HTPC Quick Sync: The Best Way to Transcode
Comments Locked

283 Comments

View All Comments

  • hmcindie - Monday, January 3, 2011 - link

    Why is that Quick Sync has better scaling? Very evident in the Dark Knight police car image as all the other versions have definite scaling artifacts on the car.

    Scaling is something that should be very easy. Why is there so big a difference? Are these programs just made to market new stuff and no-one really uses them because they suck? So big scaling differences between codepaths make no sense.
  • JarredWalton - Monday, January 3, 2011 - link

    It looks to me like some of the encodes have a sharpening effect applied, which is either good (makes text legible) or bad (aliasing effects) depending on your perspective. I'm quite happy overall with the slightly blurrier QS encodes, especially considering the speed.
  • xxxxxl - Monday, January 3, 2011 - link

    I've been so looking forward to SB...only to hear that H67 cant overclock CPU?!?!?!?!
    Disappointed.
  • digarda - Monday, January 3, 2011 - link

    Who needs the IGP for a tuned-up desktop PC anyway? Some for sure, but I see the main advantages of the SB GPU for business laptop users. As the charts show, for desktop PC enthusiasts, the GPU is still woefully slow, being blown away even by the (low-end) Radeon 5570. For this reason, I can't help feeling that the vast majority of overclockers will still want to have discrete graphics.

    I would have preferred to dual core (4-thread) models to have (say) 32 shaders, instead of the 6 or 12 being currently offered. At 32nm, there's probably enough silicon real estate to do it. I guess Intel simply didn't want the quad core processors to have a lower graphics performance than the dual core ones (sigh).

    Pity that the socket 2011 processors (without a GPU) are apparently not going to arrive for nearly a year (Q4 2011). I had previously thought the schedule was Q3 2011. Hopefully, AMD's Bulldozer-based CPUs will be around (or at least imminent) by then, forcing intel to lower the prices for its high-end parts. On the other hand, time to go - looks like I'm starting to dream again...
  • Exodite - Monday, January 3, 2011 - link

    Using myself as an example showing the drawback of limiting overclocking on H67 would be the lack of a good selection of overclocking-friendly micro-ATX boards due to most, if not all, of those being H67.

    Granted, that's not Intel's fault.

    It's just that I have no need for more than one PCIe x16 slot and 3 SATA (DVD, HDD, SSD). I don't need PCI, FDD, PS2, SER, PAR or floppy connectors at all.

    Which ideally means I'd prefer a rather basic P67 design in micro-ATX format but those are, currently, in short supply.

    The perfect motherboard, for me, would probably be a P67 micro-ATX design with the mandatory x8/x8 Crossfire support, one x1 and one x4 slot, front panel connector for USB 3, dual gigabit LAN and the base audio and SATA port options.

    Gigabyte?

    Anyone? :)
  • geofelt - Monday, January 3, 2011 - link

    The only P67 based micro-ATX motherboard I have found to date is the
    Asus P8P67-M pro. (or evo?)

    Any others?
  • Rick83 - Monday, January 3, 2011 - link

    There's also a non-pro P8P67-M.

    Keep in mind though, that the over-clocking issue may not be as bad as pointed out. There are H67 boards being marketed for over-clocking ability and manuals showing how to adjust the multiplier for CPUs... I'm not yet convinced over-clocking will be disabled on H67.
  • smilingcrow - Monday, January 3, 2011 - link

    Major bummer as I was going to order a Gigabyte H67 board and an i5-2500K but am put off now. They seem to over-clock so well and with low power consumption that it seemed the perfect platform for me…
    I don’t mind paying the small premium for the K editions but being forced to use a P67 and lose the graphics and have difficulty finding a mATX P67 board seems crazy!

    I wonder if this limit is set in the chipset or it can be changed with a BIOS update?
  • DanNeely - Monday, January 3, 2011 - link

    Quick Sync only works if the IGP is in use (may be fixable via drivers later); for anyone who cares about video encoding performance that makes the IGP a major feature.
  • mariush - Monday, January 3, 2011 - link

    On the Dark Knight test...

    Looking at the Intel software encoding and the AMD encoding, it looks like the AMD is more washed out overall, which makes me think there's actually something related to colorspaces or color space conversion involved....

    Are you guys sure there's no PC/TV mixup there with the luminance or ATI using the color matrix for SD content on HD content or something like that?

Log in

Don't have an account? Sign up now