Intel’s Quick Sync Technology

In recent years video transcoding has become one of the most widespread consumers of CPU power. The popularity of YouTube alone has turned nearly everyone with a webcam into a producer, and every PC into a video editing station. The mobile revolution hasn’t slowed things down either. No smartphone can play full bitrate/resolution 1080p content from a Blu-ray disc, so if you want to carry your best quality movies and TV shows with you, you’ll have to transcode to a more compressed format. The same goes for the new wave of tablets.

At a high level, video transcoding involves taking a compressed video stream and further compressing it to better match the storage and decoding abilities of a target device. The reason this is transcoding and not encoding is because the source format is almost always already encoded in some sort of a compressed format. The most common, these days, being H.264/AVC.

Transcoding is a particularly CPU intensive task because of the three dimensional nature of the compression. Each individual frame within a video can be compressed; however, since sequential frames of video typically have many of the same elements, video compression algorithms look at data that’s repeated temporally as well as spatially.

I remember sitting in a hotel room in Times Square while Godfrey Cheng and Matthew Witheiler of ATI explained to me the challenges of decoding HD-DVD and Blu-ray content. ATI was about to unveil hardware acceleration for some of the stages of the H.264 decoding pipeline. Full hardware decode acceleration wouldn’t come for another year at that point.

The advent of fixed function video decode in modern GPUs is important because it helped enable GPU accelerated transcoding. The first step of the video transcode process is to first decode the source video. Since transcoding involves taking a video already in a compressed format and encoding it in a new format, hardware accelerated video decode is key. How fast a decode engine is has a tremendous impact on how fast a hardware accelerated video encode can run. This is true for two reasons.

First, unlike in a playback scenario where you only need to decode faster than the frame rate of the video, when transcoding the video decode engine can run as fast as possible. The faster frames can be decoded, the faster they can be fed to the transcode engine. The second and less obvious point is that some of the hardware you need to accelerate video encoding is already present in a video decode engine (e.g. iDCT/DCT hardware).

With video transcoding as a feature of Sandy Bridge’s GPU, Intel beefed up the video decode engine from what it had in Clarkdale. In the first generation Core series processors, video decode acceleration was split between fixed function decode hardware and the GPU’s EU array. With Sandy Bridge and the second generation Core CPUs, video decoding is done entirely in fixed function hardware. This is not ideal from a flexibility standpoint (e.g. newer video codecs can’t be fully hardware accelerated on existing hardware), but it is the most efficient method to build a video decoder from a power and performance standpoint. Both AMD and NVIDIA have fixed function video decode hardware in their GPUs now; neither rely on the shader cores to accelerate video decode.

The resulting hardware is both performance and power efficient. To test the performance of the decode engine I launched multiple instances of a 15Mbps 1080p high profile H.264 video running at 23.976 fps. I kept launching instances of the video until the system could no longer maintain full frame rate in all of the simultaneous streams. The graph below shows the maximum number of streams I could run in parallel:

  Intel Core i5-2500K NVIDIA GeForce GTX 460 AMD Radeon HD 6870
Number of Parallel 1080p HP Streams 5 streams 3 streams 1 stream

AMD’s Radeon HD 6000 series GPUs can only manage a single high profile, 1080p H.264 stream, which is perfectly sufficient for video playback. NVIDIA’s GeForce GTX 460 does much better; it could handle three simultaneous streams. Sandy Bridge however takes the cake as a single Core i5-2500K can decode five streams in tandem.

The Sandy Bridge decoder is likely helped by the very large (and high bandwidth) L3 cache connected to it. This is the first advantage Intel has in what it calls its Quick Sync technology: a very fast decode engine.

The decode engine is also reused during the actual encode phase. Once frames of the source video are decoded, they are actually fed to the programmable EU array to be split apart and prepared for transcoding. The data in each frame is transformed from the spatial domain (location of each pixel) to the frequency domain (how often pixels of a certain color appear); this is done by the use of a discrete cosine transform. You may remember that inverse discrete cosine transform hardware is necessary to decode video; well, that same hardware is useful in the domain transform needed when transcoding.

Motion search, the most compute intensive part of the transcode process, is done in the EU array. It's the combination of the fast decoder, the EU array, and fixed function hardware that make up Intel's Quick Sync engine.

A Near-Perfect HTPC Quick Sync: The Best Way to Transcode
Comments Locked

283 Comments

View All Comments

  • nuudles - Monday, January 3, 2011 - link

    Anand, im not the biggest intel fan (due to their past grey area dealings) but I dont think the naming is that confusing. As I understand it they will move to the 3x00 series with Ivy Bridge, basically the higher the second number the faster the chip.

    It would be nice if there was something in the name to easily tell consumers the number of cores and threads, but the majority of consumers just want the fatest chip for their money and dont care how many cores or threads it has.

    The ix part tells enthusiasts the number of cores/threads/turbo with the i3 having 2/4/no, the i5 having 4/4/yes and i7 4/8/yes. I find this much simpler than the 2010 chips which had some dual and some quad core i5 chips for example.

    I think AMD's gpus has a sensible naming convention (except for the 68/6900 renaming) without the additional i3/i5/i7 modifier by using the second number as the tier indicator while maintaining the rule of thumb of "a higher number within a- generation means faster", if intel adopted something similar it would have been better.

    That said I wish they stick with a naming convention for at least 3 or 4 generations...
  • nimsaw - Monday, January 3, 2011 - link

    ",,but until then you either have to use the integrated GPU alone or run a multimonitor setup with one monitor connected to Intel’s GPU in order to use Quick Sync"

    So have you tested the Transcoding with QS by using an H67 chipset based motherboard? The Test Rig never mentions any H67 motherboard. I am somehow not able to follow how you got the scores for the Transcode test. How do you select the codepath if switching graphics on a desktop motherboard is not possible? Please throw some light on it as i am a bit confused here. You say that QS gives a better quality output than GTX 460, so does that mean, i need not invest in a discrete GPU if i am not gaming. Moreover, why should i be forced to use the discrete GPU in a P67 board when according to your tests, the Intel QS is giving a better output.
  • Anand Lal Shimpi - Monday, January 3, 2011 - link

    I need to update the test table. All of the Quick Sync tests were run on Intel's H67 motherboard. Presently if you want to use Quick Sync you'll need to have an H67 motherboard. Hopefully Z68 + switchable graphics will fix this in Q2.

    Take care,
    Anand
  • 7Enigma - Monday, January 3, 2011 - link

    I think this needs to be a front page comment because it is a serious deficiency that all of your reviews fail to properly describe. I read them all and it wasn't until the comments came out that this was brought to light. Seriously SNB is a fantastic chip but this CPU/mobo issue is not insignificant for a lot of people.
  • Wurmer - Monday, January 3, 2011 - link

    I haven't read through all the comments and sorry if it's been said but I find it weird that the most ''enthusiast'' chip K, comes with the better IGP when most people buying this chip will for the most part end up buying a discreet GPU.
  • Akv - Monday, January 3, 2011 - link

    It's being said in reviews from China to France to Brazil, etc.
  • nimsaw - Monday, January 3, 2011 - link

    Strangely enough i also have the same query. what is the point of better Integrated graphics when you cannot use them on a P67 mobo?
    also i came across this screen shot

    http://news.softpedia.com/newsImage/Intel-Sandy-Br...

    where on the right hand corner you have a Drop Down menu which has selected Intel Quick Sync. Will you see a discrete GPU if you expand it? Does it not mean switching between graphics solutions. In the review its mentioned that switchable graphics is still to find its way in desktop mobos.
  • sticks435 - Tuesday, January 4, 2011 - link

    It looks like that drop down is dithered, which means it's only displaying the QS system at the moment, but has a possibility to select multiple options in the future or maybe if you had 2 graphics cards etc.
  • HangFire - Monday, January 3, 2011 - link

    You are comparing video and not chipsets, right?

    I also take issue with the statement that the 890GX (really HD 4290) is the current onboard video cream of the crop. Test after test (on other sites) show it to be a bit slower than the HD4250, even though it has higher specs.

    I also think Intel is going to have a problem with folks comparing their onboard HD3000 to AMD's HD 4290, it just sounds older and slower.

    No word on Linux video drivers for the new HD2000 and HD3000? Considering what a mess KMS has made of the old i810 drivers, we may be entering an era where accelerated onboard Intel video is no longer supported on Linux.
  • mino - Wednesday, January 5, 2011 - link

    Actually, 890GX is just a re-badged 780G from 2008 with sideport memory.

    And no HD4250 is NOT faster. While some specific implementation of 890GX wthout sideport _might_ be slower, it would also be cheaper and not really a "proper" representative.
    (890GX withou sedeport is like sayin i3 with dual channel RAM is "faster" in games than i5 with single channel RAM ...)

Log in

Don't have an account? Sign up now