Intel’s Quick Sync Technology

In recent years video transcoding has become one of the most widespread consumers of CPU power. The popularity of YouTube alone has turned nearly everyone with a webcam into a producer, and every PC into a video editing station. The mobile revolution hasn’t slowed things down either. No smartphone can play full bitrate/resolution 1080p content from a Blu-ray disc, so if you want to carry your best quality movies and TV shows with you, you’ll have to transcode to a more compressed format. The same goes for the new wave of tablets.

At a high level, video transcoding involves taking a compressed video stream and further compressing it to better match the storage and decoding abilities of a target device. The reason this is transcoding and not encoding is because the source format is almost always already encoded in some sort of a compressed format. The most common, these days, being H.264/AVC.

Transcoding is a particularly CPU intensive task because of the three dimensional nature of the compression. Each individual frame within a video can be compressed; however, since sequential frames of video typically have many of the same elements, video compression algorithms look at data that’s repeated temporally as well as spatially.

I remember sitting in a hotel room in Times Square while Godfrey Cheng and Matthew Witheiler of ATI explained to me the challenges of decoding HD-DVD and Blu-ray content. ATI was about to unveil hardware acceleration for some of the stages of the H.264 decoding pipeline. Full hardware decode acceleration wouldn’t come for another year at that point.

The advent of fixed function video decode in modern GPUs is important because it helped enable GPU accelerated transcoding. The first step of the video transcode process is to first decode the source video. Since transcoding involves taking a video already in a compressed format and encoding it in a new format, hardware accelerated video decode is key. How fast a decode engine is has a tremendous impact on how fast a hardware accelerated video encode can run. This is true for two reasons.

First, unlike in a playback scenario where you only need to decode faster than the frame rate of the video, when transcoding the video decode engine can run as fast as possible. The faster frames can be decoded, the faster they can be fed to the transcode engine. The second and less obvious point is that some of the hardware you need to accelerate video encoding is already present in a video decode engine (e.g. iDCT/DCT hardware).

With video transcoding as a feature of Sandy Bridge’s GPU, Intel beefed up the video decode engine from what it had in Clarkdale. In the first generation Core series processors, video decode acceleration was split between fixed function decode hardware and the GPU’s EU array. With Sandy Bridge and the second generation Core CPUs, video decoding is done entirely in fixed function hardware. This is not ideal from a flexibility standpoint (e.g. newer video codecs can’t be fully hardware accelerated on existing hardware), but it is the most efficient method to build a video decoder from a power and performance standpoint. Both AMD and NVIDIA have fixed function video decode hardware in their GPUs now; neither rely on the shader cores to accelerate video decode.

The resulting hardware is both performance and power efficient. To test the performance of the decode engine I launched multiple instances of a 15Mbps 1080p high profile H.264 video running at 23.976 fps. I kept launching instances of the video until the system could no longer maintain full frame rate in all of the simultaneous streams. The graph below shows the maximum number of streams I could run in parallel:

  Intel Core i5-2500K NVIDIA GeForce GTX 460 AMD Radeon HD 6870
Number of Parallel 1080p HP Streams 5 streams 3 streams 1 stream

AMD’s Radeon HD 6000 series GPUs can only manage a single high profile, 1080p H.264 stream, which is perfectly sufficient for video playback. NVIDIA’s GeForce GTX 460 does much better; it could handle three simultaneous streams. Sandy Bridge however takes the cake as a single Core i5-2500K can decode five streams in tandem.

The Sandy Bridge decoder is likely helped by the very large (and high bandwidth) L3 cache connected to it. This is the first advantage Intel has in what it calls its Quick Sync technology: a very fast decode engine.

The decode engine is also reused during the actual encode phase. Once frames of the source video are decoded, they are actually fed to the programmable EU array to be split apart and prepared for transcoding. The data in each frame is transformed from the spatial domain (location of each pixel) to the frequency domain (how often pixels of a certain color appear); this is done by the use of a discrete cosine transform. You may remember that inverse discrete cosine transform hardware is necessary to decode video; well, that same hardware is useful in the domain transform needed when transcoding.

Motion search, the most compute intensive part of the transcode process, is done in the EU array. It's the combination of the fast decoder, the EU array, and fixed function hardware that make up Intel's Quick Sync engine.

A Near-Perfect HTPC Quick Sync: The Best Way to Transcode
Comments Locked

283 Comments

View All Comments

  • mosu - Monday, January 3, 2011 - link

    If I want to spend every year a big lot of money on something I'll sell on eBay at half price a few months later and if I'd like crappy quality images on my monitor, then I would buy Sandy Bridge... but sorry, I'm no no brainer for Intel.
  • nitrousoxide - Monday, January 3, 2011 - link

    It really impressed me as I do a lot of video transcoding and it's extremely slow on my triple-core Phenom II X3 720, even though I overclocked it to 4GHz. But there is one question: the acceleration needs EU in the GPU, and GPU is disabled in P67 chipset. Does it mean that if I paired my SNB with a P67 motherboard, I won't be able to use the transcoding accelerator?
  • nitrousoxide - Monday, January 3, 2011 - link

    Not talking about SNB-E this time, I know it will be the performance king again. But I wonder if Bulldozer can at least gain some performance advantage to SNB because it makes no sense that 8 cores running at stunning 4.0GHz won't overrun 4 cores below 3.5GHz, no matter what architectural differences there are between these two chips. SNB is only the new-generation mid-range parts, it will be out-performed by High-End Bulldozers. AMD will hold the low-end, just as it does now; as long as the Bulldozer regain some part that Phenoms lost in mainstream and performance market, things will be much better for it. Enthusiast market is not AMD's cup of tea, just as what it does in GPUs: let nVidia get the performance king and strike from lower performance niches.
  • strikeback03 - Tuesday, January 4, 2011 - link

    I don't think we'll know until AMD releases Bulldozer and Intel counters (if they do). Seems the SNB chips can run significantly faster than they do right now, so if necessary Intel could release new models (or a firmware update) that allows turbo modes up past 4GHz.
  • smashr - Monday, January 3, 2011 - link

    This review and others around the web refer to the CPUs as 'launching today', but I do not see them on NewEgg or other e-tailer sites.

    When can we expect these babies at retail?
  • JumpingJack - Monday, January 3, 2011 - link

    They are already selling in Malaysia, but if you don't live in Malasia then your are SOL :) ... I see rumors around that the NDA was suppose to expire on the 5th with retail availability on the 9th... I was thinking about making the leap, but think I will hold off for more info on BD and Sk2011 SB.
  • slickr - Monday, January 3, 2011 - link

    Intel has essentially shoot itself in the foot this time. Between the letters restrictions, the new chipset and crazy chipset differentiations between a P and a H its crazy.
    Not to mention they lack USB 3.0, ability to have an overclock mobo with integrated graphics and the stupid turbo boost restrictions.

    I'll go even more and say that the I3 core is pure crap and while its better than the old core I3 they are essentially leaving the biggest market the one up the $200 dollars wide open to AMD.

    Those who purchase CPU's at $200 and higher have luck in the 2500 and 2600 variants, but for the majority of us who purchase cpu's bellow $200 its crap.

    Essentially if you want gaming performance you buy I3 2100, but if you want overall better performance go for a phenom II.

    Hopefully AMD comes up with some great CPU's bellow the $200 range that are going to be with 4 cores, unlimited turbo boost and not locked.
  • Arakageeta - Tuesday, January 4, 2011 - link

    It seems that these benchmarks test the CPUs (cores) and GPU parts of SandyBridge separately. I'd like to know more about the effects of the CPU and GPU (usually data intensive) sharing the L3 cache.

    One advantage a system with a discrete GPU is that the GPU and CPUs can happily work simultaneously without largely affecting one another. This is no longer the case with SandyBridge.

    A test I would like to see is a graphics intensive application running while another another application performs some multi-threaded ATLAS-tuned LAPACK computations. Do either the GPU or CPUs swamp the L3 cache? Are there any instances of starvation? What happens to the performance of each application? What happens to frame rates? What happens to execution times?
  • morpheusmc - Tuesday, January 4, 2011 - link

    To me it seems that marketing is defining the processors now in Intel rather than engineering. This is always the case but I think now it is more evident than ever.

    Essentially if you want he features that the new architecture brings, you have to sell out for the higher end models.
    My ideal processor would be a i5-2520M for the desktop: Reasonable clocks, good turbo speeds (could be higher for the desktop since the TDP is not that limited), HT, good graphics etc. The combination of 2 cores and HT provides a good balance between power consumption and perfromance for most users.

    Its desktop equivalent price-wise is the 2500, wich has no HT and a much higher TDP because of the four cores. Alternatively, maybe the 2500S, 2400S or 2390T could be considered if they are too overpriced.

    Intel has introduced too much differentiation in this generation, and in an Apple-like fashion, i.e. they force you to pay more for stuff you don't need, just for an extra feature (eg. VT support, good graphics etc) that practically costs nothing since the silicon is already there. Bottomline, if you want to have the full functionality of the silicon that you get, you have to pay for the higher end models.
    Moreover, having features for specific functions (AES, transcoding etc) and good graphics makes more sense in lower-end models where CPU power is limited.

    This is becoming like the software market, where you have to pay extra for licenses for specific functionalities.
    I wouldn't be surprised if Intel starts selling "upgrade licenses" sometime in the future that will simply unlock features.

    I strongly prefer AMD's approach where all the fatures are available to all models.

    I am also a bit annoyed that there is very little discusison about this problem in the review. I agree that technologically Sandy Bridge is impressive, but the artificial limiting of functionality is anti-technological.
  • ac2 - Tuesday, January 4, 2011 - link

    Agreed, but, apart from the K-series/ higher IGP/ motherboard mess up (which I think should be shortly cleared up), all the rest of it is just smart product marketing...

    It irritates readers of AnandTech, but for the most people who buy off-the-shelf it's all good, with integrators patching up any shortcomings in the core/ chipset.

    The focus does seem to be mobile, low-power and video transcode, almost a recipe for macbook!!

Log in

Don't have an account? Sign up now