Intel’s Quick Sync Technology

In recent years video transcoding has become one of the most widespread consumers of CPU power. The popularity of YouTube alone has turned nearly everyone with a webcam into a producer, and every PC into a video editing station. The mobile revolution hasn’t slowed things down either. No smartphone can play full bitrate/resolution 1080p content from a Blu-ray disc, so if you want to carry your best quality movies and TV shows with you, you’ll have to transcode to a more compressed format. The same goes for the new wave of tablets.

At a high level, video transcoding involves taking a compressed video stream and further compressing it to better match the storage and decoding abilities of a target device. The reason this is transcoding and not encoding is because the source format is almost always already encoded in some sort of a compressed format. The most common, these days, being H.264/AVC.

Transcoding is a particularly CPU intensive task because of the three dimensional nature of the compression. Each individual frame within a video can be compressed; however, since sequential frames of video typically have many of the same elements, video compression algorithms look at data that’s repeated temporally as well as spatially.

I remember sitting in a hotel room in Times Square while Godfrey Cheng and Matthew Witheiler of ATI explained to me the challenges of decoding HD-DVD and Blu-ray content. ATI was about to unveil hardware acceleration for some of the stages of the H.264 decoding pipeline. Full hardware decode acceleration wouldn’t come for another year at that point.

The advent of fixed function video decode in modern GPUs is important because it helped enable GPU accelerated transcoding. The first step of the video transcode process is to first decode the source video. Since transcoding involves taking a video already in a compressed format and encoding it in a new format, hardware accelerated video decode is key. How fast a decode engine is has a tremendous impact on how fast a hardware accelerated video encode can run. This is true for two reasons.

First, unlike in a playback scenario where you only need to decode faster than the frame rate of the video, when transcoding the video decode engine can run as fast as possible. The faster frames can be decoded, the faster they can be fed to the transcode engine. The second and less obvious point is that some of the hardware you need to accelerate video encoding is already present in a video decode engine (e.g. iDCT/DCT hardware).

With video transcoding as a feature of Sandy Bridge’s GPU, Intel beefed up the video decode engine from what it had in Clarkdale. In the first generation Core series processors, video decode acceleration was split between fixed function decode hardware and the GPU’s EU array. With Sandy Bridge and the second generation Core CPUs, video decoding is done entirely in fixed function hardware. This is not ideal from a flexibility standpoint (e.g. newer video codecs can’t be fully hardware accelerated on existing hardware), but it is the most efficient method to build a video decoder from a power and performance standpoint. Both AMD and NVIDIA have fixed function video decode hardware in their GPUs now; neither rely on the shader cores to accelerate video decode.

The resulting hardware is both performance and power efficient. To test the performance of the decode engine I launched multiple instances of a 15Mbps 1080p high profile H.264 video running at 23.976 fps. I kept launching instances of the video until the system could no longer maintain full frame rate in all of the simultaneous streams. The graph below shows the maximum number of streams I could run in parallel:

  Intel Core i5-2500K NVIDIA GeForce GTX 460 AMD Radeon HD 6870
Number of Parallel 1080p HP Streams 5 streams 3 streams 1 stream

AMD’s Radeon HD 6000 series GPUs can only manage a single high profile, 1080p H.264 stream, which is perfectly sufficient for video playback. NVIDIA’s GeForce GTX 460 does much better; it could handle three simultaneous streams. Sandy Bridge however takes the cake as a single Core i5-2500K can decode five streams in tandem.

The Sandy Bridge decoder is likely helped by the very large (and high bandwidth) L3 cache connected to it. This is the first advantage Intel has in what it calls its Quick Sync technology: a very fast decode engine.

The decode engine is also reused during the actual encode phase. Once frames of the source video are decoded, they are actually fed to the programmable EU array to be split apart and prepared for transcoding. The data in each frame is transformed from the spatial domain (location of each pixel) to the frequency domain (how often pixels of a certain color appear); this is done by the use of a discrete cosine transform. You may remember that inverse discrete cosine transform hardware is necessary to decode video; well, that same hardware is useful in the domain transform needed when transcoding.

Motion search, the most compute intensive part of the transcode process, is done in the EU array. It's the combination of the fast decoder, the EU array, and fixed function hardware that make up Intel's Quick Sync engine.

A Near-Perfect HTPC Quick Sync: The Best Way to Transcode
Comments Locked

283 Comments

View All Comments

  • DanNeely - Monday, January 3, 2011 - link

    The increased power efficiency might allow Apple to squeeze a GPU onto their smaller laptop boards without loosing runtime due to the smaller battery.
  • yuhong - Monday, January 3, 2011 - link

    "Unlike P55, you can set your SATA controller to compatible/legacy IDE mode. This is something you could do on X58 but not on P55. It’s useful for running HDDERASE to secure erase your SSD for example"
    Or running old OSes.
  • DominionSeraph - Monday, January 3, 2011 - link

    "taking the original Casino Royale Blu-ray, stripping it of its DRM"

    Whoa, that's illegal.
  • RussianSensation - Monday, January 3, 2011 - link

    It would have been nice to include 1st generation Core i7 processors such as 860/870/920-975 in Starcraft 2 bench as it seems to be very CPU intensive.

    Also, perhaps a section with overclocking which shows us how far 2500k/2600k can go on air cooling with safe voltage limits (say 1.35V) would have been much appreciated.
  • Hrel - Monday, January 3, 2011 - link

    Sounds like this is SO high end it should be the server market. I mean, why make yet ANOTHER socket for servers that use basically the same CPU's? Everything's converging and I'd just really like to see server mobo's converge into "High End Desktop" mobo's. I mean seriously, my E8400 OC'd with a GTX460 is more power than I need. A quad would help with the video editing I do in HD but it works fine now, and with GPU accelerated rendering the rendering times are totally reasonable. I just can't imagine anyone NEEDING a home computer more powerful than the LGS-1155 socket can provide. Hell, 80-90% of people are probably fine with the power Sandy Bridge gives in laptops now.
  • mtoma - Monday, January 3, 2011 - link

    Perhaps it is like you say, however it's always good for buyers to decide if they want server-like features in a PC. I don't like manufacturers to dictate to me only one way to do it (like Intel does now with the odd combination of HD3000 graphics - Intel H67 chipset). Let us not forget that for a long time, all we had were 4 slots for RAM and 4-6 SATA connections (like you probably have). Intel X58 changed all that: suddenly we had the option of having 6 slots for RAM, 6-8 SATA connections and enough PCI-Express lanes.
    I only hope that LGA 2011 brings back those features, because like you said: it's not only the performance we need, but also the features.
    And, remeber that the software doesn't stay still, it usualy requires multiple processor cores (video transcoding, antivirus scanning, HDD defragmenting, modern OS, and so on...).
    All this aside, the main issue remains: Intel pus be persuaded to stop luting user's money and implement only one socket at a time. I usually support Intel, but in this regard, AMD deserves congratulations!
  • DanNeely - Monday, January 3, 2011 - link

    LGA 2011 is a high end desktop/server convergence socket. Intel started doing this in 2008, with all but the highest end server parts sharing LGA1366 with top end desktop systems. The exception was quad/octo socket CPUs, and those using enormous amounts of ram using LGA 1567.

    The main reason why LGA 1155 isn't suitable for really high end machines is that it doesn't have the memory bandwidth to feed hex and octo core CPUs. It's also limited to 16 PCIe 2.0 lanes on the CPU vs 36 PCIe 3.0 lanes on LGA2011. For most consumer systems that won't matter, but 3/4 GPU card systems will start loosing a bit of performance when running in a 4x slot (only a few percent, but people who spend $1000-2000 on GPUs want every last frame they can get), high end servers with multiple 10GB ethernet cards and PCIe SSD devices also begin running into bottlenecks.

    Not spending an extra dollar or five per system for the QPI connections only used in multi-socket systems in 1155 also adds up to major savings across the hundreds of millions of systems Intel is planning to sell.
  • Hrel - Monday, January 3, 2011 - link

    I'm confused by the upset over playing video at 23.967hz. "It makes movies look like, well, movies instead of tv shows"? What? Wouldn't recording at a lower frame rate just mean there's missed detail especially in fast action scenes? Isn't that why HD runs at 60fps instead of 30fps? Isn't more FPS good as long as it's played back at the appropriate speed? IE whatever it's filmed at? I don't understand the complaint.

    On a related note hollywood and the world need to just agree that everything gets recorded and played back at 60fps at 1920x1080. No variation AT ALL! That way everything would just work. Or better yet 120FPS and with the ability to turn 3D on and off as u see fit. Whatever FPS is best. I've always been told higher is better.
  • chokran - Monday, January 3, 2011 - link

    You are right about having more detail when filming with higher FPS, but this isn't about it being good or bad, it's more a matter of tradition and visual style.
    The look movies have these days, the one we got accustomed to, is mainly achieved by filming it in 24p or 23.967 to be precise. The look you get when filming with higher FPS just doesn't look like cinema anymore but tv. At least to me. A good article on this:
    http://www.videopia.org/index.php/read/shorts-main...
    The problem with movies looking like TV can be tested at home if you got a TV that has some kind of Motion Interpolation, eg. MotionFlow called by Sony or Intelligent Frame Creation by Panasonic. When turned on, you can see the soap opera effect by adding frames. There are people that don't see it and some that do and like it, but I have to turn it of since it doesn't look "natural" to me.
  • CyberAngel - Thursday, January 6, 2011 - link

    http://en.wikipedia.org/wiki/Showscan

Log in

Don't have an account? Sign up now