Intel’s Quick Sync Technology

In recent years video transcoding has become one of the most widespread consumers of CPU power. The popularity of YouTube alone has turned nearly everyone with a webcam into a producer, and every PC into a video editing station. The mobile revolution hasn’t slowed things down either. No smartphone can play full bitrate/resolution 1080p content from a Blu-ray disc, so if you want to carry your best quality movies and TV shows with you, you’ll have to transcode to a more compressed format. The same goes for the new wave of tablets.

At a high level, video transcoding involves taking a compressed video stream and further compressing it to better match the storage and decoding abilities of a target device. The reason this is transcoding and not encoding is because the source format is almost always already encoded in some sort of a compressed format. The most common, these days, being H.264/AVC.

Transcoding is a particularly CPU intensive task because of the three dimensional nature of the compression. Each individual frame within a video can be compressed; however, since sequential frames of video typically have many of the same elements, video compression algorithms look at data that’s repeated temporally as well as spatially.

I remember sitting in a hotel room in Times Square while Godfrey Cheng and Matthew Witheiler of ATI explained to me the challenges of decoding HD-DVD and Blu-ray content. ATI was about to unveil hardware acceleration for some of the stages of the H.264 decoding pipeline. Full hardware decode acceleration wouldn’t come for another year at that point.

The advent of fixed function video decode in modern GPUs is important because it helped enable GPU accelerated transcoding. The first step of the video transcode process is to first decode the source video. Since transcoding involves taking a video already in a compressed format and encoding it in a new format, hardware accelerated video decode is key. How fast a decode engine is has a tremendous impact on how fast a hardware accelerated video encode can run. This is true for two reasons.

First, unlike in a playback scenario where you only need to decode faster than the frame rate of the video, when transcoding the video decode engine can run as fast as possible. The faster frames can be decoded, the faster they can be fed to the transcode engine. The second and less obvious point is that some of the hardware you need to accelerate video encoding is already present in a video decode engine (e.g. iDCT/DCT hardware).

With video transcoding as a feature of Sandy Bridge’s GPU, Intel beefed up the video decode engine from what it had in Clarkdale. In the first generation Core series processors, video decode acceleration was split between fixed function decode hardware and the GPU’s EU array. With Sandy Bridge and the second generation Core CPUs, video decoding is done entirely in fixed function hardware. This is not ideal from a flexibility standpoint (e.g. newer video codecs can’t be fully hardware accelerated on existing hardware), but it is the most efficient method to build a video decoder from a power and performance standpoint. Both AMD and NVIDIA have fixed function video decode hardware in their GPUs now; neither rely on the shader cores to accelerate video decode.

The resulting hardware is both performance and power efficient. To test the performance of the decode engine I launched multiple instances of a 15Mbps 1080p high profile H.264 video running at 23.976 fps. I kept launching instances of the video until the system could no longer maintain full frame rate in all of the simultaneous streams. The graph below shows the maximum number of streams I could run in parallel:

  Intel Core i5-2500K NVIDIA GeForce GTX 460 AMD Radeon HD 6870
Number of Parallel 1080p HP Streams 5 streams 3 streams 1 stream

AMD’s Radeon HD 6000 series GPUs can only manage a single high profile, 1080p H.264 stream, which is perfectly sufficient for video playback. NVIDIA’s GeForce GTX 460 does much better; it could handle three simultaneous streams. Sandy Bridge however takes the cake as a single Core i5-2500K can decode five streams in tandem.

The Sandy Bridge decoder is likely helped by the very large (and high bandwidth) L3 cache connected to it. This is the first advantage Intel has in what it calls its Quick Sync technology: a very fast decode engine.

The decode engine is also reused during the actual encode phase. Once frames of the source video are decoded, they are actually fed to the programmable EU array to be split apart and prepared for transcoding. The data in each frame is transformed from the spatial domain (location of each pixel) to the frequency domain (how often pixels of a certain color appear); this is done by the use of a discrete cosine transform. You may remember that inverse discrete cosine transform hardware is necessary to decode video; well, that same hardware is useful in the domain transform needed when transcoding.

Motion search, the most compute intensive part of the transcode process, is done in the EU array. It's the combination of the fast decoder, the EU array, and fixed function hardware that make up Intel's Quick Sync engine.

A Near-Perfect HTPC Quick Sync: The Best Way to Transcode
Comments Locked

283 Comments

View All Comments

  • Exodite - Monday, January 3, 2011 - link

    I'm of two minds about that really.

    I had really set my mind on the 2500K as it offers unparalleled bang-for-buck and real-world testing have shown that Hyper-threading makes little difference in games.

    With the compile tests it's clear there's a distinct benefit to going with the 2600K for me though, which means this'll end up more expensive than I had planned! :)
  • Lazlo Panaflex - Monday, January 3, 2011 - link

    FYI, the 1100T is missing from several of the gaming benchmarks.....
  • Melted Rabbit - Monday, January 3, 2011 - link

    It wouldn't surprise me if that was intentional. I would hope that Anandtech reviewers were not letting companies dictate how their products were to be reviewed lest AT be denied future prerelease hardware. I can't tell from where I sit and there appears to be no denial that stating there is no such interference.

    In addition, real world benchmarks aside from games looks to be absent. Seriously, I don't use my computer for offline 3D rendering and I suspect that very few other readers do to any significant degree.

    Also, isn't SYSMark 2007 a broken, misleading benchmark? It was compiled on Intel's compiler, you know the broken one that degrades performance on AMD and VIA processors unnecessarily. Also there is this bit that Intel has to include with its comparisons that use BAPco(Intel) benchmarks that include Intel's processors with comparisons to AMD or VIA processors:

    Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchase, including the performance of that product when combined with other products.

    It isn't perfect, but that is what the FTC and Intel agreed to, and until new benchmarks are released by BAPco that do not inflict poor performance on non-Intel processors, the results are not reliable. I don't see any problem if the graph did not contain AMD processors, but that isn't what we have here. If you are curious, for better or for worse, BAPco is a non-profit organization controlled by Intel.
  • Anand Lal Shimpi - Monday, January 3, 2011 - link

    Hardware vendors have no input into how we test, nor do they stipulate that we must test a certain way in order to receive future pre-release hardware. I should also add that should a vendor "cut us off" (it has happened in the past), we have many ways around getting supplied by them directly. In many cases, we'd actually be able to bring you content sooner as we wouldn't be held by NDAs but it just makes things messier overall.

    Either way, see my response above for why the 1100T is absent from some tests. It's the same reason that the Core i7 950 is absent from some tests, maintaining Bench and adding a bunch of new benchmarks meant that not every test is fully populated with every configuration.

    As far as your request for more real world benchmarks, we include a lot of video encoding, file compression/decompression, 3D rendering and even now a compiler test. But I'm always looking for more, if there's a test out there you'd like included let me know! Users kept on asking for compiler benchmarks which is how the VS2008 test got in there, the same applies to other types of tests.

    Take care,
    Anand
  • Melted Rabbit - Tuesday, January 4, 2011 - link

    Thanks for replying to my comment. I was understand why the review was missing some benchmarks for processors like the 1100T. I was also a bit hasty in my accusations with respect to interference from manufacturers, which I apologize for.

    I still have trouble with including benchmarks compiled on the Intel compiler without a warning or explanation of what they mean. It really isn't a benchmark with meaningful results if the 1100T is used x87 code and the Core i7-2600K used SSE2/SSE3 code. I would have no problem with reporting results for benchmarks compiled with Intel's defective compiler, like SYSmark 2007 and Cinebench R10 assuming they did not include results for AMD or VIA processors along with an explanation of why they were not applicable to AMD and VIA processors. However, not giving context to such results I find problematic.
  • DanNeely - Monday, January 3, 2011 - link

    Sysmark2k7 is like the various 3dmark benches. Mostly useless but with a large enough fanbase that running it is less hassle than dealing with all the whining fanboi's/
  • Anand Lal Shimpi - Monday, January 3, 2011 - link

    There are a few holes in the data we produce for Bench, I hope to fill them after I get back from CES next week :) You'll notice there are some cases where there's some Intel hardware missing from benchmarks as well (e.g. Civ V).

    Take care,
    Anand
  • Lazlo Panaflex - Monday, January 3, 2011 - link

    Thanks Anand :-)
  • MeSh1 - Monday, January 3, 2011 - link

    Seems Intel did everything right for these to fit snuggly into next gen macs. Everthing nicely integrated into one chip and the encode/trascode speed boost is icing on the cake (If supported of course) being that Apple is content focused. Nice addition if youre a mac user.
  • Doormat - Monday, January 3, 2011 - link

    Except for the whole thing about not knowing if the GPU is going to support OpenCL. I've heard Intel is writing OpenCL drivers for possibly a GPU/CPU hybrid, or utilizing the new AVX instructions for CPU-only OpenCL.

    Other than that, the AT mobile SNB review included a last-gen Apple MBP 13" and the HD3000 graphics could keep up with the Nvidia 320M - it was equal to or ahead in low-detail settings and equal or slightly behind in medium detail settings. Considering Nvidia isn't going to rev the 320M again, Apple may as well switch over to the HD3000 now and then when Ivy Bridge hits next year, hopefully Intel can deliver a 50% perf gain in hardware alone from going to 18 EUs (and maybe their driver team can kick in some performance there too).

Log in

Don't have an account? Sign up now