At IDF in San Francisco last week, Intel provided us with lots of insights into Skylake, the microarchitecture behind the 6th generation Core series processors. Skylake marks the introduction of the Gen9 Intel HD Graphics technology. In advance of our full Skylake architecture analysis (coming soon), I wanted to get a head start and explain the media side (including Quick Sync and the image processing pipeline) of Skylake in a separate piece.

Media Capabilities and Quick Sync in Intel HD Graphics - A Brief History

Quick Sync has evolved through the last five years, starting with limited hardware acceleration and usage of the programmable EU array in Sandy Bridge. The second generation engine in Ivy Bridge moved to a hybrid hardware / software solution with rate control, motion estimation and intra estimation as well as mode decision happening in the programmable EU array. Usage of the EU array enabled tuning of the algorithms. Motion compensation, intra prediction, forward quantization and entropy coding were done in hardware in the MFX (multi-format codec engine). Haswell added JPEG / MJPEG decode to the MFX, a dedicated VQE (video quality engine) for low power video processing and a faster media sampler.

Around the time Broadwell was introduced, we had the major transitions taking place in the video codec front - HEVC adoption was picking up, and VP8 / VP9 was also gaining support. In order to tackle these aspects and build on consumer feedback, Intel made major updates to the media block / Quick Sync engine late last year.

Broadwell was also the first microarchitecture to support two BSDs (bit stream decoder) in the GT3 variants. Each BSD allows a set of commands to decode one video stream.

Broadwell's updates (when compared to Haswell) are summarized in the slide below.

The detailed discussion of Broadwell's media capabilities above is relevant to the improvements made in Skylake.

Skylake's Gen9 Graphics

The Gen9 graphics engine comes in multiple sizes for different power budgets. There are three main variants, GT2, GT3/GT3e and GT4e. In the slide below, the important aspect to note is that the media processing hardware (Media FF - Media Fixed Function) resides in the 'Unslice'. While the GT2 comes with the minimum possible Media FF logic, the GT3 and GT3e come with additional hardware capabilities. This strategy is similar to what was adopted in Broadwell.

The Unslice can operate at a different voltage and frequency compared to the Slices. This is especially important for video decoding / processing where the Media FF can run at higher clocks for better performance while ensuring minimal power consumption. From the viewpoint of tools such as GPU-Z and HWiNFO, it will be interesting to see if real-time statistics on voltage and clocks can be gathered for both the Unslice and the Slices. For additional power saving, power gating can be used at the Slices level or the EU group level.

Amongst the media improvements made in Skylake, we have:

  • An additional fixed function video encoder in the Quick Sync engine
  • Additional codec support (both decode and encode): HEVC, VP8, MJPEG
  • RAW imaging capabilities

Quick Sync in Skylake

Intel classifies the Quick Sync modes in Broadwell and previous generations as 'PG-Mode' (Processor Graphics). It is optimized for faster than real-time encoding and flexibility. The new mode, 'FF-Mode' (Fixed Function) is optimized for real-time H.264 encoding, with focus on lowering the latency and reducing the power consumption. Except for programmable rate control, all other aspects of the encoding algorithm are handled in the MFX itself. Since rate control is in the hands of the application software, it is possible to do a 2-pass adaptive mode even with the FF hardware.

The new mode could possibly enable better user-experience with features such as Wi-Di, screen recording etc.. Note that Skylake offers developers the flexibility to use either the PG mode or the FF mode in their applications. PG mode still retains the TUx (Target Usage level) discussed in one of the above slides.

Skylake's MFX engine adds HEVC Main profile decode support (4Kp60 at up to 240 Mbps). Main10 decoding can be done with GPU acceleration. The Quick Sync PG Mode supports HEVC encoding (again, Main profile only, with support for up to 4Kp60 streams).

The DXVA Checker screenshot (taken on a i7-6700K, a part with Intel HD Graphics 530 / GT2) for Skylake with driver version 10.18.15.4248 is produced below. HEVC_VLD_Main10 has a DXVA profile, but it is done partially in the GPU (as specified in the slide above). VP8 DXVA profile doesn't seem to be activated yet. There are new DXVA profiles (enabled) for the SVC (scalable video coding) extension to H.264.

Video Post Processing & Miscellaneous Aspects

Additional improvements include a scalar and format converter (SFC) that can work with MFX and VQE (without using the EUs or the media sampler). This enables power-efficient rotation and color space conversion during media playback.

Yet another power-saving trick introduced in Skylake is the media memory bandwidth compression. The compression is lossless and managed at the driver level.

Skylake's VQE also brings about new features with RAW image processing support (16-bit image pipeline), spatial denoising and local adaptive contrast enhancement (LACE). Power efficiency is also improved, with claims of the VQE consuming less than 50mW during operation.

The new fixed function hardware in the performance-sensitive stages enables even low power mobile Skylake parts to support 4Kp60 RAW video processing. LACE support is not available for 4K resolution on the Y-series Skylake parts, though.

Display Capabilities

In terms of display support, Skylake can drive up to three simultaneous displays. The supported resolutions are provided in the table below. At IDF, Intel was showing off the Skylake platform driving three 4K monitors simultaneously.

One of the disappointing aspects is the absence of a native HDMI 2.0 port with HDCP 2.2 support. Intel's solution is to add a LSPCon (Level Shifter - Protocol Converter) in the DP 1.2 path. Various solutions such as the MegaChips MCDP28 family of products exist for this purpose. According to one of leaked Intel slides from earlier this year, the Alpine Ridge Thunderbolt 3 controller can also act as a LSPCon and provide a HDMI 2.0 output. At IDF, Intel indicated that we could see Alpine Ridge supporting HDMI 2.0 towards the end of the year (something corroborated unofficially by a few motherboard manufacturers)

The display sub-system also provides hardware support for Multi-plane Overlay (MPO) that allows alpha blending of multiple layers. This saves power by selective disabling of un-needed planes. Usage applications include certain video playback scenarios and HUD (heads-up display) gaming. The table below lists out the updated support for MPO as one moves from Broadwell to Skylake. The NV12 feature is particularly interesting from a media playback perspective - it is a video format that avoids conversion as video data moves between the decoder, post processing and the display blocks. With Skylake, post-decoded NV12 content can also be provided directly to a MPO display plane, and there is no need for the video post processor to do a NV12 to RGB conversion.

Intel indicated that the new Skylake MPO feature could save as much as 1.1W when playing back 1080p24 video on a 1440p panel - which is a substantial amount when mobile devices are considered. Power savings are also achieved by altering the core display clock based on the display configuration, number of displays and the resolution of each display.

Systems utilizing eDP with Windows 8.1 or later can also take advantage of hardware support for reducing refresh rate based on video content frame rate (for example, 24 fps video streams can be played after reducing the panel refresh rate to 48 Hz - eliminating 3:2 pull-down issues while also providing power savings). Obviously, the panel and TCON should support this.

Additional power saving can also be achieved on supported panels using Panel Self Refresh Media Buffer Optimization (PSR MBO). It is an Intel-developed optimization on top of the Panel Self Refresh feature of eDP 1.3.

Concluding Remarks

The media-related changes in Skylake's Gen9 GPU are best summarized by the slide below.

Skylake brings a lot of benefits to content creators - particularly in terms of improvements to Quick Sync and additional image processing options (including real-time 4Kp60 RAW import). However, it is a mixed bag for HTPC users. While the additional video post processing options (such as LACE for adaptive contrast enhancement) can improve quality of video playback, and the increase in graphics prowess can possibly translate to better madVR capabilities, two glaring aspects prove to be dampeners. The first one is the absence of full hardware acceleration for HEVC Main10 decode. Netflix has opted to go with HEVC Main10 for its 4K streams. When Netflix finally enables 4K streaming on PCs, Skylake, unfortunately is not going to be as power efficient a platform as it could have been. The second is the absence of a native HDMI 2.0 / HDCP 2.2 video output. Even though a LSPCon solution is suggested by Intel, it undoubtedly increases the system cost. Sinks supporting this standard have become quite affordable. For less than $600, one can get a 4K Hisense TV with HDMI 2.0 / HDCP 2.2 capability. Unfortunately, Skylake is not going to deliver the most cost-effective platform to utilize the full capabilities of such a display.

POST A COMMENT

37 Comments

View All Comments

  • icrf - Wednesday, August 26, 2015 - link

    Damn, I was hoping for a Skylake powered HP Stream Mini to hook up to my HDMI 2.0 / HDCP 2.2 Vizio M70-C3. Reply
  • drzzz - Wednesday, August 26, 2015 - link

    While I know this is analyzing media performance I find it sad how AT did not find it news worthy to mention that the GT4e parts will have 72 execution units. Considering this is the first article specifically on Intel Gen9 GPU's. Yes GT3/GT3e moving up to 48 from 40 is in one of the slides but no where was the EU count for GT4e included in this article. I sure hope AT is going to release an article that discusses GT4e specifically and how much performance we can expect from it. Reply
  • witeken - Wednesday, August 26, 2015 - link

    GT4e will be discussed when they review GT4e... Reply
  • nathanddrews - Wednesday, August 26, 2015 - link

    Yes, that should really get its own review. Reply
  • jeffkibuule - Wednesday, August 26, 2015 - link

    Commenters these days, no respect. Reply
  • MrSpadge - Wednesday, August 26, 2015 - link

    To be fair, Intel did not even specify which SKUs are going to get which GPU. In fact, officially they didn't say anything specific about the SKUs apart from the 2 released K CPUs. So it would be very hard to judge GT4e performance, as it will depend significantly on TDP. Just look at how much power Broadwell GT3e already consumes under GPU load. Reply
  • Kjella - Wednesday, August 26, 2015 - link

    How much power compared to what? The i7-5775C (65W CPU+GPU) benchmarks fairly similar to a R7 250 (65W GPU) or GTX 750 (55W GPU), I couldn't find any benchmarks for the i7-5950HQ (47W CPU+GPU) but it seems to me Intel's performance/watt is pretty good. It's their performance/$ that sucks, but that's because they totally own the high end laptop CPU market. A full GT4e is probably intended to compete with the lower end of the nVidia 9xxM series that use 50-120w by themselves. Reply
  • rtho782 - Thursday, August 27, 2015 - link

    Bear in mind that in the case of the Intel CPUs you have not factored in RAM, whereas the graphics cards you have.

    Of course, the CPU part not present in the dedicated GPUs probably uses more power than the VRAM not present in the CPU, but it just shows how hard it is to make a fair comparison.
    Reply
  • MrSpadge - Thursday, August 27, 2015 - link

    Well, compared to CPUs with GT2 GPUs, Broadwell with GT3e seems to consume 30-40W more under GPU load. Now add 50% more shaders and it's easy to see how this chip might run into power limits, depending on the SKU. The performance between a chip with 95 W TDP and 37 W TDP will differ wildy. Reply
  • Kutark - Wednesday, September 02, 2015 - link

    Here's my only thing with integrated graphics. Even if it is as fast as say a hypothetical 950/940/935m (whichever happens to be the case), and even if it does have similar TDP's. The difference is that with a "discrete" GPU, the manufacturer can move the GPU to another area where it will make cooling it significantly easier. I wonder how much of an issue cooling the cpu/gpu package being on the same die/socket etc is going to cause with these higher end parts.

    The other thing i wonder about is where they will come into use in the desktop market where the cooling factor isnt really an issue due to lack of space limitations. Once you get into that performance area, people are starting to be more of proper "gamers" and are much more likely to invest in a higher cost discrete GPU to get more gaming power.

    I guess the AMD integrated gpu's are a good example of where that market might stand, and frankly i don't know anything about the sales figures on those or market share etc. I can't imagine its high. But hey, im just spitballing here.
    Reply

Log in

Don't have an account? Sign up now