Broadwell GPU Architecture

While Intel’s tick products are relatively conservative on the CPU side, the same cannot be said for the GPU side. Although the same general tick-tock rules apply to GPUs as well as they do CPUs – the bigger architectural changes are on the tock – the embarrassingly parallel nature of graphics coupled with the density improvements from newer process nodes means that even in a tick Intel’s GPU improvements are going to be substantial. And Broadwell will be no exception.

From a high level overview, Broadwell’s GPU is a continuation of the Intel Gen7 architecture first pioneered in Ivy Bridge and further refined for Gen7.5 in Haswell. While there are some important underlying changes that we’ll get to in a moment, at a fundamental level this is still the same GPU architecture that we’ve seen from Intel for the last two generations, just with more features, more polish, and more optimizations than ever before.

In terms of functionality Broadwell’s GPU has been upgraded to support the latest and greatest graphics APIs, an important milestone for Intel as this means their iGPU is now at feature parity with iGPUs and dGPUs from AMD and NVIDIA. With support for Direct3D feature level 11_2 and Intel’s previous commitment to Direct3D 12, Intel no longer trails AMD and NVIDIA in base features; in fact with FL 11_2 support they’re even technically ahead of NVIDIA’s FL 11_0 Kepler and Maxwell architectures. FL 11_2 is a rather minor update in the long run, but support for it means that Intel now supports tiled resources and pre-compiled shader headers.

Meanwhile on the compute front, Intel has confirmed that Broadwell’s GPU will offer support for OpenCL 2.0, including OpenCL’s shared virtual memory. OpenCL 2.0 will bring with it several improvements that allow GPUs to be more robust compute devices, and though Intel doesn’t have a programming paradigm comparable to AMD’s HSA, SVM none the less affords Intel and OpenCL programmers the chance to better leverage Broadwell’s CPU and GPU together by directly sharing complex data structures rather than copying them around.

Digging deeper however quickly exposes that Intel hasn’t left their GPU architecture entirely alone. Broadwell-Y, like Haswell-Y before it, implements a single slice configuration of Intel’s GPU architecture. However the composition of a slice will be changing for Broadwell, and this will have a significant impact on the balance between various execution units.

Low Level Architecture Comparison
  AMD GCN NVIDIA Maxwell Intel Gen7.5 Graphics Intel Gen8 Graphics
Building Block GCN Compute Unit Maxwell SMM Sub-Slice Sub-Slice
Shader Building Block 16-wide Vector SIMD 32-wide Vector SIMD 2 x 4-wide Vector SIMD 2 x 4-wide Vector SIMD
Smallest Implementation 4 SIMDs 4 SIMDs 10 SIMDs 8 SIMDs
Smallest Implementation (ALUs) 64 128 80 64

In Haswell-Y Intel used a GT2 configuration, which was composed of a single slice that in turn contained 2 sub-slices. In Intel’s GPU architecture the sub-slice is the smallest functional building block of the GPU, containing the EUs (shaders) along with caches and texture/data/media samplers. Each EU in turn was composed of 2 4-wide vector SIMDs, with 10 EUs per sub-slice.

For Broadwell Intel is not changing the fundamental GPU architecture, but they are rebalancing the number of EUs per sub-slice and increasing the number of sub-slices overall. As compared to Haswell, Broadwell’s sub-slices will contain 8 EUs per sub-slice, with a complete slice now containing 3 sub-slices. Taken altogether this means that whereas Haswell-Y was a 2x10EU GPU, Broadwell-Y will be a 3x8EU GPU.

The ramifications of this is that not only is the total number of EUs increased by 20% from 20 to 24, but Intel has greatly increased the ratio of L1 cache and samplers relative to EUs. There is now 25% more sampling throughput per EU, with a total increase in sampler throughput (at identical clockspeeds) of 50%. By PC GPU standards increases in the ratio of samplers to EUs is very rare, with most designs decreasing that ratio over the years. The fact that Intel is increasing this ratio is a strong sign that Haswell’s balance may have been suboptimal for modern workloads, lacking enough sampler throughput to keep up with its shaders.

Moving on, along with the sub-slices front end and common slice are also receiving their own improvements. The common slice – responsible for housing the ROPs, rasterizer, and a port for the L3 cache – is receiving some microarchitecture improvements to further increase pixel and Z fill rates. Meanwhile the front end’s geometry units are also being beefed up to increase geometry throughput at that end.

Much like overall CPU performance, Intel isn’t talking about overall GPU performance at this time. Between the 20% increase in shading resources and 50% increase in sampling resources Broadwell’s GPU should deliver some strong performance gains, though it seems unlikely that it will be on the order of a full generational gain (e.g. catching up to Haswell GT3). What Intel is doing however is reiterating the benefits of their 14nm process in this case, noting that because 14nm significantly reduces GPU power consumption it will allow for more thermal headroom, which should further improve both burst and sustained GPU performance in TDP-limited scenarios relative to Haswell.

14nm isn’t the only technique Intel has to optimize power consumption on Broadwell’s GPU, which brings us to Broadwell’s final GPU technology improvement: Duty Cycle Control. While Intel has been able to clamp down on GPU idle power consumption over the years, they are increasingly fighting the laws of physics in extracting more idle power gains. At this point Intel can significantly scale down the frequency and operating voltage of their GPU, but past a point this offers diminishing returns. Transistors require a minimum voltage to operate – the threshold voltage – which means that after a certain point Intel can no longer scale down their voltage (and hence idle power consumption) further.

Intel’s solution to this problem is both a bit brute force and a bit genius, and is definitely unlike anything else we’ve seen on PC GPUs thus far. Since Intel can’t reduce their idle voltage they are going to start outright turning off the GPU instead; the process of duty cycling. By putting the GPU on a duty cycle Intel can run the GPU for just a fraction of the time – down to 12.5% of the time – which gets around the threshold voltage issue entirely.

This duty cycling is transparent to applications and the end user, with the display controller decoupled from the GPU clock domain and always staying online so that attached displays are always being fed regardless of what the GPU itself is doing. Control of the duty cycle is then handled through a combination of the GPU hardware and Intel’s graphics drivers, so both components will play a part in establishing the cycle.

Because today’s preview is Broadwell-Y centric, it’s unclear whether GPU duty cycle control is just a Broadwell-Y feature or whether it will be enabled in additional Broadwell products. Like many of Intel’s announced optimizations for Broadwell, duty cycle control is especially important for the TDP and battery life constrained Y SKU, but ultimately all mobile SKUs would stand to benefit from this feature. So it will be interesting to see just how widely it is enabled.

Moving on, last but not least in our GPU discussion, Intel is also upgrading their GPU’s media capabilities for Broadwell. The aforementioned increase in sub-slices and the resulting increase in samplers will have a direct impact on the GPU’s video processing capabilities – the Video Quality Engine and QuickSync – further increasing the throughput of each of them, up to 2x in the case of the video engine. Intel is also promising quality improvements in QuickSync, though they haven’t specified whether this is from technical improvements to the encoder or having more GPU resources to work with.

Broadwell’s video decode capabilities will also be increasing compared to Haswell. On top of Intel’s existing codec support, Broadwell will be implementing a hybrid H.265 decoder, allowing Broadwell to decode the next-generation video codec in hardware, but not with the same degree of power efficiency as H.264 today. In this hybrid setup Intel will be utilizing both portions of their fixed function video decoder and executing decoding steps on their shaders in order to offer complete H.265 decoding. The use of the shaders for part of the decoding process is less power efficient than doing everything in fixed function hardware but it’s better than the even less optimal CPU.

The use of a hybrid approach is essentially a stop-gap solution to the problem – the lead time on the finalization of H.265 would leave little time to develop a fixed function encoder for anyone with a long product cycle like Intel – and we expect that future generation products will have a full fixed function decoder. In the meantime Intel will be in the company of other GPU manufacturers such as NVIDIA, who is using a similar hybrid approach for H.265 on their Maxwell architecture.

Finally, Broadwell’s display controller will be receiving an update of its own. Broadwell is too soon for HDMI 2.0 or DisplayPort 2.0 – it will support HDMI 1.4 and DP 1.2/eDP 1.3a respectively – but the Y SKU in particular is getting native support for 4K. This is admittedly something of a backport since Haswell already supports 4K displays, but in Haswell’s case that feature was not available on Haswell-Y, so this is the first time native 4K support has come to a Y series SKU. This means that Broadwell-Y will be able to drive 4K displays, whether that means a 4K display in the device itself, or a 4K display hooked up internally (with an overall limit of 2 displays on Broadwell-Y). Don’t expect Broadwell-Y to have the performance necessary to do intensive rendering at this resolution, but for desktop work and video playback this should be enough.

Broadwell CPU Architecture Putting It All Together: Low Power Core M
Comments Locked

158 Comments

View All Comments

  • wurizen - Monday, August 11, 2014 - link

    well, an fx-8350 is toe-to-toe with an i7-2600k, which is no slouch until today. and comparing fx-8350 with today's i7-4770k would be a little unfair since the 4770k is 22nm while the 8350 is at 32nm. and we're not even considering software optimizations from OS and/or programs that are probably bent towards intel chips due to its ubiquity.

    so, i think, you're wrong that the fx-8350 doesn't provide good enough. i have both i7-3770k oc'd to 4.1 ghz and an fx-8320 at stock and the amd is fine. it's more than good enough. i've ripped movies using handbrake on both systems and to me, both systems are fast. am i counting milliseconds? no. does it matter to me if the fx-8320 w/ lets say amd r9-290 has 85 fps for so and so game and an i7-4770k w/ the same gpu has a higher fps of 95, let's just say? i don't think so. that extra 10 fps cost that intel dude $100 more. and 10 extra frames with avg frames of 85-95 is undecipherable. it's only when the frames drop down below 60 does one notice it since most monitors are at 60 hz.

    so what makes the fx not good enough for you again? are you like a brag queen? a rich man?
  • frostyfiredude - Monday, August 11, 2014 - link

    Not fair to compare against a 22nm from Intel? Bogus, I can go to the store and buy a 22nm Intel so it should be compared against AMDs greatest. An i5-4670K matches or exceeds the performance of even the FX-9590 in all but the most embarrassingly threaded tasks while costing 50$ more. Cost to operate the machine through the power bill makes up for that price difference at a fairly standard 12c per KWh when used heavily 2 hours per day for 4 years or idling 8 hours per day for the same 4 years.

    Your argument for gaming with the 8350 being good enough is weak too when the 10$ cheaper i3-4430 keeps up. Or spent 125$ less to get a Pentium G3258 AE, then mildly overclock it to again have the same good enough gaming performance if >60FPS is all that matters. The i3 and pentiums are ~70$ cheaper yet when power use is counted again.
  • wurizen - Tuesday, August 12, 2014 - link

    well, if a pentium g3258 is good enuff for gaming, then so is an fx-8350. whaaaaaat? omg we know intel is king. i acknowledge and understand that. intel rules. but, amd is not bad. not bad at all is all im trying to make.

    /omg
  • wetwareinterface - Monday, August 11, 2014 - link

    wow...

    first off you are assuming a lot and not bothering to check any published benchmarks out there so,

    1. 8350 isn't even equal to 2500 i5 let alone 2600 i7.
    2. 32nm vs. 22nm means nothing at all when comparing raw performance in a desktop. it will limit the thermal ceiling so in a laptop the higher nm chip will run hotter therefore be unable to hit higher clocks but in a desktop it means nil.
    3. handbrake ripping relies on speed of dvd/blu-ray drive, handbrake transcoding relies on cpu performance and the 8350 gets spanked there by a dual core i3 not by miliseconds but tens of seconds. i5 it gets to the level of minutes i7 more so.
    4. let's say you're pulling framerates for an r9-290 out of somewhere other than the ether... reality is an i5 is faster than the 8350 in almost any benchmark i've ever seen by roughly 15% overall. in certan games with lots of ai you get crazy framerate advantages with i5 over 8350, things like rome total war and starcraft 2 and diablo 3 etc...

    i'll just say fx8350 isn't good enough for me and i'm certainly not a rich man. system build cost for what i have vs. what the 8350 system would have run was a whopping $65 difference
  • wurizen - Tuesday, August 12, 2014 - link

    #3 is B.S. a dual-core i3 can't rip faster than an fx-8350 in handbrake.

    #4 the r-290 was an example to pair a fairly high end gpu with an fx-8350. a fairly high end gpu helps in games. thus, pairing it with an fx-8350 will give you a good combo that is more than good enough for gaming.

    #2 22nm vs. 32nm does matter in desktops. the fx-8350 is 32nm. if it goes to 22nm, the die shrink would enable the chip to either go higher in clockspeed or lower it's tdp.

    u sound like a benchmark queen or a publicity fatso.
  • wurizen - Tuesday, August 12, 2014 - link

    oh and #1--i am not saying the fx 8350 is better than the i7-2600k. i said "toe-to-toe." the i5-2500k can also beat the fx-835o b/c of intel's IPC speed advantage. but, i think the reasons for that are programs not made to be multithreaded and make use of fx-8350 8-cores to it's potential. since amd trails intel in IPC performance by a lot--this means that a 4-core i5-2500k can match it or sometimes even beat it in games. in a multithreaded environment, the 8-core fx-8350 will always beat the i5-2500k. although it might still trailer the 4-core + 4 fake cores i7-2600k. just kidding. lol.

    i said toe to toe with 2600k which means its "competitive" to an i7-2600k even though the AMD is handicapped with slower IPC speed and most programs/OS not optimize for multithreading. so, to be 10-20% behind in most benchmarks against an i7-2600k is not bad considering how programs take advantage of intel's higher IPC performance.

    do u understand what im trying to say?
  • Andrew Lin - Tuesday, August 26, 2014 - link

    i'm sorry, is your argument here that the FX-8350 is better because it's inferior? because that's all i'm getting out of this. Of course a benchmark is going to take advantage of higher IPC performance. That's the point of a benchmark: to distinguish higher performance. The way you talk about benchmarks it's as if you think benchmarks only give higher numbers because they're biased. That's not how it works. The benchmarks give the i7-2600k higher scores because it is a higher performance part in real life, which is what anyone buying a CPU actually care about. Not to mention the significantly higher efficiency, which is just an added benefit.
    Also, it's really hard to take you seriously when your posts make me think they're written by a teenage girl.
  • wurizen - Tuesday, August 12, 2014 - link

    also, if the fps disparity is so huge btwn fx-8350 and say i5-2500k in games u mention like starcraft 2, then something is wrong with that game. and not the fx-8350. i actually have sc2 and i have access to a pc w/ an fx-8320. so i am going to do a test later tonight. my own pc is an i7-3770k. so i could directly compare 2 different systems. the only thing is that the amd pc has an hd5850 gpu, which should be good enuff for sc2 and my pc has a gtx680 so it's not going to be a direct comparison. but, it should still give a good idea, right?
  • wurizen - Tuesday, August 12, 2014 - link

    i just played starcraft 2 on a pc with fx-8320 (stock clockspeed), 8GB 1600Mhz RAM, 7200rpm HDD and an old AMD HD5850 w/ 1GB VRAM. the experience was smooth. the settings were 1080P, all things at ultra or high and antialiasing set to ON. i wasn't looking at FPS since i don't know how to do it with starcraft 2, but, the gameplay was smooth. it didn't deter my experience.

    i also play this game on my own pc which is an i7-3770k OC'd to 4.1, 16GB 1600 Mhz RAM, 7200rpmHDD and an Nvidia GTX680 FTW w/ 2GB VRAM and i couldn't tell the difference as far as the smoothness of the gameplay is concerned. there is some graphical differences between the AMD GPU and the Nvidia GPU but that is another story. my point is that my experience were seamless playing on an FX chip pc to my own pc with 3700k.

    to make another point, i also have this game on my macbook pro and that is where the experience of playing this game goes down. even in low settings. the MBP just can't handle it. at least the one i have with the older gt330m dGpu and dual-core w/ hyperthreading i7 mobile cpu.

    so.... there.... no numbers or stats. just the experience, to me, which is what counts did not change with the pc that had the amd fx cpu.
  • wurizen - Tuesday, August 12, 2014 - link

    well, i should point out that my macbook pro (mid-2010 model) can handle starcraft 2. but, it's not a "fun" experience. or as smooth.

Log in

Don't have an account? Sign up now