Broadwell GPU Improvements

The new integrated graphics for Broadwell from Intel in the form of HD 5500, HD 6000 and Iris 6100 open up some very important functionality. DirectX 11.2, OpenGL 4.3 and OpenCL 2.0 are big scoring points, especially OpenCL 2.0 when it comes to graphics compute. This includes features such as shared memory coherency between CPU and GPU to allow transfer of pointer-rich memory structures and ultimately the beginnings of heterogeneous CPU + GPU compute.

‘DirectX 12 Ready’ is somewhat of an interesting case, and it depends on Microsoft’s final specifications. At this point in time Intel state that they conform to as much of DirectX 12 feature set that they can get their hands on with their relationship with Microsoft, and barring any major addition or change, should be compatible. UHD-4K is also now supported through the HDMI 1.4b interface, albeit limited to 24 Hz.

Support for HEVC is also present in terms of a combined (hybrid) hardware and software solution, just as it was on Core-M. Intel stated that as parts of the H.264 algorithm are near if not identical to those of the H.265, with a small tweak to the hardware it can be used for both. This is still not an all-encompassing hardware acceleration, but it does aim at some parts of the codec. I would speculate that if a full section of the silicon could be made for complete hardware acceleration, it might eat into certain power budgets.

GPU Slices

Although mentioned in our Broadwell architecture overview, for Broadwell-U we should cover the basics of the GPU layout. Below is an image of a HD 6000 implementation featuring two slices of 24 EUs each, with each slice having three sub-slices of 8 EUs. Each EU can handle 7 threads at once when in a 128xSIMD8 32-bit configuration. When in the right data structure, this offers a good amount of power, especially to coalesced 16-bit computation.

For Broadwell, Intel has reduced the number of EUs in a sub-slice from 10 down to 8. This is a big change, as it eases up the Thread Dispatch, Data Port and L1/L2 samplers in each EU as they have to deal with less data overall. By this logic, with two equivalent designs, one with 8 EUs per sub-slice should be able to get through more data due to less data bandwidth pressure and less competition for each sub-slice’s shared local memory. 

Each slice has an L3 Data Cache, which for Gen8 is improved from 384 KB to 576 KB per slice. This ends up being split anyway as 64 KB per sub-slice as shared local memory and 384 KB for inter-slice L3 communication. More slices mean more caching overall, and as an aid to the last level cache outside the GPU, the Graphics Technology Interface has doubled the write bandwidth.

The principle behind caches is to provide a small amount of memory that can be accessed faster at the expense of size. The algorithms in place to predict which data is needed next (or preprogrammed data fetching) helps increase the overall speed of any CPU function, but increasing the size of a cache decreases the need for later caches by reducing the data misses that require trips further out. GPUs are historically bandwidth starved, both for gaming and for compute, so by virtue of having four named caches for the GPU to use and increasing at least the third, this should contribute to the large performance numbers Intel is producing in comparison to the previous generation.

In terms of throughput, Intel gives the following numbers for GT2 / 24 EU configurations:

32b FP: 384 FLOP/cycle = 24 EUs * (2 * SIMD-4 FPU) * (MUL + ADD)
64b DP: 96 FLOP/cycle = 24 EU * SIMD-4 FPU * (MUL + ADD) * 0.5 throughput
32b INT: 192 IOP/cycle = 24 EU * (2 * SIMD-4 FPU) * ADD

The Gen8 EUs will support 16-bit floats (half-floats) natively, as well as 16-bit integers.

We can construct a table calculating out to full performance:

Intel Gen8 Graphics Throughput
Processor 32b FP
GFLOPs
64b DP
GFLOPS
32b INT
GFLOPs
i7-5557U (48 EUs at 1100 MHz) 844.8 211.2 422.4
i5-5257U (48 EUs at 1050 MHz) 806.4 201.6 403.2
i3-5157U (48 EUs at 1000 MHz) 768 192 384
i7-5250U (48 EUs at 950 MHz) 729.6 182.4 364.8
i7-5600U (24 EUs at 950 MHz) 364.8 91.2 182.4
i5-5300U (24 EUs a 900 MHz) 345.6 86.4 172.8
i3-5010U (23 EUs at 900 MHz) 331.2 82.8 165.6
i3-5005U (23 EUs at 850 MHz) 312.8 78.2 156.4
Pentium 3805U (12 EUs at 800 MHz) 153.6 38.4 76.8

Having doubled the EUs not only doubles the performance but the extra bit of frequency also helps.

As we mentioned back in August, the sub-slice also gets some minor improvements to increase pixel and Z-fill rates, and the front end’s geometry units are also beefed up. Further reading can be found on Intel’s PDF on the Gen 8 Graphics Overview from IDF.

Intel Releases Broadwell-U Fitting in With Core M & Release Dates
POST A COMMENT

85 Comments

View All Comments

  • KaarlisK - Monday, January 05, 2015 - link

    "It might come across as somewhat surprising that a 15W CPU like the i7-5650U has a 2.2 GHz base frequency but then a 3.2 GHz to 3.1 GHz operating window, and yet the i7-5557U has a 3.1 GHz base with 3.4 GHz operating for almost double the TDP. Apart from the slight increase in CPU and GPU frequency, it is hard to account for such a jump without point at the i7-5650U and saying that ultimately it is the more efficient bin of the CPUs."
    This is not surprising. This is used to increase the GPU performance. 28W CPUs have Iris 6100, 15W CPUs have HD 6000.
    In no way does TDP tell us anything about efficiency.
    Reply
  • aratosm - Monday, January 05, 2015 - link

    Iris 6100 vs HD 6000 are almost identical. The only difference is a slightly faster clock speed. I think the problem is, HD6000 will throttle more to stay in that power envelope. Reply
  • Topinio - Monday, January 05, 2015 - link

    Looks to me like the 23W ones (i.e. those with the 6100 graphics) will be the only ones to be capable of being near the max turbo clocks for long.

    Would also be interesting to know the AVX base and turbo clocks for these chips, to compare the possible 64b DP GFLOPS from the CPU cores to those listed on page 2 from the GPUs. Top bin is likely somewhere < 102 (vs 211 from GPU), but how much lower?
    Reply
  • MrSpadge - Monday, January 05, 2015 - link

    For the big Xeons the AVX base clock is typically 200 MHz below the regular base clock. They operate in a similar frequency & voltage range as the mobile chips (and are as power-limited as they are), so expect the same to apply here. Reply
  • hansmuff - Thursday, January 08, 2015 - link

    First time I read about AVX clocks, then found another mention in a previous Xeon CPU article. Is this a thing for Xeon only, or do the Haswell desktop chips throttle the clock with heavy AVX as well? Reply
  • naloj - Monday, January 05, 2015 - link

    A good example of this is in the throttling of the HD5000 in the 15W NUC i5-4250. You can get 40% better performance by changing the TDP settings from 25W short burst / 15W steady to 35W short burst / 31W steady. Reply
  • MrSpadge - Monday, January 05, 2015 - link

    "In no way does TDP tell us anything about efficiency."

    Agreed - TDP is far to crude for this. Intel Desktop CPUs often operate far below TDP, whereas mobile chips are throttled by it. How much? Depends on the laptop, environment temperature etc.

    So even though the 15 W CPUs quoted above are allowed to top out at 3+ GHz, they won't run at anywhere close to this frequency under sustained heavy load. The 28 W chips should have no trouble sustaining the speed, given adequate cooling.
    Reply
  • zepi - Monday, January 05, 2015 - link

    Iris 6100 + edram or at least DDR4 bandwidth increases would have made a terrific difference to "retina" and high-dpi ultrabooks / laptops, but now this upgrade is watered to irrelevancy.

    Nothing to see here...
    Reply
  • HungryTurkey - Wednesday, January 14, 2015 - link

    In a retina/hdpi environment, few applications would come close to saturating the bus. The EUs (even with the 6100) would bottleneck long before LPDDR3/DDR3 would. Reply
  • fokka - Thursday, January 08, 2015 - link

    as i see it the given tdp only ensures operation at base clocks and without a substantial graphics load. operation at turbo clocks requires to overstep the tdp until power draw and or temps are too high and the clock returns to the base frequency.

    if you look at it like this it's not surprising a 15w sku has a base clock of 2.2ghz ans a 28w sku 3.1ghz. that said the 28w tdp still looks "too high" for the frequency you get out of it, but i guess that this extra power/heat-budget is there for the sole reason so the 28w sku can operate at turbo clocks for longer without throttling down again, plus there is more headroom for a graphics load at the same time. this ensures, even with similar hardware and turbo-clocks, the 28w sku is allowed to produce more heat and in turn get more work done in the same time.

    that's the same reason core-m has very high turbo speeds, but can only turbo for a couple seconds until it's too hot and it "throttles" down to base clocks.
    Reply

Log in

Don't have an account? Sign up now