07:38PM EDT - Another talk from Hot Chips, this time on Intel's Knights Mill (KNM). The Intel Knights family stems from their Xeon Phi product line, although KNM is a bit different, with machine learning specific changes. It's not a completely new Xeon Phi design, but Intel wants to go after the machine learning market. Today's talk will go into some of those changes. (We're battling some wifi here, so pictures may come later).

07:41PM EDT - Still fighting WiFi from this morning, but we're seated and Intel's KNM is the next talk :)

07:42PM EDT - Jesus Corbal to the stage, one of the Primary Architects for KNL and Lead Architect for KNM. Part of the team that created AVX512 extensions

07:43PM EDT - 'Machine Learning' is a wide umbrella

07:43PM EDT - 'We need to put in the smarts to the algorithms'

07:44PM EDT - 'Neural Networks are not new - we learned about them in the 60s'

07:44PM EDT - 'The blessing and the cure is the curated data and self-training'

07:45PM EDT - 'A lot of focus on image recognition'

07:46PM EDT - 'We have solutions, from Xeon to Xeon Phy, to FPGA, to Deep Learning in the Crest Family'

07:46PM EDT - *Phi

07:46PM EDT - 'It's a mix from all-purpose to dedicated acceleration'

07:47PM EDT - 'So why is Xeon Phi, a HPC product, now doing Deep Learning?'

07:47PM EDT - 'Xeon Phi allows scale and configuration'

07:48PM EDT - 'Announcing Knights Mill, building on top of Knights Landing'

07:48PM EDT - 'To be launched in Q4'

07:48PM EDT - '4x Deep Learning perf over Knights Landing'

07:49PM EDT - '4x Deep Learning perf over Knights Landing'

07:49PM EDT - 'Builds directly on top of KNL'

07:49PM EDT - 'It's all about integration of different components'

07:49PM EDT - 'Exploiting a new form of parallelism'

07:49PM EDT - 'We want the cake and eat it too: so we have embedded memory and DDR4'

07:50PM EDT - 16GB of MCDRAM

07:50PM EDT - It's all about the smart location of data for capacity and bandwidth

07:50PM EDT - Support binaries from Broadwell and below

07:50PM EDT - 2-way OoO, 4-way SMT, AVX-512 with VNNI, new Quad FMA

07:51PM EDT - TLP, ILP, DLP and PLP

07:51PM EDT - Quad FMA is new, VNNI is new for KNM

07:52PM EDT - PLP = Pipeline level parallelism via Quad FMA

07:52PM EDT - Based on KNL, up to 6-channel of DDR4, 36 lanes PCIe

07:53PM EDT - Same core config of KNL: 2 cores sharing 1MB of L2, one VPU per core

07:54PM EDT - Using the Mesh interconnect

07:54PM EDT - Number of cores withheld for today (although that slide says 36 tiles)

07:54PM EDT - Quad FMA does FMA and funnels into a new FMA while accumulate into new result

07:54PM EDT - Building more FMA entities one after the other vertically

07:55PM EDT - Adds latency, need enough ILP to hide latency

07:55PM EDT - A single target for the vector accumulator

07:55PM EDT - uses source block of 4 zmm sources, memory operand packing of 4 scalars

07:57PM EDT - Multiplying A into B to give C

07:57PM EDT - Pack together 12 aligned sources in DRAM to give QFMA

07:58PM EDT - Assuming 3 cycles of latency per FMA

07:58PM EDT - Now VNNI

07:58PM EDT - Variable precision via 16-bit INT inputs and 32-bit INT output

07:59PM EDT - Horizontal dot product

07:59PM EDT - Uses 31 bits of INT precision vs 24 bits of Mantissa in FP32

08:01PM EDT - Now for the core - an enhanced KNL, 2way OoO, 4way SMT, 1MB L2, 64-byte / cycle

08:03PM EDT - Even though it's 2-way in the front end, it's like 4-way in the back end

08:03PM EDT - We can send the same uop to two clusters - send it to the L/S and the VPU at the same time and is interpreted differently

08:03PM EDT - We can send the same uop to two clusters - send it to the L/S and the VPU at the same time and is interpreted differently

08:04PM EDT - Compensate a narrow front end by packing more operations in a single instruction

08:04PM EDT - In KNL, two units do SP and LP

08:06PM EDT - In KNM, remove one DP ports to give space for four SP VNNI units

08:06PM EDT - So 0.5x DP, 2 x SP, 4x VNNI

08:06PM EDT - Pitching KNM for DL but with tradeoffs, same generation as KNL

08:06PM EDT - KNL to provide time to train and scale up - solve the problem by adding nodes. You can also use it for other things

08:08PM EDT - Now Q&A

08:09PM EDT - 'Why use INT for VNNI rather than FP'

08:10PM EDT - 'FP has failures: it's actually complex to adhere to IEEE and very few advantages. INT is easier and has a similar level of accuracy'

08:11PM EDT - 'Q: Framework performance?'

08:12PM EDT - 'A: we supply libraries, such as MKL, and an open source one called MKL-DNN''

08:13PM EDT - That looks about it. Shame they didn't state cores (even though the slide says 36 tiles), or frequencies.



View All Comments

  • Ian Cutress - Monday, August 21, 2017 - link

    There might be frequency or power benefits, depending on what process it's going to be made on. I don't think they've announced that yet? Reply
  • Rig - Monday, August 21, 2017 - link

    AVX512BW? Reply
  • Ian Cutress - Monday, August 21, 2017 - link

    Not in KNM. Check the venn diagram here

  • tipoo - Tuesday, August 22, 2017 - link

    Before I followed the link I was thinking "a venn diagram is too simple for Intel products", and yup lol, quad circle diagram. Reply
  • Ro_Ja - Tuesday, August 22, 2017 - link

    Oh...chips are hot alright. Reply
  • Santoval - Tuesday, August 22, 2017 - link

    I wonder if it will better than KNL in everything but DP performance, or whether there will be additional drawbacks. Reply
  • p1esk - Tuesday, August 22, 2017 - link

    Why would I want to buy this instead of Nvidia card to do DL? Reply
  • mode_13h - Tuesday, August 22, 2017 - link

    This is supposedly faster than P100, and probably much cheaper than V100. Still a tough sell, but better than KNL at least. Reply
  • mode_13h - Tuesday, August 22, 2017 - link

    I guess a better answer would be that maybe you're building a cluster for mixed-use hosting of both conventional HPC applications and deep learning. That's the only way that x86 works out to be an advantage.

    Otherwise, if they just wanted to beat GPUs at their own game, they'd have been better off using their HD Graphics architecture as a foundation and then bolting on 512-bit vector units + McDRAM. However, we could be moving into an era when even GPUs are surpassed at deep learning by purpose-built chips like Google's TPU2.
  • Ian Cutress - Friday, August 25, 2017 - link

    Android Password Breaker hacking tutorials hacking ebooks hacking news hacking tools android technology https://myhacker.net Reply

Log in

Don't have an account? Sign up now