09:03PM EDT - The final presentation of Hot Chips 31 is from Microsoft, who will be lifting the lid of the silicon behind its HoloLens 2.0 product.

09:17PM EDT - Here we go

09:17PM EDT - HPU 2.0

09:18PM EDT - Holographic processor

09:18PM EDT - Custom silicon, obviously

09:19PM EDT - This speaker has been trained. There are purposeful pauses when she lists stuff

09:20PM EDT - Application processor runs the app, and the HPU modifies the rendered image and sends to the display

09:20PM EDT - HPU works on specific workloads

09:21PM EDT - Takes the visual cues and allows the HPU to track where the hands are at all times

09:21PM EDT - 79mm2 on TSMC 16FF+

09:21PM EDT - 123M gates, 2B transistors

09:22PM EDT - 2016 Tapeout

09:22PM EDT - 125 Mb of SRAM

09:22PM EDT - First prototype

09:22PM EDT - First prototype headset*

09:23PM EDT - HPU 2 is dedicated to only Microsoft workloads

09:23PM EDT - Targets a single Microsoft RTOS

09:23PM EDT - No MMUs, simple interrupts

09:23PM EDT - Frees up the hardware

09:23PM EDT - Works with the software team to configure caches and memory

09:24PM EDT - Balance between dedicated HW compute and flexibility / programmability

09:24PM EDT - SIMD Fixed Point at top

09:24PM EDT - Does 2D processing

09:24PM EDT - FVP, Floating Vector Processor on bottom, does 3D

09:24PM EDT - 2 Tensiilica processors per node

09:24PM EDT - Trade off area for latency - low latency was key

09:24PM EDT - DMA channel per core

09:25PM EDT - New depth based algorithms

09:25PM EDT - 13 statically assigned compute cores

09:25PM EDT - >1 TOP of programmable compute

09:25PM EDT - 100s of customized instructions

09:26PM EDT - Algorithm profiling to turn 10s of ops into a single instruction

09:27PM EDT - Example, boxavg_2x16x8 is a single cycle instruction

09:27PM EDT - instruction is applied to every pixel, saving 10k+ cycles per frame

09:27PM EDT - Hardened compute on ToF sensor

09:27PM EDT - JBL filter

09:28PM EDT - Uses 3 sensors and applies filter

09:28PM EDT - But didn't fit on the node. But adjusted a C model into RTL, for hardware. Reduces power to 1/3, and 1/30th latency

09:29PM EDT - Now thermals

09:30PM EDT - Power gating, clock gating, removing ULV cells

09:31PM EDT - Most digital logic at 250 MHz, compute at 500 MHz

09:31PM EDT - Reduced voltage, Vmin

09:31PM EDT - DVFS per chip

09:32PM EDT - Could take the guard bands off

09:32PM EDT - Can reduce the power by 20%

09:32PM EDT - at Vmin

09:32PM EDT - Now system integration

09:33PM EDT - HPU in front, App processor in back

09:33PM EDT - PCIe 2.0 x1 at 100 MB/s comms between front and back

09:34PM EDT - Rendered images sent back via MIPI to HPU

09:34PM EDT - MIPI QoS rates

09:35PM EDT - 6.8 GB/s needed to sync into two lanes of LPDDR4

09:35PM EDT - Custom DRAM scheduler on HPU 2.0

09:36PM EDT - Hologram stability

09:37PM EDT - Multiple pose updates per frame

09:37PM EDT - Hardened block on HPU decouples the render resolution to display resolution

09:37PM EDT - Gives more thermal headroom to GPU

09:38PM EDT - HPU Timestamps the sensor data as it comes in

09:38PM EDT - Hardened neural network

09:39PM EDT - Q&A

09:40PM EDT - Q: Comment on depth camera? A: Custom ToF, there's a lot of literature out.

09:40PM EDT - Q: Scheduler? A: Statically assigned algorithms to the compute units

09:41PM EDT - Q: Transfer data between different VFPs? A: Small amount of bandwidth between VFPs, but mostly between memory.

09:43PM EDT - That's a wrap! Thank you for staying with us through all the Hot Chips coverage!

Comments Locked

8 Comments

View All Comments

Log in

Don't have an account? Sign up now