Hot Chips 2020 Live Blog: Alibaba's Hanguang 800 NPU (5:00pm PT)by Dr. Ian Cutress on August 18, 2020 7:55 PM EST
08:05PM EDT - fp19 support
08:04PM EDT - on EW2 stage
08:04PM EDT - Convert data to FP and push down the pipe
08:03PM EDT - Use sliding window to minimize access
08:02PM EDT - minimize data movement
08:02PM EDT - data reuse and fused ops
08:02PM EDT - This is the tensor engine throughput
08:02PM EDT - Each core has three engines: Tensor, Pooling, Memory
08:01PM EDT - PCIe 4.0 x16
08:01PM EDT - Command processor above all four cores
08:01PM EDT - 192 MB local memory, distributed shared, no DDR
08:01PM EDT - 4 cores with ring bus
08:00PM EDT - Flexible to support future activation functions
08:00PM EDT - Optimization for GEMM as well
08:00PM EDT - Lots of Alibaba workloads are convolution-related
08:00PM EDT - achieve high-throughput, low latency, high power efficiency design
08:00PM EDT - Lots of business on inferencing
07:59PM EDT - Development in early 2018
07:58PM EDT - Former Huawei GPU architect
