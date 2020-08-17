Hot Chips 2020 Live Blog: NVIDIA A100 Performance (5:00pm PT)by Dr. Ian Cutress on August 17, 2020 7:50 PM EST
08:13PM EDT - A100 couldn't just scale up V100 - L2 memory bandwidth wouldn't keep up
08:12PM EDT - Continually stream data improving utilization
08:11PM EDT - 2x efficiency
08:11PM EDT - 3x in L1 BW, 2x inflight cap
08:11PM EDT - New load-global-store-shared copy bypassing the register file
08:11PM EDT - A100 uses 32-thread tensor cores to reduce instructions required
08:10PM EDT - Improved speeds and feeds, and efficiency
08:10PM EDT - 6K bytes per clock per SM for sparse
08:10PM EDT - Increase A100 data bandwidth increases based on algorithm requirements
08:09PM EDT - FP32 now uses TF32 OPs, supports 20x improvement for sparse data
08:09PM EDT - Tensor core supports more data types
08:08PM EDT - fixed size networks
08:08PM EDT - A100 targeted strong scaling
08:08PM EDT - each layer is parallelised - A100 is 2.5x for dense FP16
08:07PM EDT - DL strong scaling
08:07PM EDT - Strong scaling
08:06PM EDT - Even wins against unreleased chips
08:06PM EDT - A100 dominates in per-chip performance as well
08:06PM EDT - Records on MLPerf with A100 Pods
08:06PM EDT - IEEE for FP64 MatMul
08:05PM EDT - Performance uplift against V100
08:05PM EDT - Increased L1, async data movement
08:05PM EDT - More efficient, improves perf with sparsity
08:05PM EDT - Next-Gen Tensor Core
08:04PM EDT - 2x-7x improvements over V100 overall
08:04PM EDT - Elastic GPU, scale out with 3rd Gen NVLink
08:04PM EDT - 1.6 TB/sec HBM2 bandwidth
08:03PM EDT - 6912 CUDA Cores
08:03PM EDT - A100: 54-56B transistors
08:03PM EDT - Jack Choquette from NV
08:02PM EDT - Intel's John Sell, ex-Microsoft, is the chair for the session
08:00PM EDT - Open question if they'll talk about Ampere for environments other than HPC, but this session is also about 'Gaming', so you never know
07:58PM EDT - First talk of the GPU session is from NVIDIA, on the A100 performance and the Ampere architecture
