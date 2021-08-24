Hot Chips 2021 Live Blog: Machine Learning (Esperanto, Enflame, Qualcomm)by Dr. Ian Cutress on August 24, 2021 11:05 AM EST
12:16PM EDT - AIC and OAM
12:15PM EDT - Layer cables to racks without DMA
12:15PM EDT - custom protocol with sub-microsecond latency
12:15PM EDT - 200 GB/s bi-directional IO per card
12:14PM EDT - Supports dimension reshape
12:14PM EDT - 4D tensors
12:13PM EDT - Async data flow and compute pipeline
12:13PM EDT - L0 cache with 10 TB/s bandwidth
12:12PM EDT - have to have it on a power of two boundary
12:12PM EDT - Support various tensor shapes
12:11PM EDT - 256 kernels support convolution operations
12:10PM EDT - hardware can add padding elements to get best efficiency compined with zero power instruction detection
12:09PM EDT - Cector and Scalar support sum and pooling
12:09PM EDT - 2 kbit per cycle for store, 1 kbit per cycle for load
12:08PM EDT - can fully skip instructions if zero power instruction detected
12:08PM EDT - Introduce sparsity for power
12:07PM EDT - Each kernel supports 1x-32bit MAC or 4x16-bit/8-bit MAC. All kernels do all precisions
12:07PM EDT - 256 Tensor compute Kernels
12:07PM EDT - GPU-Care 1.0
12:06PM EDT - DMA engine with 1 KB interface
12:06PM EDT - 256 KB of L1-Data
12:06PM EDT - 1024-bit bus with
12:06PM EDT - VLIW programmable
12:06PM EDT - on chip network*
12:06PM EDT - 40 data transfer engines
12:05PM EDT - 4 clusters of 8 tensor units
12:05PM EDT - ip networkj
12:05PM EDT - 32 AI compute cores
12:05PM EDT - 2 HBM2 at 512 GB/s
12:04PM EDT - 300W
12:04PM EDT - 16 lanes PCIe 4.0
12:03PM EDT - 80 TF of BF16, 12nm FinFet, 14.1 billion transistors, 200 GB/s interconnect
12:03PM EDT - DTU 1.0
12:02PM EDT - Designed 2018, launched 2019
12:02PM EDT - First Gen
12:02PM EDT - Next talk is Enflame
12:01PM EDT - Q: Data cache size for general purpose A: With area of 1000 cores, shift L1/L2 to multi-level is important. Special circuits - keep very robust voltage, need to use large SRAM for low voltage. 4 KB L1 gave a good hit rate with the L2 for performance
12:00PM EDT - Q: Why not BF16? A: Natively it does, but BF16 would be expanded FP32 for compute and put to BF16 back in storage. Because we do inference - customer wants inference, doesn't need BF16
11:58AM EDT - Q: External memory and IO power add above 20W - A: IOs are included. 20W includes DRAM and other components
11:56AM EDT - Q*A time
11:55AM EDT - Early Access for qualified customers later in 2021
11:55AM EDT - Highest performance commercial RISC-V chip to date
11:55AM EDT - A0 silicon in test
11:54AM EDT - First silicon in bring up
11:54AM EDT - 24 billion transistors, 570mm2, 89 mask layers
11:54AM EDT - Full RV64GC ISA
11:54AM EDT - Four high-performance ET-Maxions
11:52AM EDT - Esperanto projected performance
11:51AM EDT - Software through many interfaces
11:50AM EDT - 6 chips have a single heatspreader
11:50AM EDT - How to deploy at scale
11:50AM EDT - OCP versions
11:49AM EDT - 822 GB/s total memory bandwidth per PCIe card
11:49AM EDT - 192 GB of accelerator memory
11:49AM EDT - Six chips and 24 LPDDR4 chips on a PCIe card with a PCIe switch
11:49AM EDT - 256-bit wide LPDDR4X
11:48AM EDT - 16 LPDDR4X controllers
11:48AM EDT - Meshes run over the cores
11:48AM EDT - SRAM banks could be partitioned as private L2 or shared L3
11:48AM EDT - mesh interconnect on each shire
11:47AM EDT - with 4 MB of shared SRAM
11:47AM EDT - 4 neighborhoods makes a shire
11:47AM EDT - custom instructions
11:47AM EDT - cooperative loads
11:46AM EDT - far more efficient than having each core with its own I-cache
11:46AM EDT - 8 minions share a single large instruction cache
11:46AM EDT - before wide length became a problem
11:46AM EDT - 8 cores on a chip form a neighborhood
11:45AM EDT - 512-bit wide integer per cycle, 256-bit wide FP per cycle, per core
11:45AM EDT - 64k ops
11:45AM EDT - can do 64 ops on one tensor instruction
11:45AM EDT - 300 MHz to 2 GHz
11:44AM EDT - SMT2
11:44AM EDT - in order pipeline
11:44AM EDT - 64-bit risc-v processor, software configurable l1 data cache
11:43AM EDT - Best efficient point is at 8.5 W - 2.5x better perf than at 0.9 volts
11:42AM EDT - 0.75 volts is 164W per chip
11:42AM EDT - One chip could use 275W at peak
11:42AM EDT - Inferences per second per watt
11:41AM EDT - Efficiency vs voltage - 0.34 is best
11:40AM EDT - C dynamic is hard
11:40AM EDT - drive down voltage per core
11:40AM EDT - TSMC 7nm FinFET
11:38AM EDT - Highest recommendation performance inside 120W in six chips
11:38AM EDT - allows for lower voltage, increasing efciciency
11:38AM EDT - Esperanto splits it across chips
11:38AM EDT - Large chips have large power
11:38AM EDT - 1000s of RISC-V cores in esperanto
11:38AM EDT - limited parallelism with single big chips
11:37AM EDT - thousands of threads
11:36AM EDT - Fixed function hardware can quickly become obsolete
11:35AM EDT - reduce off-die memory references
11:34AM EDT - be programmable
11:34AM EDT - dense and sparse workloads
11:34AM EDT - Multiple data type support
11:34AM EDT - Low power budget per card
11:34AM EDT - these servers need add-in cards
11:34AM EDT - traditionally run on x86
11:33AM EDT - focus on recommendation models
11:33AM EDT - Under 20 watts for inference
11:33AM EDT - Up to 200 Tera-Ops
11:33AM EDT - PCIe x8 Gen 4
11:33AM EDT - 160 million bytes of SRAM onboard
11:32AM EDT - ET-Minion with tensor units
11:32AM EDT - 1088 RISC-V cores
11:31AM EDT - AI Accelerator - 1000 RISC-V cores on a chip
11:30AM EDT - First up is a talk from Esperanto Technologies
11:25AM EDT - Starting here in about 5 minutes
11:08AM EDT - Event starts at 8:30am PT, so in about 22 minutes
11:08AM EDT - Welcome to Hot Chips! This is the annual conference all about the latest, greatest, and upcoming big silicon that gets us all excited. Stay tuned during Monday and Tuesday for our regular AnandTech Live Blogs.
