12:16PM EDT - AIC and OAM

12:15PM EDT - Layer cables to racks without DMA

12:15PM EDT - custom protocol with sub-microsecond latency

12:15PM EDT - 200 GB/s bi-directional IO per card

12:14PM EDT - Supports dimension reshape

12:14PM EDT - 4D tensors

12:13PM EDT - Async data flow and compute pipeline

12:13PM EDT - L0 cache with 10 TB/s bandwidth

12:12PM EDT - have to have it on a power of two boundary

12:12PM EDT - Support various tensor shapes

12:11PM EDT - 256 kernels support convolution operations

12:10PM EDT - hardware can add padding elements to get best efficiency compined with zero power instruction detection

12:09PM EDT - Cector and Scalar support sum and pooling

12:09PM EDT - 2 kbit per cycle for store, 1 kbit per cycle for load

12:08PM EDT - can fully skip instructions if zero power instruction detected

12:08PM EDT - Introduce sparsity for power

12:07PM EDT - Each kernel supports 1x-32bit MAC or 4x16-bit/8-bit MAC. All kernels do all precisions

12:07PM EDT - 256 Tensor compute Kernels

12:07PM EDT - GPU-Care 1.0

12:06PM EDT - DMA engine with 1 KB interface

12:06PM EDT - 256 KB of L1-Data

12:06PM EDT - 1024-bit bus with

12:06PM EDT - VLIW programmable

12:06PM EDT - on chip network*

12:06PM EDT - 40 data transfer engines

12:05PM EDT - 4 clusters of 8 tensor units

12:05PM EDT - ip networkj

12:05PM EDT - 32 AI compute cores

12:05PM EDT - 2 HBM2 at 512 GB/s

12:04PM EDT - 300W

12:04PM EDT - 16 lanes PCIe 4.0

12:03PM EDT - 80 TF of BF16, 12nm FinFet, 14.1 billion transistors, 200 GB/s interconnect

12:03PM EDT - DTU 1.0

12:02PM EDT - Designed 2018, launched 2019

12:02PM EDT - First Gen

12:02PM EDT - Next talk is Enflame

12:01PM EDT - Q: Data cache size for general purpose A: With area of 1000 cores, shift L1/L2 to multi-level is important. Special circuits - keep very robust voltage, need to use large SRAM for low voltage. 4 KB L1 gave a good hit rate with the L2 for performance

12:00PM EDT - Q: Why not BF16? A: Natively it does, but BF16 would be expanded FP32 for compute and put to BF16 back in storage. Because we do inference - customer wants inference, doesn't need BF16

11:58AM EDT - Q: External memory and IO power add above 20W - A: IOs are included. 20W includes DRAM and other components

11:56AM EDT - Q*A time

11:55AM EDT - Early Access for qualified customers later in 2021

11:55AM EDT - Highest performance commercial RISC-V chip to date

11:55AM EDT - A0 silicon in test

11:54AM EDT - First silicon in bring up

11:54AM EDT - 24 billion transistors, 570mm2, 89 mask layers

11:54AM EDT - Full RV64GC ISA

11:54AM EDT - Four high-performance ET-Maxions

11:52AM EDT - Esperanto projected performance

11:51AM EDT - Software through many interfaces

11:50AM EDT - 6 chips have a single heatspreader

11:50AM EDT - How to deploy at scale

11:50AM EDT - OCP versions

11:49AM EDT - 822 GB/s total memory bandwidth per PCIe card

11:49AM EDT - 192 GB of accelerator memory

11:49AM EDT - Six chips and 24 LPDDR4 chips on a PCIe card with a PCIe switch

11:49AM EDT - 256-bit wide LPDDR4X

11:48AM EDT - 16 LPDDR4X controllers

11:48AM EDT - Meshes run over the cores

11:48AM EDT - SRAM banks could be partitioned as private L2 or shared L3

11:48AM EDT - mesh interconnect on each shire

11:47AM EDT - with 4 MB of shared SRAM

11:47AM EDT - 4 neighborhoods makes a shire

11:47AM EDT - custom instructions

11:47AM EDT - cooperative loads

11:46AM EDT - far more efficient than having each core with its own I-cache

11:46AM EDT - 8 minions share a single large instruction cache

11:46AM EDT - before wide length became a problem

11:46AM EDT - 8 cores on a chip form a neighborhood

11:45AM EDT - 512-bit wide integer per cycle, 256-bit wide FP per cycle, per core

11:45AM EDT - 64k ops

11:45AM EDT - can do 64 ops on one tensor instruction

11:45AM EDT - 300 MHz to 2 GHz

11:44AM EDT - SMT2

11:44AM EDT - in order pipeline

11:44AM EDT - 64-bit risc-v processor, software configurable l1 data cache

11:43AM EDT - Best efficient point is at 8.5 W - 2.5x better perf than at 0.9 volts

11:42AM EDT - 0.75 volts is 164W per chip

11:42AM EDT - One chip could use 275W at peak

11:42AM EDT - Inferences per second per watt

11:41AM EDT - Efficiency vs voltage - 0.34 is best

11:40AM EDT - C dynamic is hard

11:40AM EDT - drive down voltage per core

11:40AM EDT - TSMC 7nm FinFET

11:38AM EDT - Highest recommendation performance inside 120W in six chips

11:38AM EDT - allows for lower voltage, increasing efciciency

11:38AM EDT - Esperanto splits it across chips

11:38AM EDT - Large chips have large power

11:38AM EDT - 1000s of RISC-V cores in esperanto

11:38AM EDT - limited parallelism with single big chips

11:37AM EDT - thousands of threads

11:36AM EDT - Fixed function hardware can quickly become obsolete

11:35AM EDT - reduce off-die memory references

11:34AM EDT - be programmable

11:34AM EDT - dense and sparse workloads

11:34AM EDT - Multiple data type support

11:34AM EDT - Low power budget per card

11:34AM EDT - these servers need add-in cards

11:34AM EDT - traditionally run on x86

11:33AM EDT - focus on recommendation models

11:33AM EDT - Under 20 watts for inference

11:33AM EDT - Up to 200 Tera-Ops

11:33AM EDT - PCIe x8 Gen 4

11:33AM EDT - 160 million bytes of SRAM onboard

11:32AM EDT - ET-Minion with tensor units

11:32AM EDT - 1088 RISC-V cores

11:31AM EDT - AI Accelerator - 1000 RISC-V cores on a chip

11:30AM EDT - First up is a talk from Esperanto Technologies

11:25AM EDT - Starting here in about 5 minutes

11:08AM EDT - Event starts at 8:30am PT, so in about 22 minutes

