07:05PM EDT - A slightly different talk from Hot Chips this time: here's a neural network processor aimed at IoT devices based in 16nm. The slides look interesting, so we're covering this talk too.

07:05PM EDT - Designing NN algorithms that require less multiplications and less storage

07:05PM EDT - For IoT

07:05PM EDT - MobileNet for reducing workloads

07:06PM EDT - Optimize the hardware to reduce compute

07:06PM EDT - Efficiency vs flexibility

07:06PM EDT - Accelerators optimized for a specific workload might not be flexible

07:07PM EDT - In embedded, still care about die area

07:07PM EDT - DNNs have risk of premature obsolescence

07:07PM EDT - Few DNNs have converged

07:07PM EDT - No silver bullet

07:07PM EDT - IoT spend a lot of time in low power

07:08PM EDT - Heavily optimized hardware is preferential in low power scenarios, but lose flexibility

07:08PM EDT - Don't just care about performance, it's efficiency with minimum perf

07:08PM EDT - This led to SMIV

07:09PM EDT - Built at Harvard, academic chip

07:09PM EDT - 25mm2, 500m transistors, Cortex A-class CPUs

07:09PM EDT - 672-pin BGA package

07:09PM EDT - Near Threshold always on cluster

07:10PM EDT - Rest of system will spend most of the time off

07:10PM EDT - Dual ARM Cortex A53

07:11PM EDT - embedded FPGA core from FlexLogic

07:11PM EDT - Standalone accelerator for CPU

07:11PM EDT - 4MB of SRAM for Neural Network workloads

07:11PM EDT - off-chip link to FPGA

07:12PM EDT - Cortex M0 always on core

07:12PM EDT - Using regularly

07:12PM EDT - Also includes DNN engine classifier

07:12PM EDT - AON cluster has 1 MB of SRAM. Doesn't use off-chip memory access

07:12PM EDT - 150nJ per inference for small tasks

07:13PM EDT - A53 has ACP - Accelerator Coherency Port

07:13PM EDT - efficient way to interface accelerators

07:13PM EDT - Low hardware cost for the accelerator, advantages with coherent memory access

07:14PM EDT - Gives the accelerators access to L2 cache

07:14PM EDT - Fine grained datapath acceleration

07:14PM EDT - Cache Coherent Datapath Acceleators

07:15PM EDT - Embedded FPGA in 2x2 array

07:15PM EDT - Standard cell approach, IP is scalable and flexible

07:15PM EDT - Logic tile is 2500 LUT6, DSP tile has 40x 22b MACs

07:16PM EDT - 2 Logic tiles + 2 DSP tiles in 2x2 array

07:16PM EDT - Use embedded FPGA as standalone accelerator for efficiency vs flexibility

07:16PM EDT - Chips already back from TSMC

07:16PM EDT - All pre-silicon tests passed.

07:17PM EDT - Using for research

07:17PM EDT - eFPGA as first class citizen, incremental wakeup from AON sub-system

07:18PM EDT - AON has best Energy / Area efficiency - but heavily impacted by SRAM

07:18PM EDT - Perf is obviously different

07:18PM EDT - eFPGA is 4.5x efficiency of CPUs

07:18PM EDT - Graph is in GOPS/Watt

07:19PM EDT - eFPGA has area overhead for reconfigurability

07:20PM EDT - Made under DARPA program

07:20PM EDT - No custom layout - standard cells and SRAMs

07:20PM EDT - High-level synthesis using NVIDIA methodology

07:20PM EDT - Scripted methodologies for generating memory mapped registers, IO pad-ring, clock domains

07:21PM EDT - Makes validation on completed silicon easier

07:23PM EDT - Arm Designstart helped

07:23PM EDT - Used academic licences

07:24PM EDT - Driving new IoT use cases with DNN in IoT SoCs in 16nm

07:25PM EDT - Q: What acceleration can you do on eFPGA?

07:26PM EDT - A: MatMul for Convolutional Neural Networks. Smart DMA engines.

07:27PM EDT - Q: Why is the efficiency of the dedicated accelerator less than the Always On?

07:27PM EDT - A: Our dedicated accelerator was mainly focused on the connectivity, but can be scaled

07:28PM EDT - That's a wrap. Next talk in 30 minutes on Xavier

POST A COMMENT

4 Comments

View All Comments

Log in

Don't have an account? Sign up now