Sparsity in Deep Learning: Pruning + growth for efficient inference and training in neural networks
Torsten Hoefler presents an overview of sparsity in deep learning. Check the markers for various parts of the talk.
arXiv: https://arxiv.org/abs/2102.00554
Chapters:
0:00 Introduction to deep learning
7:29 Introduction to hardware scaling and locality
11:25 Overview of model compression and optimization
12:47 Introduction to sparsification
18:17 Overparameterization, SGD dynamics, and Occam's hill and generalization
26:00 Sparse storage formats and representational overheads
31:10 Overview of sparsification techniques - model and ephemeral sparsity
35:13 Sparsification schedules - when to sparsify
41:36 Fully sparse training
47:53 Retraining example
50:35 How to sparsify - picking elements for removal
54:50 Data-free pruning - magnitude
56:52 Data-driven pruning - sensitivity, activity, and correlation
59:42 Training-aware pruning - Taylor expansions of the loss and regularization
1:09:19 Learnable gating functions (approximations)
1:12:49 Structured sparsification
1:15:34 Variational removal methods
1:18:08 Parameter budgets between layers and literature statistics
1:22:10 Re-growing elements in fully-sparse training
1:24:53 Ephemeral sparsity - activations, gradients, dynamic networks
1:33:07 Putting everything together - case studies with CNNs
1:36:39 Parameter efficiency and slack
1:41:03 Compute efficiency and sparse transformers
1:43:41 Acceleration for sparse deep learning
1:50:04 Lottery tickets and sparse subnetworks
1:53:37 Best practices for sparse deep learning
1:56:06 Open research questions and summary
Abstract: The growing energy and performance costs of deep learning have driven the community to reduce the size of neural networks by selectively pruning components. Similarly to their biological counterparts, sparse networks generalize just as well, if not better than, the original dense networks. Sparsity can reduce the memory footprint of regular networks to fit mobile devices, as well as shorten training time for ever growing networks. In this paper, we survey prior work on sparsity in deep learning and provide an extensive tutorial of sparsification for both inference and training. We describe approaches to remove and add elements of neural networks, different training strategies to achieve model sparsity, and mechanisms to exploit sparsity in practice. Our work distills ideas from more than 300 research papers and provides guidance to practitioners who wish to utilize sparsity today, as well as to researchers whose goal is to push the frontier forward. We include the necessary background on mathematical methods in sparsification, describe phenomena such as early structure adaptation, the intricate relations between sparsity and the training process, and show techniques for achieving acceleration on real hardware. We also define a metric of pruned parameter efficiency that could serve as a baseline for comparison of different sparse networks. We close by speculating on how sparsity can improve future workloads and outline major open problems in the field.