- Parameter pruning and sharing: Focus on removing insignificant parameters from deep neural networks. Sub-categories are model quantization, model binarization, structural matrices, and parameter sharing. 
- Low-rank factorization: Use matrix and tensor decomposition to identify redundant parameters
- Transferred compact convolutional filters: Transfer or compress convolutional filters to remove unnecessary parameters 
- Knowledge Distillation (KD): Filter knowledge from a deep neural network into a smaller network 

University of California, Berkeley

Focus of recent work: 
- (1) Efficient building blocks, e.g. depth wise separate convolution, as seen in MobileNets and ShuffleNets 

- (2) Model Compression and acceleration techniques 

Developing Efficient Deep Models 

Gou, J., Yu, B., Maybank, S.J., & Tao, D. (2021). Knowledge Distillation: A Survey. Int. J. Comput. Vis., 129, 1789-1819.

Link: https://arxiv.org/pdf/2006.05525.pdf

Knowledge Distillation: A Survey

Model Compression and Acceleration Method Categories    


Model compression was was initially proposed as a knowledge transfer from a large / ensemble “teacher” model into training small “student” models with similar accuracy. This was later known as knowledge distillation. 

Model Compression

Model cascading is an inference-time technique designed to enhance efficiency by strategically using a combination of models with varying capabilities. The process initiates with a fast, less accurate small model handling the input to generate a preliminary output. This result is then evaluated against a set of pre-defined criteria. If the criteria are met, the output is accepted. If not, the input is escalated to a slower, more accurate large model for reprocessing. This hierarchical method substantially lowers computational costs and latency by ensuring the resource-intensive large model is only used for inputs that the small model cannot handle effectively.

Learn Before

Related