1Cademy - Ways to compress PTMs

Concept

Ways to compress PTMs

Model Pruning: Model pruning refers to removing part of neural network (e.g., weights, neurons, layers, channels, attention heads), thereby achieving the effects of reducing the model size and speeding up inference time.
Quantization: It refers to the compression of higher precision parameters to lower precision.
Knowledge Distillation: This is a compression technique in which a small model called student model is trained to reproduce the behaviors of a large model called teacher model.
Module Replacing: It reduces the model size by replacing the large modules of original PTMs with more compact substitutes.
Early Exit: It allows the model to exit early at an off-ramp instead of passing through the entire model. The number of layers to be executed is conditioned on the input.
Parameter Sharing: Reduces model size by using the same set of parameters across multiple layers, which also saves memory during inference.

0

1

Updated 2025-10-06

Contributors are:

Who are from:

References