Learn Before
Concept

Ways to compress PTMs

  • Model Pruning: Model pruning refers to removing part of neural network (e.g., weights, neurons, layers, channels, attention heads), thereby achieving the effects of reducing the model size and speeding up inference time.

  • Quantization: It refers to the compression of higher precision parameters to lower precision.

  • Knowledge Distillation: This is a compression technique in which a small model called student model is trained to reproduce the behaviors of a large model called teacher model.

  • Module Replacing: It reduces the model size by replacing the large modules of original PTMs with more compact substitutes.

  • Early Exit: It allows the model to exit early at an off-ramp instead of passing through the entire model. The number of layers to be executed is conditioned on the input.

  • Parameter Sharing: Reduces model size by using the same set of parameters across multiple layers, which also saves memory during inference.

0

1

Updated 2025-10-06

Tags

Data Science

Ch.1 Pre-training - Foundations of Large Language Models

Foundations of Large Language Models

Foundations of Large Language Models Course

Computing Sciences