Concept

Conventional Model Compression for BERT

Standard model compression techniques can be effectively used to create more compact versions of BERT. Two primary methods are pruning and quantization. Pruning involves the removal of components from the Transformer's encoding network, such as entire layers, a fraction of the model's parameters, or specific attention heads. Notably, pruning attention heads can enhance inference speed with minimal impact on performance. Quantization, another key technique, reduces model size by converting its parameters to low-precision numerical formats. Although a general method not exclusive to BERT, quantization is particularly well-suited for large Transformer-based models.

0

1

Updated 2026-04-17

Contributors are:

Who are from:

Tags

Ch.1 Pre-training - Foundations of Large Language Models

Foundations of Large Language Models

Foundations of Large Language Models Course

Computing Sciences