Learn Before
Pruning for BERT Compression
Pruning is a technique for compressing BERT by strategically removing parts of its Transformer network. This can be implemented in several ways, such as eliminating entire layers, removing a certain percentage of network parameters, or discarding specific attention heads. These actions can significantly speed up model inference, often without a major decrease in performance.
0
1
Tags
Ch.1 Pre-training - Foundations of Large Language Models
Foundations of Large Language Models
Foundations of Large Language Models Course
Computing Sciences
Related
Pruning for BERT Compression
Quantization for BERT Compression
A development team is working to optimize a large, pre-trained language model for a real-time translation application. The model's current inference speed is too slow. They are considering two strategies: (1) removing a specific number of attention heads from each layer, or (2) representing all model parameters with lower-precision numbers. Which statement best distinguishes the primary impact of these two compression techniques in this context?
BERT Compression Strategy for Mobile Deployment
An engineering team needs to make a large language model more efficient for deployment. They are considering two distinct compression methods. Match each method with its corresponding description.
Learn After
Evaluating Model Compression Strategies
A development team needs to accelerate the inference speed of a large, pre-trained language model for a task requiring a deep understanding of long-range dependencies and complex sentence structures. Which of the following strategies for reducing the model's size is most likely to severely degrade performance on this specific task?
A machine learning team is exploring different methods to reduce the size and inference time of a large language model based on the Transformer architecture. Match each pruning technique with its most likely description or primary impact.