Learn Before
Ways to compress PTMs
-
Model Pruning: Model pruning refers to removing part of neural network (e.g., weights, neurons, layers, channels, attention heads), thereby achieving the effects of reducing the model size and speeding up inference time.
-
Quantization: It refers to the compression of higher precision parameters to lower precision.
-
Knowledge Distillation: This is a compression technique in which a small model called student model is trained to reproduce the behaviors of a large model called teacher model.
-
Module Replacing: It reduces the model size by replacing the large modules of original PTMs with more compact substitutes.
-
Early Exit: It allows the model to exit early at an off-ramp instead of passing through the entire model. The number of layers to be executed is conditioned on the input.
-
Parameter Sharing: Reduces model size by using the same set of parameters across multiple layers, which also saves memory during inference.
0
1
Contributors are:
Who are from:
Tags
Data Science
Ch.1 Pre-training - Foundations of Large Language Models
Foundations of Large Language Models
Foundations of Large Language Models Course
Computing Sciences
Related
Ways to compress PTMs
A development team has created a large, high-performance language model for a new smartphone application that provides real-time text summarization. During user testing, they observe that while the summaries are highly accurate, the application is slow to respond and causes the phone's battery to drain rapidly. Which of the following strategies would be the most appropriate first step to address these specific performance issues on the device?
Deployment Strategy for a New AI Assistant
Deployment Challenges of Large Models
For any real-world application, applying compression techniques to a large pre-trained model is the optimal deployment strategy because it reduces model size and improves computation efficiency without compromising the model's performance.
Learn After
A research team is tasked with deploying a large language model on edge devices with limited memory and processing power. Their primary goal is to reduce the model's memory footprint. They achieve this by converting the model's 32-bit floating-point weights and activations into 8-bit integers. While this significantly reduces the model's size, they observe a minor drop in performance. Which model compression technique does this scenario describe?
Selecting a Model Compression Strategy
Match each model compression technique with its corresponding description.