Learn Before
Analyzing Memory Constraints in Different LLM Applications
A company is deploying two large language models on hardware with a fixed memory capacity. Model A is designed for a real-time chatbot application, which typically handles short user queries. Model B is designed for analyzing lengthy legal documents. Assuming both models use a standard architecture for generating responses, analyze the primary memory-related challenge that will disproportionately affect Model B compared to Model A during inference. In your analysis, explain how the mechanism for storing attention states contributes to this challenge and discuss the resulting trade-off between the length of the context the model can handle and the number of users it can serve concurrently.
0
1
Tags
Ch.2 Generative Models - Foundations of Large Language Models
Foundations of Large Language Models
Foundations of Large Language Models Course
Computing Sciences
Ch.5 Inference - Foundations of Large Language Models
Analysis in Bloom's Taxonomy
Cognitive Psychology
Psychology
Social Science
Empirical Science
Science
Related
Architectural Adaptation of LLMs for Long Sequences
Linear Attention
Classification of Memory Models in LLMs
Memory Models in LLMs as Context Encoders
PagedAttention for KV Cache Memory Optimization
Strategies for Mitigating KV Cache Memory Usage
A machine learning engineer is deploying a large language model and finds that the system frequently runs out of memory during inference. They are investigating two specific high-load scenarios, both of which involve processing a total of 16,000 tokens:
- Scenario X: Processing a batch of 32 user requests simultaneously, where each request has a context length of 500 tokens.
- Scenario Y: Processing a single user request that involves summarizing a very long document with a context length of 16,000 tokens.
Based on how attention states (keys and values) are managed during inference, which statement best analyzes the memory consumption issue?
Architectural Shift in LLMs due to Long-Sequence Limitations
Diagnosing Inference Failures with Long Documents
Analyzing Memory Constraints in Different LLM Applications