Learn Before
A machine learning team is processing an extremely long input sequence and wants to parallelize the self-attention computation across 4 GPUs using sequence parallelism. For a single attention head, which of the following strategies correctly describes how the Key (K) and Value (V) matrices should be partitioned and distributed?
0
1
Tags
Ch.2 Generative Models - Foundations of Large Language Models
Foundations of Large Language Models
Foundations of Large Language Models Course
Computing Sciences
Analysis in Bloom's Taxonomy
Cognitive Psychology
Psychology
Social Science
Empirical Science
Science
Related
Resolving Memory Bottlenecks in Attention Mechanisms
A machine learning team is processing an extremely long input sequence and wants to parallelize the self-attention computation across 4 GPUs using sequence parallelism. For a single attention head, which of the following strategies correctly describes how the Key (K) and Value (V) matrices should be partitioned and distributed?
Flawed Parallel Attention Implementation
Computing Attention Weights in Sequence Parallelism
Motivation for Sequence Parallelism