1Cademy - A machine learning team is processing an extremely long input sequence and wants to parallelize the self-attention computation across 4 GPUs using sequence parallelism. For a single attention head, which of the following strategies correctly describes how the Key (K) and Value (V) matrices should be partitioned and distributed?

Learn Before

Sequence Parallelism

Multiple Choice

A machine learning team is processing an extremely long input sequence and wants to parallelize the self-attention computation across 4 GPUs using sequence parallelism. For a single attention head, which of the following strategies correctly describes how the Key (K) and Value (V) matrices should be partitioned and distributed?

Updated 2025-10-05

Contributors are:

Who are from:

Learn Before

Related