A team is parallelizing a large matrix multiplication across a cluster of GPUs. They successfully decompose the matrix so that sub-problems fit onto each GPU, avoiding out-of-memory errors. However, profiling reveals that within each GPU, the computational cores are frequently idle, leading to poor overall performance. This suggests a bottleneck where the cores are waiting for data to be fetched from memory. Which component of a two-level, tile-based parallelization strategy is most likely misconfigured or inefficiently implemented?
0
1
Tags
Ch.2 Generative Models - Foundations of Large Language Models
Foundations of Large Language Models
Foundations of Large Language Models Course
Computing Sciences
Analysis in Bloom's Taxonomy
Cognitive Psychology
Psychology
Social Science
Empirical Science
Science
Related
A team is parallelizing a large matrix multiplication across a cluster of GPUs. They successfully decompose the matrix so that sub-problems fit onto each GPU, avoiding out-of-memory errors. However, profiling reveals that within each GPU, the computational cores are frequently idle, leading to poor overall performance. This suggests a bottleneck where the cores are waiting for data to be fetched from memory. Which component of a two-level, tile-based parallelization strategy is most likely misconfigured or inefficiently implemented?
High-Level Decomposition in Tensor Parallelism
Low-Level Tile-Based Execution in Tensor Parallelism
A team is implementing a large matrix multiplication using a two-level, tile-based approach for parallel processing on multiple hardware units. Match each of the following implementation goals to the level at which it is primarily addressed.
Critique of a Parallelization Strategy