1Cademy - A machine learning team is training a large model using a distributed framework. They upgrade their hardware from GPU Architecture X to GPU Architecture Y, which has significantly more raw computational power. To their surprise, the execution speed of the individual, pre-decomposed sub-matrix multiplication tasks running on each GPU decreases. Assuming no issues with networking or cooling, what is the most likely cause of this performance degradation?

Learn Before

Low-Level Tile-Based Execution in Tensor Parallelism

Multiple Choice

A machine learning team is training a large model using a distributed framework. They upgrade their hardware from 'GPU Architecture X' to 'GPU Architecture Y', which has significantly more raw computational power. To their surprise, the execution speed of the individual, pre-decomposed sub-matrix multiplication tasks running on each GPU decreases. Assuming no issues with networking or cooling, what is the most likely cause of this performance degradation?

Updated 2025-10-01

Contributors are:

Who are from:

Learn Before

Related