Essay

Analyzing Scalability Trade-offs in Distributed Training

A team is training a large language model on a cluster with high-speed interconnects but observes that training throughput is not scaling linearly with the number of nodes. Two competing proposals are suggested to improve performance:

Proposal A: Implement a sophisticated software pipeline that aggressively overlaps the data transfer of the next micro-batch with the computation of the current one.

Proposal B: Redesign the workload distribution algorithm to ensure perfectly even load balancing across all devices, even if it requires more frequent, smaller synchronization steps.

Analyze the potential trade-offs of implementing Proposal A versus Proposal B. In your analysis, discuss the specific performance bottlenecks each proposal is designed to address, and describe a scenario where one proposal would be clearly superior to the other.

0

1

Updated 2025-10-06

Contributors are:

Who are from:

Tags

Ch.2 Generative Models - Foundations of Large Language Models

Foundations of Large Language Models

Foundations of Large Language Models Course

Computing Sciences

Analysis in Bloom's Taxonomy

Cognitive Psychology

Psychology

Social Science

Empirical Science

Science