Essay

Strategic Data Selection for LLM Development

A startup is developing a general-purpose AI assistant with a primary goal of excelling at complex, multi-step logical reasoning tasks. Due to budget constraints, they must choose between two training data strategies:

  1. A 10-terabyte dataset composed exclusively of diverse natural language texts (e.g., books, articles, web content).
  2. A 9-terabyte dataset that combines the same types of natural language with a large corpus of well-structured programming code.

Which strategy should the startup choose to best achieve its primary goal? Justify your recommendation by explaining the specific cognitive benefits that one of these data types imparts to a language model's reasoning capabilities.

0

1

Updated 2025-10-07

Contributors are:

Who are from:

Tags

Ch.2 Generative Models - Foundations of Large Language Models

Foundations of Large Language Models

Foundations of Large Language Models Course

Computing Sciences

Evaluation in Bloom's Taxonomy

Cognitive Psychology

Psychology

Social Science

Empirical Science

Science