Learn Before
Pre-training Data Strategy for a Command-Following Model
Imagine you are part of a team building a new large language model from scratch. The primary goal is for the model to be able to understand and execute user commands (e.g., 'Summarize this text,' 'Translate this sentence') right after its initial training is complete. Besides a vast collection of general text like books and web pages, what specific category of data must be included in the pre-training dataset to achieve this goal? Explain why this data is essential.
0
1
Tags
Ch.1 Pre-training - Foundations of Large Language Models
Foundations of Large Language Models
Foundations of Large Language Models Course
Computing Sciences
Ch.2 Generative Models - Foundations of Large Language Models
Application in Bloom's Taxonomy
Cognitive Psychology
Psychology
Social Science
Empirical Science
Science
Related
Enabling Zero-Shot Learning through Instruction Understanding
Computational Expense of Training LLMs from Scratch
Difficulty in Collecting Labeled Data for Instruction Pre-training
A research lab develops a new large language model by training it on a massive dataset consisting solely of digitized books and encyclopedias. The model becomes exceptionally proficient at generating coherent, factual paragraphs. However, when users give it a direct command, such as "Translate 'hello' into French," the model often responds with a continuation like "is a common English greeting," instead of "Bonjour."
Which of the following best analyzes the most likely reason for this specific failure?
Pre-training Data Strategy for a Command-Following Model
Pre-training a Specialized Code Assistant