Learn Before
Difficulty in Collecting Labeled Data for Instruction Pre-training
For pre-training with instruction-following data to be effective, a vast quantity of such data is necessary. However, collecting large-scale, high-quality labeled data that covers all potential tasks is a significant and difficult challenge.
0
1
Tags
Ch.2 Generative Models - Foundations of Large Language Models
Foundations of Large Language Models
Foundations of Large Language Models Course
Computing Sciences
Related
Enabling Zero-Shot Learning through Instruction Understanding
Computational Expense of Training LLMs from Scratch
Difficulty in Collecting Labeled Data for Instruction Pre-training
A research lab develops a new large language model by training it on a massive dataset consisting solely of digitized books and encyclopedias. The model becomes exceptionally proficient at generating coherent, factual paragraphs. However, when users give it a direct command, such as "Translate 'hello' into French," the model often responds with a continuation like "is a common English greeting," instead of "Bonjour."
Which of the following best analyzes the most likely reason for this specific failure?
Pre-training Data Strategy for a Command-Following Model
Pre-training a Specialized Code Assistant
Learn After
A development team is pre-training a new language model to follow a wide range of instructions. They recognize that manually creating a massive, diverse, and high-quality dataset of human-written instructions and responses is prohibitively expensive and time-consuming. As a solution, they propose using an existing powerful model to synthetically generate millions of training examples. Which statement best evaluates the most significant risk of this strategy?
Evaluating a Data Collection Strategy
Evaluating Data Collection Strategies for Instruction Pre-training