Learn Before
Pre-training a Specialized Code Assistant
Based on the company's primary goal, what specific type of data should they prioritize adding to their existing code dataset during the pre-training phase? Explain why this addition is essential for achieving the desired instruction-following behavior.
0
1
Tags
Ch.1 Pre-training - Foundations of Large Language Models
Foundations of Large Language Models
Foundations of Large Language Models Course
Computing Sciences
Ch.2 Generative Models - Foundations of Large Language Models
Application in Bloom's Taxonomy
Cognitive Psychology
Psychology
Social Science
Empirical Science
Science
Related
Enabling Zero-Shot Learning through Instruction Understanding
Computational Expense of Training LLMs from Scratch
Difficulty in Collecting Labeled Data for Instruction Pre-training
A research lab develops a new large language model by training it on a massive dataset consisting solely of digitized books and encyclopedias. The model becomes exceptionally proficient at generating coherent, factual paragraphs. However, when users give it a direct command, such as "Translate 'hello' into French," the model often responds with a continuation like "is a common English greeting," instead of "Bonjour."
Which of the following best analyzes the most likely reason for this specific failure?
Pre-training Data Strategy for a Command-Following Model
Pre-training a Specialized Code Assistant