Learn Before
Analyzing a Data Preparation Pipeline
A research team is developing a new large-scale language model. They have amassed a vast dataset of raw text scraped from the web. Their data preparation process consists of two main steps: first, they remove all HTML markup from the documents, and second, they tokenize the cleaned text. They then immediately begin the training process on this tokenized data. Analyze this pipeline and identify the most critical oversight. Explain why this oversight is likely to negatively impact the model's training process and final performance.
0
1
Tags
Ch.2 Generative Models - Foundations of Large Language Models
Foundations of Large Language Models
Foundations of Large Language Models Course
Computing Sciences
Analysis in Bloom's Taxonomy
Cognitive Psychology
Psychology
Social Science
Empirical Science
Science
Related
Data Quality as a Key Issue in LLM Training
Analyzing a Data Preparation Pipeline
A team is preparing a massive text dataset for training a new large language model. Arrange the following key data preparation stages into the most logical and efficient sequence.
A research team is preparing a massive, diverse dataset scraped from the web to train a large language model. They are primarily concerned with two potential issues: training instability and the model learning undesirable social biases from the raw data. Which of the following data preparation strategies would most directly and effectively address both of these concerns?