Case Study

Analyzing a Data Preparation Pipeline

A research team is developing a new large-scale language model. They have amassed a vast dataset of raw text scraped from the web. Their data preparation process consists of two main steps: first, they remove all HTML markup from the documents, and second, they tokenize the cleaned text. They then immediately begin the training process on this tokenized data. Analyze this pipeline and identify the most critical oversight. Explain why this oversight is likely to negatively impact the model's training process and final performance.

0

1

Updated 2025-10-01

Contributors are:

Who are from:

Tags

Ch.2 Generative Models - Foundations of Large Language Models

Foundations of Large Language Models

Foundations of Large Language Models Course

Computing Sciences

Analysis in Bloom's Taxonomy

Cognitive Psychology

Psychology

Social Science

Empirical Science

Science