Interpreting Cross-Entropy for Data Curation
A data curation team uses a small language model to pre-process a large text corpus. The model assigns a cross-entropy score to each document. They find two documents with the following scores:
- Document A: Cross-entropy = 1.8
- Document B: Cross-entropy = 9.5
Based on the goal of creating a high-quality, coherent training set, which document is more likely to be included, and why? Explain the relationship between the cross-entropy score and how well a document aligns with the small model's learned patterns.
0
1
Tags
Ch.4 Alignment - Foundations of Large Language Models
Foundations of Large Language Models
Foundations of Large Language Models Course
Computing Sciences
Analysis in Bloom's Taxonomy
Cognitive Psychology
Psychology
Social Science
Empirical Science
Science
Related
Data Filtering for a Specialized Language Model
A team is refining a large, general-purpose text corpus to train a specialized language model. They use a smaller, pre-existing model to calculate the cross-entropy for each document. Their goal is to create a high-quality, coherent, and well-structured training set. Which of the following filtering strategies should they implement and why?
Interpreting Cross-Entropy for Data Curation