1Cademy - Interpreting Cross-Entropy for Data Curation

Learn Before

Likelihood and Cross-Entropy as Data Filtering Criteria

Short Answer

Interpreting Cross-Entropy for Data Curation

A data curation team uses a small language model to pre-process a large text corpus. The model assigns a cross-entropy score to each document. They find two documents with the following scores:

Document A: Cross-entropy = 1.8
Document B: Cross-entropy = 9.5

Based on the goal of creating a high-quality, coherent training set, which document is more likely to be included, and why? Explain the relationship between the cross-entropy score and how well a document aligns with the small model's learned patterns.

Updated 2025-10-06

Contributors are:

Who are from:

Learn Before

Related