Concept

Flat Corpus Representation

In some text datasets a corpus of token indices is stored as a single, flat list rather than a nested list of per-line or per-sentence token lists. This design is appropriate when the source text does not have meaningful sentence or paragraph boundaries—for example, when each line of the original file is not guaranteed to represent a complete sentence. Flattening all tokens into one continuous sequence treats the entire text as a single stream of characters (or words), which simplifies downstream operations such as generating fixed-length training subsequences for language models.

0

1

Updated 2026-05-13

Contributors are:

Who are from:

Tags

D2L

Dive into Deep Learning @ D2L