Learn Before
Dataset
Penn Tree Bank (PTB) Dataset
The Penn Tree Bank (PTB) is a widely used corpus in natural language processing, sampled from Wall Street Journal articles. It is typically divided into training, validation, and test sets. When formatting the dataset for word embedding models, each line represents a sentence with words separated by spaces, allowing individual words to be extracted and processed as discrete tokens. Notably, the original dataset explicitly contains <unk> tokens to represent rare or unknown words.
0
1
Updated 2026-05-25
Tags
D2L
Dive into Deep Learning @ D2L