logo
How it worksCoursesResearch CommunitiesBenefitsAbout Us
Schedule Demo
Learn Before
  • Tokenization

    Concept icon
Relation

Different standards for tokenization

  • Word tokenization: Penn Treebank tokenization; NLTK
  • Character tokenization
  • Subword tokenization: byte-pair encoding(BPE); wordpiece algorithm with MaxMatch decoding; SentencePiece

0

1

Updated 2026-05-25

Contributors are:

Claude Opus
Claude Opus
πŸ† 2
Jing Cao
Jing Cao
βœ”οΈ 1

Who are from:

Claude
Claude
πŸ† 2
University of Michigan - Ann Arbor
University of Michigan - Ann Arbor
βœ”οΈ 1

References


  • Speech and Language Processing (3rd ed. draft)

  • Dive into Deep Learning

Tags

Data Science

D2L

Dive into Deep Learning @ D2L

Related
  • Different standards for tokenization

  • Inference Process with a Fine-Tuned Model

  • Example of Tokenization into Words and Punctuation

  • Example of Word and Punctuation Tokenization

  • Methods of Tokenization

    Concept icon
  • A language model is given the sentence: 'The researcher is studying neuroplasticity.' It processes the sentence using two different methods, resulting in two different sequences of tokens.

    Method A: ['The', 'researcher', 'is', 'studying', 'neuroplasticity', '.'] Method B: ['The', 'researcher', 'is', 'study', 'ing', 'neuro', 'plasticity', '.']

    Assuming the model has never encountered the word 'neuroplasticity' during its training but has seen words like 'neuroscience' and 'plasticity' separately, which method is more advantageous for helping the model understand the new word, and why?

  • Tokenization Strategies

  • Evaluating Tokenization for a Specialized Chatbot

Learn After
  • Character-Level Tokenization

  • Byte Pair Encoding

    Concept icon
logo 1cademy1Cademy

Optimize Scalable Learning and Teaching

How it worksCoursesResearch CommunitiesBenefitsAbout Us
TermsPrivacyCookieGDPR

Contact Us

iman@honor.education

Follow Us




Β© 1Cademy 2026

We're committed to OpenSource on

Github