logo
How it worksCoursesResearch CommunitiesBenefitsAbout Us
Schedule Demo
Learn Before
  • Tokenization

    Concept icon
Relation

Different standards for tokenization

  • Word tokenization: Penn Treebank tokenization; NLTK
  • Character tokenization
  • Subword tokenization: byte-pair encoding(BPE); wordpiece algorithm with MaxMatch decoding; SentencePiece

0

1

Updated 2020-07-16

Contributors are:

Jing Cao
Jing Cao
πŸ† 1

Who are from:

University of Michigan - Ann Arbor
University of Michigan - Ann Arbor
πŸ† 1

References


  • Speech and Language Processing (3rd ed. draft)

Tags

Data Science

Related
  • Different standards for tokenization

  • Inference Process with a Fine-Tuned Model

  • Example of Tokenization into Words and Punctuation

  • Example of Word and Punctuation Tokenization

  • Methods of Tokenization

    Concept icon
  • A language model is given the sentence: 'The researcher is studying neuroplasticity.' It processes the sentence using two different methods, resulting in two different sequences of tokens.

    Method A: ['The', 'researcher', 'is', 'studying', 'neuroplasticity', '.'] Method B: ['The', 'researcher', 'is', 'study', 'ing', 'neuro', 'plasticity', '.']

    Assuming the model has never encountered the word 'neuroplasticity' during its training but has seen words like 'neuroscience' and 'plasticity' separately, which method is more advantageous for helping the model understand the new word, and why?

  • Tokenization Strategies

  • Evaluating Tokenization for a Specialized Chatbot

logo 1cademy1Cademy

Optimize Scalable Learning and Teaching

How it worksCoursesResearch CommunitiesBenefitsAbout Us
TermsPrivacyCookieGDPR

Contact Us

iman@honor.education

Follow Us




Β© 1Cademy 2026

We're committed to OpenSource on

Github