Learn Before
Concept
Word segmentation and normalization
Word segmentation, or word tokenization, means segmenting running text into separative words. In practice, since word segmentation is usually the first step in language processing, it needs to be very fast.
Word normalization basically means changing words into standard formats, which are used to stand for other forms with the same meaning.
Word tokenization and normalization are generally done by regular expression or finite automata.
0
1
Updated 2021-09-12
Tags
Natural language processing
Data Science