Learn Before
Concept

Word segmentation and normalization

Word segmentation, or word tokenization, means segmenting running text into separative words. In practice, since word segmentation is usually the first step in language processing, it needs to be very fast.

Word normalization basically means changing words into standard formats, which are used to stand for other forms with the same meaning.

Word tokenization and normalization are generally done by regular expression or finite automata.

0

1

Updated 2021-09-12

Tags

Natural language processing

Data Science

Related
Learn After