1Cademy - Simple Unix Tokenization Commands

How it works Courses Research Communities Benefits About Us

Learn Before

Unix Tools for Crude Tokenization and Normalization

Code

Simple Unix Tokenization Commands

A \ command \ for \ tokenizing \ words, \ in \ increasing \ specificity: \ \newline \newline 1) \ tr \ \text{--}sc \ ’A\text{--}Za\text{--}z’ \ ’\backslash n’ \ < \ sh.txt \newline 2) \ tr \ \text{--}sc \ ’A\text{--}Za\text{--}z’ \ ’\backslash n’ \ < \ sh.txt \ | \ sort \ | \ uniq \ -c \newline 3) \ tr \ \text{--}sc \ ’A\text{--}Za\text{--}z’ \ ’\backslash n’ \ < sh.txt \ | \ tr \ A\text{--}Z \ a\text{--}z \ | \ sort \ | \ uniq \ \text{--}c \newline 4) \ tr \ \text{--}sc \ ’A\text{--}Za\text{--}z’ \ ’\backslash n’ \ < sh.txt \ | \ tr \ A\text{--}Z a\text{--}z \ | \ sort \ | \ uniq \ \text{--}c \ | \ sort \ \text{--}n \ \text{--}r \newline \newline Where \ in \ each \ additional \ step: \newline 1) \ Words \ are \ tokenized \ per\text{--}line \newline 2) \ Sorts \ words \ alphabetically, \ displays \ instance \ counts \newline 3) \ Collapses \ uppercase \ letters \ to \ lowercase \newline 4) \ Sorts \ by \ frequency

0

1

Updated 2021-09-19

Contributors are:

Len Morelos-Zaragoza

Len Morelos-Zaragoza

Who are from:

University of California, Santa Cruz

University of California, Santa Cruz

References

Speech and Language Processing (3rd ed. draft)

Tags

Data Science

Related

Simple Unix Tokenization Commands