Token-based models often fail to recognize intentionally obfuscated hate speech, such as the phrase "\$ ki11 yrslef a\$\$hole\$". Because standard word-level tokenizers cannot process these variations, character-level n-grams are necessary features to detect disguised offensive language.

San Jose State University

Google

Surface level features in text classification tasks are 

 - Bag of words
 - Unigrams( Word level)
 - Large n-grams(Word Level)
 - N-grams combined with other features 
 - Character n-grams
 - Frequency of URL mentions and punctuations
 - Comment and token length 
 - Capitalization
 - Words not in English dictionaries 
 - Number of non-alpha numeric characters present in tokens 


Simple Surface Features Needed for Text Classification Tasks

A Survey on Hate Speech Detection using Natural Language Processing by Anna Schmidt  and Michael Wiegand

https://aclanthology.org/W17-1101.pdf

A Survey on Hate Speech Detection using Natural Language Processing

 - Text classification models built on features like bag of words give high accuracy when predictive words appear both in training and test data.
 - Data sparsity


Learn Before

Related