Concept

Chunking in NLTK

Chunking allows you to identify phrases by making use of POS tags. Naturally, to chunk a text, word_tokenize must be imported.

from nltk.tokenize import word_tokenize

Let text be the body of text to chunk.

tokenized_text = word_tokenize(text) # tokenized_text will be a list of words separated into different strings nltk.download("averaged_perceptron_tagger") POS_tags = nltk.pos_tag(tokenized_text) # returns list of tuples with each word paired with a POS

The next step is to form a grammar rule by which the sentence should be phrased, or "chunked."

grammar = "NP: {<DT>?<JJ>*<NN>}"

This rule defines a Noun Phrase(NP), which means it can start with an optional determiner, then have any number of adjectives, then ends with a noun.

Then create a chunk parser with this grammar

chunk_parser = nltk.RegexpParser(grammar) tree = chunk_parser.parse(POS_tags) tree.draw()

0

1

Updated 2022-11-03

References


Tags

Python Programming Language

Data Science