Introduction to NLTK: Tokenization, Stemming, Lemmatization, POS Tagging - GeeksforGeeks (2024)

Natural Language Toolkit (NLTK) is one of the largest Python libraries for performing various Natural Language Processing tasks. From rudimentary tasks such as text pre-processing to tasks like vectorized representation of text – NLTK’s API has covered everything. In this article, we will accustom ourselves to the basics of NLTK and perform some crucial NLP tasks: Tokenization, Stemming, Lemmatization, and POS Tagging.

Table of Content

  • What is the Natural Language Toolkit (NLTK)?
  • Tokenization
  • Stemming and Lemmatization
  • Stemming
  • Lemmatization
  • Part of Speech Tagging

As discussed earlier, NLTK is Python’s API library for performing an array of tasks in human language. It can perform a variety of operations on textual data, such as classification, tokenization, stemming, tagging, Leparsing, semantic reasoning, etc.

Installation:
NLTK can be installed simply using pip or by running the following code.

! pip install nltk

Accessing Additional Resources:
To incorporate the usage of additional resources, such as recourses of languages other than English – you can run the following in a python script. It has to be done only once when you are running it for the first time in your system.

Python
import nltknltk.download('all')

Now, having installed NLTK successfully in our system, let’s perform some basic operations on text data using NLTK.

Tokenization

Tokenization refers to break down the text into smaller units. It entails splitting paragraphs into sentences and sentences into words. It is one of the initial steps of any NLP pipeline. Let us have a look at the two major kinds of tokenization that NLTK provides:

Work Tokenization

It involves breaking down the text into words.

 "I study Machine Learning on GeeksforGeeks." will be word-tokenized as
['I', 'study', 'Machine', 'Learning', 'on', 'GeeksforGeeks', '.'].

Sentence Tokenization

It involves breaking down the text into individual sentences.

Example:
"I study Machine Learning on GeeksforGeeks. Currently, I'm studying NLP"
will be sentence-tokenized as
['I study Machine Learning on GeeksforGeeks.', 'Currently, I'm studying NLP.']

In Python, both these tokenizations can be implemented in NLTK as follows:

Python
# Tokenization using NLTKfrom nltk import word_tokenize, sent_tokenizesent = "GeeksforGeeks is a great learning platform.\It is one of the best for Computer Science students."print(word_tokenize(sent))print(sent_tokenize(sent))

Output:

['GeeksforGeeks', 'is', 'a', 'great', 'learning', 'platform', '.',
'It', 'is', 'one', 'of', 'the', 'best', 'for', 'Computer', 'Science', 'students', '.']
['GeeksforGeeks is a great learning platform.',
'It is one of the best for Computer Science students.']

Stemming and Lemmatization

When working with Natural Language, we are not much interested in the form of words – rather, we are concerned with the meaning that the words intend to convey. Thus, we try to map every word of the language to its root/base form. This process is called canonicalization.

E.g. The words ‘play’, ‘plays’, ‘played’, and ‘playing’ convey the same action – hence, we can map them all to their base form i.e. ‘play’.

Now, there are two widely used canonicalization techniques: Stemming and Lemmatization.

Stemming

Stemming generates the base word from the inflected word by removing the affixes of the word. It has a set of pre-defined rules that govern the dropping of these affixes. It must be noted that stemmers might not always result in semantically meaningful base words. Stemmers are faster and computationally less expensive than lemmatizers.

In the following code, we will be stemming words using Porter Stemmer – one of the most widely used stemmers:

Python
from nltk.stem import PorterStemmer# create an object of class PorterStemmerporter = PorterStemmer()print(porter.stem("play"))print(porter.stem("playing"))print(porter.stem("plays"))print(porter.stem("played"))

Output:

play
play
play
play

We can see that all the variations of the word ‘play’ have been reduced to the same word – ‘play’. In this case, the output is a meaningful word, ‘play’. However, this is not always the case. Let us take an example.

Please note that these groups are stored in the lemmatizer; there is no removal of affixes as in the case of a stemmer.

Python
from nltk.stem import PorterStemmer# create an object of class PorterStemmerporter = PorterStemmer()print(porter.stem("Communication"))

Output:

commun

The stemmer reduces the word ‘communication’ to a base word ‘commun’ which is meaningless in itself.

Lemmatization

Lemmatization involves grouping together the inflected forms of the same word. This way, we can reach out to the base form of any word which will be meaningful in nature. The base from here is called the Lemma.

Lemmatizers are slower and computationally more expensive than stemmers.

Example:
'play', 'plays', 'played', and 'playing' have 'play' as the lemma.

In Python, both these tokenizations can be implemented in NLTK as follows:

Python
from nltk.stem import WordNetLemmatizer# create an object of class WordNetLemmatizerlemmatizer = WordNetLemmatizer()print(lemmatizer.lemmatize("plays", 'v'))print(lemmatizer.lemmatize("played", 'v'))print(lemmatizer.lemmatize("play", 'v'))print(lemmatizer.lemmatize("playing", 'v'))

Output:

play
play
play
play

Please note that in lemmatizers, we need to pass the Part of Speech of the word along with the word as a function argument.

Also, lemmatizers always result in meaningful base words. Let us take the same example as we took in the case for stemmers.

Python
from nltk.stem import WordNetLemmatizer# create an object of class WordNetLemmatizerlemmatizer = WordNetLemmatizer()print(lemmatizer.lemmatize("Communication", 'v'))

Output:

Communication

Part of Speech Tagging

Part of Speech (POS) tagging refers to assigning each word of a sentence to its part of speech. It is significant as it helps to give a better syntactic overview of a sentence.

Example:
"GeeksforGeeks is a Computer Science platform."
Let's see how NLTK's POS tagger will tag this sentence.

In Python, both these tokenizations can be implemented in NLTK as follows:

Python
from nltk import pos_tagfrom nltk import word_tokenizetext = "GeeksforGeeks is a Computer Science platform."tokenized_text = word_tokenize(text)tags = tokens_tag = pos_tag(tokenized_text)tags

Output:

[('GeeksforGeeks', 'NNP'),
('is', 'VBZ'),
('a', 'DT'),
('Computer', 'NNP'),
('Science', 'NNP'),
('platform', 'NN'),
('.', '.')]

Conclusion

In conclusion, the Natural Language Toolkit (NLTK) works as a powerful Python library that a wide range of tools for Natural Language Processing (NLP). From fundamental tasks like text pre-processing to more advanced operations such as semantic reasoning, NLTK provides a versatile API that caters to the diverse needs of language-related tasks.



`; tags.map((tag)=>{ let tag_url = `videos/${getTermType(tag['term_id__term_type'])}/${tag['term_id__slug']}/`; tagContent+=``+ tag['term_id__term_name'] +``; }); tagContent+=`
Introduction to NLTK: Tokenization, Stemming, Lemmatization, POS Tagging - GeeksforGeeks (2024)

FAQs

What is stemming and lemmatization with NLTK? ›

While both lemmatization and stemming involve reducing words to their base forms, lemmatization considers the context and morphological analysis to return a valid word, whereas stemming applies simpler rules to chop off prefixes or suffixes, often resulting in non-dictionary words.

What is tokenization, lemmatization, and stemming? ›

Tokenization and lemmatization are the same processes, where both split the text into individual words and convert them to their base forms for analysis. B) Tokenization is the process of converting text into individual words or tokens, while lemmatization is the process of converting words to their base or root forms.

What is POS tagging in NLTK? ›

NLTK POS tag is the practice of marking up the words in text format for a specific segment of a speech context is known as POS Tagging (Parts of Speech Tagging). It is in charge of interpreting a language's text and associating each word with a specific token. Grammar tagging is another term for it.

What is tokenization and POS tagging? ›

Tokenization and part-of-speech (PoS) tagging are two fundamental NLP tasks. Tokenization aims at detecting word and sentence boundaries in text while PoS tagging uses the recognized words and assigns each word its syntactical category.

Which is better, lemmatization vs. stemming? ›

Lemmatization takes more time as compared to stemming because it finds meaningful word/ representation. Stemming just needs to get a base word and therefore takes less time. Stemming has its application in Sentiment Analysis while Lemmatization has its application in Chatbots, human-answering.

What is an example of lemmatization? ›

Lemmatization takes a word and breaks it down to its lemma. For example, the verb "walk" might appear as "walking," "walks" or "walked." Inflectional endings such as "s," "ed" and "ing" are removed. Lemmatization groups these words as its lemma, "walk."

What is lemmatization using PoS? ›

Lemmatization: obtains the lemmas of the different words in a text. PoS tagging: obtains not only the grammatical category of a word, but also all the possible grammatical categories in which a word of each specific PoS type can be classified (check the tagset associated).

What is an example of stemming? ›

Stemming is a technique used to reduce an inflected word down to its word stem. For example, the words “programming,” “programmer,” and “programs” can all be reduced down to the common word stem “program.” In other words, “program” can be used as a synonym for the prior three inflection words.

What is tokenization in NLP? ›

Tokenization, in the realm of Natural Language Processing (NLP) and machine learning, refers to the process of converting a sequence of text into smaller parts, known as tokens. These tokens can be as small as characters or as long as words.

What is the purpose of POS tagging? ›

POS tagging helps identify the correct meaning based on context, using tagsets that define the possible tags for each word type and their contexts. Language Modeling: POS tags provide valuable information about the relationships between words, which is useful for building statistical models of language.

What is an example of a POS tagger? ›

Consider the sentence: “The quick brown fox jumps over the lazy dog.” After performing POS Tagging: “The” is tagged as determiner (DT) “quick” is tagged as adjective (JJ)

What is the difference between POS tagging and parsing? ›

They are two distinct procedures: POS Tagging: each token gets assigned a label which reflects its word class. Parsing: each sentence gets assigned a structure (often a tree) which reflects how its components are related to each other.

What is POS tokenization? ›

Point-of-sale security

Tokenization enhances security by ensuring that sensitive payment data is not stored within the POS system, reducing the risk of breaches or unauthorized access.

Why is it called tokenization? ›

The PCI Council defines tokenization as "a process by which the primary account number (PAN) is replaced with a surrogate value called a token.

What is the difference between token and tokenization? ›

A token is a collection of characters that has semantic meaning for a model. Tokenization is the process of converting the words in your prompt into tokens. You can monitor foundation model token usage in a project on the Environments page on the Resource usage tab.

What is the difference between stemming and lemmatization paper? ›

Lemmatization and stemming are both techniques to reduce words to the base form; however, they differ on the level of linguistic analysis. Lemmatization was used instead of stemming because the method uses morphological analysis and ensures that the result is a valid word [18] . ...

What is the difference between spaCy lemmatization and NLTK? ›

In general, spaCy works better than NLTK in comparison to the speed and implementation, but NLTK is also required. Throughout the article I will show you the basic implementation of NLP tasks like tokenization, stemming, lemmatization, POS tagging, text matching, etc. Let's make our hands dirty with some code.

What is the purpose of stemming in NLP? ›

What is stemming? Stemming is a text preprocessing technique used in natural language processing (NLP) to reduce words to their root or base form. The goal of stemming is to simplify and standardize words, which helps improve the performance of information retrieval, text classification, and other NLP tasks.

What is stemming or lemmatization for topic Modelling? ›

Lemmatization also does the same task as Stemming which brings a shorter word or base word. There is a slight difference between them is Lemmatization cuts the word to gets its lemma word meaning it gets a much more meaningful form than what stemming does.

Top Articles
Latest Posts
Article information

Author: Edwin Metz

Last Updated:

Views: 5795

Rating: 4.8 / 5 (78 voted)

Reviews: 93% of readers found this page helpful

Author information

Name: Edwin Metz

Birthday: 1997-04-16

Address: 51593 Leanne Light, Kuphalmouth, DE 50012-5183

Phone: +639107620957

Job: Corporate Banking Technician

Hobby: Reading, scrapbook, role-playing games, Fishing, Fishing, Scuba diving, Beekeeping

Introduction: My name is Edwin Metz, I am a fair, energetic, helpful, brave, outstanding, nice, helpful person who loves writing and wants to share my knowledge and understanding with you.