Text Preprocessing with NLTK (2024)

Contents

  1. What is Natural Language Processing?
  2. What is NLTK?
  3. Initial Steps
  4. Preliminary Statistics
  5. Stemming and Lemmatization with NLTK
  6. How are Stemming and Lemmatization Different?
  7. Conclusion
  8. Further Reading

What is Natural Language Processing?

Natural Language Processing or NLP is a branch of artificial intelligence that deals with the interaction between computers and humans using the natural language. The ultimate objective of NLP is to read, decipher, understand, and make sense of human languages in a manner that is valuable. To this end, many different models, libraries, and methods have been used to train machines to process text, understand it, make predictions based on it, and even generate new text. The first step to training a model is to obtain and preprocess the data. In this article, I will be going through some of the most common steps to be followed with almost any dataset before you can pass it as an input to a model.

Text Preprocessing with NLTK (3)

What is NLTK?

The Natural Language Toolkit, or more commonly NLTK, is a suite of libraries and programs for symbolic and statistical natural language processing for English written in the Python programming language. It consists of the most common algorithms such as tokenizing, part-of-speech tagging, stemming, sentiment analysis, topic segmentation, and named entity recognition, some of which we will be making use of in this article.

Initial Steps

First we import the required NLTK toolkit.

# Importing modules
import nltk

Now we import the required dataset, which can be stored and accessed locally or online through a web URL. We can also make use of one of the corpus datasets provided by NLTK itself. In this article, we will be using a sample corpus dataset provided by NLTK.

# Sample corpus.
from nltk.corpus import inaugural
corpus = inaugural.raw('1789-Washington.txt')
print(corpus)

We print the corpus so that we can take a look at the text, study it, and make note of special characters and other changes that might need to be made before training a model based on it.

Preliminary Statistics

We now look at how to extract some statistics from the corpus, such as the number of sentences, etc. using tokenization. These statistics can later be used to set some parameters while training a model. Tokenization is the process by which big quantities of text are divided into smaller parts called tokens. It is crucial to understand the pattern in the text in order to perform various NLP tasks. These tokens are very useful for finding such patterns. NLTK has a very important module tokenize which further comprises of sub-modules -

  1. word tokenize
  2. sentence tokenize
from nltk.tokenize import word_tokenize,sent_tokenizesents = nltk.sent_tokenize(corpus)
print("The number of sentences is", len(sents))
words = nltk.word_tokenize(corpus)
print("The number of tokens is", len(words))
average_tokens = round(len(words)/len(sents))
print("The average number of tokens per sentence is",average_tokens)
unique_tokens = set(words)
print("The number of unique tokens are", len(unique_tokens))
from nltk.corpus import stopwords
stop_words = set(stopwords.words('english'))
final_tokens = []
for each in words:
if each not in stop_words:
final_tokens.append(each)
print("The number of total tokens after removing stopwords are", len((final_tokens)))
Text Preprocessing with NLTK (4)

Now that we have some numerical descriptors of the dataset, we can take a look at stemming and lemmatization.

Stemming and Lemmatization with NLTK

What is Stemming?
Stemming is a kind of normalization for words. It is a technique where a set of words in a sentence are converted into a sequence to shorten its lookup. The words which have the same meaning but have some variation according to the context or sentence are normalized. Stemming is hence a way to find the root word from variations of the word.

NLTK provides many inbuilt stemmers such as the Porter Stemmer, Snowball Stemmer and Lancaster Stemmer. We will look at the differences between the Porter Stemmer and the Snowball Stemmer.

from nltk.stem import PorterStemmer
from nltk.stem import SnowballStemmer
# Snowball Stemmer has language as a parameter.words = ["grows","leaves","fairly","cats","trouble","misunderstanding","friendships","easily", "rational", "relational"]#Create instances of both stemmers, and stem the words using them.stemmer_ps = PorterStemmer()
#an instance of Porter Stemmer
stemmed_words_ps = [stemmer_ps.stem(word) for word in words]
print("Porter stemmed words: ", stemmed_words_ps)
stemmer_ss = SnowballStemmer("english")
#an instance of Snowball Stemmer
stemmed_words_ss = [stemmer_ss.stem(word) for word in words]
print("Snowball stemmed words: ", stemmed_words_ss)

Once we create an instance of the stemmers, we write a function which takes each sentence of a corpus as input and returns the stemmed version of the word.

# A function which takes a sentence/corpus and gets its stemmed version.def stemSentence(sentence):
token_words=word_tokenize(sentence) #we need to tokenize the sentence or else stemming will return the entire sentence as is.
stem_sentence=[]
for word in token_words:
stem_sentence.append(stemmer_ps.stem(word))
stem_sentence.append(" ") #adding a space so that we can join all the words at the end to form the sentence again.
return "".join(stem_sentence)
stemmed_sentence = stemSentence("The circ*mstances under which I now meet you will acquit me from entering into that subject further than to refer to the great constitutional charter under which you are assembled, and which, in defining your powers, designates the objects to which your attention is to be given.")
print("The Porter stemmed sentence is: ", stemmed_sentence)
Text Preprocessing with NLTK (5)

We observe that the 2 stemmers are nearly the same, except in the case of some adverbs, where the Snowball Stemmer seems to give a better output closer to the root word.

Some differences between the Porter Stemmer and Snowball Stemmer are -

  • Snowball Stemmer is more aggressive than Porter Stemmer.
  • Some issues in Porter Stemmer are fixed in Snowball Stemmer.
  • Words like ‘fairly‘ and ‘sportingly‘ are stemmed to ‘fair’ and ‘sport’ in the Snowball Stemmer but are stemmed to ‘fairli‘ and ‘sportingli‘ with the Porter Stemmer.

As a general rule of thumb, the Snowball Stemmer stems words to a more accurate stem.

What is Lemmatization?
Lemmatization is the algorithmic process of finding the lemma of a word depending on their meaning. Lemmatization usually refers to the morphological analysis of words, which aims to remove inflectional endings. It helps in returning the base or dictionary form of a word, which is known as the lemma.

The NLTK Lemmatization method is based on WordNet’s built-in morph function.

We write some code to import the WordNet Lemmatizer.

from nltk.stem import WordNetLemmatizer
nltk.download('wordnet')
# Since Lemmatization is based on WordNet's built-in morph function.

Now that we have downloaded the wordnet, we can go ahead with lemmatization. Lemmatization can be done with or without a POS tag. A POS or part-of-speech tag assigns a tag to each word, and hence increases the accuracy of the lemma in the context of the dataset. For example, the word ‘leaves’ without a POS tag would get lemmatized to the word ‘leaf’, but with a verb tag, its lemma would become ‘leave’.

words = ["grows","leaves","fairly","cats","trouble","running","friendships","easily", "was", "relational","has"]lemmatizer = WordNetLemmatizer() 
#an instance of Word Net Lemmatizer
lemmatized_words = [lemmatizer.lemmatize(word) for word in words]
print("The lemmatized words: ", lemmatized_words)
#prints the lemmatized words
lemmatized_words_pos = [lemmatizer.lemmatize(word, pos = "v") for word in words]
print("The lemmatized words using a POS tag: ", lemmatized_words_pos)
#prints POS tagged lemmatized words

Now that we have created instances of the lemmatizers, we write a function which takes as input each sentence of the corpus and returns its lemmatized version.

#A function which takes a sentence/corpus and gets its lemmatized version.def lemmatizeSentence(sentence):
token_words=word_tokenize(sentence)
#we need to tokenize the sentence or else lemmatizing will return the entire sentence as is.
lemma_sentence=[]
for word in token_words:
lemma_sentence.append(lemmatizer.lemmatize(word))
lemma_sentence.append(" ")
return "".join(lemma_sentence)
lemma_sentence = lemmatizeSentence("The circ*mstances under which I now meet you will acquit me from entering into that subject further than to refer to the great constitutional charter under which you are assembled, and which, in defining your powers, designates the objects to which your attention is to be given.")
print("The lemmatized sentence is: ", lemma_sentence)
Text Preprocessing with NLTK (6)

In order to get results more in accordance with the context of the dataset, POS tags can be used with the lemmatizer.

How are Stemming and Lemmatization Different?

  1. Stemming reduces word-forms to stems in order to reduce size, whereas lemmatization reduces the word-forms to linguistically valid lemmas. For example, the stem of the word ‘happy’ is ‘happi’, but its lemma is ‘happy’, which is linguistically valid.
  2. Lemmatization is usually more sophisticated and requires some sort of lexica. Stemming, on the other hand, can be achieved with simple rule-based approaches.
  3. A stemmer operates on a single word without knowledge of the context, and cannot discriminate between words which have similar/different meanings depending on part of speech. For example, the word ‘better’ has ‘good’ as its lemma. This link is missed by stemming, as it shows ‘bet’ as the stem.

Conclusion

I hope this article was a good introduction to text preprocessing using stemming and lemmatization, and the associated differences between the two. Apart from these, there are many other tasks to be done before the corpus can be fed into a model to train, such as removal of newlines, special characters, conversion to lower case, etc. These will be covered in future articles. The full code used in this article can be accessed here.

Further Reading

Text Preprocessing with NLTK (2024)

FAQs

How to use NLTK for preprocessing? ›

Top 14 NLTK preprocessing steps
  1. Tokenization. Splitting the text into individual words or subwords (tokens). ...
  2. Lowercasing. Converting all text to lowercase to make it case-insensitive. ...
  3. Remove punctuation. ...
  4. Remove stop words. ...
  5. Remove extra whitespace. ...
  6. Remove URLs. ...
  7. Remove HTML code. ...
  8. Remove frequent words.
Dec 21, 2022

What are the challenges of text preprocessing in NLP? ›

Text preprocessing in NLP faces challenges such as context sensitivity, scalability, language diversity, and data quality.

Is NLTK faster than spaCy? ›

Performance and speed

spaCy: SpaCy is known for its high speed and efficiency. The developers Honnibal and Montani have optimized the library to quickly process large amounts of text data. NLTK: NLTK offers a solid performance, but tends to be slower than spaCy, especially when processing large amounts of text.

What is the correct order for preprocessing in natural language processing? ›

Preprocessing is a critical step in NLP that involves cleaning and preparing text data for analysis. It includes several tasks such as tokenization, removing stop words, stemming, lemmatization, and more. These tasks help in reducing the noise in the data, making it more manageable and meaningful for analysis.

How do you preprocess text in NLP? ›

Techniques for Text Preprocessing
  1. Load data. ...
  2. 2) Lower Case. ...
  3. 3) Remove punctuations. ...
  4. Sometimes it happens that words and digits combine are written in the text which creates a problem for machines to understand. ...
  5. 5) Remove Stopwords. ...
  6. 7) Stemming and Lemmatization. ...
  7. Lemmatization. ...
  8. 8) Remove Extra Spaces.

What is NLTK and how it is useful in processing NLP text analysis? ›

The Natural Language Toolkit (NLTK) is a popular open-source library for natural language processing (NLP) in Python. It provides an easy-to-use interface for a wide range of tasks, including tokenization, stemming, lemmatization, parsing, and sentiment analysis.

Why is NLP difficult? ›

It's the nature of the human language that makes NLP difficult. The rules that dictate the passing of information using natural languages are not easy for computers to understand. Some of these rules can be high-leveled and abstract; for example, when someone uses a sarcastic remark to pass information.

How can I speed up my NLP processing? ›

  1. 1 Choose the right framework. One of the first steps to speed up your NLP apps is to choose the right framework for your needs. ...
  2. 2 Reduce the model size. ...
  3. 3 Use parallel processing. ...
  4. 4 Optimize the code. ...
  5. 5 Test and benchmark. ...
  6. 6 Here's what else to consider.
Sep 18, 2023

What is the biggest challenge in NLP? ›

Ambiguity: One of the most significant challenges in NLP is dealing with ambiguity in language. Words and sentences often have multiple meanings, and understanding the correct interpretation depends heavily on context. Developing models that accurately discern context and disambiguate language remains a complex task.

Is NLTK still relevant? ›

NLTK was originally designed for research and development due to its vast libraries. Today, it is used in prototyping and creating text processing software and can still be used in production environments.

What is the alternative to NLTK? ›

Other important factors to consider when researching alternatives to NLTK include projects and tasks. The best overall NLTK alternative is openNLP. Other similar apps like NLTK are Stanford CoreNLP, Amazon Comprehend, Google Cloud Natural Language API, and spaCy.

What are the advantages of NLTK? ›

Advantages and disadvantages
  • NLTK proves to be highly suitable for carrying out NLP tasks.;
  • It is convenient to access external resources, and all the models have been trained on dependable datasets.;
  • Texts are often supplied with annotations.

Why lowercase in NLP? ›

The common approach is to reduce everything to lower case for simplicity. Lowercasing is applicable to most text mining and NLP tasks and significantly helps with consistency of the output. However, it is important to remember that some words, like “US” and “us”, can change meanings when reduced to the lower case.

What are the 5 steps in NLP? ›

The five main steps of NLP are morphological/lexical analysis, syntactic analysis, semantic analysis, discourse integration, and pragmatic analysis. In 2021, NLP continues to be one of the fastest growing areas of artificial intelligence and machine learning.

What are the stop words in text preprocessing? ›

Stop word removal is one of the most commonly used preprocessing steps across different NLP applications. The idea is simply removing the words that occur commonly across all the documents in the corpus. Typically, articles and pronouns are generally classified as stop words.

How do you automate data preprocessing in Python? ›

How to Preprocess Data in Python Step-by-Step
  1. Load data in Pandas.
  2. Drop columns that aren't useful.
  3. Drop rows with missing values.
  4. Create dummy variables.
  5. Take care of missing data.
  6. Convert the data frame to NumPy.
  7. Divide the data set into training data and test data.
Jun 10, 2022

What do you use NLTK for? ›

NLTK (Natural Language Toolkit) is the go-to API for NLP (Natural Language Processing) with Python. It is a really powerful tool to preprocess text data for further analysis like with ML models for instance. It helps convert text into numbers, which the model can then easily work with.

How to do image pre processing in Python? ›

To get started with image processing in Python, you'll need to load and convert your images into a format the libraries can work with. The two most popular options for this are OpenCV and Pillow. This will load the image as a NumPy array. The image is in the BGR color space, so you may want to convert it to RGB.

Top Articles
Latest Posts
Article information

Author: Clemencia Bogisich Ret

Last Updated:

Views: 5799

Rating: 5 / 5 (80 voted)

Reviews: 95% of readers found this page helpful

Author information

Name: Clemencia Bogisich Ret

Birthday: 2001-07-17

Address: Suite 794 53887 Geri Spring, West Cristentown, KY 54855

Phone: +5934435460663

Job: Central Hospitality Director

Hobby: Yoga, Electronics, Rafting, Lockpicking, Inline skating, Puzzles, scrapbook

Introduction: My name is Clemencia Bogisich Ret, I am a super, outstanding, graceful, friendly, vast, comfortable, agreeable person who loves writing and wants to share my knowledge and understanding with you.