Python Sentiment Analysis With the NLTK Library [With Examples] (2024)

Sentiment analysis is a technique to extract emotions from textual data. This data may be used to determine what people actually believe, think, and feel about specific subjects or products.

Python’s popularity as a programming language has resulted in a wide range of sentiment analysis applications. The Natural Language Toolkit (NLTK) is a common library for sentiment analysis.

In this tutorial, you will learn the fundamentals to perform sentiment analysis using Python’s NLTK library.

How to Install and Import the NLTK Library in Python

You must first know how to install and import the NLTK library into your Python distribution before you can begin sentiment analysis with NLTK.

Pip is the default Python package installer, which you can use to install NLTK. Enter the following command into your command prompt:

pip install nltk

Once the installation is completed, you can import NLTK into your python environment as shown below:

import nltk

Now you are good to go with NLTK sentiment analysis with Python.

Tokenization and Stop Words Removal with NLTK

We must first preprocess our text input before doing sentiment analysis.

The text must be modified, with stop words removed and words stemmed. NLTK offers several functions to achieve these objectives.

Let’s have a look at a few of these features:

How Do You Tokenize Text in Python?

Tokenization is the process of splitting text into discrete phrases or words. To do this, NLTK provides the word_tokenize() tokenizer part of the nltk.tokenize package.

A tokenizer converts a piece of text into a list of tokens and allows finding words and punctuation in the string.

The code snippet below uses a word tokenizer available in the NLTK library to split the given text into words.

from nltk.tokenize import word_tokenizetext = "Hello, today we will learn about Python Sentiment Analysis with NLTK."tokens = word_tokenize(text)print(tokens)

When you execute this code you might see the following error:

LookupError: ********************************************************************** Resource punkt not found. Please use the NLTK Downloader to obtain the resource: >>> import nltk >>> nltk.download('punkt')

The reason why you see the error “Resource punkt not found” is that this tokenizer needs Punkt sentence tokenization models to be installed on your machine.

To install Punkt sentence tokenization models run the following commands after opening the Python shell:

>>> import nltk>>> nltk.download('punkt')

Here is what it will look like when you do this on your computer:

>>> import nltk>>> nltk.download('punkt')[nltk_data] Downloading package punkt to[nltk_data] /Users/codefather/nltk_data...[nltk_data] Unzipping tokenizers/punkt.zip.True

Now, execute the initial Python program to tokenize the string of text and confirm you see the following list of strings.

['Hello', ',', 'today', 'we', 'will', 'learn', 'about', 'Python', 'Sentiment', 'Analysis', 'with', 'NLTK', '.']

As you can see we have split the text into words and punctuation.

Once again, in the code above, we have:

imported the word_tokenize() tokenizer from the nltk.tokenize package.
passed the text to analyze to word_tokenize().
this function splits the text into words and punctuation.

As we will see later in this tutorial, NLTK also provides a sentence tokenizer to split text into sentences instead of words.

How to Remove Stopwords in Python

The text you analyze may contain words like “the”, “is”, and “an” that have no significance. These are called stop words and to remove them with Python you can use the NLTK library.

We can use NLTK’s stopwords.words that provides a list of stop words to exclude them from our analysis.

What Is the Meaning of Stemming in Python?

Stemming is the process of bringing a word to its root form. For example, the stem for the words “programming” or “programmer” is “program”.

As a result, the vocabulary we use shrinks but sentiment analysis becomes more accurate. PorterStemmer, SnowballStemmer, and other stemming algorithms are available in the NLTK library.

from nltk.tokenize import word_tokenizefrom nltk.stem import PorterStemmertext = "Python programming is becoming very popular."tokens = word_tokenize(text)stemmer = PorterStemmer()stemmed_tokens = [stemmer.stem(token) for token in tokens]print(stemmed_tokens)

Here is what we get back when we use NLTK to perform the stemming process.

['python', 'program', 'is', 'becom', 'veri', 'popular', '.']

In the code above, we are importing PorterStemmer from nltk.stem. We are tokenizing the text and then we pass the tokenized text to PorterStemmer.

This will break down words into their first form. For example, programming is stemmed to the simpler form program.

When I executed this code I was wondering why the stemmed form for very is veri. The reason is that the idea of stemming is to get a shorter version of a word to simplify sentiment analysis. At the same time, the shorter version might not have an actual meaning.

Let’s see a different process that given a word as input returns a shorter form that has a meaning.

What Is Lemmatization in Python?

Let’s have a look at the process called lemmatization which also brings words to their basic form. The difference from stemming is that with lemmatization you will obtain words that have a meaning.

Is lemmatization better than stemming? Let’s find out!

We will use as an example the same phrase we have stemmed previously and instead of using a stemmer, we use a lemmatizer.

from nltk.tokenize import word_tokenizefrom nltk.stem.wordnet import WordNetLemmatizertext = "Python programming is becoming very popular."tokens = word_tokenize(text)lemmatizer = WordNetLemmatizer()lemmatized_tokens = [lemmatizer.lemmatize(token) for token in tokens]print(lemmatized_tokens)

To execute this code you will have to download the WordNet lexical database using the commands below in the Python shell:

>>> import nltk>>> nltk.download('wordnet')

Here is the output of the lemmatization process:

['Python', 'programming', 'is', 'becoming', 'very', 'popular', '.']

Hmmm…the lemmatized version is identical to the original phrase.

The lemmatize method also accepts a second argument that represents the Part of Speech tag, for example in this case we can pass “v” which stands for “verb”.

Update the following piece of code and rerun the program:

lemmatized_tokens = [lemmatizer.lemmatize(token, "v") for token in tokens]

Here is the output:

['Python', 'program', 'be', 'become', 'very', 'popular', '.']

The verb “programming” has been lemmatized to “program” and the verb “becoming” to “become”.

We have seen some of the functions in the NLTK library. You should have now a better understanding of what NLTK is and how it works.

How Does NLTK Do Sentiment Analysis?

In this example, we will use a sentiment analysis classifier called Vader (Valence Aware Dictionary for Sentiment Reasoning). Vader has already been trained in NLTK to categorize the sentiment intensity score of sentences as positive, negative, or neutral.

This pre-trained sentiment analysis classifier, which has a high degree of accuracy, was trained using a large dataset of text from social media. Given some text, it’s able to provide a sentiment score.

Let’s look at how you can use this classifier for sentiment analysis:

from nltk.sentiment import SentimentIntensityAnalyzeranalyzer = SentimentIntensityAnalyzer()text = "I love this product! It's amazing."scores = analyzer.polarity_scores(text)print(scores)

In the code above, we import SentimentIntensityAnalyzer to see if a sentence is positive, negative, or neutral.

We pass some text to the classifier and then check the polarity score of the words. It will tell us whether the sentence has a positive or negative sentiment. In our case, the sentence is 73% positive.

{'neg': 0.0, 'neu': 0.266, 'pos': 0.734, 'compound': 0.8516}

But…

What does compound mean in the output of the sentiment intensity analyzer?

The compound score is the sum of negative, neutral, and positive normalized to the range from -1 to +1, in other words, it represents the overall sentiment of the text. A compound score of +1 indicates a strong sense of positivity, whereas a score of -1 indicates a strong sense of negativity.

You can use these scores to categorize the sentiment of your text data.

Here is how we can compare the sentiment intensity of two sentences to understand which one is more positive than the other.

from nltk.sentiment import SentimentIntensityAnalyzeranalyzer = SentimentIntensityAnalyzer()review1 = "I love this product! It's amazing."review2 = "This product is terrible. I hate it."review1_score = analyzer.polarity_scores(review1)print("Score for Review #1: {}".format(review1_score))review2_score = analyzer.polarity_scores(review2)print("Score for Review #2: {}".format(review2_score))if review1_score['compound'] > review2_score['compound']: print("The review that has a more positive sentiment is Review #1: \"{}\"".format(review1))else: print("The review that has a more positive sentiment is Review #2:\"{}\"".format(review2))

The code above is self-explanatory, we are trying to classify two reviews as positive or negative based on the compound score calculated by a classifier.

Here are the scores of the two reviews:

Score for Review #1: {'neg': 0.0, 'neu': 0.266, 'pos': 0.734, 'compound': 0.8516}Score for Review #2: {'neg': 0.63, 'neu': 0.37, 'pos': 0.0, 'compound': -0.7783}

Based on the compound scores you can see that review1 is a positive review and review2 is a negative review.

Based on the score the classifier confirms that the first review has a more positive sentiment than the second review.

The review that has a more positive sentiment is Review #1: "I love this product! It's amazing."

You can play with the code and pass different sentences to see different results.

How Do You Identify Sentences in Text Using Python NLTK?

In the first example of this tutorial, we used a word tokenizer when analyzing the tokens in a sentence.

The NLTK library also provides a sentence tokenizer that allows identifying sentences within a text string instead of simple words.

Let’s see how the same text can be split based on words or sentences.

from nltk.tokenize import word_tokenizefrom nltk.tokenize import sent_tokenizetext = "NLTK stands for Natural Language Toolkit. Today we will learn about Python Sentiment Analysis with NLTK."words = word_tokenize(text)sentences = sent_tokenize(text)print("Words: ", words)print("Sentences: ", sentences)

Here is the output:

Words: ['NLTK', 'stands', 'for', 'Natural', 'Language', 'Toolkit', '.', 'Today', 'we', 'will', 'learn', 'about', 'Python', 'Sentiment', 'Analysis', 'with', 'NLTK', '.']Sentences: ['NLTK stands for Natural Language Toolkit.', 'Today we will learn about Python Sentiment Analysis with NLTK.']

To get the words for each sentence you can apply word_tokenize to every sentence using a list comprehension.

words_in_sentences = [word_tokenize(sentence) for sentence in sentences]print("Words in sentences: ", words_in_sentences)

The NLTK library provides a list of lists where the words in the first list are the words in the first sentence and the words in the second list belong to the second sentence.

Words in sentences: [['NLTK', 'stands', 'for', 'Natural', 'Language', 'Toolkit', '.'], ['Today', 'we', 'will', 'learn', 'about', 'Python', 'Sentiment', 'Analysis', 'with', 'NLTK', '.']]

How Do You Extract the Frequency of Words With NLTK?

What else can we do with text analysis in Python?

Imagine you want to identify the frequency distribution of each word in a document.

To identify the frequency distribution of tokens in a text you can use nltk.probability.FreqDist available in the NLTK library.

Before calling the FreqDist function you have to extract tokens from the text in the way you have learned in this tutorial.

Then you can pass the list of tokens to the FreqDist function (that accepts an iterable of tokens).

from nltk.tokenize import word_tokenizefrom nltk.probability import FreqDisttext = "NLTK stands for Natural Language Toolkit. Today we will learn about Python Sentiment Analysis with NLTK."words = word_tokenize(text)frequency_distribution = FreqDist(words)print(frequency_distribution)

If you execute this program you get the following:

<FreqDist with 16 samples and 18 outcomes>

What can we do with the data returned by FreqDist?

One useful thing we can do is to plot the frequency distribution of each token in the text using FreqDist.plot().

frequency_distribution.plot()

You will see the following graph.

Python Sentiment Analysis With the NLTK Library [With Examples] (1)

Conclusion:

In this post, we covered the fundamentals of sentiment analysis using Python with NLTK.

We learned how to install and import Python’s Natural Language Toolkit (NLTK), as well as how to analyze text and preprocess text with NLTK capabilities like word tokenization, stopwords, stemming, and lemmatization.

In addition, we classified the sentiment of our text data using NLTK’s pre-trained sentiment analysis classifier.

Businesses may use sentiment analysis to learn what their customers think and feel about the goods and services they provide.

Claudio Sabato

Claudio Sabato is an IT expert with over 15 years of professional experience in Python programming, Linux Systems Administration, Bash programming, and IT Systems Design. He isa professional certified by the Linux Professional Institute.

With a Master’s degree in Computer Science, he has a strong foundation in Software Engineering and a passion for robotics with Raspberry Pi.

No related posts.

Python Sentiment Analysis With the NLTK Library [With Examples] (2024)

FAQs

How to do sentiment analysis in Python NLTK? ›

Sentiment Analysis: First Steps With Python's NLTK Library

Getting Started With NLTK. Installing and Importing. Compiling Data. ...
Using NLTK's Pre-Trained Sentiment Analyzer.
Customizing NLTK's Sentiment Analysis. Selecting Useful Features. ...
Comparing Additional Classifiers. Installing and Importing scikit-learn. ...
Conclusion.

Read On ›

How to use NLTK library in Python? ›

Stemming

from nltk.stem import PorterStemmerstemmer = PorterStemmer() words = ['play', 'playing', 'plays', 'played', ...
sent2 = "I played the play playfully as the players were playing in. the play with playfullness" ...
from nltk import pos_tag token = word_tokenize(sent) + word_tokenize(sent2) tagged = pos_tag(cleaned_token)

What is sentiment analysis in NLP with example? ›

Sentiment analysis (or opinion mining) is a natural language processing (NLP) technique used to determine whether data is positive, negative or neutral. Sentiment analysis is often performed on textual data to help businesses monitor brand and product sentiment in customer feedback, and understand customer needs.

Read On ›

How do you do sentiment analysis step by step? ›

Here's a step-by-step guide to get you started with your sentiment analysis project:

Step 1: Dataset Exploration. First, we need to understand our dataset. ...
Step 2: Data Preprocessing. ...
Step 3: Split the Dataset. ...
Step 4 : Feature Extraction. ...
Step 5 : Build and Train a Sentiment Analysis Model. ...
Step 6 : Evaluate the Model.

Jun 3, 2023

Get More Info ›

Which Python tool for sentiment analysis? ›

Python libraries like NLTK, TextBlob, and VADER provide pre-trained models, simple APIs, and specialized functionality for analyzing sentiments in textual data. These top Python libraries for sentiment analysis make it easy for beginners to start with this important natural language processing technique.

Which model is best for sentiment analysis? ›

Statistical machine learning models like Naive Bayes Classifier, Support Vector Machine (SVM), Logistic Regression, Random Forest, and Gradient Boosting Machines (GBM) are all valuable for sentiment analysis, each with their strengths.

Get More Info ›

How to create a sentiment analysis model in Python? ›

To perform sentiment analysis, here are the steps we will follow;

Preprocess the text data.
Split the dataset.
Vectorize the dataset.
Data Conversion.
Label Encoding.
Train a Neural Networks.
Train the model.
Evaluate the Model ( With Plotting)

Feb 7, 2024

Get More Info ›

Which algorithm can be used for sentiment analysis? ›

Classification algorithms such as Naïve Bayes, linear regression, support vector machines, and deep learning are used to generate the output. The AI model provides a sentiment score to the newly processed data as the new data passes through the ML classifier.

Discover More ›

Is NLTK outdated? ›

You can work your way down the vast number of nltk modules, and you'll find almost none of them are useful for real work, and those that are, ship a host of alternatives that are all much worse than the current state-of-the-art. nltk makes most sense as a teaching tool, but even then it's mostly out of date.

What is the difference between NLP and NLTK? ›

Natural language processing (NLP) is a field that focuses on making natural human language usable by computer programs. NLTK, or Natural Language Toolkit, is a Python package that you can use for NLP. A lot of the data that you could be analyzing is unstructured data and contains human-readable text.

Find Out More ›

What are the advantages of NLTK in Python? ›

NLTK is widely used by researchers, developers, and data scientists worldwide to develop NLP applications and analyze text data. One of the major advantages of using NLTK is its extensive collection of corpora, which includes text data from various sources such as books, news articles, and social media platforms.

Discover More ›

What are the three types of sentiment analysis? ›

Existing approaches to sentiment analysis can be grouped into three main categories: knowledge-based techniques, statistical methods, and hybrid approaches. Knowledge-based techniques classify text by affect categories based on the presence of unambiguous affect words such as happy, sad, afraid, and bored.

Get More Info ›

What is a real example of sentiment analysis? ›

Examples of Sentiment Analysis

For instance, sentiment analysis may be performed on Twitter to determine overall opinion on a particular trending topic. Companies and brands often utilize sentiment analysis to monitor brand reputation across social media platforms or across the web as a whole.