How To Perform Review or Sentiment Analysis in Python ?
Table of Content
Introduction About NLP
Importing Necessary Libraries and Dataset
Removing Noise from dataset
Model Preparation
Checking accuracy of model
Conclusion
Introduction about NLP
A large numbers of data generated today’s are unstructured and requires processing to get insights from it. Some of its examples are news articles, social media post, products reviews on e-commerce sites, etc...
The process of analyzing this data and get insights from it falls under the field of NLP.
Now, you may be wondering what NLP is. So NLP stands for Natural Language Processing, which is the part of AI which studies how machine, interact with human language. Examples of NLP are Chabot, spell-checker, language translator, sentiment analysis, etc...
In this article we have a dataset which contains sentences and they are labelled with positive or negative sentiment.
Importing Necessary Libraries and Dataset
We are going to use NLTK package in python for all NLP task in this article, so if you have not installed nltk package then in your command prompt write the below code to install the package.
# import required packages
import pandas as pd
import numpy as np
import re
import nltk
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import confusion_matrix, accuracy_score
nltk.download('stopwords')
[nltk_data] Downloading package stopwords to C:\Users\Imurgence\AppData\Roaming\nltk_data...
[nltk_data] Unzipping corpora\stopwords.zip.
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
nltk.download('wordnet')
[nltk_data] Downloading package wordnet to C:\Users\Imurgence\AppData\Roaming\nltk_data...
[nltk_data] Unzipping corpora\wordnet.zip.
# Download the data sets by clicking on the links
# amazon reviews data set
# imdb review data set
# yelp reviews data set
# Importing Dataset
am = pd.read_excel('amazon.xlsx', header=None)
im = pd.read_excel('imdb.xlsx',header=None)
ye = pd.read_excel('yelp.xlsx',header=None)
print("Shape of \namazon : {} \nimdb : {} \nyelp : {}".format(am.shape,im.shape,ye.shape))
Shape of
amazon : (1000, 2)
imdb : (748, 2)
yelp : (1000, 2)
# Combining the datasets
df = pd.concat([am,im,ye],axis=0, ignore_index=True)
df.columns = ['text','sentiment']
df.head(5)
text sentiment
0 So there is no way for me to plug it in here i... 0
1 Good case, Excellent value. 1
2 Great for the jawbone. 1
3 Tied to charger for conversations lasting more... 0
4 The mic is great. 1
df.info()
RangeIndex: 2748 entries, 0 to 2747
Data columns (total 2 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 text 2744 non-null object
1 sentiment 2748 non-null int64
dtypes: int64(1), object(1)
memory usage: 43.1+ KB
Removing Noise from Dataset
In this step we will remove noise from dataset. Noise is a part of a text which does not add any meaning or information to data which helps to predict the sentiment. The most common words in a language are called stop words. Some examples of stop words are “is”, “am”, “are” and “a”, etc... They are generally irrelevant when processing language, unless a specific use case warrants their inclusion.
Let’s see what stopwords contain
print(stopwords.words('english'))
['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', 'too', 'very', 's', 't', 'can', 'will', 'just', 'don', "don't", 'should', "should've", 'now', 'd', 'll', 'm', 'o', 're', 've', 'y', 'ain', 'aren', "aren't", 'couldn', "couldn't", 'didn', "didn't", 'doesn', "doesn't", 'hadn', "hadn't", 'hasn', "hasn't", 'haven', "haven't", 'isn', "isn't", 'ma', 'mightn', "mightn't", 'mustn', "mustn't", 'needn', "needn't", 'shan', "shan't", 'shouldn', "shouldn't", 'wasn', "wasn't", 'weren', "weren't", 'won', "won't", 'wouldn', "wouldn't"]
It also contain not word which is essential for review or sentiment analysis. So we have to first remove this word from stopwords.
stopword = stopwords.words('english')
stopword.remove('not')
Now let’s make a list which contains only essential words to build model.
corpus = []
for i in range(0, len(df)):
print(df['text'][i])
review = re.sub('[^a-zA-Z]',' ', str(df['text'][i]))
review = review.lower()
review = review.split()
wordnet = WordNetLemmatizer()
review = [wordnet.lemmatize(word) for word in review if not word in stopword ]
review = " ".join(review)
corpus.append(review)
Let’s have a look how our cleaned word list looks like
corpus[0:4]
['way plug u unless go converter',
'good case excellent value',
'great jawbone',
'tied charger conversation lasting minute major problem']
corpus[2743:2747]
['think food flavor texture lacking',
'appetite instantly gone',
'overall not impressed would not go back',
'whole experience underwhelming think go ninja sushi next time']
Model Preperation
We have different models to do sentiment analysis like Bag of Words, TF-IDF, and Word2Vec. Word2Vec is used where we have large dataset and Bag of Words and TF-IDF can be used for small dataset. In this article we are going to use TF-IDF model for sentiment analysis.
# Creating TF-IDF Model
cv = TfidfVectorizer()
X = cv.fit_transform(corpus).toarray()
y = df.iloc[:,-1]
# Splitting Dataset into Train and Test data
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size = 0.2,random_state=42)
# Training Model with Naive Bayes
classifier = MultinomialNB()
classifier.fit(X_train,y_train)
y_pred = classifier.predict(X_test)
Checking accuracy of model
cm = confusion_matrix(y_test,y_pred)
acc = accuracy_score(y_test,y_pred)
print(" cm :\n {} \nacc : {}".format(cm,acc))
cm :
[[225 66]
[ 35 224]]
acc : 0.8163636363636364
We got the accuracy of 81.63 % which is a pretty good score. If we use separate data and build model only for one at a time we can get better accuracy score. We have a three dataset which belongs to product, movie and location review. Let’s check the model:
def review(text):
df1 = [text]
X1 = cv.transform(df1).toarray()
prediction = classifier.predict(X1)
return prediction
# Create text labels to be returned
outlabel=["Negative Sentiment","Positive Sentiment"]
outlabel[int(review("This product is awesome"))]
'Positive Sentiment'
outlabel[int(review("Location is not good"))]
'Negative Sentiment'
outlabel[int(review("Movie was boring"))]
'Negative Sentiment'
outlabel[int(review("Great ambience"))]
'Positive Sentiment'
outlabel[int(review("Location is not good but moview was amazing"))]
'Positive Sentiment'
outlabel[int(review("moview was amazing but Location is not good"))]
'Positive Sentiment'
Conclusion
This article is a basic sentiment analysis model using the nltk library. First, we installed necessary libraries and then removed noises from data. Finally, we built a model to associate reviews of product, movies and places to a particular sentiment and checked our model working by giving random input to model. Do read this important article on time series using LSTM in python.
About the Author's:
Sachin Kumar Gupta
Sachin, is a Mechanical Engineer and data science enthusiast. He loves to find trend in data and extract useful information from it. He has executed projects on Machine Learning and Deep Learning using Python.
Mohan Rai
Mohan Rai is an Alumni of IIM Bangalore , he has completed his MBA from University of Pune and Bachelor of Science (Statistics) from University of Pune. He is a Certified Data Scientist by EMC. Mohan is a learner and has been enriching his experience throughout his career by exposing himself to several opportunities in the capacity of an Advisor, Consultant and a Business Owner. He has more than 18 years’ experience in the field of Analytics and has worked as an Analytics SME on domains ranging from IT, Banking, Construction, Real Estate, Automobile, Component Manufacturing and Retail. His functional scope covers areas including Training, Research, Sales, Market Research, Sales Planning, and Market Strategy.
Table of Content
- Introduction About NLP
- Importing Necessary Libraries and Dataset
- Removing Noise from dataset
- Model Preparation
- Checking accuracy of model
- Conclusion
A large numbers of data generated today’s are unstructured and requires processing to get insights from it. Some of its examples are news articles, social media post, products reviews on e-commerce sites, etc...
The process of analyzing this data and get insights from it falls under the field of NLP.
Now, you may be wondering what NLP is. So NLP stands for Natural Language Processing, which is the part of AI which studies how machine, interact with human language. Examples of NLP are Chabot, spell-checker, language translator, sentiment analysis, etc...
In this article we have a dataset which contains sentences and they are labelled with positive or negative sentiment.
We are going to use NLTK package in python for all NLP task in this article, so if you have not installed nltk package then in your command prompt write the below code to install the package.
# import required packages
import pandas as pd
import numpy as np
import re
import nltk
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import confusion_matrix, accuracy_score
nltk.download('stopwords')
[nltk_data] Downloading package stopwords to C:\Users\Imurgence\AppData\Roaming\nltk_data...
[nltk_data] Unzipping corpora\stopwords.zip.
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
nltk.download('wordnet')
[nltk_data] Downloading package wordnet to C:\Users\Imurgence\AppData\Roaming\nltk_data...
[nltk_data] Unzipping corpora\wordnet.zip.
# Download the data sets by clicking on the links
# amazon reviews data set
# imdb review data set
# yelp reviews data set
# Importing Dataset
am = pd.read_excel('amazon.xlsx', header=None)
im = pd.read_excel('imdb.xlsx',header=None)
ye = pd.read_excel('yelp.xlsx',header=None)
print("Shape of \namazon : {} \nimdb : {} \nyelp : {}".format(am.shape,im.shape,ye.shape))
Shape of
amazon : (1000, 2)
imdb : (748, 2)
yelp : (1000, 2)
# Combining the datasets
df = pd.concat([am,im,ye],axis=0, ignore_index=True)
df.columns = ['text','sentiment']
df.head(5)
text sentiment
0 So there is no way for me to plug it in here i... 0
1 Good case, Excellent value. 1
2 Great for the jawbone. 1
3 Tied to charger for conversations lasting more... 0
4 The mic is great. 1
df.info()
RangeIndex: 2748 entries, 0 to 2747
Data columns (total 2 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 text 2744 non-null object
1 sentiment 2748 non-null int64
dtypes: int64(1), object(1)
memory usage: 43.1+ KB
In this step we will remove noise from dataset. Noise is a part of a text which does not add any meaning or information to data which helps to predict the sentiment. The most common words in a language are called stop words. Some examples of stop words are “is”, “am”, “are” and “a”, etc... They are generally irrelevant when processing language, unless a specific use case warrants their inclusion.
Let’s see what stopwords contain
print(stopwords.words('english'))
['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', 'too', 'very', 's', 't', 'can', 'will', 'just', 'don', "don't", 'should', "should've", 'now', 'd', 'll', 'm', 'o', 're', 've', 'y', 'ain', 'aren', "aren't", 'couldn', "couldn't", 'didn', "didn't", 'doesn', "doesn't", 'hadn', "hadn't", 'hasn', "hasn't", 'haven', "haven't", 'isn', "isn't", 'ma', 'mightn', "mightn't", 'mustn', "mustn't", 'needn', "needn't", 'shan', "shan't", 'shouldn', "shouldn't", 'wasn', "wasn't", 'weren', "weren't", 'won', "won't", 'wouldn', "wouldn't"]
It also contain not word which is essential for review or sentiment analysis. So we have to first remove this word from stopwords.
stopword = stopwords.words('english')
stopword.remove('not')
Now let’s make a list which contains only essential words to build model.
corpus = []
for i in range(0, len(df)):
print(df['text'][i])
review = re.sub('[^a-zA-Z]',' ', str(df['text'][i]))
review = review.lower()
review = review.split()
wordnet = WordNetLemmatizer()
review = [wordnet.lemmatize(word) for word in review if not word in stopword ]
review = " ".join(review)
corpus.append(review)
Let’s have a look how our cleaned word list looks like
corpus[0:4]
['way plug u unless go converter',
'good case excellent value',
'great jawbone',
'tied charger conversation lasting minute major problem']
corpus[2743:2747]
['think food flavor texture lacking',
'appetite instantly gone',
'overall not impressed would not go back',
'whole experience underwhelming think go ninja sushi next time']
We have different models to do sentiment analysis like Bag of Words, TF-IDF, and Word2Vec. Word2Vec is used where we have large dataset and Bag of Words and TF-IDF can be used for small dataset. In this article we are going to use TF-IDF model for sentiment analysis.
# Creating TF-IDF Model
cv = TfidfVectorizer()
X = cv.fit_transform(corpus).toarray()
y = df.iloc[:,-1]
# Splitting Dataset into Train and Test data
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size = 0.2,random_state=42)
# Training Model with Naive Bayes
classifier = MultinomialNB()
classifier.fit(X_train,y_train)
y_pred = classifier.predict(X_test)
cm = confusion_matrix(y_test,y_pred)
acc = accuracy_score(y_test,y_pred)
print(" cm :\n {} \nacc : {}".format(cm,acc))
cm :
[[225 66]
[ 35 224]]
acc : 0.8163636363636364
We got the accuracy of 81.63 % which is a pretty good score. If we use separate data and build model only for one at a time we can get better accuracy score. We have a three dataset which belongs to product, movie and location review. Let’s check the model:
def review(text):
df1 = [text]
X1 = cv.transform(df1).toarray()
prediction = classifier.predict(X1)
return prediction
# Create text labels to be returned
outlabel=["Negative Sentiment","Positive Sentiment"]
outlabel[int(review("This product is awesome"))]
'Positive Sentiment'
outlabel[int(review("Location is not good"))]
'Negative Sentiment'
outlabel[int(review("Movie was boring"))]
'Negative Sentiment'
outlabel[int(review("Great ambience"))]
'Positive Sentiment'
outlabel[int(review("Location is not good but moview was amazing"))]
'Positive Sentiment'
outlabel[int(review("moview was amazing but Location is not good"))]
'Positive Sentiment'
This article is a basic sentiment analysis model using the nltk library. First, we installed necessary libraries and then removed noises from data. Finally, we built a model to associate reviews of product, movies and places to a particular sentiment and checked our model working by giving random input to model. Do read this important article on time series using LSTM in python.
About the Author's:
Write A Public Review