For the past couple of weeks I have been reading and learning Natural Language Processing (NLP) basics from Dr. Christopher Potts (Stanford University, Department of Linguistics) online tutorial. Kaggle's knowledge based competition: Sentiment analysis on movie reviews motivated me to learn basics of NLP (pretty interesting area of research).
I will be using Python (ipython notebook) to analyze data and scikit-learn (Machine Learning library for Python) for predicting sentiment labels. The analysis and prediction done here are based on scikit-learn Working with Text Data tutorial. Movie reviews are from Rotten Tomatoes dataset. The sentiment labels are as follows:
0 - negative
1 - somewhat negative
2 - neutral
3 - somewhat positive
4 - positive
##########################################
# View files in the directory
ls
Out:
RottenTomatoes.ipynb train.tsv test.tsv
# Import necessary libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
# Reading training and testing data (*tsv : Tab-Seperated Values)
train = pd.read_csv('train.tsv', sep = '\t')
test = pd.read_csv('test.tsv', sep = '\t')
# View training data (top 5 instances)
train.head()
Out:
# View testing data (top 5 instances)
test.head()
Out:
# Unique sentiment labels
train['Sentiment'].unique()
Out:
array([1, 2, 3, 4, 0])
# Type of data frame
type(train)
Out:
pandas.core.frame.DataFrame
# Summary of data (Works only for numerical data)
train.descripe()
Out:
Extracting features from the text, i.e. conveying text content into numerical feature vector.
Working on explanations of Bag of Words, Tokenizing text, Term Frequency, Term Frequency times Inverse Document Frequency, Naive Bayes Classifier...
# Tokenizing text with scikit-learn
from sklearn.feature_extraction.text import CountVectorizer
count_vector = CountVectorizer()
x_train_counts = count_vector.fit_transform(train['Phrase'])
# Dimensions of the training data count vector
x_train_counts.shape
Out:
(156060, 15240)
# Get index of some common words/n-grams/consecutive characters
# For example: 'movie'
count_vector.vocabulary_.get(u'movie')
Out:
8791
# Get feature names
count_vector.get_feature_names()
Out:
# Converting occurrences to frequencies
from sklearn.feature_extraction.text import TfidfTransformer
## Term Frequencies (tf)
# Use fit() method to fit estimator to the data
tf_transformer = TfidfTransformer(use_idf = False).fit(x_train_counts)
# Use transform() method to transform count-matrix to 'tf' representation
x_train_tf = tf_transformer.transform(x_train_counts)
## Term Frequency times Inverse Document Frequency (tf-idf)
tfidf_transformer = TfidfTransformer()
# Use transform() method to transform count-matrix to 'tf-idf' representation
x_train_tfidf = tfidf_transformer.fit_transform(x_train_counts)
## Training a classifier to predict sentiment label of a phrase
# Naive Bayes Classifier (Multinomial)
from sklearn.naive_bayes import MultinominalNB
clf = MultinomialNB().fit(x_train_tfidf, train['Sentiment'])
## Prediction on test data
# Tokenizing test phrase
x_test_counts = count_vector.transform(test['Phrase'])
# Use transform() method to transform test count-matrix to 'tf-idf' representation
x_test_tfidf = tfidf_transformer.transform(x_test_counts)
# Prediction
predicted = clf.predict(x_test_tfidf)
# View predictions
for i, j in zip(test['PhraseId'], predicted):
print(i, predicted[j])
Out:
# Writing *csv file for Kaggle submission
import csv
with open('Rotten_Sentiment.csv', 'w') as csvfile:
csvfile.write('PhraseId,Sentiment\n')
for i, j in zip(test['PhraseId'], predicted):
csvfile.write('{}, {}\n'.format(i, j))
Out:
Finally, I submitted test data sentiment label predictions on competitions submission link, and got score of 0.58289. Below is the screenshot of leader board standings.
I will be using Python (ipython notebook) to analyze data and scikit-learn (Machine Learning library for Python) for predicting sentiment labels. The analysis and prediction done here are based on scikit-learn Working with Text Data tutorial. Movie reviews are from Rotten Tomatoes dataset. The sentiment labels are as follows:
0 - negative
1 - somewhat negative
2 - neutral
3 - somewhat positive
4 - positive
##########################################
# View files in the directory
ls
Out:
RottenTomatoes.ipynb train.tsv test.tsv
# Import necessary libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
# Reading training and testing data (*tsv : Tab-Seperated Values)
train = pd.read_csv('train.tsv', sep = '\t')
test = pd.read_csv('test.tsv', sep = '\t')
# View training data (top 5 instances)
train.head()
Out:
# View testing data (top 5 instances)
test.head()
Out:
# Unique sentiment labels
train['Sentiment'].unique()
Out:
array([1, 2, 3, 4, 0])
# Type of data frame
type(train)
Out:
pandas.core.frame.DataFrame
# Summary of data (Works only for numerical data)
train.descripe()
Out:
Working on explanations of Bag of Words, Tokenizing text, Term Frequency, Term Frequency times Inverse Document Frequency, Naive Bayes Classifier...
# Tokenizing text with scikit-learn
from sklearn.feature_extraction.text import CountVectorizer
count_vector = CountVectorizer()
x_train_counts = count_vector.fit_transform(train['Phrase'])
# Dimensions of the training data count vector
x_train_counts.shape
Out:
(156060, 15240)
# Get index of some common words/n-grams/consecutive characters
# For example: 'movie'
count_vector.vocabulary_.get(u'movie')
Out:
8791
# Get feature names
count_vector.get_feature_names()
Out:
from sklearn.feature_extraction.text import TfidfTransformer
## Term Frequencies (tf)
# Use fit() method to fit estimator to the data
tf_transformer = TfidfTransformer(use_idf = False).fit(x_train_counts)
# Use transform() method to transform count-matrix to 'tf' representation
x_train_tf = tf_transformer.transform(x_train_counts)
## Term Frequency times Inverse Document Frequency (tf-idf)
tfidf_transformer = TfidfTransformer()
# Use transform() method to transform count-matrix to 'tf-idf' representation
x_train_tfidf = tfidf_transformer.fit_transform(x_train_counts)
## Training a classifier to predict sentiment label of a phrase
# Naive Bayes Classifier (Multinomial)
from sklearn.naive_bayes import MultinominalNB
clf = MultinomialNB().fit(x_train_tfidf, train['Sentiment'])
## Prediction on test data
# Tokenizing test phrase
x_test_counts = count_vector.transform(test['Phrase'])
# Use transform() method to transform test count-matrix to 'tf-idf' representation
x_test_tfidf = tfidf_transformer.transform(x_test_counts)
# Prediction
predicted = clf.predict(x_test_tfidf)
# View predictions
for i, j in zip(test['PhraseId'], predicted):
print(i, predicted[j])
Out:
import csv
with open('Rotten_Sentiment.csv', 'w') as csvfile:
csvfile.write('PhraseId,Sentiment\n')
for i, j in zip(test['PhraseId'], predicted):
csvfile.write('{}, {}\n'.format(i, j))
Out:
As always, interesting stuff! Thats neat you can submit your work!
ReplyDeleteCan u provide us the python code?
ReplyDeleteThank you so much for sharing this worth able content with us. The concept taken here will be useful for my future programs and i will surely implement them in my study. Keep blogging article like this.
ReplyDeleteData Science Online Training|
R Programming Online Training|
Hadoop Online Training
The development of artificial intelligence (AI) has propelled more programming architects, information scientists, and different experts to investigate the plausibility of a vocation in machine learning. Notwithstanding, a few newcomers will in general spotlight a lot on hypothesis and insufficient on commonsense application. IEEE final year projects on machine learning In case you will succeed, you have to begin building machine learning projects in the near future.
Projects assist you with improving your applied ML skills rapidly while allowing you to investigate an intriguing point. Furthermore, you can include projects into your portfolio, making it simpler to get a vocation, discover cool profession openings, and Final Year Project Centers in Chennai even arrange a more significant compensation.
Data analytics is the study of dissecting crude data so as to make decisions about that data. Data analytics advances and procedures are generally utilized in business ventures to empower associations to settle on progressively Python Training in Chennai educated business choices. In the present worldwide commercial center, it isn't sufficient to assemble data and do the math; you should realize how to apply that data to genuine situations such that will affect conduct. In the program you will initially gain proficiency with the specialized skills, including R and Python dialects most usually utilized in data analytics programming and usage; Python Training in Chennai at that point center around the commonsense application, in view of genuine business issues in a scope of industry segments, for example, wellbeing, promoting and account.
A very nice guide. I will definitely follow these tips. Thank you for sharing such detailed article. I am learning a lot from you.keep it up!!!
ReplyDeleteAndroid Training in Chennai
Android Online Training in Chennai
Android Training in Bangalore
Android Training in Hyderabad
Android Training in Coimbatore
Android Training
Android Online Training
Hi there! Someone in my Myspace group shared this site with us so I came to give it a look. I’m definitely loving the information. I’m bookmarking and will be tweeting this to my followers! Outstanding blog and wonderful style and design. data science from scratch
ReplyDelete