Friday, December 5, 2014

Sentiment Analysis on Rotten Tomatoes Movie Reviews

For the past couple of weeks I have been reading and learning Natural Language Processing (NLP) basics from Dr. Christopher Potts (Stanford University, Department of Linguistics) online tutorial. Kaggle's knowledge based competition: Sentiment analysis on movie reviews motivated me to learn basics of NLP (pretty interesting area of research). 

I will be using Python (ipython notebook) to analyze data and scikit-learn (Machine Learning library for Python) for predicting sentiment labels. The analysis and prediction done here are based on scikit-learn Working with Text Data tutorialMovie reviews are from Rotten Tomatoes dataset. The sentiment labels are as follows:

0 - negative
1 - somewhat negative
2 - neutral
3 - somewhat positive
4 - positive

##########################################
# View files in the directory
ls

Out:
RottenTomatoes.ipynb    train.tsv    test.tsv

# Import necessary libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# Reading training and testing data (*tsv : Tab-Seperated Values)
train = pd.read_csv('train.tsv', sep = '\t')
test = pd.read_csv('test.tsv', sep = '\t')

# View training data (top 5 instances)
train.head()

Out:


# View testing data (top 5 instances)
test.head()

Out:


# Unique sentiment labels
train['Sentiment'].unique()

Out:
array([1, 2, 3, 4, 0])

# Type of data frame
type(train)

Out:
pandas.core.frame.DataFrame

# Summary of data (Works only for numerical data)
train.descripe()

Out:

Extracting features from the text, i.e. conveying text content into numerical feature vector. 

Working on explanations of Bag of Words, Tokenizing text, Term Frequency, Term Frequency times Inverse Document Frequency, Naive Bayes Classifier...

# Tokenizing text with scikit-learn
from sklearn.feature_extraction.text import CountVectorizer
count_vector = CountVectorizer()
x_train_counts = count_vector.fit_transform(train['Phrase'])

# Dimensions of the training data count vector
x_train_counts.shape

Out:
(156060, 15240)


# Get index of some common words/n-grams/consecutive characters
# For example: 'movie'
count_vector.vocabulary_.get(u'movie')

Out:
8791

# Get feature names
count_vector.get_feature_names()

Out:

# Converting occurrences to frequencies
from sklearn.feature_extraction.text import TfidfTransformer

## Term Frequencies (tf)
# Use fit() method to fit estimator to the data
tf_transformer = TfidfTransformer(use_idf = False).fit(x_train_counts)
# Use transform() method to transform count-matrix to 'tf' representation
x_train_tf = tf_transformer.transform(x_train_counts)

## Term Frequency times Inverse Document Frequency (tf-idf)
tfidf_transformer = TfidfTransformer()
# Use transform() method to transform count-matrix to 'tf-idf' representation
x_train_tfidf = tfidf_transformer.fit_transform(x_train_counts)


## Training a classifier to predict sentiment label of a phrase
# Naive Bayes Classifier (Multinomial)
from sklearn.naive_bayes import MultinominalNB
clf = MultinomialNB().fit(x_train_tfidf, train['Sentiment'])

## Prediction on test data
# Tokenizing test phrase
x_test_counts = count_vector.transform(test['Phrase'])
# Use transform() method to transform test count-matrix to 'tf-idf' representation
x_test_tfidf = tfidf_transformer.transform(x_test_counts)

# Prediction
predicted = clf.predict(x_test_tfidf)

# View predictions
for i, j in zip(test['PhraseId'], predicted):
    print(i, predicted[j])

Out:

# Writing *csv file for Kaggle submission
import csv
with open('Rotten_Sentiment.csv', 'w') as csvfile:
    csvfile.write('PhraseId,Sentiment\n')
    for i, j in zip(test['PhraseId'], predicted):
         csvfile.write('{}, {}\n'.format(i, j))

Out:


Finally, I submitted test data sentiment label predictions on competitions submission link, and got score of 0.58289. Below is the screenshot of leader board standings.





6 comments:

  1. As always, interesting stuff! Thats neat you can submit your work!

    ReplyDelete
  2. Can u provide us the python code?

    ReplyDelete
  3. Thank you so much for sharing this worth able content with us. The concept taken here will be useful for my future programs and i will surely implement them in my study. Keep blogging article like this.

    Data Science Online Training|
    R Programming Online Training|
    Hadoop Online Training

    ReplyDelete

  4. The development of artificial intelligence (AI) has propelled more programming architects, information scientists, and different experts to investigate the plausibility of a vocation in machine learning. Notwithstanding, a few newcomers will in general spotlight a lot on hypothesis and insufficient on commonsense application. IEEE final year projects on machine learning In case you will succeed, you have to begin building machine learning projects in the near future.

    Projects assist you with improving your applied ML skills rapidly while allowing you to investigate an intriguing point. Furthermore, you can include projects into your portfolio, making it simpler to get a vocation, discover cool profession openings, and Final Year Project Centers in Chennai even arrange a more significant compensation.


    Data analytics is the study of dissecting crude data so as to make decisions about that data. Data analytics advances and procedures are generally utilized in business ventures to empower associations to settle on progressively Python Training in Chennai educated business choices. In the present worldwide commercial center, it isn't sufficient to assemble data and do the math; you should realize how to apply that data to genuine situations such that will affect conduct. In the program you will initially gain proficiency with the specialized skills, including R and Python dialects most usually utilized in data analytics programming and usage; Python Training in Chennai at that point center around the commonsense application, in view of genuine business issues in a scope of industry segments, for example, wellbeing, promoting and account.

    ReplyDelete
  5. Hi there! Someone in my Myspace group shared this site with us so I came to give it a look. I’m definitely loving the information. I’m bookmarking and will be tweeting this to my followers! Outstanding blog and wonderful style and design. data science from scratch

    ReplyDelete