Data Science (Side Projects): December 2014

I wanted to learn how machine learning is used to classify images (Image recognition). I was browsing Kaggle's past competitions and I found Dogs Vs Cats: Image Classification Competition (Here one needs to classify whether image contain either a dog or a cat). Google search helped me to get started. Here are some of the references that I found quite useful: Yhat's Image Classification in Python and SciKit-image Tutorial. Data is available here. Here I am using first 501 dog images and first 501 cat images from train data folder. For testing I selected first 100 images from test data folder and manually labeled image for verifying.

##########################################
# View files in the directory
ls

Out:
Image_Classification.ipynb data/

# View files in the data directory
ls data

Out:
data/ test/ train/

# Import necessary libraries
import pandas as pd
import numpy as np
from skim age import io
from matplotlib import pyplot as plt

# Define location of data
import os
train_directory = "./data/train/"
test_directory = "./data/test/"

# Define a function to return a list containing the names of the files in a directory given by path
def images(image_directory):
return [image_directory+image for image in os.listdir(image_directory)]

images(train_directory)

Out:

In training directory, image filename indicates image label as cat or dog: Need to extract labels

## Extracting training image labels
train_image_names = images(train_directory)

# Function to extract labels
def extract_labels(file_names):
'''Create labels from file names: Cat = 0 and Dog = 1'''

# Create empty vector of length = no. of files, filled with zeros
n = len(file_names)
y = np.zeros(n, dtype = np.int32)

# Enumerate gives index
for i, filename in enumerate(file_names):

# If 'cat' string is in file name assign '0'
if 'cat' in str(filename):
y[i] = 0
else:
y[i] = 1
return y

extract_labels(train_image_names)

Out:
array([0, 0, 0, ..., 1, 1, 1], dtype=int32)

# Save labels
y = extract_labels(train_image_names)

# Save labels: np.save(file or string, array)
np.save('y', y)

# Images in test directory
images(test_directory)

Out:

## View image: Dog
# from skimage import io # (imported earlier)
temp = io.imread('./data/train/dog.20.jpg')
plt.imshow(temp)

Out:

## View image: Cat
# from skimage import io # (imported earlier)
temp = io.imread('./data/train/cat.4.jpg')
plt.imshow(temp)

Out:

Using folder sort I found that Images are of different sizes: (max size = cat.835.jpg, min size = cat.4821.jpg). Need a standard size for analysis

# Get size of images (Ref: stackoverflow)

image_size = [ ]

for i in train_image_names: # images(file_directory)

im = Image.open(i)

image_size.append(im.size) # A list with tuples: [(x, y), …]

# Get mean of image size (Ref: stackoverflow)

[sum(y) / len(y) for y in zip(*image_size)]

Out: [403, 358]

Transforming the image: Standard size = (400, 350)

## Transforming the image

# Set up a standard image size based on approximate mean size

STANDARD_SIZE = (400, 350)

Code below copied from: Yhat's Image Classification in Python

# Function to read image, change image size and transform image to matrix

def img_to_matrix(filename, verbose=False):

'''

takes a filename and turns it into a numpy array of RGB pixels

'''

img = Image.open(filename)

# img = Image.fromarray(filename)

if verbose == True:

print "Changing size from %s to %s" % (str(img.size), str(STANDARD_SIZE))

img = img.resize(STANDARD_SIZE)

img = list(img.getdata())

img = map(list, img)

img = np.array(img)

return img

# Function to flatten numpy array

def flatten_image(img):

'''

takes in an (m, n) numpy array and flattens it

into an array of shape (1, m * n)

'''

s = img.shape[0] * img.shape[1]

img_wide = img.reshape(1, s)

return img_wide[0]

## Prepare training data

data = []

for i in images(train_directory):

img = img_to_matrix(i)

img = flatten_image(img)

data.append(img)

data = np.array(data)

data.shape

Out: (1002, 420000)

data[1].shape

Out: (420000, )

Total 420,000 features per image. 420,000 features is a lot to deal with for many algorithms, so the number of dimensions should be reduced somehow. For this we can use an unsupervised learning technique called PCA to derive a smaller number of features from the raw pixel data. Principal Component Analysis (PCA): to identify patterns to reduce dimensions of the dataset with minimum loss of information.

# Import PCA

from sklearn.decomposition import PCA

# PCA on training data
pca = PCA(n_components = 2)
X = pca.fit_transform(data)
X.size

Out: 20040

X[:, 0[.size

Out: 1002

X[:, 1[.size

Out: 1002

# Create a dataframe
df = pd.DataFrame({"x-1": X[:, 0], "x-2": X[:, 1], "label" : np.where(y == 1, "Dog", "Cat")})
df

Out:

# Create a dataframe
np.sum(pca.explained_variance_ratio_)

Out: 0.6461222455062432

Here 2-Dimension PCA, captures 64.6% of the variation

## Prepare testing data: PCA

test_images = images(test_directory)

test = [ ]

for i in test_images:

img = img_to_matrix(i)

img = flatten_image(img)

test.append(img)

test = np.array(test)

test.shape

Out: (100, 420000)

# Transforming test data
testX = pca.fit_transform(test)
testX.shape[1]

Out: 20

## Logistic regression
from sklearn.linear_model import LogisticRegression

clf = LogisticRegression( )
logreg = clf.fit(X, y)

# Predict using Logistic Regression
y_predict_logreg = logreg.predict(testX)
y_predict_logreg
Out:

## Logistic Regression: Accuracy

# Load 'Actual' labels for test data
actual = pd.read_csv('ActualLabels.csv')
actual['Labels'].head( )

Out:

logreg_accuracy =
np.where(y_predict_logreg == actual['Labels'], 1, 0).sum()/float(len(actual))

logreg_accuracy

Out: 0.54

54% of the images were correctly classified using logistic regression (2D PCA)

## KNN classifier
from sklearn.neighbors import KNeighborsClassifier

knn = KNeighborsClassifier()
knn.fit(X, y)

Out: KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
metric_params=None, n_neighbors=5, p=2, weights='uniform')

# Predict using KNN classifier

y_predict_knn = knn.predict(testX)
y_predict_knn
Out:

## KNN: Accuracy
knn_accuracy = np.where(y_predict_knn == actual['Labels'], 1, 0).sum()/float(len(actual))
knn_accuracy

Out: 0.52

52% of the images were correctly classified using KNN (2D PCA)

More sophisticated approaches, for example Support Vector Machines, Neural Networks and Others, would classify images with higher accuracy .

For the past couple of weeks I have been reading and learning Natural Language Processing (NLP) basics from Dr. Christopher Potts (Stanford University, Department of Linguistics) online tutorial. Kaggle's knowledge based competition: Sentiment analysis on movie reviews motivated me to learn basics of NLP (pretty interesting area of research).

I will be using Python (ipython notebook) to analyze data and scikit-learn (Machine Learning library for Python) for predicting sentiment labels. The analysis and prediction done here are based on scikit-learn Working with Text Data tutorial. Movie reviews are from Rotten Tomatoes dataset. The sentiment labels are as follows:

0 - negative
1 - somewhat negative
2 - neutral
3 - somewhat positive
4 - positive

##########################################
# View files in the directory
ls

Out:
RottenTomatoes.ipynb train.tsv test.tsv

# Import necessary libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# Reading training and testing data (*tsv : Tab-Seperated Values)
train = pd.read_csv('train.tsv', sep = '\t')
test = pd.read_csv('test.tsv', sep = '\t')

# View training data (top 5 instances)
train.head()

Out:

# View testing data (top 5 instances)
test.head()

Out:

# Unique sentiment labels
train['Sentiment'].unique()

Out:
array([1, 2, 3, 4, 0])

# Type of data frame
type(train)

Out:
pandas.core.frame.DataFrame

# Summary of data (Works only for numerical data)
train.descripe()

Out:

Extracting features from the text, i.e. conveying text content into numerical feature vector.

Working on explanations of Bag of Words, Tokenizing text, Term Frequency, Term Frequency times Inverse Document Frequency, Naive Bayes Classifier...

# Tokenizing text with scikit-learn
from sklearn.feature_extraction.text import CountVectorizer
count_vector = CountVectorizer()
x_train_counts = count_vector.fit_transform(train['Phrase'])

# Dimensions of the training data count vector
x_train_counts.shape

Out:
(156060, 15240)

# Get index of some common words/n-grams/consecutive characters
# For example: 'movie'
count_vector.vocabulary_.get(u'movie')

Out:
8791

# Get feature names
count_vector.get_feature_names()

Out:

# Converting occurrences to frequencies
from sklearn.feature_extraction.text import TfidfTransformer

## Term Frequencies (tf)
# Use fit() method to fit estimator to the data
tf_transformer = TfidfTransformer(use_idf = False).fit(x_train_counts)
# Use transform() method to transform count-matrix to 'tf' representation
x_train_tf = tf_transformer.transform(x_train_counts)

## Term Frequency times Inverse Document Frequency (tf-idf)
tfidf_transformer = TfidfTransformer()
# Use transform() method to transform count-matrix to 'tf-idf' representation
x_train_tfidf = tfidf_transformer.fit_transform(x_train_counts)

## Training a classifier to predict sentiment label of a phrase
# Naive Bayes Classifier (Multinomial)
from sklearn.naive_bayes import MultinominalNB
clf = MultinomialNB().fit(x_train_tfidf, train['Sentiment'])

## Prediction on test data
# Tokenizing test phrase
x_test_counts = count_vector.transform(test['Phrase'])
# Use transform() method to transform test count-matrix to 'tf-idf' representation
x_test_tfidf = tfidf_transformer.transform(x_test_counts)

# Prediction
predicted = clf.predict(x_test_tfidf)

# View predictions
for i, j in zip(test['PhraseId'], predicted):
print(i, predicted[j])

Out:

# Writing *csv file for Kaggle submission
import csv
with open('Rotten_Sentiment.csv', 'w') as csvfile:
csvfile.write('PhraseId,Sentiment\n')
for i, j in zip(test['PhraseId'], predicted):
csvfile.write('{}, {}\n'.format(i, j))

Out:

Finally, I submitted test data sentiment label predictions on competitions submission link, and got score of 0.58289. Below is the screenshot of leader board standings.

Data Science (Side Projects)

Friday, December 19, 2014

Image Classification: Dogs Vs Cats

Friday, December 5, 2014

Sentiment Analysis on Rotten Tomatoes Movie Reviews

Blog Archive