Data Science (Side Projects): November 2014

After reading Why is Python a language of choice for data scientists?, Is Python Becoming the King of the Data Science Forest? and other related blogs, I decided to brush up and improve my Python programming skills (after a couple of years of hiatus). At UCI I had intensively used Matlab, and I had learned Python from YouTube videos for some of my research calculations.

Last week I studied and practiced Python programming from Codeacademy's online Python Course. This is a really nice, easy to follow and interactive course. Estimated course time is 13 hours but it took me nearly 26 hours to finish. After finishing the course, I decided to analyze data using Python to familiarize myself with Python's Data Analysis Library: Pandas, Scientific Computing Libraries : NumPy, SciPy, Plotting Library: matplotlib (IMO: ggplot2 package in R plots much better looking plots compared to matplotlib plots), and scikit-learn for Machine Learning in Python.

For analyzing data I am using Titanic: Machine Learning from Disaster data from Kaggle's knowledge based competition, a major reason to use this data is that there are a lot of online Python tutorials and blogs that use this data and this makes learning/understanding easier.

Note: This is not a tutorial. The data analysis done here is based on various online Titanic Data related Python tutorials/blogs.

#############################################################
### Kaggle Competition: Titanic Machine Learning from Disaster
# Import important libraries and modules
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import pylab as p
import sklearn as sol

# Reading Titanic (training) data
train = pd.read_csv("/Users/Ankoor/Desktop/Python/Kaggle/Titanic/train.csv")

# View dataframe
train

# View first 'n' rows (R*)
train.head(5)

Out:

# View last 'n' rows (R*)
train.tail(3)

# Get column names (features / attributes) in data frame [Similar to R's names()]
list(train)

Out:

train.columns # another command to get feature names

# What kind of data array is 'train'?
type(train)

Out: pandas.core.frame.DataFrame

# Data types in 'train'
train.dtypes

Out:

PassengerId int64
Survived int64
Pclass int64
Name object
Sex object
Age float64
SibSp int64
Parch int64
Ticket object
Fare float64
Cabin object
Embarked object
dtype: object

# Information about data i.e. How to find out missing values?
train.info()

Out:

There are 891 observations. Features 'Age' (714 observations remaining), 'Cabin' (204 observations remaining) and 'Embarked' (889 observations remaining) have missing data.

# Checking missing values in the data: Age and Cabin
sum(train['Age'].isnull())

Out: 177

177 'Age' observations missing

sum(train['Cabin'].isnull())

Out: 687

687 'Cabin' observations missing

Note: .isnull() does not work for 'str'

# Describe data: Count, Mean, STD, Min, Max [Similar to R's summary()]
train.describe()

Out:

# Referencing in Pandas
train['Age'][0:10]

Out:

# Referencing: Other method (Kind of similar to R when you replace '.' with '$')
train.Age[0:10]

# Type of referenced data?
type(train['Age'])
type(train.Age) # another command to get type of referenced data

## Some basic statistics like mean and median
# Mean Age (Ignoring missing values)
train.Age.mean()
train['Age'].mean()

Out: 29.69911764705882

# Median Fare (Ignoring missing values)
train.Fare.median()
train['Fare'].median()

Out: 14.4542

# Unique values
train.Sex.unique()

Out: array(['male', 'female'], dtype=object)

train['Embarked'].unique()

Out: array(['S', 'C', 'Q', nan], dtype=object)

train['Pclass'].unique()

Out: array([3, 1, 2])

3 Passenger classes

# Getting subsets of dataframe
train[['Sex', 'Pclass', 'Age']]

Out:

# Filtering data with 'Age' > 60 years
train[train['Age'] > 60]

Out:

# Filtering and sub-setting data with 'Age' > 70 years
train[train['Age'] > 60][['Sex', 'Pclass', 'Age', 'Survived']]

Out:

# Filtering and sub-setting data with missing values
train[train['Age'].isnull()][['Sex', 'Pclass', 'Age', 'Survived']]

# Counting # of males in each passenger class
for i in range(1, 4):
print i, len(train[(train['Sex'] == 'male') & (train['Pclass'] == i)])

Out:

1 122
2 108
3 347

# Counting # of females in each passenger class
for i in range(1, 4):
print i, len(train[(train['Sex'] == 'female') & (train['Pclass'] == i)])

Out:

1 94
2 76
3 144

Passenger class 3 has more male and female passengers compared to passenger class 1 and 2

# Simple Histogram of Age
train['Age'].hist()
p.show()

Out:

# Histogram of Age (after dropping missing values), alpha controls 'transparency'?

train['Age'].dropna().hist(bins = 16, range = (0, 80), alpha = 0.5)

Out:

## Cleaning data: Transforming 'String values'

# 1. Adding a new column and filling it with a number

train['Gender'] = 4

# 2. Populating the new column 'Gender' with M or F

train['Gender'] = train['Sex'].map(lambda x: x[0].upper())

# 3. Populating the new column witn binary integers

train['Gender'] = train['Sex'].map({'female': 0, 'male': 1}).astype(int)

## Feature Engineering: Name attribute has honorific titles like Mr., Mrs., etc

# Extracting salutation from Name (Format: Last Name, Title, First Name)

# Name example: Dahlberg, Miss. Gerda Ulrika

def title(name):

temp_1 = name.split(',') # Split by (,)

temp_2 = temp_1[1].split('.')[0] # Split by (.)

temp_3 = temp_2.strip() # Remove white space

return temp_3

train['Title'] = train['Name'].apply(title) # Apply function 'title' to 'Name'

train[['PassengerId', 'Survived', 'Sex', 'Pclass', 'Age', 'Gender', 'Title']]

Out:

## How to count passenger by Title

# Grouping by Title

temp_4 = train.groupby('Title')

# Counting passengers by Title

temp_5 = temp_4.PassengerId.count()

print temp_5

Out:

# Barplot: Passenger count by title

temp_5.plot(kind = 'bar')

Out:

Majority of passengers had 4 honorific titles: Mr, Mrs, Miss, and Master. I will rename (1) honorific titles like Capt, Don, Dr, Jonkheer, Major, Rev and Sir to Mr; (2) honorific titles like Lady, Mme, Ms, and the Countess to Mrs; (3) honorific titles like Mlle to Miss

# How many males and females are Doctors?

train[train['Title'] == 'Dr']

Out:

6 male doctors and 1 female doctor (Dr. Alice (Farnham) Leader)

## Create a 'Temp' column in train and fill it with concatenated 'Sex' and 'Title' string values

train ['Temp'] = train['Sex'] + train['Title']

## Replace concatenated value 'femaleDr' value with 'Mrs'

train.loc[train['Temp'] == 'femaleDr', 'Title'] = 'Mrs'

# Drop 'Temp' column

train = train.drop(['Temp'], axis = 1)

## There are 4 main titles: Mr, Mrs, Master and Miss, and some other titles

# Taking care of other titles

def new_title(title):

if title == 'Mr' or title == 'Capt' or title == 'Don' or title == 'Dr' or title == 'Jonkheer' or title == 'Major' or title == 'Rev' or title == 'Sir' or title == 'Col':

return 'Mr'

elif title == 'Mrs' or title == 'Lady' or title == 'Mme' or title == 'Ms' or title == 'the Countess':

return 'Mrs'

elif title == 'Miss' or title == 'Mlle':

return 'Miss'

else:

return 'Master'

train['NewTitle'] = train['Title'].apply(new_title)

# Drop 'Title' attribute

train = train.drop(['Title'], axis = 1)

# Grouping by Title

temp_6 = train.groupby('NewTitle')

# Counting passengers by Title

temp_7 = temp_6.PassengerId.count()

print temp_7
temp_7.plot(kind = 'bar')

Out:

Now all the passenger honorific titles have been updated.

## Now descriptive statistics plots to understand data and survival chance

train.boxplot(column = 'Age', by = 'NewTitle')

Out:

Outliers: Miss with Age around 60? I used train[(train['Age'] > 30) & (train['NewTitle'] == 'Master')] and found that some females with age > 30 years have 'Miss in their title (May be they were unmarried or some other reason)

train.boxplot(column = 'Fare', by = 'Pclass')

Out:

Outliers: Some passengers in First Class have paid more than $200 for tickets, may be they have paid for their whole family.

# Passenger distribution by Passenger Class and Survival Chance

group_1 = train.groupby('Pclass').PassengerId.count()

group_1.plot(kind = 'bar')

Out:

Almost half of the passengers were 3rd class passengers

group_2 = train.groupby('Pclass').Survived.sum()

Pclass_Survival_Prob = group_2/group_1

Pclass_Survival_Prob.plot(kind = 'bar', color = 'pink', alpha = 0.65)

Out:

However, more First Class and Second Class passengers survived compared to passengers in Third Class (May be better access to Lifeboats/Life Jackets, or easy access to upper decks?)

# Passenger distribution by Passenger Class, Gender and Survival Chance

# Barplot using Cross-tabulation

group_3 = pd.crosstab([train.Pclass, train.Sex], train.Survived)

group_3.plot(kind = 'bar', stacked = True, color = ['black', 'yellow'])

Out:

Compared to males more females survived the disaster.

# Some other related plots

group_4 = pd.crosstab([train.Pclass, train.Sex, train.Embarked], train.Survived)
group_4.plot(kind = 'bar', stacked = True, color = ['black', 'yellow'], alpha = 0.5)

Out:

group_5 = pd.crosstab([train.Pclass, train.NewTitle], train.Survived)
group_5.plot(kind = 'bar', stacked = True, color = ['black', 'yellow'])

Out:

group_6 = pd.crosstab([train.Embarked, train.NewTitle], train.Survived)
group_6.plot(kind = 'bar', stacked = True, color = ['black', 'yellow'])

Out:

# Feature Engineering: Family size
train['Family'] = train['SibSp'] + train['Parch']
group_7 = pd.crosstab([train.Pclass, train.Family], train.Survived)
group_7.plot(kind = 'bar', stacked = True, color = ['black', 'yellow'], alpha = 0.25)

## Imputing missing values in attribute 'Age'. I found the code used below at this blog.

# View dataframe: 'Age' = NaN

train[train['Age'].isnull()].head()

table = train.pivot_table(values = 'Age', index = ['NewTitle'], columns = ['Pclass', 'Sex'], aggfunc = np.mean)

def ageFunc(x):

return table[x['Pclass']][x['Sex']][x['NewTitle']]

train['Age'].fillna(train[train['Age'].isnull()].apply(ageFunc, axis = 1), inplace = True)

train['Age'] = train['Age'].astype(int)

#Some more plots

#Specifying Plot Parameters
# figsize = (x inches, y inches), dpi = n dots per inches
fig = plt.figure(figsize = (11, 8), dpi = 1600)

# Plot: 1
ax1 = fig.add_subplot(221) # .add_subplot(rcp): r = row, c = col, p = position
female_hiclass = train['Survived'][train['Sex'] == 'female'][train['Pclass'] != 3].value_counts()
female_hiclass.plot(kind = 'bar', label = 'Female High Class', color = 'deeppink', alpha = 0.25)
ax1.set_xticklabels(['Survived', 'Dead'], rotation = 0)
ax1.set_xlim(-1, len(female_hiclass))
ax1.set_ylim(0, 400)
plt.legend(loc = 'best')

# Plot: 2
ax2 = fig.add_subplot(222) # .add_subplot(rcp): r = row, c = col, p = position
female_loclass = train['Survived'][train['Sex'] == 'female'][train['Pclass'] == 3].value_counts()
female_loclass.plot(kind = 'bar', label = 'Female Low Class', color = 'pink', alpha = 0.25)
ax2.set_xticklabels(['Survived', 'Dead'], rotation = 0)
ax2.set_xlim([-1, len(female_loclass)])
ax2.set_ylim(0, 400)
plt.legend(loc = 'best')

# Plot: 3
ax3 = fig.add_subplot(223) # .add_subplot(rcp): r = row, c = col, p = position
male_hiclass = train['Survived'][train['Sex'] == 'male'][train['Pclass'] != 3].value_counts()
male_hiclass.plot(kind = 'bar', label = 'Male High Class', color = 'teal', alpha = 0.25)
ax3.set_xticklabels(['Dead', 'Survided'], rotation = 0)
ax3.set_xlim(-1, len(male_hiclass))
ax3.set_ylim(0, 400)
plt.legend(loc = 'best')

# Plot: 4
ax4 = fig.add_subplot(224) # .add_subplot(rcp): r = row, c = col, p = position
male_loclass = train['Survived'][train['Sex'] == 'male'][train['Pclass'] == 3].value_counts()
male_loclass.plot(kind = 'bar', label = 'Male Low Class', color = 'green', alpha = 0.25)
ax4.set_xticklabels(['Dead', 'Survived'], rotation = 0)
ax4.set_xlim(-1, len(male_loclass))
ax4.set_ylim(0, 400)
plt.legend(loc = 'best')

Out:

Females in the high class had better survival chance compared to females in low class. Irrespective of the class more male passengers perished compared to females.

Data Science (Side Projects)

Thursday, November 13, 2014

"Titanic: Machine Learning from Disaster" Data Analysis using Python

Blog Archive