Data Science (Side Projects): October 2014

Thursday, October 30, 2014

Lending Club Data - A Simple Logistic Regression Approach

This post is continuation of the Lending Club Data Analysis (Linear Regression Approach). I was going to start a new project to but I found a source that uses Lending Club Data to teach how to use IPython to develop a simple Logistic Regression model. I will be using R to develop a simple logistic regression model. First step is to clean data and understand data (Data Exploration).

Lets assume Miss X, who is a computer scientist and a bike enthusiast, earns $6,500 a month and is interested in purchasing a performance bike that costs $15,000. She has a FICO score of 750. She wants to know if she can borrow $15,000 from Lending Club with interest rate 10% or less.

In the previous post I had already found significant variables and I will be using those variables to develop a simple logistic regression model:

Approval.Indicator = b0 + b1 * FICO.Mean + b2 * Amount.Requested + b3 * Monthly.Income

Since "loan approval with interest rate 10%" or less is not provided, I will "approval indicator" variable.

# First add an indicator variable which indicates whether interest rate is <= 10
loanData$Indicator <- loanData$Interest.Rate <= 10
head(loanData)

summary(loanData)

sapply(loanData, sd)

# Fit a logit model using glm
logitModel <- glm(Indicator ~ FICO.Mean + Amount.Requested + Monthly.Income, data = loanData, family = "binomial")
summary(logitModel)
confint(logitModel)

For every positive unit change in FICO.Mean, the log Odds of loan approval with interest rate 10% or less increases by 0.07224

par(mfrow = c(2, 2))
plot(logitModel)

# Odds ratio and 95% CI

exp(cbind(OR = coef(logitModel), confint(logitModel)))

For one unit increase in FICO.Mean, the odds of loan approval with interest rate 10% or less increases by a factor of 1.0749

boxplot(predict(logitModel, type = "response") ~ loanData$Indicator, col = "blue")

# Choosing a cutoff(re-substitution)

temp <- seq(0, 1, length = 20)

err <- rep(NA, 20)

for (i in 1:length(temp)){

err[i] <- sum((predict(logitModel, type = "response") > temp[i]) != loanData$Indicator)

}

plot(temp, err, pch = 19, col = "red", xlab = "Cutoff", ylab = "Error")

The error is minimum when Cutoff is approximately equal to 0.4, thus

# Simple cutoff: Prob > 0.40 means loan approved, otherwise loan not approved.

Checking Model Performance

Performance <- predict(logitModel, type = "response") > 0.4

table(loanData$Indicator, Performance)

Now lets calculate the probability that Miss X's loan request for $15,000 from Lending Club with interest rate 10% or less will be approved or not, given her FICO score = 750 and monthly earning = $6,500

missX <- data.frame(FICO.Mean = 750, Amount.Requested = 15000, Monthly.Income = 6500)

predict(logitModel, newdata = missX, type = "response")

The resulting probability = 0.6464171 > 0.4, this means that Miss X's request for $15,000 from Lending Club with interest rate 10% or less will be approved!

Wednesday, October 29, 2014

Lending Club Data - A Simple Linear Regression Approach To Predict Loan Interest Rate

I started this project yesterday just for fun and to find out how someones FICO score affects their loan interest rates. Sometime back the Lending Club made data on loans available to public (Of course data is anonymized). The data is available here. I am using R to clean up the data and to develop a simple linear regression model. The data has 2500 observations and 14 loan attributes. The attributes are self explanatory and Google is always there for the definitions of loan attributes. The attributes are:

Amount.Requested
Amount.Funded.By.Investors
Interest.Rate
Loan.Length
Loan.Purpose
Debt.To.Income.Ratio
State
Home.Ownership
Monthly.Income
FICO.Range
Open.CREDIT.Lines
Revolving.CREDIT.Balance
Inquiries.in.the.Last.6.Months
Employment.Length

The first step is to read the data and browse it before data cleaning step.

############################################## # Session info
sessionInfo()

# Set working director
setwd("/Users/Ankoor/Desktop/ML with R")

# Get file form the internet
fileUrl <- "https://spark-public.s3.amazonaws.com/dataanalysis/loansData.csv"
download.file(fileUrl, destfile = "loansData.csv", method = "curl")

# Date of download
dateDownloaded <- date()

# Load data to R
loanData <- read.csv("loansData.csv")

# Structure of Data
str(loanData)

Notice that the variables: Interest.Rate, Debt.To.Income.Ratio are factors with % sign and Loan.Length is factor with text "months". The data needs to be cleaned by removing % sign. I will just leave the text "months" as it is for now.

# Display data: column names and data type
sapply(loanData, class)

# Display first few rows of data, last rows -> tail()
head(loanData)

Now I need to find out if some observations have missing attributes or not, and then to deal with missing data

## Check for missing data
# Identifies total no. of missing values
sum(is.na(loanData))

The result indicate that there are 7 missing values. Now to identify the loan attributes with missing data

# Identifies column names with missing data
names(loanData[, !complete.cases(t(loanData))]) # t() -> transpose

The loan attributes with missing data are: "Monthly.Income", "Open.CREDIT.Lines", "Revolving.CREDIT.Balance", and "Inquiries.in.the.Last.6.Months". Now removing missing data (NA's) and cleaning data.

# Data cleanup
# Remove rows with NA's
loanData <- loanData[complete.cases(loanData), ]

# Trick to replace or impute missing value (NA's) with mean values
# loanData$Monthly.Income[is.na(loanData$Monthly.Income)] <- mean(loanData$Monthly.Income)

# Remove column with xx% to numeric variables
loanData$Interest.Rate <- as.numeric(strsplit(as.character(loanData$Interest.Rate),"%"))

loanData$Debt.To.Income.Ratio <- as.numeric(strsplit(as.character(loanData$Debt.To.Income.Ratio), "%"))

# Remove column with xx month to numeric variables
# loanData$Loan.Length <- as.numeric(strsplit(as.character(loanData$Loan.Length)," months"))

FICO.Range attribute is also a factor and it has "-" sign and it needs to be removed.

# Mean of FICO Range (simple approach - not using here)
# loanData$FICO.Range <- as.numeric(substring(temp$FICO.Range, 1, 3)) + 2

# Mean of FICO Range (Another approach)
loanData$FICO <- paste(loanData$FICO_range_low, loanData$FICO_range_high, sep = "-")
meanFICO <- function(x) (as.numeric(substr(x, 1, 3)) + as.numeric(substr(x, 5, 7)))/2
loanData$FICO.Mean <- sapply(loanData$FICO.Range, meanFICO)

# Remove monthly income outlier since some monthly income are greater than $50000
loanData <- loanData[which(loanData$Monthly.Income < 50000), ]

# Data exploration

# Set plot display to 1 row and 2 graphs
par(mfrow = c(1, 1))

# Plot interest rate and check for normal distribution
hist(loanData$Interest.Rate, col = "blue", xlab = "Interest Rate", main = "Histogram")
qqnorm(loanData$Interest.Rate)
qqline(loanData$Interest.Rate, col = "red", lwd = 1.5)

# Box Plots of Interest Rate Vs other loan attributes

par(mfrow = c(1, 2))
boxplot(loanData$Interest.Rate ~ loanData$Loan.Purpose, col = "green", varwidth = TRUE, xlab = "Loan Purpose", , ylab = "Interest Rate")

boxplot(loanData$Interest.Rate ~ loanData$Home.Ownership, col = "orange", varwidth = TRUE, xlab = "Home Ownership", , ylab = "Interest Rate")

par(mfrow = c(1, 2))

boxplot(loanData$Interest.Rate ~ loanData$Employment.Length, col = "blue", varwidth = TRUE, xlab = "Employment Length", ylab = "Interest Rate")

boxplot(loanData$Interest.Rate ~ loanData$Inquiries.in.the.Last.6.Months, xlab = "# of Inquiries in last 6 months",col = "red", varwidth = TRUE, ylab = "Interest Rate")

par(mfrow = c(1, 2))
boxplot(loanData$Interest.Rate ~ loanData$Open.CREDIT.Lines, col = "purple", varwidth = TRUE, xlab = "# of Open Credit Lines", ylab = "Interest Rate")

# Interest rate and State
par(mfrow = c(1,1))
boxplot(loanData$Interest.Rate ~ loanData$State, col = "green", xlab = "State", ylab = "Interest Rate")

# Interest rate and Mean FICO score
par(mfrow = c(1,1))
boxplot(loanData$Interest.Rate ~ loanData$FICO.Mean, col = "blue", varwidth = TRUE, xlab = "Mean FICO Score", ylab = "Interest Rate")

# Interest rate and FICO Range
par(mfrow = c(1,1))
boxplot(loanData$Interest.Rate ~ loanData$FICO.Mean, col = "yellow", varwidth = TRUE, xlab = "FICO Range", ylab = "Interest Rate")

It is obvious from the above plot that Loan Interest Rate decreases as Mean FICO Score increases (or Improves)

Development of Linear Regression Model

# Fitting a simple linear regression model to predict interest rate based on mean fico score
meanFicoLM <- lm(loanData$Interest.Rate ~ loanData$FICO.Mean)
summary(meanFicoLM)

Here is the output, R square = 0.5029

# Fitting all independent variables to find significant loan attributes with p-value close to 0

testLM <- lm(loanData$Interest.Rate ~ ., data = loanData)

summary(testLM)

# Fitting model with significant independent variables: "Amount.Requested",
# "Amount.Funded.By.Investors", "Loan.Length", "Monthly.Income", "Open.CREDIT.Lines",
# "Inquiries.in.the.Last.6.Months", "FICO.Mean"

linearModel <- lm(loanData$Interest.Rate ~ loanData$Amount.Requested + loanData$Loan.Length +
loanData$Amount.Funded.By.Investors + loanData$Monthly.Income +
loanData$Open.CREDIT.Lines + loanData$Inquiries.in.the.Last.6.Months +
loanData$FICO.Mean)
summary(linearModel)

Here is the output, Adjusted R square = 0.7599

# Display 95% Confidence Interval

confint(linearModel)

# Plot Residuals check fitting problems

par(mfrow = c(1,2))

hist(linearModel$residuals, col = "azure", xlab = "Residuals") # Colors: http://www.stat.columbia.edu/~tzheng/files/Rcolor.pdf

qqnorm(linearModel$residuals)

qqline(linearModel$residuals, col = "blue", lwd = 2) # lwd -> line width

par(mfrow = c(2, 2))

plot(linearModel)

Machine Learning: Linear Regression

## Training and Testing Data Sets: Training = 80%, Testing = 20%,

# No Cross-Validation here

# Setting seed to reproduce partition

set.seed(15)

sampleSize <- ceiling(nrow(loanData)*0.8)

position <- sample(nrow(loanData), sampleSize)

train <- loanData[position, ]

test <- loanData[-position, ]

# Removing dependent variable from test data

trueValues <- test$Interest.Rate

test$Interest.Rate <- NULL

# Fit significant factors in training data

trainLM <- lm(Interest.Rate ~ Amount.Requested + Loan.Length + Amount.Funded.By.Investors + Monthly.Income + Open.CREDIT.Lines + Inquiries.in.the.Last.6.Months + FICO.Mean, data = train)

# Calculate Root Mean Squared Error

RMSE <- sqrt(mean(trueValues - interestRateHat)^2)

RMSEpercent <- RMSE/mean(trueValues) * 100

Root Mean Squared Error = 0.02456184

Root Mean Squared Error Percentage = 0.1875415%

Tuesday, October 28, 2014

Reshaping 3 Column R Data Frame to a Square Matrix

Suppose you have an R data frame that looks like this:

Imaginary Flight Time (minutes)

And you want to create an Origin and Destination matrix or reshape this data frame to a square matrix, then one possible way to do this in R is shown below.

##############################################

## Read data (csv)

temp <- read.csv("AirportData.csv", stringsAsFactors = TRUE)

## Order temp data in ascending order of origin airport and then destination airport
temp <- temp[order(temp$Origin.Airport, temp$Destination.Airport), ]

## Ordered data looks like this:

## Populating square matrix in R

oriAirport <- temp[, 1]

desAirport <- temp[, 2]

flightTime <- temp[, 3]

Index <- which(desAirport == desAirport[1])

desAirportRange <- Index[2] - 1

oriAirportRange <- length(flightTime) / desAirportRange

distMatrix <- matrix(0, nrow = oriAirportRange, ncol = desAirportRange)
for (i in 1:oriAirportRange){
for (j in 1:desAirportRange){
distMatrix[i, j] <- flightTime[(i - 1) * desAirportRange + j]
}
}

dist <- data.frame(distMatrix)

names(dist) <- c(levels(desAirport))

row.names(dist) <- c(levels(oriAirport))
View(dist)

##############################################

The resulting square matrix looks like this:

Monday, October 27, 2014

NYC Citi Bike System Data Analysis

NYC Citi Bike is a public bicycle sharing system that serves parts of Manhattan and Brooklyn, two of the most populous boroughs of New York City. The New York Citi Bike System Data is publicly available here. The data is available for past 12 months from the current date.

Citi Bike Trip data include:

Trip Duration (seconds)
Start Time and Date
Stop Time and Date
Start Station Name
End Station Name
Station ID
Station Latitude/Longitude
Bike ID
User Type (Customer = 24-hour pass or 7-day pass user; Subscriber = Annual Member)
Gender (0 = unknown; 1 = male; 2 = female)
Year of Birth

Since the location of Citi Bike Station is important variable and it was not available here, I then used Google to find out the location of Citi Bike Station in NYC. Citi Bike Station location data is available here.

I am using R to analyze 3 months (December 2013 to February 2014) of Citi Bike Trip data. The R code is stated below.

#############################################################################################
# Set working directory
setwd("/Users/Ankoor/Desktop/NYCBS")

# Read data
dec <- read.csv("2013-12 - Citi Bike trip data.csv", stringsAsFactors = FALSE)
jan <- read.csv("2014-01 - Citi Bike trip data.csv", stringsAsFactors = FALSE)
feb <- read.csv("2014-02 - Citi Bike trip data.csv", stringsAsFactors = FALSE)

# Merge data (Adding rows)
trip <- rbind(dec, jan, feb)
rm(dec, jan, feb)

# Bike station data
bikeStn <- read.csv("citibike.csv", stringsAsFactor = FALSE)

# Drop unnecessary columns from bikeStn
names(bikeStn) # Get column names in bikeStn
drop <- c("name", "streetAddress", "streetAddress.address2", "latitude", "longitude",
"loc", "entityTitle", "X.context", "X.type", "X.id")
bikeStn <- bikeStn[, !(names(bikeStn) %in% drop)]

# Change column names in BikeStn to merge start
names(bikeStn) <- c("start.station.id", "start.totDocks", "start.hood", "start.zip")

# Merge data for start station (many to one)
trip <- merge(trip, bikeStn, by = c("start.station.id"))

# Change column names in BikeStn to merge start
names(bikeStn) <- c("end.station.id", "end.totDocks", "end.hood", "end.zip")

# Merge data for end station (many to one)
trip <- merge(trip, bikeStn, by = c("end.station.id"))
rm(bikeStn, drop)

# Plot - 1: NYC Citi Bike Route Popularity

hood.trips <- table(trip$start.hood, trip$end.hood)
temp <- data.frame(hood.trips)
names(temp) <- c("startHood", "endHood", "Popularity")
temp <- temp[,c(2,1,3)]
names(temp) <- c("startHood", "endHood", "Popularity")

# ggplot2 library
library(ggplot2)

# Plot
pdf("NYC Citi Bike Route Popularity.pdf", width = 11, height = 11)
ggplot(temp, aes(startHood, endHood)) + geom_tile(aes(fill = Popularity), color = "black") +
scale_fill_gradient(low = 'white', high = 'blue') + theme(axis.text.x = element_text(
angle = 90)) + xlab("Starting Neighborhood") + ylab("Ending Neighborhood") +
ggtitle("NYC Citi Bike Route Popularity")
dev.off()

#############################################################################################

The above R code produces this plot. Here is the screen shot of the plot: