Wednesday, October 29, 2014

Lending Club Data - A Simple Linear Regression Approach To Predict Loan Interest Rate

I started this project yesterday just for fun and to find out how someones FICO score affects their loan interest rates. Sometime back the Lending Club made data on loans available to public (Of course data is anonymized). The data is available here. I am using R to clean up the data and to develop a simple linear regression model. The data has 2500 observations and 14 loan attributes. The attributes are self explanatory and Google is always there for the definitions of loan attributes. The attributes are:


  1. Amount.Requested
  2. Amount.Funded.By.Investors
  3. Interest.Rate
  4. Loan.Length
  5. Loan.Purpose
  6. Debt.To.Income.Ratio
  7. State
  8. Home.Ownership
  9. Monthly.Income
  10. FICO.Range
  11. Open.CREDIT.Lines
  12. Revolving.CREDIT.Balance
  13. Inquiries.in.the.Last.6.Months
  14. Employment.Length
The first step is to read the data and browse it before data cleaning step.

############################################## # Session info
sessionInfo()

# Set working director
setwd("/Users/Ankoor/Desktop/ML with R")

# Get file form the internet
fileUrl <- "https://spark-public.s3.amazonaws.com/dataanalysis/loansData.csv"
download.file(fileUrl, destfile = "loansData.csv", method = "curl")

# Date of download
dateDownloaded <- date()

# Load data to R
loanData <- read.csv("loansData.csv")

# Structure of Data
str(loanData)


Notice that the variables: Interest.Rate, Debt.To.Income.Ratio are factors with % sign and Loan.Length is factor with text "months". The data needs to be cleaned by removing sign. I will just leave the text "months" as it is for now.

# Display data: column names and data type
sapply(loanData, class)

# Display first few rows of data, last rows -> tail()
head(loanData)

Now I need to find out if some observations have missing attributes or not, and then to deal with missing data 

## Check for missing data
# Identifies total no. of missing values
sum(is.na(loanData)) 

The result indicate that there are 7 missing values. Now to identify the loan attributes with missing data

# Identifies column names with missing data
names(loanData[, !complete.cases(t(loanData))]) # t() -> transpose

The loan attributes with missing data are: "Monthly.Income", "Open.CREDIT.Lines", "Revolving.CREDIT.Balance", and "Inquiries.in.the.Last.6.Months". Now removing missing data (NA's) and cleaning data.

# Data cleanup
# Remove rows with NA's
loanData <- loanData[complete.cases(loanData), ]

# Trick to replace or impute missing value (NA's) with mean values
# loanData$Monthly.Income[is.na(loanData$Monthly.Income)] <- mean(loanData$Monthly.Income)


# Remove column with xx% to numeric variables
loanData$Interest.Rate <- as.numeric(strsplit(as.character(loanData$Interest.Rate),"%"))

loanData$Debt.To.Income.Ratio <- as.numeric(strsplit(as.character(loanData$Debt.To.Income.Ratio), "%"))

# Remove column with xx month to numeric variables
# loanData$Loan.Length <- as.numeric(strsplit(as.character(loanData$Loan.Length)," months"))

FICO.Range attribute is also a factor and it has "-" sign and it needs to be removed. 

# Mean of FICO Range (simple approach - not using here)
# loanData$FICO.Range <- as.numeric(substring(temp$FICO.Range, 1, 3)) + 2

# Mean of FICO Range (Another approach)
loanData$FICO <- paste(loanData$FICO_range_low, loanData$FICO_range_high, sep = "-")
meanFICO <- function(x) (as.numeric(substr(x, 1, 3)) + as.numeric(substr(x, 5, 7)))/2
loanData$FICO.Mean <- sapply(loanData$FICO.Range, meanFICO)

# Remove monthly income outlier since some monthly income are greater than $50000
loanData <- loanData[which(loanData$Monthly.Income < 50000), ]

# Data exploration

# Set plot display to 1 row and 2 graphs
par(mfrow = c(1, 1))

# Plot interest rate and check for normal distribution
hist(loanData$Interest.Rate, col = "blue", xlab = "Interest Rate", main = "Histogram")
qqnorm(loanData$Interest.Rate)
qqline(loanData$Interest.Rate, col = "red", lwd = 1.5)



# Box Plots of Interest Rate Vs other loan attributes

par(mfrow = c(1, 2))
boxplot(loanData$Interest.Rate ~ loanData$Loan.Purpose, col = "green", varwidth = TRUE, xlab = "Loan Purpose", , ylab = "Interest Rate")

boxplot(loanData$Interest.Rate ~ loanData$Home.Ownership, col = "orange", varwidth = TRUE, xlab = "Home Ownership", , ylab = "Interest Rate")




par(mfrow = c(1, 2))
boxplot(loanData$Interest.Rate ~ loanData$Employment.Length, col = "blue", varwidth = TRUE, xlab = "Employment Length", ylab = "Interest Rate")

boxplot(loanData$Interest.Rate ~ loanData$Inquiries.in.the.Last.6.Months, xlab = "# of Inquiries in last 6 months",col = "red", varwidth = TRUE, ylab = "Interest Rate")




par(mfrow = c(1, 2))
boxplot(loanData$Interest.Rate ~ loanData$Open.CREDIT.Lines, col = "purple", varwidth = TRUE, xlab = "# of Open Credit Lines", ylab = "Interest Rate")




# Interest rate and State
par(mfrow = c(1,1))
boxplot(loanData$Interest.Rate ~ loanData$State, col = "green", xlab = "State", ylab = "Interest Rate")




# Interest rate and Mean FICO score
par(mfrow = c(1,1))
boxplot(loanData$Interest.Rate ~ loanData$FICO.Mean, col = "blue", varwidth = TRUE, xlab = "Mean FICO Score", ylab = "Interest Rate")





# Interest rate and FICO Range
par(mfrow = c(1,1))
boxplot(loanData$Interest.Rate ~ loanData$FICO.Mean, col = "yellow", varwidth = TRUE, xlab = "FICO Range", ylab = "Interest Rate")


It is obvious from the above plot that Loan Interest Rate decreases as Mean FICO Score increases (or Improves)

Development of Linear Regression Model

# Fitting a simple linear regression model to predict interest rate based on mean fico score
meanFicoLM <- lm(loanData$Interest.Rate ~ loanData$FICO.Mean)
summary(meanFicoLM)

Here is the output, R square = 0.5029



# Fitting all independent variables to find significant loan attributes with p-value close to 0
testLM <- lm(loanData$Interest.Rate ~ ., data = loanData)
summary(testLM)

# Fitting model with significant independent variables: "Amount.Requested",
# "Amount.Funded.By.Investors", "Loan.Length", "Monthly.Income", "Open.CREDIT.Lines", 
# "Inquiries.in.the.Last.6.Months", "FICO.Mean"

linearModel <- lm(loanData$Interest.Rate ~ loanData$Amount.Requested + loanData$Loan.Length +
                          loanData$Amount.Funded.By.Investors + loanData$Monthly.Income +
                          loanData$Open.CREDIT.Lines + loanData$Inquiries.in.the.Last.6.Months +
                          loanData$FICO.Mean)
summary(linearModel)

Here is the output, Adjusted R square = 0.7599




# Display 95% Confidence Interval
confint(linearModel)

# Plot Residuals check fitting problems
par(mfrow = c(1,2))
hist(linearModel$residuals, col = "azure", xlab = "Residuals") # Colors: http://www.stat.columbia.edu/~tzheng/files/Rcolor.pdf
qqnorm(linearModel$residuals)
qqline(linearModel$residuals, col = "blue", lwd = 2) # lwd -> line width


par(mfrow = c(2, 2))
plot(linearModel)





Machine Learning: Linear Regression

## Training and Testing Data Sets: Training = 80%, Testing = 20%, 
# No Cross-Validation here

# Setting seed to reproduce partition
set.seed(15)
sampleSize <- ceiling(nrow(loanData)*0.8)
position <- sample(nrow(loanData), sampleSize)
train <- loanData[position, ]
test <- loanData[-position, ]

# Removing dependent variable from test data
trueValues <- test$Interest.Rate
test$Interest.Rate <- NULL 

# Fit significant factors in training data
trainLM <- lm(Interest.Rate ~ Amount.Requested + Loan.Length +        Amount.Funded.By.Investors + Monthly.Income + Open.CREDIT.Lines + Inquiries.in.the.Last.6.Months + FICO.Mean, data = train)

# Calculate Root Mean Squared Error
RMSE <- sqrt(mean(trueValues - interestRateHat)^2)
RMSEpercent <- RMSE/mean(trueValues) * 100


Root Mean Squared Error = 0.02456184
Root Mean Squared Error Percentage = 0.1875415%



















1 comment:


  1. The development of artificial intelligence (AI) has propelled more programming architects, information scientists, and different experts to investigate the plausibility of a vocation in machine learning. Notwithstanding, a few newcomers will in general spotlight a lot on hypothesis and insufficient on commonsense application. IEEE final year projects on machine learning In case you will succeed, you have to begin building machine learning projects in the near future.

    Projects assist you with improving your applied ML skills rapidly while allowing you to investigate an intriguing point. Furthermore, you can include projects into your portfolio, making it simpler to get a vocation, discover cool profession openings, and Final Year Project Centers in Chennai even arrange a more significant compensation.


    Data analytics is the study of dissecting crude data so as to make decisions about that data. Data analytics advances and procedures are generally utilized in business ventures to empower associations to settle on progressively Python Training in Chennai educated business choices. In the present worldwide commercial center, it isn't sufficient to assemble data and do the math; you should realize how to apply that data to genuine situations such that will affect conduct. In the program you will initially gain proficiency with the specialized skills, including R and Python dialects most usually utilized in data analytics programming and usage; Python Training in Chennai at that point center around the commonsense application, in view of genuine business issues in a scope of industry segments, for example, wellbeing, promoting and account.

    ReplyDelete