This post is continuation of the Lending Club Data Analysis (Linear Regression Approach). I was going to start a new project to but I found a source that uses Lending Club Data to teach how to use IPython to develop a simple Logistic Regression model. I will be using R to develop a simple logistic regression model. First step is to clean data and understand data (Data Exploration).
Lets assume Miss X, who is a computer scientist and a bike enthusiast, earns $6,500 a month and is interested in purchasing a performance bike that costs $15,000. She has a FICO score of 750. She wants to know if she can borrow $15,000 from Lending Club with interest rate 10% or less.
In the previous post I had already found significant variables and I will be using those variables to develop a simple logistic regression model:
Approval.Indicator = b0 + b1 * FICO.Mean + b2 * Amount.Requested + b3 * Monthly.Income
Since "loan approval with interest rate 10%" or less is not provided, I will "approval indicator" variable.
# First add an indicator variable which indicates whether interest rate is <= 10
loanData$Indicator <- loanData$Interest.Rate <= 10
head(loanData)
summary(loanData)
sapply(loanData, sd)
# Fit a logit model using glm
logitModel <- glm(Indicator ~ FICO.Mean + Amount.Requested + Monthly.Income, data = loanData, family = "binomial")
summary(logitModel)
confint(logitModel)
par(mfrow = c(2, 2))
plot(logitModel)
The resulting probability = 0.6464171 > 0.4, this means that Miss X's request for $15,000 from Lending Club with interest rate 10% or less will be approved!
Lets assume Miss X, who is a computer scientist and a bike enthusiast, earns $6,500 a month and is interested in purchasing a performance bike that costs $15,000. She has a FICO score of 750. She wants to know if she can borrow $15,000 from Lending Club with interest rate 10% or less.
In the previous post I had already found significant variables and I will be using those variables to develop a simple logistic regression model:
Approval.Indicator = b0 + b1 * FICO.Mean + b2 * Amount.Requested + b3 * Monthly.Income
Since "loan approval with interest rate 10%" or less is not provided, I will "approval indicator" variable.
# First add an indicator variable which indicates whether interest rate is <= 10
loanData$Indicator <- loanData$Interest.Rate <= 10
head(loanData)
summary(loanData)
sapply(loanData, sd)
# Fit a logit model using glm
logitModel <- glm(Indicator ~ FICO.Mean + Amount.Requested + Monthly.Income, data = loanData, family = "binomial")
summary(logitModel)
confint(logitModel)
For every positive unit change in FICO.Mean, the log Odds of loan approval with interest rate 10% or less increases by 0.07224
par(mfrow = c(2, 2))
plot(logitModel)
# Odds ratio and 95% CI
exp(cbind(OR = coef(logitModel), confint(logitModel)))
For one unit increase in FICO.Mean, the odds of loan approval with interest rate 10% or less increases by a factor of 1.0749
boxplot(predict(logitModel, type = "response") ~ loanData$Indicator, col = "blue")
# Choosing a cutoff(re-substitution)
temp <- seq(0, 1, length = 20)
err <- rep(NA, 20)
for (i in 1:length(temp)){
err[i] <- sum((predict(logitModel, type = "response") > temp[i]) != loanData$Indicator)
}
plot(temp, err, pch = 19, col = "red", xlab = "Cutoff", ylab = "Error")
The error is minimum when Cutoff is approximately equal to 0.4, thus
# Simple cutoff: Prob > 0.40 means loan approved, otherwise loan not approved.
Checking Model Performance
Performance <- predict(logitModel, type = "response") > 0.4
table(loanData$Indicator, Performance)
Now lets calculate the probability that Miss X's loan request for $15,000 from Lending Club with interest rate 10% or less will be approved or not, given her FICO score = 750 and monthly earning = $6,500
missX <- data.frame(FICO.Mean = 750, Amount.Requested = 15000, Monthly.Income = 6500)
predict(logitModel, newdata = missX, type = "response")