Classification Models (Logistic Regression)

July 22, 2019

9 minute read

Logistic regression is the last method I shared with the Army unit. I believe logistic regression is an excellent classification model. You are limited to two outcomes, but it is interpretable.

Logistic Regression

This example uses the titanic survival data to walk through a logistic regression example. The question is, what increases your probability of survival.

titanic = titanic::titanic_train

Exploring the Data

Figures 1, 2, and 3 graphically depict the relationship between possible explanatory variables and the response variable (surviving). We are looking for a easily identifiable split between surviving and not surviving. Normally you will not be lucky enough to have a clean split, but some are more separated than others.

titanic%>%
  ggplot(aes(x = Age, y = Survived))+
  geom_jitter(height = .1, width = .1)+
  scale_y_continuous(limits = c(-0.1,1.1), breaks = c(0,1))+
  labs(x = "Age (years)", y = "Survived (1 = yes, 0 = no)")

## Warning: Removed 177 rows containing missing values (geom_point).

Figure 1: Age versus Surviving

titanic%>%
  ggplot(aes(x = Pclass, y = Survived))+
  geom_jitter(height = 0.1, width = 0.1)+
  scale_y_continuous(limits = c(-0.1,1.1), breaks = c(0,1))+
  labs(x = "Passenger Class", y = "Survived (1 = yes, 0 = no)")

Figure 2: Passenger Class versus Surviving

titanic%>%
  ggplot(aes(x = Sex, y = Survived))+
  geom_jitter(height = 0.1, width = 0.1)+
  scale_y_continuous(limits = c(-0.1,1.1), breaks = c(0,1))+
  labs(x = "Passenger Class", y = "Survived (1 = yes, 0 = no)")

Figure 3: Gender versus Surviving

Making a logistic model using gender as the explanatory varaible

A logistic model returns a probability of having a success (surviving). The coefficients in the model of very little meaning. We are often concerned with an odds ratio. An odds ratio has a real interpretation. Example a team is 2 times more likely to win. We will use gender to explain how to interpret the logistic regression model.

gender_model = glm(Survived ~ Sex, data = titanic, family = binomial)

summary(gender_model)

## 
## Call:
## glm(formula = Survived ~ Sex, family = binomial, data = titanic)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.6462  -0.6471  -0.6471   0.7725   1.8256  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)   1.0566     0.1290   8.191 2.58e-16 ***
## Sexmale      -2.5137     0.1672 -15.036  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1186.7  on 890  degrees of freedom
## Residual deviance:  917.8  on 889  degrees of freedom
## AIC: 921.8
## 
## Number of Fisher Scoring iterations: 4

If I were to write out the model mathematically it would look like this.

\[p(x) = 1.0566 - 2.5137*Sex\] \[\text{sex}=1\text{ if male}, 0\text{ if female}\] \[p(x) = \text{ The probability of surviving}\]

The value -2.5137 is the log-odds which has no common meaning. If we manipulate it by \(e^{\beta_1}\) we get the odds ratio which has a commonly understood meaning. In this case we get the following odds ratio

exp(gender_model$coefficients[2])

##    Sexmale 
## 0.08096732

The odds ratio is 0.081. Because the explanatory variable is categorical, if you are male you are 0.081 times more likely to survive than a female. This is still a little weird to explain, but it does mean you are much more likely to die if you are male. We can switch the survival coding to make it a little more interpretable.

gender_model_flip = glm(Survived ~ Sex, data = titanic%>%
                          mutate(Survived = ifelse(Survived==1,0,1)), family = binomial)

summary(gender_model_flip)

## 
## Call:
## glm(formula = Survived ~ Sex, family = binomial, data = titanic %>% 
##     mutate(Survived = ifelse(Survived == 1, 0, 1)))
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.8256  -0.7725   0.6471   0.6471   1.6462  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  -1.0566     0.1290  -8.191 2.58e-16 ***
## Sexmale       2.5137     0.1672  15.036  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1186.7  on 890  degrees of freedom
## Residual deviance:  917.8  on 889  degrees of freedom
## AIC: 921.8
## 
## Number of Fisher Scoring iterations: 4

exp(gender_model_flip$coefficients[2])

##  Sexmale 
## 12.35066

In this case, we would say that you are 12.35 times more likely to die if you are male than female.

How well does our model do?

titanic_pred = titanic%>%
  mutate(prediction_probs = predict(gender_model, titanic, type = "response"))%>%
  mutate(pred_survival = ifelse(prediction_probs>=.5,1,0))

table(titanic_pred$Survived, titanic_pred$pred_survival)

##    
##       0   1
##   0 468  81
##   1 109 233

With this simple model we correctly identified 78.7% of the outcomes, but is this good. A naive approach would be that I state everyone survived. In this data set that would result in being correct 61.6% of the time.

Good I do better with more predictors?

fitControl = trainControl(
  method = "cv",
  number = 10,
  savePredictions = TRUE)

titanic_model = train(as.factor(Survived) ~ Sex,
                      data = titanic,
                      method = "glm",
                      family = binomial(),
                      trControl = fitControl,
                      na.action = na.pass)

titanic_model

## Generalized Linear Model 
## 
## 891 samples
##   1 predictor
##   2 classes: '0', '1' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 802, 801, 801, 802, 802, 802, ... 
## Resampling results:
## 
##   Accuracy   Kappa    
##   0.7867348  0.5421422

summary(titanic_model)

## 
## Call:
## NULL
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.6462  -0.6471  -0.6471   0.7725   1.8256  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)   1.0566     0.1290   8.191 2.58e-16 ***
## Sexmale      -2.5137     0.1672 -15.036  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1186.7  on 890  degrees of freedom
## Residual deviance:  917.8  on 889  degrees of freedom
## AIC: 921.8
## 
## Number of Fisher Scoring iterations: 4

titanic_model = train(as.factor(Survived) ~ Sex + Age,
                      data = titanic,
                      method = "glm",
                      family = binomial(),
                      trControl = fitControl,
                      na.action = na.pass)

titanic_model

## Generalized Linear Model 
## 
## 891 samples
##   2 predictor
##   2 classes: '0', '1' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 803, 801, 801, 802, 802, 802, ... 
## Resampling results:
## 
##   Accuracy   Kappa    
##   0.7794228  0.5356102

summary(titanic_model)

## 
## Call:
## NULL
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.7405  -0.6885  -0.6558   0.7533   1.8989  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  1.277273   0.230169   5.549 2.87e-08 ***
## Sexmale     -2.465920   0.185384 -13.302  < 2e-16 ***
## Age         -0.005426   0.006310  -0.860     0.39    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 964.52  on 713  degrees of freedom
## Residual deviance: 749.96  on 711  degrees of freedom
##   (177 observations deleted due to missingness)
## AIC: 755.96
## 
## Number of Fisher Scoring iterations: 4

titanic_model = train(as.factor(Survived) ~ Sex + Age + Pclass,
                      data = titanic,
                      method = "glm",
                      family = binomial(),
                      trControl = fitControl,
                      na.action = na.pass)

titanic_model

## Generalized Linear Model 
## 
## 891 samples
##   3 predictor
##   2 classes: '0', '1' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 802, 803, 801, 802, 802, 802, ... 
## Resampling results:
## 
##   Accuracy  Kappa    
##   0.79171   0.5641345

summary(titanic_model)

## 
## Call:
## NULL
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.7270  -0.6799  -0.3947   0.6483   2.4668  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  5.056006   0.502128  10.069  < 2e-16 ***
## Sexmale     -2.522131   0.207283 -12.168  < 2e-16 ***
## Age         -0.036929   0.007628  -4.841 1.29e-06 ***
## Pclass      -1.288545   0.139259  -9.253  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 964.52  on 713  degrees of freedom
## Residual deviance: 647.29  on 710  degrees of freedom
##   (177 observations deleted due to missingness)
## AIC: 655.29
## 
## Number of Fisher Scoring iterations: 5

I can now compare the accuracy measures for each of the models and decide on the optimal model.

confusionMatrix(titanic_model)

## Cross-Validated (10 fold) Confusion Matrix 
## 
## (entries are percentual average cell counts across resamples)
##  
##           Reference
## Prediction    0    1
##          0 50.0 11.5
##          1  9.4 29.1
##                             
##  Accuracy (average) : 0.7913

Is your model just guessing

ROC can be seen as a grade. A 1 is an A+, a .5 is an F, or really your model is just guessing.

Here are two ways to look at it, one visual and one with the numeric calculation.

titanic_predictions = predict(gender_model, titanic,
                              type = "response")

colAUC(titanic_predictions,titanic$Survived, plotROC = T)

##              [,1]
## 0 vs. 1 0.7668728

fitControl = trainControl(
  method = "cv",
  number = 10,
  summaryFunction = twoClassSummary,
  classProbs = TRUE, # IMPORTANT!
  verboseIter = TRUE)

titanic_model = train(make.names(Survived) ~ Sex,
                      data = titanic,
                      method = "glm",
                      trControl = fitControl,
                      na.action = na.pass)

## Warning in train.default(x, y, weights = w, ...): The metric "Accuracy" was
## not in the result set. ROC will be used instead.

## + Fold01: parameter=none 
## - Fold01: parameter=none 
## + Fold02: parameter=none 
## - Fold02: parameter=none 
## + Fold03: parameter=none 
## - Fold03: parameter=none 
## + Fold04: parameter=none 
## - Fold04: parameter=none 
## + Fold05: parameter=none 
## - Fold05: parameter=none 
## + Fold06: parameter=none 
## - Fold06: parameter=none 
## + Fold07: parameter=none 
## - Fold07: parameter=none 
## + Fold08: parameter=none 
## - Fold08: parameter=none 
## + Fold09: parameter=none 
## - Fold09: parameter=none 
## + Fold10: parameter=none 
## - Fold10: parameter=none 
## Aggregating results
## Fitting final model on full training set

titanic_model

## Generalized Linear Model 
## 
## 891 samples
##   1 predictor
##   2 classes: 'X0', 'X1' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 801, 802, 802, 802, 802, 802, ... 
## Resampling results:
## 
##   ROC        Sens       Spec     
##   0.7670356  0.8523906  0.6816807

Age is more continuous which will give a better visualization of the ROC curve

age_model = glm(Survived ~ Age, data = titanic, family = binomial)

summary(age_model)

## 
## Call:
## glm(formula = Survived ~ Age, family = binomial, data = titanic)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.1488  -1.0361  -0.9544   1.3159   1.5908  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)  
## (Intercept) -0.05672    0.17358  -0.327   0.7438  
## Age         -0.01096    0.00533  -2.057   0.0397 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 964.52  on 713  degrees of freedom
## Residual deviance: 960.23  on 712  degrees of freedom
##   (177 observations deleted due to missingness)
## AIC: 964.23
## 
## Number of Fisher Scoring iterations: 4

titanic_predictions = predict(age_model, titanic,
                              type = "response")

colAUC(titanic_predictions,titanic$Survived, plotROC = T)

## Warning in matrix(rep(idx, nL), nR, nL): data length [1428] is not a sub-
## multiple or multiple of the number of rows [891]

##              [,1]
## 0 vs. 1 0.5130549

Bryan Adams' Blog

Classification Models (Logistic Regression)

Logistic Regression

Exploring the Data

Making a logistic model using gender as the explanatory varaible

How well does our model do?

Good I do better with more predictors?

Is your model just guessing

Recent posts

Using purrr - map a function

CNN with keras

Using RSelenium for Literature Reivew

Classification Models (Logistic Regression)

Classification Models (LDA)

Categories

About