This notebook shows the strengths of linear discriminant analysis (LDA) in a simple study of gender prediction. First, LDA is illustrated. Later on, it is compared with quadratic discriminant analysis (QDA) and logistic regression.

Read the data

The dataset holds data on weight, height and gender of 10,000 individuals and was taken from https://www.kaggle.com/mustafaali96/weight-height.

life_data <- read.csv("weight-height.csv")

# height in inches, weight in pounds, transform to cm and kg
life_data <- life_data %>%
  mutate(Height = Height * 2.54,
         Weight = life_data$Weight * 0.454,
         Gender = as.factor(Gender))

glimpse(life_data)
## Rows: 10,000
## Columns: 3
## $ Gender <fct> Male, Male, Male, Male, Male, Male, Male, Male, Male, Male, Mal…
## $ Height <dbl> 187.5714, 174.7060, 188.2397, 182.1967, 177.4998, 170.8227, 174…
## $ Weight <dbl> 109.81968, 73.68895, 96.58435, 99.89928, 93.68281, 69.10432, 83…

Analyse height only

First, consider only the height. Let us see its relationship with gender.

# visualize the data
life_data %>% ggplot(aes(x=Gender, y=Height, fill=Gender)) + 
  geom_boxplot()

life_data %>% ggplot(aes(x=Height, fill=Gender)) +
  geom_density(alpha=0.6)

Linear discriminant analysis (LDA)

The previous graphs confirm the well-known relationship between height and gender. Additionally, the density plot supports the assumptions of LDA: height appears to be normally distributed within each class, and the variance seems to be approximately equal across both groups. Now, let us run LDA and examine the results:

# run LDA
lda_res <- lda(Gender ~ Height, life_data)
lda_res
## Call:
## lda(Gender ~ Height, data = life_data)
## 
## Prior probabilities of groups:
## Female   Male 
##    0.5    0.5 
## 
## Group means:
##          Height
## Female 161.8203
## Male   175.3269
## 
## Coefficients of linear discriminants:
##              LD1
## Height 0.1415641
# Take look at the predicted posterior probabilities for each gender
gender_pred <- predict(lda_res, life_data)
# View(gender_pred$posterior)

# Check the confusion matrix and the accuracy of predictions
# Note: this is the train accuracy, however, no big risk of overfitting, the sample size is large and the model has only 5 parameters
my_accuracy(gender_pred$class, 
            life_data$Gender,
            print_confmat = TRUE)
## 
## Confusion matrix:
##         reference
## preds    Female Male
##   Female   4189  861
##   Male      811 4139
## 
##  Accuracy: 0.8328
# Can you guess, what is the threshold value by which we discriminate genders (in a good way)?
# You can use this DF
gender_pred_df <- data.frame(
  Height = life_data$Height,
  Gender_pred = gender_pred$class
)

# Un-comment to view the predictions
# View(gender_pred_df)

lda_decisionBoundary <- gender_pred_df %>% 
  filter(Gender_pred == "Female") %>% 
  dplyr::select(Height) %>% 
  max

paste("The LDA decision boundary in height:", round(lda_decisionBoundary,2))
## [1] "The LDA decision boundary in height: 168.57"

In this case, the method simply thresholds the samples: those with a height below 168.57 cm are classified as women, while samples with a height of 168.57 cm or above are classified as men. Despite its simplicity, the method reaches the train accuracy of 83%.

Understand LDA by arriving at the same results yourself

The core of LDA is the Bayes theorem: \[Pr(Male|Height) = \frac{Pr(Height|Male)} {Pr(Height|Male) + Pr(Height|Female)}\] Note that the above expression is a simplified form of the theorem since the prior probabilities of genders are the same, i.e., \(P(Male)=P(Female)\). LDA further assumes the distributions of heights in each gender to be normal \(Pr(Height|Male) \sim \mathcal{N}(\mu,\sigma^2)\). All you need to do is to estimate parameters of the normal distributions and use them in the Bayes formula.

## [1] 0.9942165 0.8404279 0.9951696 0.9756555 0.9182121 0.6477859
##        Female      Male
## 1 0.005810296 0.9941897
## 2 0.159773209 0.8402268
## 3 0.004853568 0.9951464
## 4 0.024423757 0.9755762
## 5 0.081951918 0.9180481
## 6 0.352339590 0.6476604
## 
##  Accuracy: 0.8328
## [1] 168.5736
## [1] 168.5736

Both the decision boundary and accuracy are the same as in the lda() run.

Plot the LDA decision boundary

# Create normal approximations of our data (i.e the "ideal data") that LDA internally works with
height_seq <- seq(min(life_data$Height), max(life_data$Height), length.out=nrow(life_data)/2)

ideal_life_data <- data.frame(
  height=height_seq, 
  # Densities of heights under the normality assumption
  Female=dnorm(height_seq, mean=mu_f, sd=sd_all), 
  Male = dnorm(height_seq, mean=mu_m, sd=sd_all)
) 

# Reshape for plotting purposes
ideal_life_data <- ideal_life_data %>%
  reshape2::melt(id.vars="height", variable.name = "Gender")

# Plot the distributions, their normal approximations and the decision boundary (or point in this 1D case)
life_data %>% ggplot(aes(x=Height, fill=Gender)) +
  geom_density(alpha=0.6) +
  geom_line(data = ideal_life_data, aes(x=height, y=value), linetype = 2)  +
  # Add the decision boundary
  geom_vline(xintercept = (mu_f + mu_m) / 2, size=1) +
  ggtitle("Decision boundary\n (dotted lines indicate normal approximations)") +
  theme_minimal()

# standard deviations averaged, decision boundary does not cross the intersection of class densities in the real case

Work with both height and weight

Now, let us include both the features in our LDA model.

# First, observe, how the multivariate normal distributions of Height + Weight looks
life_data %>%
  ggplot(aes(x=Height, y=Weight, col=Gender)) +
  geom_point(alpha=0.1)

# run LDA function
lda_hw <- lda(Gender ~ ., life_data)
lda_hw
## Call:
## lda(Gender ~ ., data = life_data)
## 
## Prior probabilities of groups:
## Female   Male 
##    0.5    0.5 
## 
## Group means:
##          Height   Weight
## Female 161.8203 61.68048
## Male   175.3269 84.90736
## 
## Coefficients of linear discriminants:
##                LD1
## Height -0.06766107
## Weight  0.15646959
gender_pred <- predict(lda_hw, life_data)$class

# Accuracy, by 8% better than with Height only
my_accuracy(gender_pred, life_data$Gender)
## 
##  Accuracy: 0.9193
decisionplot(lda_hw, life_data, class = "Gender", main = "LDA", resolution=200)

# Question: Again, why is the decision boundary linear?
# Short answer: Using the formula for the normal distribution density and setting P(man|X)=P(woman|X) we can 
# derive a term for the boundary, which will be linear in X. 
# For details refer to "Introduction to statistical learning in R", page 142

With weight, the accuracy improves to nearly 92%. The decision boundary is linear. The coefficients of linear discriminants returned by the lda() function represent the weights used to transform the predictor variables into a new space where the classes are best separated. Additionally, they represent the normal vector to the decision boundary, i.e., the vector that is perpendicular to this boundary.

LDA with two predictors perfomed with the Bayes formula

With two features, the Bayes formula extends to: \(Pr(Male|Height,Weight) = \frac{Pr(Height,Weight|Male)} {Pr(Height,Weight|Male) + Pr(Height,Weight|Female)}\)

# estimate parameters of distributions, substitute them into the Bayes formula
library(mvtnorm)
mu_f <- colMeans(life_data[life_data$Gender=="Female",c(2,3)])
mu_m <- colMeans(life_data[life_data$Gender=="Male",c(2,3)])
cov_f <- cov(life_data[life_data$Gender=="Female",c(2,3)])
cov_m <- cov(life_data[life_data$Gender=="Male",c(2,3)])
cov_avg <- (cov_f + cov_m)/2

# Again, compute the posterior for the samples (simplified by the fact that P(Male)=P(Female))
postM <- dmvnorm(life_data[c(2,3)], mu_m, cov_avg) / 
  (dmvnorm(life_data[c(2,3)], mu_m, cov_avg) + dmvnorm(life_data[c(2,3)], mu_f, cov_avg))
head(postM)
##          1          2          3          4          5          6 
## 0.99999416 0.27675255 0.99815610 0.99985189 0.99912096 0.09998952
gender_pred <- postM > 0.5

my_accuracy(gender_pred, life_data$Gender)
## 
##  Accuracy: 0.9193
# The train accuracy of Logistic regression was 0.9194

Quadratic discriminant analysis (QDA)

Although it looks that the LDA assumptions are met, we will run QDA and thus allow each class to have its own covariance matrix. It means the spread and correlation of the predictors can vary between classes. We will deal with more model parameters and make our model more flexible.

qda_res <- qda(Gender ~ .,life_data)
qda_res
## Call:
## qda(Gender ~ ., data = life_data)
## 
## Prior probabilities of groups:
## Female   Male 
##    0.5    0.5 
## 
## Group means:
##          Height   Weight
## Female 161.8203 61.68048
## Male   175.3269 84.90736
gender_pred <- predict(qda_res, life_data)$class

# Evaluate accuracy
# almost the same train accuracy as LDA, more freedom does not help here
my_accuracy(gender_pred, life_data$Gender)
## 
##  Accuracy: 0.9192
decisionplot(qda_res, life_data, class = "Gender", main = "QDA", resolution=200) # decision boundary close to linear

qda_res$scaling # both the genders scale nearly equally (scaling transforms the observations so that within-group covariance matrices are spherical)
## , , Female
## 
##                1          2
## Height 0.1460161 -0.2352167
## Weight 0.0000000  0.2195462
## 
## , , Male
## 
##                1          2
## Height -0.137496 -0.2348510
## Weight  0.000000  0.2203915
var(as.matrix(subset(life_data, Gender=="Male",2:3)) %*% qda_res$scaling[,,"Male"]) # after scaling, the group variances equal 1 and covariances 0
##               1             2
## 1  1.000000e+00 -1.795308e-15
## 2 -1.795308e-15  1.000000e+00

QDA reaches approximately the same train accuracy as LDA, it seems that additional complexity does not help much. The decision boundary in QDA can take the form of parabolas, ellipses, or hyperbolas in 2D space, or their higher-dimensional equivalents. In our case it is close to linear as the class covariance matrices do not differ much.

Logistic regression

Finally, we will perform both the previously solved classification tasks with logistic regression. We will start with the simple height model.

# run Logistic regression - start with a simple Height model
lr <- glm(Gender ~ Height, life_data, family=binomial)
# Look the model - does a coefficient for Height have any (meaningful) interpretation?
# Maybe revise what quantity the Y = B0+B1x1+... represents in the logistic regression
lr
## 
## Call:  glm(formula = Gender ~ Height, family = binomial, data = life_data)
## 
## Coefficients:
## (Intercept)       Height  
##    -45.3001       0.2689  
## 
## Degrees of Freedom: 9999 Total (i.e. Null);  9998 Residual
## Null Deviance:       13860 
## Residual Deviance: 7542  AIC: 7546
# Obtain the probability of being a Man for train data
lr_prob <- predict(lr, life_data, type="response")
gender_pred <- lr_prob > 0.5

# Compare the 2 calls below. Can you guess how do they differ and how to get from one to the other?
# HINT: Recall, how to get from Y = B0+B1x1+... to the probability
predict(lr, life_data, type="response") %>% head
##         1         2         3         4         5         6 
## 0.9941160 0.8416663 0.9950788 0.9755074 0.9184738 0.6517203
predict(lr, life_data)                  %>% head
##         1         2         3         4         5         6 
## 5.1296161 1.6706788 5.3092777 3.6845852 2.4217888 0.6266095
# Evaluate accuracy of Logistic regression
my_accuracy(gender_pred, life_data$Gender, print_confmat = TRUE)
## 
## Confusion matrix:
##        reference
## preds   Female Male
##   FALSE   4171  845
##   TRUE     829 4155
## 
##  Accuracy: 0.8326
# Prepare data to plot the sigmoid
sigmoid_data <- life_data %>%
  mutate(lr_prob = lr_prob,
         gender_num = as.numeric(Gender) - 1)

# Using the sigmoid data, estimate the decision boundary of the logistic regression
# You will see a better way to analytically find the decision boundary
lr_decisionBoundary <- sigmoid_data %>% filter(lr_prob > 0.5) %>% pull(Height) %>% min() %>% round(digits=3)
# compared to lda_decisionBoundary, a tiny difference exists
sprintf("LR decision boundary: %s", lr_decisionBoundary)
## [1] "LR decision boundary: 168.498"
# Plot the sigmoid
sigmoid_data %>%   
  ggplot(aes(x=Height, y=lr_prob)) +
  geom_point(col="red") +
  geom_point(aes(x=Height, y=gender_num)) +
  # Logistic regression decision boundary
  geom_vline(xintercept = lr_decisionBoundary, col="blue") +
  geom_hline(yintercept = 0.5, linetype="dotted") +
  ylab("P(Gender=Male|Height)")

The logistic regression decision boundary as well as train accuracy nearly perfectly match the outcomes obtained by LDA. Both the methods have sufficient amount of data to learn, the problem is simple, the conceptual differences between methods do not matter in this task.

Fit LR with both Height and Weight

lr_hw <- glm(Gender ~ ., life_data, family=binomial)
lr_hw
## 
## Call:  glm(formula = Gender ~ ., family = binomial, data = life_data)
## 
## Coefficients:
## (Intercept)       Height       Weight  
##      0.6925      -0.1939       0.4369  
## 
## Degrees of Freedom: 9999 Total (i.e. Null);  9997 Residual
## Null Deviance:       13860 
## Residual Deviance: 4183  AIC: 4189
# Predict the gender and assess the accuracy
gender_pred <- predict(lr_hw, life_data, type = "response") > 0.5
my_accuracy(gender_pred, life_data$Gender)
## 
##  Accuracy: 0.9194
# Plotting decision boundary
class(lr_hw) <- c("lr", class(lr_hw)) # a bit of adjusting for plotting the decision boundary
# the decision boundary is linear!
predict.lr <- function(object, newdata, ...)
  predict.glm(object, newdata, type = "response") > .5
decisionplot(lr_hw, life_data, class = "Gender", main = "Logistic regression", resolution=200) 

# Question: Why is the logistic regression decision boundary linear? Can you derive the equation?
# HINT: Refer again to the equation of LR. What probability do the points lying on the decision boundary have?

Again, the decision boundary and train accuracy nearly perfectly match the outcomes obtained by LDA. Let us compare the decision boundary slope from the coefficients. In LDA, the normalized normal vector is -0.3969039, 0.9178602, in LR the same vector equals -0.4057524, 0.913983. The vectors can be reached from the coefficents obtained in both the methods.

Independent work

  1. Which method is expected to work best on test data in this task (LDA, QDA or LR)? Answer without testing first. Use the knowledge of the individual methods assumptions.
  2. Experimentally verify your answer. Note that you may need to deal with different (and sometimes very small) train sets to see any difference … you have already seen that for the 10,000 samples the differences can be negligible.