This script stems from the lab accompanying Chapter 3 of Introduction to Statistical Learning and follows on from the previous script analyzing Boston data. This time we will work with multiple linear regression and will try to identify the model of right size.
help(Boston) # learn the structure of the dataset, 506 areas described by 14 variables, medv will serve as the target variable
# employ all the available independent variables
lm.fit.all <- lm(medv ~ ., data=Boston)
summary(lm.fit.all) # model improves again, only some of the variables seem to be unimportant, we may want to exclude them
##
## Call:
## lm(formula = medv ~ ., data = Boston)
##
## Residuals:
## Min 1Q Median 3Q Max
## -15.595 -2.730 -0.518 1.777 26.199
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.646e+01 5.103e+00 7.144 3.28e-12 ***
## crim -1.080e-01 3.286e-02 -3.287 0.001087 **
## zn 4.642e-02 1.373e-02 3.382 0.000778 ***
## indus 2.056e-02 6.150e-02 0.334 0.738288
## chas 2.687e+00 8.616e-01 3.118 0.001925 **
## nox -1.777e+01 3.820e+00 -4.651 4.25e-06 ***
## rm 3.810e+00 4.179e-01 9.116 < 2e-16 ***
## age 6.922e-04 1.321e-02 0.052 0.958229
## dis -1.476e+00 1.995e-01 -7.398 6.01e-13 ***
## rad 3.060e-01 6.635e-02 4.613 5.07e-06 ***
## tax -1.233e-02 3.760e-03 -3.280 0.001112 **
## ptratio -9.527e-01 1.308e-01 -7.283 1.31e-12 ***
## black 9.312e-03 2.686e-03 3.467 0.000573 ***
## lstat -5.248e-01 5.072e-02 -10.347 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.745 on 492 degrees of freedom
## Multiple R-squared: 0.7406, Adjusted R-squared: 0.7338
## F-statistic: 108.1 on 13 and 492 DF, p-value: < 2.2e-16
# use correlation coefficients, keep only variables that significantly correlate with medv
sort(apply(Boston,2,function(x) cor.test(x,Boston$medv)$p.value)) # recommendation: keep all
## medv lstat rm ptratio indus tax
## 0.000000e+00 5.081103e-88 2.487229e-74 1.609509e-34 4.900260e-31 5.637734e-29
## nox crim rad age zn black
## 7.065042e-24 1.173987e-19 5.465933e-19 1.569982e-18 5.713584e-17 1.318113e-14
## dis chas
## 1.206612e-08 7.390623e-05
# use the coefficient p-values to decide whether to include variables
lm.fit.exclude <- lm(medv ~ . -age -indus, data=Boston) # truly exclude age and indus
summary(lm.fit.exclude) # the simpler model seems to maintain the performance of the previous model
##
## Call:
## lm(formula = medv ~ . - age - indus, data = Boston)
##
## Residuals:
## Min 1Q Median 3Q Max
## -15.5984 -2.7386 -0.5046 1.7273 26.2373
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 36.341145 5.067492 7.171 2.73e-12 ***
## crim -0.108413 0.032779 -3.307 0.001010 **
## zn 0.045845 0.013523 3.390 0.000754 ***
## chas 2.718716 0.854240 3.183 0.001551 **
## nox -17.376023 3.535243 -4.915 1.21e-06 ***
## rm 3.801579 0.406316 9.356 < 2e-16 ***
## dis -1.492711 0.185731 -8.037 6.84e-15 ***
## rad 0.299608 0.063402 4.726 3.00e-06 ***
## tax -0.011778 0.003372 -3.493 0.000521 ***
## ptratio -0.946525 0.129066 -7.334 9.24e-13 ***
## black 0.009291 0.002674 3.475 0.000557 ***
## lstat -0.522553 0.047424 -11.019 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.736 on 494 degrees of freedom
## Multiple R-squared: 0.7406, Adjusted R-squared: 0.7348
## F-statistic: 128.2 on 11 and 494 DF, p-value: < 2.2e-16
anova(lm.fit.exclude,lm.fit.all) # no difference, the simpler model preferred
## Analysis of Variance Table
##
## Model 1: medv ~ (crim + zn + indus + chas + nox + rm + age + dis + rad +
## tax + ptratio + black + lstat) - age - indus
## Model 2: medv ~ crim + zn + indus + chas + nox + rm + age + dis + rad +
## tax + ptratio + black + lstat
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 494 11081
## 2 492 11079 2 2.5794 0.0573 0.9443
## use stepwise regression for feature selection
step <- stepAIC(lm.fit.all, direction="both") # stepwise regression, based on AIC criterion, taken from MASS library
## Start: AIC=1589.64
## medv ~ crim + zn + indus + chas + nox + rm + age + dis + rad +
## tax + ptratio + black + lstat
##
## Df Sum of Sq RSS AIC
## - age 1 0.06 11079 1587.7
## - indus 1 2.52 11081 1587.8
## <none> 11079 1589.6
## - chas 1 218.97 11298 1597.5
## - tax 1 242.26 11321 1598.6
## - crim 1 243.22 11322 1598.6
## - zn 1 257.49 11336 1599.3
## - black 1 270.63 11349 1599.8
## - rad 1 479.15 11558 1609.1
## - nox 1 487.16 11566 1609.4
## - ptratio 1 1194.23 12273 1639.4
## - dis 1 1232.41 12311 1641.0
## - rm 1 1871.32 12950 1666.6
## - lstat 1 2410.84 13490 1687.3
##
## Step: AIC=1587.65
## medv ~ crim + zn + indus + chas + nox + rm + dis + rad + tax +
## ptratio + black + lstat
##
## Df Sum of Sq RSS AIC
## - indus 1 2.52 11081 1585.8
## <none> 11079 1587.7
## + age 1 0.06 11079 1589.6
## - chas 1 219.91 11299 1595.6
## - tax 1 242.24 11321 1596.6
## - crim 1 243.20 11322 1596.6
## - zn 1 260.32 11339 1597.4
## - black 1 272.26 11351 1597.9
## - rad 1 481.09 11560 1607.2
## - nox 1 520.87 11600 1608.9
## - ptratio 1 1200.23 12279 1637.7
## - dis 1 1352.26 12431 1643.9
## - rm 1 1959.55 13038 1668.0
## - lstat 1 2718.88 13798 1696.7
##
## Step: AIC=1585.76
## medv ~ crim + zn + chas + nox + rm + dis + rad + tax + ptratio +
## black + lstat
##
## Df Sum of Sq RSS AIC
## <none> 11081 1585.8
## + indus 1 2.52 11079 1587.7
## + age 1 0.06 11081 1587.8
## - chas 1 227.21 11309 1594.0
## - crim 1 245.37 11327 1594.8
## - zn 1 257.82 11339 1595.4
## - black 1 270.82 11352 1596.0
## - tax 1 273.62 11355 1596.1
## - rad 1 500.92 11582 1606.1
## - nox 1 541.91 11623 1607.9
## - ptratio 1 1206.45 12288 1636.0
## - dis 1 1448.94 12530 1645.9
## - rm 1 1963.66 13045 1666.3
## - lstat 1 2723.48 13805 1695.0
step$anova # display results, actually does the same as we did manually, removes age and indus
## Stepwise Model Path
## Analysis of Deviance Table
##
## Initial Model:
## medv ~ crim + zn + indus + chas + nox + rm + age + dis + rad +
## tax + ptratio + black + lstat
##
## Final Model:
## medv ~ crim + zn + chas + nox + rm + dis + rad + tax + ptratio +
## black + lstat
##
##
## Step Df Deviance Resid. Df Resid. Dev AIC
## 1 492 11078.78 1589.643
## 2 - age 1 0.06183435 493 11078.85 1587.646
## 3 - indus 1 2.51754013 494 11081.36 1585.761
stepAIC(lm(medv~1,Boston), direction="forward",scope=list(upper=lm.fit.all))
## Start: AIC=2246.51
## medv ~ 1
##
## Df Sum of Sq RSS AIC
## + lstat 1 23243.9 19472 1851.0
## + rm 1 20654.4 22062 1914.2
## + ptratio 1 11014.3 31702 2097.6
## + indus 1 9995.2 32721 2113.6
## + tax 1 9377.3 33339 2123.1
## + nox 1 7800.1 34916 2146.5
## + crim 1 6440.8 36276 2165.8
## + rad 1 6221.1 36495 2168.9
## + age 1 6069.8 36647 2171.0
## + zn 1 5549.7 37167 2178.1
## + black 1 4749.9 37966 2188.9
## + dis 1 2668.2 40048 2215.9
## + chas 1 1312.1 41404 2232.7
## <none> 42716 2246.5
##
## Step: AIC=1851.01
## medv ~ lstat
##
## Df Sum of Sq RSS AIC
## + rm 1 4033.1 15439 1735.6
## + ptratio 1 2670.1 16802 1778.4
## + chas 1 786.3 18686 1832.2
## + dis 1 772.4 18700 1832.5
## + age 1 304.3 19168 1845.0
## + tax 1 274.4 19198 1845.8
## + black 1 198.3 19274 1847.8
## + zn 1 160.3 19312 1848.8
## + crim 1 146.9 19325 1849.2
## + indus 1 98.7 19374 1850.4
## <none> 19472 1851.0
## + rad 1 25.1 19447 1852.4
## + nox 1 4.8 19468 1852.9
##
## Step: AIC=1735.58
## medv ~ lstat + rm
##
## Df Sum of Sq RSS AIC
## + ptratio 1 1711.32 13728 1678.1
## + chas 1 548.53 14891 1719.3
## + black 1 512.31 14927 1720.5
## + tax 1 425.16 15014 1723.5
## + dis 1 351.15 15088 1725.9
## + crim 1 311.42 15128 1727.3
## + rad 1 180.45 15259 1731.6
## + indus 1 61.09 15378 1735.6
## <none> 15439 1735.6
## + zn 1 56.56 15383 1735.7
## + age 1 20.18 15419 1736.9
## + nox 1 14.90 15424 1737.1
##
## Step: AIC=1678.13
## medv ~ lstat + rm + ptratio
##
## Df Sum of Sq RSS AIC
## + dis 1 499.08 13229 1661.4
## + black 1 389.68 13338 1665.6
## + chas 1 377.96 13350 1666.0
## + crim 1 122.52 13606 1675.6
## + age 1 66.24 13662 1677.7
## <none> 13728 1678.1
## + tax 1 44.36 13684 1678.5
## + nox 1 24.81 13703 1679.2
## + zn 1 14.96 13713 1679.6
## + rad 1 6.07 13722 1679.9
## + indus 1 0.83 13727 1680.1
##
## Step: AIC=1661.39
## medv ~ lstat + rm + ptratio + dis
##
## Df Sum of Sq RSS AIC
## + nox 1 759.56 12469 1633.5
## + black 1 502.64 12726 1643.8
## + chas 1 267.43 12962 1653.1
## + indus 1 242.65 12986 1654.0
## + tax 1 240.34 12989 1654.1
## + crim 1 233.54 12995 1654.4
## + zn 1 144.81 13084 1657.8
## + age 1 61.36 13168 1661.0
## <none> 13229 1661.4
## + rad 1 22.40 13206 1662.5
##
## Step: AIC=1633.47
## medv ~ lstat + rm + ptratio + dis + nox
##
## Df Sum of Sq RSS AIC
## + chas 1 328.27 12141 1622.0
## + black 1 311.83 12158 1622.7
## + zn 1 151.71 12318 1629.3
## + crim 1 141.43 12328 1629.7
## + rad 1 53.48 12416 1633.3
## <none> 12469 1633.5
## + indus 1 17.10 12452 1634.8
## + tax 1 10.50 12459 1635.0
## + age 1 0.25 12469 1635.5
##
## Step: AIC=1621.97
## medv ~ lstat + rm + ptratio + dis + nox + chas
##
## Df Sum of Sq RSS AIC
## + black 1 272.837 11868 1612.5
## + zn 1 164.406 11977 1617.1
## + crim 1 116.330 12025 1619.1
## + rad 1 58.556 12082 1621.5
## <none> 12141 1622.0
## + indus 1 26.274 12115 1622.9
## + tax 1 4.187 12137 1623.8
## + age 1 2.331 12139 1623.9
##
## Step: AIC=1612.47
## medv ~ lstat + rm + ptratio + dis + nox + chas + black
##
## Df Sum of Sq RSS AIC
## + zn 1 189.936 11678 1606.3
## + rad 1 144.320 11724 1608.3
## + crim 1 55.633 11813 1612.1
## <none> 11868 1612.5
## + indus 1 15.584 11853 1613.8
## + age 1 9.446 11859 1614.1
## + tax 1 2.703 11866 1614.4
##
## Step: AIC=1606.31
## medv ~ lstat + rm + ptratio + dis + nox + chas + black + zn
##
## Df Sum of Sq RSS AIC
## + crim 1 94.712 11584 1604.2
## + rad 1 93.614 11585 1604.2
## <none> 11678 1606.3
## + indus 1 16.048 11662 1607.6
## + tax 1 3.952 11674 1608.1
## + age 1 1.491 11677 1608.2
##
## Step: AIC=1604.19
## medv ~ lstat + rm + ptratio + dis + nox + chas + black + zn +
## crim
##
## Df Sum of Sq RSS AIC
## + rad 1 228.604 11355 1596.1
## <none> 11584 1604.2
## + indus 1 15.773 11568 1605.5
## + age 1 2.470 11581 1606.1
## + tax 1 1.305 11582 1606.1
##
## Step: AIC=1596.1
## medv ~ lstat + rm + ptratio + dis + nox + chas + black + zn +
## crim + rad
##
## Df Sum of Sq RSS AIC
## + tax 1 273.619 11081 1585.8
## <none> 11355 1596.1
## + indus 1 33.894 11321 1596.6
## + age 1 0.096 11355 1598.1
##
## Step: AIC=1585.76
## medv ~ lstat + rm + ptratio + dis + nox + chas + black + zn +
## crim + rad + tax
##
## Df Sum of Sq RSS AIC
## <none> 11081 1585.8
## + indus 1 2.51754 11079 1587.7
## + age 1 0.06271 11081 1587.8
##
## Call:
## lm(formula = medv ~ lstat + rm + ptratio + dis + nox + chas +
## black + zn + crim + rad + tax, data = Boston)
##
## Coefficients:
## (Intercept) lstat rm ptratio dis nox
## 36.341145 -0.522553 3.801579 -0.946525 -1.492711 -17.376023
## chas black zn crim rad tax
## 2.718716 0.009291 0.045845 -0.108413 0.299608 -0.011778
stepAIC(lm.fit.all, direction="backward")
## Start: AIC=1589.64
## medv ~ crim + zn + indus + chas + nox + rm + age + dis + rad +
## tax + ptratio + black + lstat
##
## Df Sum of Sq RSS AIC
## - age 1 0.06 11079 1587.7
## - indus 1 2.52 11081 1587.8
## <none> 11079 1589.6
## - chas 1 218.97 11298 1597.5
## - tax 1 242.26 11321 1598.6
## - crim 1 243.22 11322 1598.6
## - zn 1 257.49 11336 1599.3
## - black 1 270.63 11349 1599.8
## - rad 1 479.15 11558 1609.1
## - nox 1 487.16 11566 1609.4
## - ptratio 1 1194.23 12273 1639.4
## - dis 1 1232.41 12311 1641.0
## - rm 1 1871.32 12950 1666.6
## - lstat 1 2410.84 13490 1687.3
##
## Step: AIC=1587.65
## medv ~ crim + zn + indus + chas + nox + rm + dis + rad + tax +
## ptratio + black + lstat
##
## Df Sum of Sq RSS AIC
## - indus 1 2.52 11081 1585.8
## <none> 11079 1587.7
## - chas 1 219.91 11299 1595.6
## - tax 1 242.24 11321 1596.6
## - crim 1 243.20 11322 1596.6
## - zn 1 260.32 11339 1597.4
## - black 1 272.26 11351 1597.9
## - rad 1 481.09 11560 1607.2
## - nox 1 520.87 11600 1608.9
## - ptratio 1 1200.23 12279 1637.7
## - dis 1 1352.26 12431 1643.9
## - rm 1 1959.55 13038 1668.0
## - lstat 1 2718.88 13798 1696.7
##
## Step: AIC=1585.76
## medv ~ crim + zn + chas + nox + rm + dis + rad + tax + ptratio +
## black + lstat
##
## Df Sum of Sq RSS AIC
## <none> 11081 1585.8
## - chas 1 227.21 11309 1594.0
## - crim 1 245.37 11327 1594.8
## - zn 1 257.82 11339 1595.4
## - black 1 270.82 11352 1596.0
## - tax 1 273.62 11355 1596.1
## - rad 1 500.92 11582 1606.1
## - nox 1 541.91 11623 1607.9
## - ptratio 1 1206.45 12288 1636.0
## - dis 1 1448.94 12530 1645.9
## - rm 1 1963.66 13045 1666.3
## - lstat 1 2723.48 13805 1695.0
##
## Call:
## lm(formula = medv ~ crim + zn + chas + nox + rm + dis + rad +
## tax + ptratio + black + lstat, data = Boston)
##
## Coefficients:
## (Intercept) crim zn chas nox rm
## 36.341145 -0.108413 0.045845 2.718716 -17.376023 3.801579
## dis rad tax ptratio black lstat
## -1.492711 0.299608 -0.011778 -0.946525 0.009291 -0.522553
# Fit LASSO to remove unimportant predictors
lambda_grid <- 10^ seq(10 , -3 , length =200)
lasso.model <- glmnet(as.matrix(Boston[,-ncol(Boston)]),Boston$medv,alpha=1, lambda=lambda_grid, standardize=TRUE)
lasso.cv.out <- cv.glmnet(as.matrix(Boston[,-ncol(Boston)]),Boston$medv,alpha=1)
plot(lasso.cv.out) # small lambda values preferred, shrinkage effect is small
lasso.lambda <- lasso.cv.out$lambda.min
lasso.coefficients <- predict(lasso.model, type="coefficients", s=lasso.lambda)
# Display the coefficients and selected variables
print("LASSO coefficients:")
## [1] "LASSO coefficients:"
print(as.matrix(lasso.coefficients)) # the absolute coefficent values influenced by the scale of the individual variables, nox has a small scale
## s1
## (Intercept) 35.008345160
## crim -0.101374833
## zn 0.042789603
## indus 0.000000000
## chas 2.696230567
## nox -16.635622703
## rm 3.847363449
## age 0.000000000
## dis -1.425786444
## rad 0.266778132
## tax -0.010413817
## ptratio -0.934983573
## black 0.009107027
## lstat -0.522521851
print(as.matrix(lasso.coefficients)[seq(2,length(Boston)-1),] != 0) # removes indus and age too
## crim zn indus chas nox rm age dis rad tax
## TRUE TRUE FALSE TRUE TRUE TRUE FALSE TRUE TRUE TRUE
## ptratio black
## TRUE TRUE
To avoid overfitting and increase simplicity and understandability of the model, only relevant features should be used. We have shown four different approaches to feature selection: pairwise correlations, p-values, stepwise regression and lasso. The method based on pairwise correlations recommends to keep all the variables, but it fails to control correlation between predictors. All of the rest came to the same conclusion: two features should be removed from the model. Using coefficient p-values for variable selection worked well in our case, however, it is not considered a standard or recommended practice. Significance of a variable in this context does not necessarily imply practical importance. A small p-value may indicate statistical significance, but the effect size might be trivial in a real-world context and vice versa. Multiple comparisons may lead to false positives (Type I errors) due to the problem of multiple comparisons (you might mistakenly include variables that have a low p-value by chance alone). Similarly, stepwise variable selection has to be used judiciously as it often suffers from overfitting. Stepwise procedures are data-driven and can lead to the selection of variables that happen to perform well on the specific dataset used for modeling but do not generalize well to new, unseen data. Lasso with the properly tuned regularization strength using techniques like cross-validation generally achieves an optimal balance between model complexity and performance.