Multiple linear regression

This script stems from the lab accompanying Chapter 3 of Introduction to Statistical Learning and follows on from the previous script analyzing Boston data. This time we will work with multiple linear regression and will try to identify the model of right size.

help(Boston) # learn the structure of the dataset, 506 areas described by 14 variables, medv will serve as the target variable

# employ all the available independent variables
lm.fit.all <- lm(medv ~ ., data=Boston)
summary(lm.fit.all) # model improves again, only some of the variables seem to be unimportant, we may want to exclude them
## 
## Call:
## lm(formula = medv ~ ., data = Boston)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -15.595  -2.730  -0.518   1.777  26.199 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  3.646e+01  5.103e+00   7.144 3.28e-12 ***
## crim        -1.080e-01  3.286e-02  -3.287 0.001087 ** 
## zn           4.642e-02  1.373e-02   3.382 0.000778 ***
## indus        2.056e-02  6.150e-02   0.334 0.738288    
## chas         2.687e+00  8.616e-01   3.118 0.001925 ** 
## nox         -1.777e+01  3.820e+00  -4.651 4.25e-06 ***
## rm           3.810e+00  4.179e-01   9.116  < 2e-16 ***
## age          6.922e-04  1.321e-02   0.052 0.958229    
## dis         -1.476e+00  1.995e-01  -7.398 6.01e-13 ***
## rad          3.060e-01  6.635e-02   4.613 5.07e-06 ***
## tax         -1.233e-02  3.760e-03  -3.280 0.001112 ** 
## ptratio     -9.527e-01  1.308e-01  -7.283 1.31e-12 ***
## black        9.312e-03  2.686e-03   3.467 0.000573 ***
## lstat       -5.248e-01  5.072e-02 -10.347  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.745 on 492 degrees of freedom
## Multiple R-squared:  0.7406, Adjusted R-squared:  0.7338 
## F-statistic: 108.1 on 13 and 492 DF,  p-value: < 2.2e-16
# use correlation coefficients, keep only variables that significantly correlate with medv
sort(apply(Boston,2,function(x) cor.test(x,Boston$medv)$p.value)) # recommendation: keep all
##         medv        lstat           rm      ptratio        indus          tax 
## 0.000000e+00 5.081103e-88 2.487229e-74 1.609509e-34 4.900260e-31 5.637734e-29 
##          nox         crim          rad          age           zn        black 
## 7.065042e-24 1.173987e-19 5.465933e-19 1.569982e-18 5.713584e-17 1.318113e-14 
##          dis         chas 
## 1.206612e-08 7.390623e-05
# use the coefficient p-values to decide whether to include variables
lm.fit.exclude <- lm(medv ~ . -rad -age -indus, data=Boston) # truly exclude age and indus
summary(lm.fit.exclude) # the simpler model seems to maintain the performance of the previous model
## 
## Call:
## lm(formula = medv ~ . - rad - age - indus, data = Boston)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -15.8588  -2.8451  -0.5955   1.4641  27.6777 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  2.974e+01  4.975e+00   5.978 4.33e-09 ***
## crim        -6.356e-02  3.204e-02  -1.984 0.047853 *  
## zn           4.131e-02  1.378e-02   2.999 0.002846 ** 
## chas         3.037e+00  8.697e-01   3.492 0.000522 ***
## nox         -1.646e+01  3.605e+00  -4.567 6.26e-06 ***
## rm           4.147e+00  4.082e-01  10.160  < 2e-16 ***
## dis         -1.429e+00  1.892e-01  -7.552 2.08e-13 ***
## tax          5.175e-04  2.191e-03   0.236 0.813396    
## ptratio     -8.519e-01  1.302e-01  -6.542 1.52e-10 ***
## black        8.392e-03  2.724e-03   3.081 0.002180 ** 
## lstat       -5.254e-01  4.843e-02 -10.849  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.837 on 495 degrees of freedom
## Multiple R-squared:  0.7289, Adjusted R-squared:  0.7234 
## F-statistic: 133.1 on 10 and 495 DF,  p-value: < 2.2e-16
anova(lm.fit.exclude,lm.fit.all) # no difference, the simpler model preferred
## Analysis of Variance Table
## 
## Model 1: medv ~ (crim + zn + indus + chas + nox + rm + age + dis + rad + 
##     tax + ptratio + black + lstat) - rad - age - indus
## Model 2: medv ~ crim + zn + indus + chas + nox + rm + age + dis + rad + 
##     tax + ptratio + black + lstat
##   Res.Df   RSS Df Sum of Sq      F   Pr(>F)    
## 1    495 11582                                 
## 2    492 11079  3     503.5 7.4533 6.88e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## use stepwise regression for feature selection
step <- stepAIC(lm.fit.all, direction="both") # stepwise regression, based on AIC criterion, taken from MASS library, use trace=0 to avoid output
## Start:  AIC=1589.64
## medv ~ crim + zn + indus + chas + nox + rm + age + dis + rad + 
##     tax + ptratio + black + lstat
## 
##           Df Sum of Sq   RSS    AIC
## - age      1      0.06 11079 1587.7
## - indus    1      2.52 11081 1587.8
## <none>                 11079 1589.6
## - chas     1    218.97 11298 1597.5
## - tax      1    242.26 11321 1598.6
## - crim     1    243.22 11322 1598.6
## - zn       1    257.49 11336 1599.3
## - black    1    270.63 11349 1599.8
## - rad      1    479.15 11558 1609.1
## - nox      1    487.16 11566 1609.4
## - ptratio  1   1194.23 12273 1639.4
## - dis      1   1232.41 12311 1641.0
## - rm       1   1871.32 12950 1666.6
## - lstat    1   2410.84 13490 1687.3
## 
## Step:  AIC=1587.65
## medv ~ crim + zn + indus + chas + nox + rm + dis + rad + tax + 
##     ptratio + black + lstat
## 
##           Df Sum of Sq   RSS    AIC
## - indus    1      2.52 11081 1585.8
## <none>                 11079 1587.7
## + age      1      0.06 11079 1589.6
## - chas     1    219.91 11299 1595.6
## - tax      1    242.24 11321 1596.6
## - crim     1    243.20 11322 1596.6
## - zn       1    260.32 11339 1597.4
## - black    1    272.26 11351 1597.9
## - rad      1    481.09 11560 1607.2
## - nox      1    520.87 11600 1608.9
## - ptratio  1   1200.23 12279 1637.7
## - dis      1   1352.26 12431 1643.9
## - rm       1   1959.55 13038 1668.0
## - lstat    1   2718.88 13798 1696.7
## 
## Step:  AIC=1585.76
## medv ~ crim + zn + chas + nox + rm + dis + rad + tax + ptratio + 
##     black + lstat
## 
##           Df Sum of Sq   RSS    AIC
## <none>                 11081 1585.8
## + indus    1      2.52 11079 1587.7
## + age      1      0.06 11081 1587.8
## - chas     1    227.21 11309 1594.0
## - crim     1    245.37 11327 1594.8
## - zn       1    257.82 11339 1595.4
## - black    1    270.82 11352 1596.0
## - tax      1    273.62 11355 1596.1
## - rad      1    500.92 11582 1606.1
## - nox      1    541.91 11623 1607.9
## - ptratio  1   1206.45 12288 1636.0
## - dis      1   1448.94 12530 1645.9
## - rm       1   1963.66 13045 1666.3
## - lstat    1   2723.48 13805 1695.0
step$anova # display results, actually does the same as we did manually, removes age and indus
## Stepwise Model Path 
## Analysis of Deviance Table
## 
## Initial Model:
## medv ~ crim + zn + indus + chas + nox + rm + age + dis + rad + 
##     tax + ptratio + black + lstat
## 
## Final Model:
## medv ~ crim + zn + chas + nox + rm + dis + rad + tax + ptratio + 
##     black + lstat
## 
## 
##      Step Df   Deviance Resid. Df Resid. Dev      AIC
## 1                             492   11078.78 1589.643
## 2   - age  1 0.06183435       493   11078.85 1587.646
## 3 - indus  1 2.51754013       494   11081.36 1585.761
stepAIC(lm(medv~1,Boston), direction="forward",scope=list(upper=lm.fit.all))
## Start:  AIC=2246.51
## medv ~ 1
## 
##           Df Sum of Sq   RSS    AIC
## + lstat    1   23243.9 19472 1851.0
## + rm       1   20654.4 22062 1914.2
## + ptratio  1   11014.3 31702 2097.6
## + indus    1    9995.2 32721 2113.6
## + tax      1    9377.3 33339 2123.1
## + nox      1    7800.1 34916 2146.5
## + crim     1    6440.8 36276 2165.8
## + rad      1    6221.1 36495 2168.9
## + age      1    6069.8 36647 2171.0
## + zn       1    5549.7 37167 2178.1
## + black    1    4749.9 37966 2188.9
## + dis      1    2668.2 40048 2215.9
## + chas     1    1312.1 41404 2232.7
## <none>                 42716 2246.5
## 
## Step:  AIC=1851.01
## medv ~ lstat
## 
##           Df Sum of Sq   RSS    AIC
## + rm       1    4033.1 15439 1735.6
## + ptratio  1    2670.1 16802 1778.4
## + chas     1     786.3 18686 1832.2
## + dis      1     772.4 18700 1832.5
## + age      1     304.3 19168 1845.0
## + tax      1     274.4 19198 1845.8
## + black    1     198.3 19274 1847.8
## + zn       1     160.3 19312 1848.8
## + crim     1     146.9 19325 1849.2
## + indus    1      98.7 19374 1850.4
## <none>                 19472 1851.0
## + rad      1      25.1 19447 1852.4
## + nox      1       4.8 19468 1852.9
## 
## Step:  AIC=1735.58
## medv ~ lstat + rm
## 
##           Df Sum of Sq   RSS    AIC
## + ptratio  1   1711.32 13728 1678.1
## + chas     1    548.53 14891 1719.3
## + black    1    512.31 14927 1720.5
## + tax      1    425.16 15014 1723.5
## + dis      1    351.15 15088 1725.9
## + crim     1    311.42 15128 1727.3
## + rad      1    180.45 15259 1731.6
## + indus    1     61.09 15378 1735.6
## <none>                 15439 1735.6
## + zn       1     56.56 15383 1735.7
## + age      1     20.18 15419 1736.9
## + nox      1     14.90 15424 1737.1
## 
## Step:  AIC=1678.13
## medv ~ lstat + rm + ptratio
## 
##         Df Sum of Sq   RSS    AIC
## + dis    1    499.08 13229 1661.4
## + black  1    389.68 13338 1665.6
## + chas   1    377.96 13350 1666.0
## + crim   1    122.52 13606 1675.6
## + age    1     66.24 13662 1677.7
## <none>               13728 1678.1
## + tax    1     44.36 13684 1678.5
## + nox    1     24.81 13703 1679.2
## + zn     1     14.96 13713 1679.6
## + rad    1      6.07 13722 1679.9
## + indus  1      0.83 13727 1680.1
## 
## Step:  AIC=1661.39
## medv ~ lstat + rm + ptratio + dis
## 
##         Df Sum of Sq   RSS    AIC
## + nox    1    759.56 12469 1633.5
## + black  1    502.64 12726 1643.8
## + chas   1    267.43 12962 1653.1
## + indus  1    242.65 12986 1654.0
## + tax    1    240.34 12989 1654.1
## + crim   1    233.54 12995 1654.4
## + zn     1    144.81 13084 1657.8
## + age    1     61.36 13168 1661.0
## <none>               13229 1661.4
## + rad    1     22.40 13206 1662.5
## 
## Step:  AIC=1633.47
## medv ~ lstat + rm + ptratio + dis + nox
## 
##         Df Sum of Sq   RSS    AIC
## + chas   1    328.27 12141 1622.0
## + black  1    311.83 12158 1622.7
## + zn     1    151.71 12318 1629.3
## + crim   1    141.43 12328 1629.7
## + rad    1     53.48 12416 1633.3
## <none>               12469 1633.5
## + indus  1     17.10 12452 1634.8
## + tax    1     10.50 12459 1635.0
## + age    1      0.25 12469 1635.5
## 
## Step:  AIC=1621.97
## medv ~ lstat + rm + ptratio + dis + nox + chas
## 
##         Df Sum of Sq   RSS    AIC
## + black  1   272.837 11868 1612.5
## + zn     1   164.406 11977 1617.1
## + crim   1   116.330 12025 1619.1
## + rad    1    58.556 12082 1621.5
## <none>               12141 1622.0
## + indus  1    26.274 12115 1622.9
## + tax    1     4.187 12137 1623.8
## + age    1     2.331 12139 1623.9
## 
## Step:  AIC=1612.47
## medv ~ lstat + rm + ptratio + dis + nox + chas + black
## 
##         Df Sum of Sq   RSS    AIC
## + zn     1   189.936 11678 1606.3
## + rad    1   144.320 11724 1608.3
## + crim   1    55.633 11813 1612.1
## <none>               11868 1612.5
## + indus  1    15.584 11853 1613.8
## + age    1     9.446 11859 1614.1
## + tax    1     2.703 11866 1614.4
## 
## Step:  AIC=1606.31
## medv ~ lstat + rm + ptratio + dis + nox + chas + black + zn
## 
##         Df Sum of Sq   RSS    AIC
## + crim   1    94.712 11584 1604.2
## + rad    1    93.614 11585 1604.2
## <none>               11678 1606.3
## + indus  1    16.048 11662 1607.6
## + tax    1     3.952 11674 1608.1
## + age    1     1.491 11677 1608.2
## 
## Step:  AIC=1604.19
## medv ~ lstat + rm + ptratio + dis + nox + chas + black + zn + 
##     crim
## 
##         Df Sum of Sq   RSS    AIC
## + rad    1   228.604 11355 1596.1
## <none>               11584 1604.2
## + indus  1    15.773 11568 1605.5
## + age    1     2.470 11581 1606.1
## + tax    1     1.305 11582 1606.1
## 
## Step:  AIC=1596.1
## medv ~ lstat + rm + ptratio + dis + nox + chas + black + zn + 
##     crim + rad
## 
##         Df Sum of Sq   RSS    AIC
## + tax    1   273.619 11081 1585.8
## <none>               11355 1596.1
## + indus  1    33.894 11321 1596.6
## + age    1     0.096 11355 1598.1
## 
## Step:  AIC=1585.76
## medv ~ lstat + rm + ptratio + dis + nox + chas + black + zn + 
##     crim + rad + tax
## 
##         Df Sum of Sq   RSS    AIC
## <none>               11081 1585.8
## + indus  1   2.51754 11079 1587.7
## + age    1   0.06271 11081 1587.8
## 
## Call:
## lm(formula = medv ~ lstat + rm + ptratio + dis + nox + chas + 
##     black + zn + crim + rad + tax, data = Boston)
## 
## Coefficients:
## (Intercept)        lstat           rm      ptratio          dis          nox  
##   36.341145    -0.522553     3.801579    -0.946525    -1.492711   -17.376023  
##        chas        black           zn         crim          rad          tax  
##    2.718716     0.009291     0.045845    -0.108413     0.299608    -0.011778
stepAIC(lm.fit.all, direction="backward")
## Start:  AIC=1589.64
## medv ~ crim + zn + indus + chas + nox + rm + age + dis + rad + 
##     tax + ptratio + black + lstat
## 
##           Df Sum of Sq   RSS    AIC
## - age      1      0.06 11079 1587.7
## - indus    1      2.52 11081 1587.8
## <none>                 11079 1589.6
## - chas     1    218.97 11298 1597.5
## - tax      1    242.26 11321 1598.6
## - crim     1    243.22 11322 1598.6
## - zn       1    257.49 11336 1599.3
## - black    1    270.63 11349 1599.8
## - rad      1    479.15 11558 1609.1
## - nox      1    487.16 11566 1609.4
## - ptratio  1   1194.23 12273 1639.4
## - dis      1   1232.41 12311 1641.0
## - rm       1   1871.32 12950 1666.6
## - lstat    1   2410.84 13490 1687.3
## 
## Step:  AIC=1587.65
## medv ~ crim + zn + indus + chas + nox + rm + dis + rad + tax + 
##     ptratio + black + lstat
## 
##           Df Sum of Sq   RSS    AIC
## - indus    1      2.52 11081 1585.8
## <none>                 11079 1587.7
## - chas     1    219.91 11299 1595.6
## - tax      1    242.24 11321 1596.6
## - crim     1    243.20 11322 1596.6
## - zn       1    260.32 11339 1597.4
## - black    1    272.26 11351 1597.9
## - rad      1    481.09 11560 1607.2
## - nox      1    520.87 11600 1608.9
## - ptratio  1   1200.23 12279 1637.7
## - dis      1   1352.26 12431 1643.9
## - rm       1   1959.55 13038 1668.0
## - lstat    1   2718.88 13798 1696.7
## 
## Step:  AIC=1585.76
## medv ~ crim + zn + chas + nox + rm + dis + rad + tax + ptratio + 
##     black + lstat
## 
##           Df Sum of Sq   RSS    AIC
## <none>                 11081 1585.8
## - chas     1    227.21 11309 1594.0
## - crim     1    245.37 11327 1594.8
## - zn       1    257.82 11339 1595.4
## - black    1    270.82 11352 1596.0
## - tax      1    273.62 11355 1596.1
## - rad      1    500.92 11582 1606.1
## - nox      1    541.91 11623 1607.9
## - ptratio  1   1206.45 12288 1636.0
## - dis      1   1448.94 12530 1645.9
## - rm       1   1963.66 13045 1666.3
## - lstat    1   2723.48 13805 1695.0
## 
## Call:
## lm(formula = medv ~ crim + zn + chas + nox + rm + dis + rad + 
##     tax + ptratio + black + lstat, data = Boston)
## 
## Coefficients:
## (Intercept)         crim           zn         chas          nox           rm  
##   36.341145    -0.108413     0.045845     2.718716   -17.376023     3.801579  
##         dis          rad          tax      ptratio        black        lstat  
##   -1.492711     0.299608    -0.011778    -0.946525     0.009291    -0.522553
# Fit LASSO to remove unimportant predictors
lambda_grid <- 10^ seq(10 , -3 , length =200)
lasso.model <- glmnet(as.matrix(Boston[,-ncol(Boston)]),Boston$medv,alpha=1, lambda=lambda_grid, standardize=TRUE)
lasso.cv.out <- cv.glmnet(as.matrix(Boston[,-ncol(Boston)]),Boston$medv,alpha=1)
plot(lasso.cv.out) # small lambda values preferred, shrinkage effect is small

## λmin is the value of the LASSO penalty that minimizes cross-validated error, giving the most predictive but usually less sparse model.
lasso.lambda.min <- lasso.cv.out$lambda.min
lasso.coefficients <- predict(lasso.model, type="coefficients", s=lasso.lambda.min)

# Display the coefficients and selected variables
print("LASSO coefficients:")
## [1] "LASSO coefficients:"
print(as.matrix(lasso.coefficients)) # the absolute coefficent values influenced by the scale of the individual variables, nox has a small scale
##                        s1
## (Intercept)  34.598619889
## crim         -0.099247576
## zn            0.041849211
## indus         0.000000000
## chas          2.688042253
## nox         -16.403110480
## rm            3.861006279
## age           0.000000000
## dis          -1.404850441
## rad           0.256873202
## tax          -0.010001484
## ptratio      -0.931414361
## black         0.009049078
## lstat        -0.522507756
print(as.matrix(lasso.coefficients)[seq(2,length(Boston)-1),] != 0) # removes indus and age too
##    crim      zn   indus    chas     nox      rm     age     dis     rad     tax 
##    TRUE    TRUE   FALSE    TRUE    TRUE    TRUE   FALSE    TRUE    TRUE    TRUE 
## ptratio   black 
##    TRUE    TRUE

Further thoughts and finetuning

The best linear model balances predictive accuracy and parsimony. Above, we used statistical significance as one way to assess this balance. However, the Boston dataset has a relatively high ratio of samples to predictors. Consequently, even very small changes in predictive accuracy — potentially negligible in practice — can appear statistically significant. In LASSO, small differences in predictive accuracy may lead to substantially larger selected models. Below, we demonstrate that models with fewer features can achieve very competitive performance as well.

## above, we used λmin that minimizes cross-validated error, giving the most predictive but usually less sparse model
## 𝜆1se is the largest 𝜆within one standard error of the minimum, producing a simpler, sparser model with only a slight increase in error
lasso.lambda.1se <- lasso.cv.out$lambda.1se
lasso.coefficients <- predict(lasso.model, type="coefficients", s=lasso.lambda.1se)
print(as.matrix(lasso.coefficients))
##                        s1
## (Intercept) 18.5651526330
## crim        -0.0241614962
## zn           0.0004583721
## indus        0.0000000000
## chas         1.9948750237
## nox         -4.4871514084
## rm           4.2705018262
## age          0.0000000000
## dis         -0.3933640745
## rad          0.0000000000
## tax          0.0000000000
## ptratio     -0.8004356619
## black        0.0066973241
## lstat       -0.5185491480
## compare cross-validated MSE for both the options
mse_min <- lasso.cv.out$cvm[lasso.cv.out$lambda == lasso.lambda.min]
mse_1se <- lasso.cv.out$cvm[lasso.cv.out$lambda == lasso.lambda.1se]
cat("Comparison of cross-validated MSE with 2 versus 5 removed features: ", mse_min, mse_1se)
## Comparison of cross-validated MSE with 2 versus 5 removed features:  23.26698 25.74933

LASSO with \(\lambda_{\text{min}}\) selects the model that achieves the best predictive performance on unseen data. LASSO with \(\lambda_{\text{1se}}\) reflects a trade-off between predictive accuracy and interpretability, favoring interpretability by retaining the most informative features while removing those with marginal contributions. In our example, this results in a roughly 10% increase in MSE in exchange for the removal of three additional features.

Our model can be further fine-tuned, for example, by including interactions.

## consider only two most significant predictors and their interaction
## this model seems to perform better that previous lm.fit.exclude
lm.fit.int <- lm(medv ~ lstat+rm+lstat:rm, data=Boston)
summary(lm.fit.int)
## 
## Call:
## lm(formula = medv ~ lstat + rm + lstat:rm, data = Boston)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -23.2349  -2.6897  -0.6158   1.9663  31.6141 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -29.12452    3.34250  -8.713   <2e-16 ***
## lstat         2.19398    0.20570  10.666   <2e-16 ***
## rm            9.70126    0.50023  19.393   <2e-16 ***
## lstat:rm     -0.48494    0.03459 -14.018   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.701 on 502 degrees of freedom
## Multiple R-squared:  0.7402, Adjusted R-squared:  0.7387 
## F-statistic: 476.9 on 3 and 502 DF,  p-value: < 2.2e-16
## confirm with a hold out test
set.seed(1)
sample <- sample(c(TRUE, FALSE), nrow(Boston), replace=TRUE, prob=c(0.7, 0.3))
 
train  <- Boston[sample, ]
test   <- Boston[!sample, ]
 
# eval a linear model on unseen data
evalLM<-function(m,d){
  predictions<-predict(m, newdata = d)
  rss <- sum((predictions - d$medv)^2)
  tss <- sum((mean(d$medv)-d$medv)^2)
  rsquared <- 1-rss/tss
  p <- length(coef(m)) - 1
  rsquared_adj <- 1- (rss/(nrow(d)-p-1))/(tss/(nrow(d)-1))
  mse <- mean((d$medv - predictions)^2)
  return(signif(c(rsquared,rsquared_adj,mse),4))
}

lm.train.exclude <- lm(formula=medv~. -age -indus, data = train)
paste("The model without age and indus (11 predictors) and its R2, adjR2 and MSE:", paste(evalLM(lm.train.exclude,test),collapse = ", "))
## [1] "The model without age and indus (11 predictors) and its R2, adjR2 and MSE: 0.6887, 0.6632, 26.12"
lm.train.int <- lm(formula=medv~lstat+rm+lstat:rm, data = train) 
paste("The interaction model (3 predictors) and its R2, adjR2 and MSE:", paste(evalLM(lm.train.int,test),collapse = ", "))
## [1] "The interaction model (3 predictors) and its R2, adjR2 and MSE: 0.706, 0.6998, 24.67"
## choose the brute force direction, all pairwise interactions filtered with lasso
library(Matrix)
X_sparse <- as(Matrix(model.matrix(medv ~ .^2, data=Boston)), "CsparseMatrix")
lasso.int.model <- glmnet(as.matrix(Boston[,-ncol(Boston)]),Boston$medv,alpha=1, lambda=lambda_grid, standardize=TRUE)
lasso.int.cv <- cv.glmnet(X_sparse, Boston$medv, alpha = 1)
lasso.lambda.1se <- lasso.int.cv$lambda.1se
lasso.coefficents<-predict(lasso.int.cv, type="coefficients", s=lasso.lambda.1se)
active_features <- rownames(lasso.coefficents)[which(lasso.coefficents != 0)]
active_features <- setdiff(active_features, "(Intercept)")
cat("Cross-validated MSE with", length(active_features), "features including pairwise interactions: ", lasso.cv.out$cvm[lasso.int.cv$lambda == lasso.lambda.1se])
## Cross-validated MSE with 60 features including pairwise interactions:  23.30535
## use the selected features and their interactions in a linear model
formula_int <- as.formula(paste("medv ~", paste(active_features, collapse = " + ")))
lm.int.lasso <- lm(formula_int, data = Boston)
summary(lm.int.lasso)
## 
## Call:
## lm(formula = formula_int, data = Boston)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -8.1649 -1.5226 -0.1455  1.2533 19.2389 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   -1.070e+02  1.390e+01  -7.695 9.18e-14 ***
## zn            -2.017e-01  1.066e-01  -1.891 0.059259 .  
## indus          8.014e-01  6.350e-01   1.262 0.207590    
## chas           3.326e+01  1.208e+01   2.753 0.006143 ** 
## rm             2.560e+01  2.871e+00   8.914  < 2e-16 ***
## age            3.855e-01  1.269e-01   3.037 0.002527 ** 
## dis           -1.643e+00  1.373e+00  -1.197 0.232120    
## rad            4.924e-01  1.250e+00   0.394 0.693767    
## black          6.791e-02  3.403e-02   1.996 0.046587 *  
## lstat          2.389e+00  4.332e-01   5.515 5.91e-08 ***
## zn:crim        9.662e-02  1.298e-01   0.745 0.456957    
## chas:crim      2.679e+00  5.632e-01   4.756 2.68e-06 ***
## crim:nox      -1.086e+00  7.894e-01  -1.376 0.169610    
## rm:crim        2.204e-01  4.549e-02   4.845 1.75e-06 ***
## dis:crim      -1.752e-01  7.942e-02  -2.206 0.027888 *  
## rad:crim      -3.951e-02  1.902e-02  -2.077 0.038415 *  
## black:crim    -3.708e-04  1.553e-04  -2.388 0.017362 *  
## lstat:crim     2.657e-02  6.396e-03   4.155 3.91e-05 ***
## zn:indus       6.994e-04  4.333e-03   0.161 0.871833    
## zn:chas       -1.360e-02  5.581e-02  -0.244 0.807651    
## zn:age         4.252e-04  6.234e-04   0.682 0.495585    
## zn:dis         1.014e-02  5.539e-03   1.831 0.067777 .  
## zn:rad        -4.315e-03  5.329e-03  -0.810 0.418601    
## zn:tax         4.239e-04  1.397e-04   3.035 0.002545 ** 
## zn:ptratio     3.870e-03  5.573e-03   0.695 0.487711    
## zn:lstat      -1.057e-02  3.240e-03  -3.263 0.001186 ** 
## indus:chas     1.127e-01  2.528e-01   0.446 0.655967    
## indus:age      2.773e-03  2.770e-03   1.001 0.317369    
## indus:dis     -1.000e-01  4.243e-02  -2.357 0.018834 *  
## indus:tax      1.973e-04  3.738e-04   0.528 0.597898    
## indus:ptratio -3.691e-02  2.702e-02  -1.366 0.172624    
## indus:black    3.392e-04  7.664e-04   0.443 0.658239    
## indus:lstat   -1.640e-02  9.810e-03  -1.671 0.095383 .  
## chas:nox      -2.810e+01  9.648e+00  -2.913 0.003764 ** 
## chas:rm       -3.776e+00  1.050e+00  -3.597 0.000358 ***
## chas:age       5.826e-02  5.332e-02   1.093 0.275125    
## chas:dis       5.856e-01  1.295e+00   0.452 0.651233    
## chas:rad      -3.406e-01  1.665e-01  -2.045 0.041421 *  
## chas:black     9.579e-03  1.469e-02   0.652 0.514711    
## chas:lstat    -3.511e-01  1.620e-01  -2.168 0.030703 *  
## rm:nox        -2.449e+00  2.862e+00  -0.856 0.392553    
## rad:nox       -6.940e-01  4.877e-01  -1.423 0.155455    
## nox:ptratio    1.875e-01  1.186e+00   0.158 0.874478    
## lstat:nox      4.681e-01  3.828e-01   1.223 0.222062    
## rm:age        -3.429e-02  1.440e-02  -2.381 0.017695 *  
## rm:rad        -6.609e-02  7.988e-02  -0.827 0.408431    
## rm:tax        -1.491e-02  3.460e-03  -4.310 2.01e-05 ***
## rm:ptratio    -3.754e-01  1.130e-01  -3.321 0.000972 ***
## rm:lstat      -2.973e-01  3.877e-02  -7.669 1.10e-13 ***
## age:rad        2.555e-03  1.961e-03   1.303 0.193284    
## age:black     -5.816e-04  1.745e-04  -3.333 0.000931 ***
## age:lstat     -5.206e-03  1.646e-03  -3.162 0.001672 ** 
## dis:tax       -2.486e-03  1.688e-03  -1.473 0.141559    
## dis:ptratio    5.374e-02  6.746e-02   0.797 0.426080    
## dis:lstat      8.722e-02  3.257e-02   2.678 0.007689 ** 
## rad:ptratio    3.850e-02  5.925e-02   0.650 0.516159    
## rad:lstat     -1.724e-02  1.052e-02  -1.638 0.102115    
## tax:ptratio    5.229e-03  1.416e-03   3.693 0.000249 ***
## lstat:tax     -1.361e-03  6.094e-04  -2.233 0.026012 *  
## black:ptratio -2.620e-04  1.316e-03  -0.199 0.842317    
## black:lstat   -5.808e-04  3.523e-04  -1.648 0.099978 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.936 on 445 degrees of freedom
## Multiple R-squared:  0.9102, Adjusted R-squared:  0.8981 
## F-statistic: 75.15 on 60 and 445 DF,  p-value: < 2.2e-16

First, we demonstrated that the interaction between the two strongest predictors contributes more to the model than the remaining predictors. The model including just these two features and their interaction outperforms the model obtained through feature selection. At the same time, increasing in the number of interactions—and consequently the number of predictors—retained by LASSO leads to only a practically negligible improvement in predictive performance. Ultimately, the user must decide whether an approximately 3.5% reduction in MSE justifies the addition of 57 extra predictors.

Summary

To avoid overfitting and improve the simplicity and interpretability of a model, only relevant features should be used. We have shown four different approaches to feature selection: pairwise correlations, p-values, stepwise regression and lasso.

The method based on pairwise correlations recommends keeping all variables but fails to control for multicollinearity between predictors. The other methods all agreed that two features should be removed from the model. Using coefficient p-values for variable selection worked well in our case, however, it is not considered a standard or recommended practice. Significance of a variable does not necessarily imply practical importance: a small p-value may indicate statistical significance, but the effect size might be trivial in a real-world context and vice versa. Multiple comparisons may lead to false positives (Type I errors) due to the problem of multiple comparisons (you might mistakenly include variables that have a low p-value by chance alone).

Similarly, stepwise selection should be used judiciously, as it often suffers from overfitting. Stepwise procedures are data-driven and may select variables that happen to perform well on the specific dataset used for modeling but do not generalize to new, unseen data. Lasso, when tuned properly using techniques like cross-validation, generally achieves a better balance between model complexity and predictive performance.

Statistical and practical significance may lead to different model choices. With sufficiently large datasets, even negligible improvements can become statistically significant. The final inclusion of feature interactions demonstrates that feature selection should not be considered in isolation but rather in conjunction with the other modeling techniques discussed in this course. The super simple model medv~lstat+rm+lstat:rm proves to be a highly effective solution to the problem of house price prediction.