Let us work with a noisy trigonometric sine function. It is obvious that it is non-linear and we will try to fit it with models of various complexity. In the beginning, we will explain how far the linear model works. Then, we will propose a suitable alternative model. Before you start with this script, think about the best model yourself.
x <- seq(0, pi * 2, 0.1)
#y <- sin(x)
y <- sin(x) + rnorm(n = length(x), mean = 0, sd = sd(sin(x) / 2))
sin.data <- data.frame(y,x)
plot(sin.data$x,sin.data$y)
lines(sin.data$x,sin(sin.data$x),col="green")
How far the linear model works?
lm.sin <- lm(y ~ x, data = sin.data) # learn the linear model
lm.sin.pred <- predict(lm.sin, interval="confidence") # make predictions at testing points
summary(lm.sin) # the model works somehow, explains more than 40% of the variance, much better than averaging
##
## Call:
## lm(formula = y ~ x, data = sin.data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.16751 -0.52940 0.00129 0.55398 1.22514
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.00043 0.15511 6.450 2.02e-08 ***
## x -0.30823 0.04316 -7.142 1.32e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.6229 on 61 degrees of freedom
## Multiple R-squared: 0.4554, Adjusted R-squared: 0.4465
## F-statistic: 51.01 on 1 and 61 DF, p-value: 1.321e-09
myRMSE(predict(lm.sin,sin.data),sin.data$y) # RMSE on train data
## [1] 0.6129324
myRMSE(0,sin.data$y) # RMSE of the zero model on train data
## [1] 0.83178
plot(sin.data$x,sin.data$y)
abline(lm.sin,col="red")
lines(sin.data$x,lm.sin.pred[,"lwr"],col="blue",lty=2)
lines(sin.data$x,lm.sin.pred[,"upr"],col="blue",lty=2)
lines(sin.data$x,sin(sin.data$x),col="green")
plot(lm.sin,which=1) # however, the assumptions violated, unreliable confidence intervals, the model underfits the data
sin.test <- data.frame(x=seq(0, pi * 2, 0.1),y=sin(x) + rnorm(n = length(x), mean = 0, sd = sd(sin(x) / 2)))
myRMSE(predict(lm.sin,sin.test),sin.test$y) # RMSE on test data
## [1] 0.6366656
myRMSE(0,sin.test$y) # RMSE of the zero model on test data
## [1] 0.836926
The linear model underfits the data. It is unable to capture the relationship between the dependent and independent variable, it has a high error rate both on training as well as testing data. For one sine period, the linear model overcomes the zero model, but we definitely need a more complex model. The cubic polynomial could be the first choice, its S shape with two inflection points resembles sine, although it will never fit the sine function perfectly for shape differences.
lm.sin.cub <- lm(y ~ poly(x,3), data = sin.data) # learn the linear model
lm.sin.pred <- predict(lm.sin.cub, interval="confidence") # make predictions at testing points
summary(lm.sin.cub) # the model works even better, explains nearly 80% of the variance, much better than linear model
##
## Call:
## lm(formula = y ~ poly(x, 3), data = sin.data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.1609 -0.2285 -0.0395 0.2603 1.1248
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.04492 0.05341 0.841 0.404
## poly(x, 3)1 -4.44879 0.42392 -10.494 4.09e-15 ***
## poly(x, 3)2 0.05068 0.42392 0.120 0.905
## poly(x, 3)3 3.61424 0.42392 8.526 7.11e-12 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4239 on 59 degrees of freedom
## Multiple R-squared: 0.756, Adjusted R-squared: 0.7436
## F-statistic: 60.94 on 3 and 59 DF, p-value: < 2.2e-16
myRMSE(predict(lm.sin.cub,sin.data),sin.data$y) # RMSE on train data
## [1] 0.4102442
myRMSE(0,sin.data$y) # RMSE of the zero model on train data
## [1] 0.83178
plot(sin.data$x,sin.data$y)
lines(sin.data$x,lm.sin.pred[,"fit"],col="blue")
lines(sin.data$x,lm.sin.pred[,"lwr"],col="blue",lty=2)
lines(sin.data$x,lm.sin.pred[,"upr"],col="blue",lty=2)
lines(sin.data$x,sin(sin.data$x),col="green")
plot(lm.sin.cub,which=1) # much better residuals
myRMSE(predict(lm.sin.cub,sin.test),sin.test$y) # RMSE on test data
## [1] 0.3750632
myRMSE(0,sin.test$y) # RMSE of the zero model on test data
## [1] 0.836926
In order to reach a better fit, we will demonstrate the application of simple splines. The quadratic spline with one knot is the simplest choice with the desired S shape with two inflection points, we will also play with the cubic spline and a large-degree spline.
gam.lin<-gam(y~x,data=sin.data) # simple gam, matches lm
summary(gam.lin)
##
## Call: gam(formula = y ~ x, data = sin.data)
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.16751 -0.52940 0.00129 0.55398 1.22514
##
## (Dispersion Parameter for gaussian family taken to be 0.388)
##
## Null Deviance: 43.4599 on 62 degrees of freedom
## Residual Deviance: 23.6682 on 61 degrees of freedom
## AIC: 123.1092
##
## Number of Local Scoring Iterations: 2
##
## Anova for Parametric Effects
## Df Sum Sq Mean Sq F value Pr(>F)
## x 1 19.792 19.792 51.009 1.321e-09 ***
## Residuals 61 23.668 0.388
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
coefficients(gam.lin) # the same as in lm.sin
## (Intercept) x
## 1.0004350 -0.3082309
plot(sin.data$x,sin.data$y)
lines(sin.data$x,sin(sin.data$x),col="green") # plot the true model
legend("topright",legend=c("sine","linear GAM"),lty=c("solid"),col=c("green","red"),bty="n")
abline(gam.lin,col="red")
gam.quad<-gam(y~bs(x,degree=2,knots=c(3)),data=sin.data) # gam with a quadratic spline with one knot
summary(gam.quad)
##
## Call: gam(formula = y ~ bs(x, degree = 2, knots = c(3)), data = sin.data)
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.22402 -0.21593 -0.04054 0.25238 1.21764
##
## (Dispersion Parameter for gaussian family taken to be 0.1811)
##
## Null Deviance: 43.4599 on 62 degrees of freedom
## Residual Deviance: 10.6851 on 59 degrees of freedom
## AIC: 77.0065
##
## Number of Local Scoring Iterations: 2
##
## Anova for Parametric Effects
## Df Sum Sq Mean Sq F value Pr(>F)
## bs(x, degree = 2, knots = c(3)) 3 32.775 10.9249 60.324 < 2.2e-16 ***
## Residuals 59 10.685 0.1811
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
coefficients(gam.quad)
## (Intercept) bs(x, degree = 2, knots = c(3))1
## 0.04762010 1.98276773
## bs(x, degree = 2, knots = c(3))2 bs(x, degree = 2, knots = c(3))3
## -1.97160543 -0.03581281
gam.quad.pred <- predict(gam.quad, se.fit=T) # make predictions at testing points
plot(sin.data$x,sin.data$y) # plot the original data
lines(sin.data$x,sin(sin.data$x),col="green") # plot the true model
lines(sin.data$x,gam.quad.pred$fit,col="blue")
gam.cub<-gam(y~bs(x,degree=3,knots=c(3)),data=sin.data) # increase the degree, cubic spline with one knot
lines(sin.data$x,predict(gam.cub),col="red")
gam.overfit<-gam(y~bs(x,degree=10,knots=quantile(sin.data$x,c(0.25,0.5,0.75))),data=sin.data) # overfit the data
lines(sin.data$x,predict(gam.overfit),col="orange")
legend("topright",legend=c("sine","quadratic spline","cubic spline","degree-10 spline"),lty=c("solid"),col=c("green","blue","red","orange"),bty="n")
How well does it compare with smoothing splines?
gam.smooth<-gam(y~s(x),data=sin.data) # gam with a smoothing spline, the default flexibility is df=4 (df=1 corresponds to linear fit)
#gam.smooth<-gam(y~s(x,df=nrow(sin.data)),data=sin.data) # smoothing spline, exact fit
summary(gam.smooth)
##
## Call: gam(formula = y ~ s(x), data = sin.data)
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.08825 -0.23161 -0.06488 0.29400 1.10030
##
## (Dispersion Parameter for gaussian family taken to be 0.1891)
##
## Null Deviance: 43.4599 on 62 degrees of freedom
## Residual Deviance: 10.9688 on 57.9998 degrees of freedom
## AIC: 80.6576
##
## Number of Local Scoring Iterations: NA
##
## Anova for Parametric Effects
## Df Sum Sq Mean Sq F value Pr(>F)
## s(x) 1 19.792 19.7917 104.65 1.327e-14 ***
## Residuals 58 10.969 0.1891
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Anova for Nonparametric Effects
## Npar Df Npar F Pr(F)
## (Intercept)
## s(x) 3 22.382 9.421e-10 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
coefficients(gam.smooth)
## (Intercept) s(x)
## 1.0004350 -0.3082309
gam.smooth.pred <- predict(gam.smooth, se.fit=T) # make predictions at testing points
plot(sin.data$x,sin.data$y) # plot the smoothing spline
lines(sin.data$x,sin(sin.data$x),col="green") # plot the true model
lines(sin.data$x,gam.smooth.pred$fit,col="blue")
lines(sin.data$x,gam.smooth.pred$fit+2*gam.smooth.pred$se.fit[,1],col="blue",lty=2)
lines(sin.data$x,gam.smooth.pred$fit-2*gam.smooth.pred$se.fit[,1],col="blue",lty=2)
legend("topright",legend=c("sine","degree-4 smoothing"),lty=c("solid"),col=c("green","blue"),bty="n")
# the above plot suggests that the smoothing spline has about the right flexibility, however learn the best df value with CV
gam.smooth.opt<-smooth.spline(sin.data$x, sin.data$y, cv = TRUE) # the best df seems to be a bit larger, around 6
plot(sin.data$x,sin.data$y) # plot the smoothing spline
lines(sin.data$x,sin(sin.data$x),col="green") # plot the true model
lines(sin.data$x,gam.smooth.pred$fit,col="blue")
lines(sin.data$x,predict(gam.smooth.opt)$y,col="red")
legend("topright",legend=c("sine","degree-4 smoothing","degree-opt smoothing"),lty=c("solid"),col=c("green","blue","red"),bty="n")
Summary:
Let us compute the errors of the individual models in one place. We will start with testing (noisy) data:
The model | linear | cubic | quadratic spline | cubic spline | overfitted spline | smoothing spline | opt smoothing |
---|---|---|---|---|---|---|---|
RMSE | 0.6366656 | 0.3750632 | 0.3743123 | 0.3744638 | 0.37477 | 0.3898912 | 0.37477 |
Note, that there is a certain irreducible error in this data: 0.3514698. The errors shown above should only be larger than this margin.
Let us also see the training errors. If they are significantly smaller than the figures above, the given model overfits the training data:
The model | linear | cubic | quadratic spline | cubic spline | overfitted spline | smoothing spline | opt smoothing |
---|---|---|---|---|---|---|---|
RMSE | 0.6129324 | 0.4102442 | 0.4118315 | 0.4101581 | 0.3925662 | 0.4172523 | 0.3925662 |
Now, we will compute the root mean square error of the individual models against the ideal sine function to see how well they reconstruct the original function:
The model | linear | cubic | quadratic spline | cubic spline | overfitted spline | smoothing spline | opt smoothing |
---|---|---|---|---|---|---|---|
RMSE | 0.4475817 | 0.0902386 | 0.0685612 | 0.0896494 | 0.1334859 | 0.0974597 | 0.0707227 |
The quadratic spline with single knot worked pretty well in our scenario. This spline has a sufficient complexity to fit the sine function and does not overfit. The other simple non-linear functions work well as well (the degree-3 polynomial, cubic spline, smoothing spline). More complex models may overfit the data. The amount of noise has also to be considered when comparing models. Our default setting shows relatively large amount of noise which encourages overfitting. If learning from less noisy data, more complex models can be more advantageous.