Non-linear Modeling

We re-analyze the `Wage` data considered in the examples throughout this chapter. We begin by loading the `ISLR2` library, which contains the data. ```{r} library(ISLR2) attach(Wage) ``` # Polynomial regression The following syntax fits a linear model, using the `lm()` function, in order to predict `wage` using a fourth-degree polynomial in `age`: `poly(age, 4)`. The function returns a matrix whose columns are a basis of orthogonal polynomials, which essentially means that each column is a linear combination of the variables `age`, `age^2`, `age^3` and `age^4`. ```{r} fit <- lm(wage ~ poly(age, 4), data = Wage) coef(summary(fit)) ``` However, we can also use `poly()` to obtain `age`, `age^2`, `age^3` and `age^4` directly, if we prefer. We can do this by using the `raw = TRUE` argument to the `poly()` function. It does not affect the fitted values. ```{r} fit2 <- lm(wage ~ poly(age, 4, raw = T), data = Wage) coef(summary(fit2)) ``` We create a grid of values for `age` at which we want predictions, and then call the generic `predict()` function. `se = TRUE` specifies that we want standard errors as well. ```{r} agelims <- range(age) age.grid <- seq(from = agelims[1], to = agelims[2]) preds <- predict(fit, newdata = list(age = age.grid), se = TRUE) ``` In performing a polynomial regression we must decide on the degree of the polynomial to use. One way to do this is by using hypothesis tests. We now fit models ranging from linear to a degree-5 polynomial and seek to determine the simplest model which is sufficient to explain the relationship between `wage` and `age`. We use the `anova()` function, which performs F-tests in order to test the null hypothesis that a model $\mathcal{M}_1$ is sufficient to explain the data against the alternative hypothesis that a more complex model $\mathcal{M}_2$ is required. In order to use the `anova()` function, $\mathcal{M}_1$ and $\mathcal{M}_2$ must be nested models: the predictors in $\mathcal{M}_1$ must be a subset of the predictors in $\mathcal{M}_2$. In this case, we fit five different models and sequentially compare the simpler model to the more complex model. ```{r} fit.1 <- lm(wage ~ age, data = Wage) fit.2 <- lm(wage ~ poly(age, 2), data = Wage) fit.3 <- lm(wage ~ poly(age, 3), data = Wage) fit.4 <- lm(wage ~ poly(age, 4), data = Wage) fit.5 <- lm(wage ~ poly(age, 5), data = Wage) anova(fit.1, fit.2, fit.3, fit.4, fit.5) ``` The p-value comparing the linear `Model 1` to the quadratic `Model 2` is essentially zero ($<$$10^{-15}$), indicating that a linear fit is not sufficient. Similarly the p-value comparing the quadratic `Model 2` to the cubic `Model 3` is very low ($0.0017$), so the quadratic fit is also insufficient. The p-value comparing the cubic and degree-4 polynomials, `Model 3` and `Model 4`, is approximately $5\,\%$ while the degree-5 polynomial `Model 5` seems unnecessary because its p-value is $0.37$. Hence, either a cubic or a quartic polynomial appear to provide a reasonable fit to the data, but lower- or higher-order models are not justified. However, the ANOVA method also works when we have other terms in the model as well. ```{r} fit.1 <- lm(wage ~ education + age, data = Wage) fit.2 <- lm(wage ~ education + poly(age, 2), data = Wage) fit.3 <- lm(wage ~ education + poly(age, 3), data = Wage) anova(fit.1, fit.2, fit.3) ``` # Splines In order to fit regression splines in `R`, we use the `splines` library. Regression splines can be fit by constructing an appropriate matrix of basis functions. ```{r} library(splines) ``` The `bs()` function generates the entire matrix of basis functions for splines with the specified set of knots. By default, cubic splines are produced. Then we use the generic `predict()` function, specifying that we want standard errors as well. ```{r} fit <- lm(wage ~ bs(age, knots = c(25, 40, 60)), data = Wage) pred <- predict(fit, newdata = list(age = age.grid), se = T) plot(age, wage, col = "gray") lines(age.grid, pred$fit, lwd = 2) lines(age.grid, pred$fit + 2 * pred$se, lty = "dashed") lines(age.grid, pred$fit - 2 * pred$se, lty = "dashed") ``` Here we have prespecified knots at ages $25$, $40$, and $60$. We could also use the `df` option to produce a spline with knots at uniform quantiles of the data. In this case `R` chooses knots at ages $33.8, 42.0$, and $51.0$, which correspond to the 25th, 50th, and 75th percentiles of `age`. ```{r} dim(bs(age, knots = c(25, 40, 60))) dim(bs(age, df = 6)) attr(bs(age, df = 6), "knots") ``` In order to instead fit a natural spline, we use the `ns()` function. Here we fit a natural spline with four degrees of freedom. As with the `bs()` function, we could instead specify the knots directly using the `knots` option. ```{r} fit2 <- lm(wage ~ ns(age, df = 4), data = Wage) pred2 <- predict(fit2, newdata = list(age = age.grid),se = T) plot(age, wage, col = "gray") lines(age.grid, pred2$fit, col = "red", lwd = 2) ``` In order to fit a smoothing spline, we use the `smooth.spline()` function. Notice that in the first call to `smooth.spline()`, we specified `df = 16`. The function then determines which value of $\lambda$ leads to $16$ degrees of freedom. In the second call to `smooth.spline()`, we select the smoothness level by cross-validation; this results in a value of $\lambda$ that yields 6.8 degrees of freedom. ```{r} plot(age, wage, col = "darkgrey") fit <- smooth.spline(age, wage, df = 16) fit2 <- smooth.spline(age, wage, cv = TRUE) lines(fit, col = "red", lwd = 2) lines(fit2, col = "blue", lwd = 2) ``` # General Additive Models We now fit a GAM to predict `wage` using natural spline functions of `lyear` and `age`, treating `education` as a qualitative predictor. We can simply do this using the `lm()` function. ```{r} gam1 <- lm(wage ~ ns(year, 4) + ns(age, 5) + education, data = Wage) ``` In order to fit more general sorts of GAMs, using smoothing splines or other components that cannot be expressed in terms of basis functions and then fit using least squares regression, we will need to use the `gam` library in `R`. The `s()` function, which is part of the `gam` library, is used to indicate that we would like to use a smoothing spline. We specify that the function of `lyear` should have $4$ degrees of freedom, and that the function of `age` will have $5$ degrees of freedom. Since `education` is qualitative, we leave it as is, and it is converted into four dummy variables. ```{r} library(gam) gam.m3 <- gam(wage ~ s(year, 4) + s(age, 5) + education, data = Wage) plot(gam.m3, se = TRUE, col = "blue") ``` The `summary()` function produces a summary of the gam fit. The "Anova for Parametric Effects" p-values clearly demonstrate that `year`, `age`, and `education` are all highly statistically significant, even when only assuming a linear relationship. Alternatively, the "Anova for Nonparametric Effects" p-values for `year` and `age` correspond to a null hypothesis of a linear relationship versus the alternative of a non-linear relationship. The large p-value for `year` shows that a linear function is adequate for this term. However, there is very clear evidence that a non-linear term is required for `age`. ```{r} summary(gam.m3) ```