Cross Validation and Bootstrap

We use the `set.seed()` function in order to set a seed for `R`'s random number generator, so that the results are reproducible. ```{r} library(ISLR2) library(boot) set.seed(1) ``` # Cross validation The LOOCV estimate can be automatically computed for any generalized linear model using the `glm()` and `cv.glm()` functions. If we use `glm()` to fit a model without passing in the `family` argument, then it performs linear regression, just like the `lm()` function. `glm()` can be used together with `cv.glm()`. The `cv.glm()` function is part of the `boot` library. The `cv.glm()` function produces a list with several components. The two numbers in the `delta` vector contain the LOOCV estimates of the test error. The latter is bias corrected version. In the case of LOOCV, these two numbers should be very close to each other. ```{r} glm.fit <- glm(mpg ~ horsepower, data = Auto) cv.err <- cv.glm(Auto, glm.fit) cv.err$delta ``` We can repeat this procedure for increasingly complex polynomial fits. We use `for` loop and the vector `cv.error` is created to store the estimates. ```{r} cv.error <- rep(0, 10) for (i in 1:10) { glm.fit <- glm(mpg ~ poly(horsepower, i), data = Auto) cv.error[i] <- cv.glm(Auto, glm.fit)$delta[1] } cv.error ``` We see a sharp drop in the estimated test MSE between the linear and quadratic fits, but then no clear improvement from using higher-order polynomials. The `cv.glm()` function can also be used to implement $k$-fold CV. ```{r} set.seed(17) cv.error.10 <- rep(0, 10) for (i in 1:10) { glm.fit <- glm(mpg ~ poly(horsepower, i), data = Auto) cv.error.10[i] <- cv.glm(Auto, glm.fit, K = 10)$delta[1] } cv.error.10 ``` We still see little evidence that using cubic or higher-order polynomial terms leads to lower test error than simply using a quadratic fit. The two numbers associated with `delta` are essentially the same when LOOCV is performed. When we instead perform $k$-fold CV, then the two numbers associated with `delta` differ slightly. # Bootstrap The bootstrap approach can be used to assess the variability of the coefficient estimates and predictions from a statistical learning method. Here we use the bootstrap approach in order to assess the variability of the estimates for $\beta_0$ and $\beta_1$, the intercept and slope terms for the linear regression model that uses `horsepower` to predict `mpg` in the `Auto` data set. We first create a simple function, `boot.fn()`, which takes in the `Auto` data set as well as a set of indices for the observations, and returns the intercept and slope estimates for the linear regression model. We then apply this function to the full set of $392$ observations in order to compute the estimates of $\beta_0$ and $\beta_1$ on the entire data set. ```{r} boot.fn <- function(data, index) coef(lm(mpg ~ horsepower, data = data, subset = index)) boot.fn(Auto, 1:392) ``` Next, we use the `boot()` function to compute the standard errors of 1,000 bootstrap estimates for the intercept and slope terms. ```{r} set.seed(1) boot(Auto, boot.fn, 1000) ``` This indicates that the bootstrap estimate for ${\rm SE}(\hat{\beta}_0)$ is $0.84$, and that the bootstrap estimate for ${\rm SE}(\hat{\beta}_1)$ is $0.0073$. We compare them with results from standard formulas. ```{r} summary(lm(mpg ~ horsepower, data = Auto))$coef ``` The standard error estimates using the standard formulas are $0.717$ for the intercept and $0.0064$ for the slope. Interestingly, these are somewhat different from the estimates obtained using the bootstrap. Indeed, bootstrap standard errors are automatically heteroskedasticity-robust. To get White heteroskedasticity robust standard errors, use package `sandwich`. Construct the robust variance covariance matrix `rvcv`. The robust standard errors are on the diagonal. ```{r} library(sandwich) lm.fit = lm(mpg ~ horsepower, data = Auto) rvcv=vcovHC(lm.fit,type="HC") sqrt(diag(rvcv)) ``` To get *t*-statistics and *p*-values: ```{r} library(lmtest) coeftest(lm.fit, vcov=rvcv) ```