Subset Selection, Ridge Regression and LASSO

Here we apply the best subset selection approach to the `Hitters` data. We wish to predict a baseball player's `Salary` on the basis of various statistics associated with performance in the previous year. First of all, we note that the `Salary` variable is missing for some of the players. The `is.na()` function can be used to identify the missing observations. It returns a vector of the same length as the input vector, with a `TRUE` for any elements that are missing, and a `FALSE` for non-missing elements. The `sum()` function can then be used to count all of the missing elements. ```{r} library(ISLR2) names(Hitters) dim(Hitters) sum(is.na(Hitters$Salary)) ``` The `na.omit()` function removes all of the rows that have missing values in any variable. ```{r} Hitters <- na.omit(Hitters) dim(Hitters) sum(is.na(Hitters)) ``` # Best subset selection and stepwise selection The `regsubsets()` function (part of the `leaps` library) performs best subset selection by identifying the best model that contains a given number of predictors, where best is quantified using RSS. An asterisk indicates that a given variable is included in the corresponding model. For instance, this output indicates that the best two-variable model contains only `Hits` and `CRBI`. ```{r} library(leaps) regfit.full <- regsubsets(Salary ~ ., data = Hitters) summary(regfit.full) ``` By default, `regsubsets()` only reports results up to the best eight-variable model. But the `nvmax` option can be used in order to return as many variables as are desired. Here we fit up to a 19-variable model. ```{r} regfit.full <- regsubsets(Salary ~ ., data = Hitters, nvmax = 19) reg.summary <- summary(regfit.full) ``` The `summary()` function also returns $R^2$, RSS, adjusted $R^2$, $C_p$, and BIC. We can examine these to try to select the best overall model. ```{r} names(reg.summary) ``` The $R^2$ statistic increases monotonically as more variables are included: ```{r} reg.summary$rsq ``` The $C_p$ statistics: ```{r} reg.summary$cp which.min(reg.summary$cp) ``` The BIC: ```{r} reg.summary$bic which.min(reg.summary$bic) ``` We can also use the `regsubsets()` function to perform forward stepwise or backward stepwise selection, using the argument `method = "forward"` or `method = "backward"`. ```{r} regfit.fwd <- regsubsets(Salary ~ ., data = Hitters, nvmax = 19, method = "forward") summary(regfit.fwd) regfit.bwd <- regsubsets(Salary ~ ., data = Hitters, nvmax = 19, method = "backward") summary(regfit.bwd) ``` For this data, the best one-variable through six-variable models are each identical for best subset and forward selection. However, the best seven-variable models identified by forward stepwise selection, backward stepwise selection, and best subset selection are different. ```{r} coef(regfit.full, 7) coef(regfit.fwd, 7) coef(regfit.bwd, 7) ``` We just saw that it is possible to choose among a set of models of different sizes using $C_p$, BIC, and adjusted $R^2$. We will now consider how to do this using the validation set and cross-validation approaches. In order to use the validation set approach, we begin by splitting the observations into a training set and a test set. ```{r} set.seed(1) train <- sample(c(TRUE, FALSE), nrow(Hitters), replace = TRUE) test <- (!train) ``` We apply `regsubsets()` to the training set in order to perform best subset selection. ```{r} regfit.best <- regsubsets(Salary ~ ., data = Hitters[train, ], nvmax = 19) ``` We now compute the validation set error for the best model of each model size. We first make a model matrix from the test data. The `model.matrix()` function is used in many regression packages for building an ``X'' matrix from data. ```{r} test.mat <- model.matrix(Salary ~ ., data = Hitters[test, ]) ``` Now we run a loop,and for each size `i`, we extract the coefficients from `regfit.best` for the best model of that size, multiply them into the appropriate columns of the test model matrix to form the predictions, and compute the test MSE. ```{r} val.errors <- rep(NA, 19) for (i in 1:19) { coefi <- coef(regfit.best, id = i) pred <- test.mat[, names(coefi)] %*% coefi val.errors[i] <- mean((Hitters$Salary[test] - pred)^2) } ``` We find that the best model is the one that contains seven variables. ```{r} val.errors which.min(val.errors) coef(regfit.best, 7) ``` Finally, we perform best subset selection on the full data set, and select the best seven-variable model. Note that we perform best subset selection on the full data set and select the best seven-variable model, rather than simply using the variables that were obtained from the training set, because the best seven-variable model on the full data set may differ from the corresponding model on the training set. ```{r chunk20} regfit.best <- regsubsets(Salary ~ ., data = Hitters, nvmax = 19) coef(regfit.best, 7) ``` We write a function for the `predict()` method for `regsubsets()`. ```{r} predict.regsubsets <- function(object, newdata, id, ...) { form <- as.formula(object$call[[2]]) mat <- model.matrix(form, newdata) coefi <- coef(object, id = id) xvars <- names(coefi) mat[, xvars] %*% coefi } ``` We now try to choose among the models of different sizes using cross-validation. We must perform best subset selection within each of the $k$ training sets. First, we create a vector that allocates each observation to one of $k=10$ folds, and we create a matrix in which we will store the results. ```{r} k <- 10 n <- nrow(Hitters) set.seed(1) folds <- sample(rep(1:k, length = n)) cv.errors <- matrix(NA, k, 19,) ``` In the $j$th fold, the elements of `folds` that equal `j` are in the test set, and the remainder are in the training set. We make our predictions for each model size, compute the test errors on the appropriate subset, and store them in the matrix `cv.errors`. This has given us a $10 \times 19$ matrix, of which the $(j,i)$th element corresponds to the test MSE for the $j$th cross-validation fold for the best $i$-variable model. ```{r} for (j in 1:k) { best.fit <- regsubsets(Salary ~ ., data = Hitters[folds != j, ], nvmax = 19) for (i in 1:19) { pred <- predict(best.fit, Hitters[folds == j, ], id = i) cv.errors[j, i] <- mean((Hitters$Salary[folds == j] - pred)^2) } } ``` We use the `apply()` function to average over the columns of this matrix in order to obtain a vector for which the $i$th element is the cross-validation error for the $i$-variable model. We see that cross-validation selects a 10-variable model. ```{r} mean.cv.errors <- apply(cv.errors, 2, mean) mean.cv.errors ``` We now perform best subset selection on the full data set in order to obtain the 10-variable model. ```{r} reg.best <- regsubsets(Salary ~ ., data = Hitters, nvmax = 19) coef(reg.best, 10) ``` # Ridge regression and the LASSO We will use the function `glmnet()` in the `glmnet` package. We will now perform ridge regression and the lasso in order to predict `Salary` on the `Hitters` data. The `model.matrix()` function is useful for creating `x`; not only does it produce a matrix corresponding to the $19$ predictors but it also automatically transforms any qualitative variables into dummy variables. `glmnet()` can only take numerical, quantitative inputs. `[, -1]` removes the intercept. ```{r} x <- model.matrix(Salary ~ ., Hitters)[, -1] y <- Hitters$Salary ``` The `glmnet()` function has an `alpha` argument that determines what type of model is fit. If `alpha=0` then a ridge regression model is fit, and if `alpha=1` then a LASSO model is fit.We implement the function over a grid: $\lambda=10^{10}$ to $\lambda=10^{-2}$, covering the null model containing only the intercept, to the least squares fit. By default, the `glmnet()` function standardizes the variables so that they are on the same scale. ```{r} library(glmnet) grid <- 10^seq(10, -2, length = 100) ridge.mod <- glmnet(x, y, alpha = 0, lambda = grid) ``` Associated with each value of $\lambda$ is a vector of ridge regression coefficients, stored in a matrix that can be accessed by `coef()`. In this case, it is a $20 \times 100$ matrix, with $20$ rows (one for each predictor, plus an intercept) and $100$ columns (one for each value of $\lambda$). ```{r} dim(coef(ridge.mod)) ``` We can use the `predict()` function for a number of purposes. For instance, we can obtain the ridge regression coefficients for a new value of $\lambda$, say $50$: ```{r} predict(ridge.mod, s = 50, type = "coefficients")[1:20, ] ``` We now split the samples into a training set and a test set in order to estimate the test error of ridge regression and the LASSO We randomly choose a subset of numbers between $1$ and $n$ as the indices for the training observations. ```{r} set.seed(1) train <- sample(1:nrow(x), nrow(x) / 2) test <- (-train) y.test <- y[test] ridge.mod <- glmnet(x[train, ], y[train], alpha = 0, lambda = grid) ``` We use cross-validation to choose the tuning parameter $\lambda$. We can do this using the built-in cross-validation function, `cv.glmnet()`. By default, the function performs ten-fold cross-validation, though this can be changed using the argument `nfolds`. ```{r} set.seed(1) cv.out <- cv.glmnet(x[train, ], y[train], alpha = 0) plot(cv.out) bestlam <- cv.out$lambda.min bestlam ``` The test MSE associated with this value of $\lambda$: ```{r} ridge.pred <- predict(ridge.mod, s = bestlam, newx = x[test, ]) mean((ridge.pred - y.test)^2) ``` We examine the coefficient estimates. None of the coefficients are zero: ridge regression does not perform variable selection. ```{r chunk38} out <- glmnet(x, y, alpha = 0) predict(out, type = "coefficients", s = bestlam)[1:20, ] ``` In order to fit a LASSO model, we once again use the `glmnet()` function with `alpha=1`. We perform cross-validation and compute the associated test error. ```{r} lasso.mod <- glmnet(x[train, ], y[train], alpha = 1, lambda = grid) set.seed(1) cv.out <- cv.glmnet(x[train, ], y[train], alpha = 1) plot(cv.out) bestlam <- cv.out$lambda.min lasso.pred <- predict(lasso.mod, s = bestlam, newx = x[test, ]) mean((lasso.pred - y.test)^2) ``` However, the lasso has a substantial advantage over ridge regression in that the resulting coefficient estimates are sparse. ```{r} out <- glmnet(x, y, alpha = 1, lambda = grid) lasso.coef <- predict(out, type = "coefficients", s = bestlam)[1:20, ] lasso.coef ```