Conceptual

Ex.1

We perform best subset, forward stepwise, and backward stepwise selection on a single data set. For each approach, we obtain p + 1 models, containing 0, 1, 2, . . . , p predictors. Explain your answers:

(a) Which of the three models with k predictors has the smallest training RSS?

Answer: Best subset. It tries all possible combinations and the one with the smallest training RSS can be selected.

(b) Which of the three models with k predictors has the smallest test RSS?

Answer: Best subset has the highest chance of having the smallest test RSS as it has the smallest training RSS.

(c) True or False:

i. The predictors in the k-variable model identified by forward stepwise are a subset of the predictors in the (k+1)-variable model identified by forward stepwise selection.

Answer: True

ii. The predictors in the k-variable model identified by backward stepwise are a subset of the predictors in the (k + 1)- variable model identified by backward stepwise selection.

Answer: True

iii. The predictors in the k-variable model identified by backward stepwise are a subset of the predictors in the (k + 1)- variable model identified by forward stepwise selection.

Answer: False

iv. The predictors in the k-variable model identified by forward stepwise are a subset of the predictors in the (k+1)-variable model identified by backward stepwise selection.

Answer: False

v. The predictors in the k-variable model identified by best subset are a subset of the predictors in the (k + 1)-variable model identified by best subset selection.

Answer: False

Ex.2

For parts (a) through (c), indicate which of i. through iv. is correct. Justify your answer.

(a) The lasso, relative to least squares, is:

i. More flexible and hence will give improved prediction accuracy when its increase in bias is less than its decrease in variance.

Answer: Lasso is less flexible due to the inserted \(\lambda\) “noise”. That helps prevent overfitting.

ii. More flexible and hence will give improved prediction accuracy when its increase in variance is less than its decrease in bias.

Answer: The same as i.

iii. Less flexible and hence will give improved prediction accuracy when its increase in bias is less than its decrease in variance.

Answer: The Lasso is better when it decreases variance more than increasing bias.

iv. Less flexible and hence will give improved prediction accuracy when its increase in variance is less than its decrease in bias.

Answer: That is the opposite of iii., hence not correct

(b) Repeat (a) for ridge regression relative to least squares.

Answer: The same as (a)

(c) Repeat (a) for non-linear methods relative to least squares.

Answer: Non-linear methods are more flexible and will give improved prediction accuracy when variance increases less than bias decreases, hence ii. in correct.

Ex.3

Suppose we estimate the regression coefficients in a linear regression model by minimizing

for a particular value of \(s\). For parts (a) through (e), indicate which of i. through v. is correct. Justify your answer.

(a) As we increase \(s\) from 0, the training RSS will:

Answer: Training RSS will steadily decrease, hence iv. is correct

(b) Repeat (a) for test RSS.

Answer: Test RSS will decrease initially, since increase of variance will be hindered, but then eventually will start increasing, hence ii. is correct

(c) Repeat (a) for variance.

Answer Variance will steadily increase, hence iii. is correct

(d) Repeat (a) for (squared) bias.

Answer: Bias will steadily decrease, hence iv. is correct

(e) Repeat (a) for the irreducible error.

Answer: v. is correct - remain constant

Ex.4

Suppose we estimate the regression coefficients in a linear regression model by minimizing

for a particular value of \(\lambda\). For parts (a) through (e), indicate which of i. through v. is correct. Justify your answer.

(a) As we increase \(\lambda\) from 0, the training RSS will:

Answer: Training RSS will steadily increase - iii.. \(\beta\) values will be becoming smaller and smaller, hence train RSS will be increasing

(b) Repeat (a) for test RSS.

Answer: (ii.) is correct - decrease initially, and than eventually start increasing in a U-shape. Decrease at first stage is due to the fact that overfitting is being prevented from the \(\lambda\) increase. Increase at the last stage is due to the fact that too high \(\lambda\) values cause the model to start deviating substantially

(c) Repeat (a) for variance.

Answer: (iv.) is correct - steadily decrease. Increasing \(\lambda\) causes the model to become less and less flexible, hence the variance decrease

(d) Repeat (a) for (squared) bias.

Answer: (iii.) is correct - steadily increase. With the decrease of flexibility comes the increase of bias.

(e) Repeat (a) for the irreducible error.

Answer: (v.) is correct - remain constant. This is by definition.

Ex.5

It is well-known that ridge regression tends to give similar coefficient values to correlated variables, whereas the lasso may give quite different coefficient values to correlated variables. We will now explore this property in a very simple setting.

Suppose that \(n = 2\), \(p = 2\), \(x_{11} = x_{12}\), \(x_{21} = x_{22}\). Furthermore, suppose that \(y_1+y_2 = 0\) and \(x_{11}+x_{21} = 0\) and \(x_{12}+x_{22} = 0\), so that the estimate for the intercept in a least squares, ridge regression, or lasso model is zero: \(\hat\beta_0 = 0\).

(a) Write out the ridge regression optimization problem in this setting.

Answer: We have to minimize the following:

(b) Argue that in this setting, the ridge coefficient estimates satisfy \(\hat\beta_1 = \hat\beta_2\)

Answer: To find the coeff. we have to differentiate by \(\beta_1\) and \(\beta_2\) and solve for \(0\)

(c) Write out the lasso optimization problem in this setting.

(d) Argue that in this setting, the lasso coefficients \(\hat\beta_1\) and \(\hat\beta_2\) are not unique—in other words, there are many possible solutions to the optimization problem in (c). Describe these solutions.

Answer (c)&(d): We get \(\beta_1\) dependent on \(\beta_2\), so we do not have a closed solution

Ex.6

We will now explore (6.12) and (6.13) further.

(a) Consider (6.12) with p = 1. For some choice of \(y_1\) and \(\lambda > 0\), plot (6.12) as a function of \(\beta_1\). Your plot should confirm that (6.12) is solved by (6.14).

Answer: Following is the plot and the confirmation:

beta1 <- seq(from = 1, to = 5, by = 0.1)
y <- 3
lambda <- 0.5

fbeta1_reg_6a <- function(beta1, y, lambda){
        
        fbeta1 <- (y - beta1)^2 + lambda*beta1^2
        fbeta1
}

fbeta1 <- map_dbl(beta1, fbeta1_reg_6a, y, lambda)

beta1formula_6a <- y / (1 + lambda)

ggplot() +
        geom_line(aes(y = fbeta1, x = beta1), color = "blue") + 
        geom_hline(aes(yintercept = y), color = "red") +
        geom_vline(aes(xintercept = beta1formula_6a), color = "green")

(b) Consider (6.13) with p = 1. For some choice of \(y_1\) and \(\lambda > 0\), plot (6.13) as a function of \(\beta_1\). Your plot should confirm that (6.13) is solved by (6.15).

Answer: Following is the plot and the confirmation:

beta1 <- seq(from = 1, to = 5, by = 0.1)
y <- 3
lambda <- 0.5

fbeta1_reg_6b <- function(beta1, y, lambda){
        
        fbeta1 <- (y - beta1)^2 + lambda*beta1
        fbeta1
}

fbeta1 <- map_dbl(beta1, fbeta1_reg_6b, y, lambda)

beta1formula_6b <- y - lambda/2 ## y > lambda/2

ggplot() +
        geom_line(aes(y = fbeta1, x = beta1), color = "blue") + 
        geom_hline(aes(yintercept = y), color = "red") +
        geom_vline(aes(xintercept = beta1formula_6b), color = "green")

Ex.7

We will now derive the Bayesian connection to the lasso and ridge regression discussed in Section 6.2.2.

(a) Suppose that \(y_i = \beta_0 + \sum_{j=1}^px_{ij}\beta_j + \epsilon_i\) where \(\epsilon_1, ...., \epsilon_n\) are independent and identically distributed from a \(N(0, \sigma^2)\) distribution. Write out the likelihood for the data.

Answer: Following is the likelihood:

(b) Assume the following prior for \(\beta: \beta_1, . . . , \beta_p\) are independent and identically distributed according to a double-exponential distribution with mean 0 and common scale parameter b: i.e.\(p(\beta) = \frac{1}{2b}exp(\frac{-\lvert\beta\rvert}{b})\). Write out the posterior for \(\beta\) in this setting.

Answer: Following is the posterior:

(c) Argue that the lasso estimate is the \(mode\) for \(\beta\) under this posterior distribution.

Answer: Following is the way to argue:

(d) Now assume the following prior for \(\beta: \beta_1,....,\beta_p\) are independent and identically distributed according to a normal distribution with mean zero and variance \(c\). Write out the posterior for \(\beta\) in this setting.

Answer: Following is the posterior:

(e) Argue that the ridge regression estimate is both the mode and the mean for \(\beta\) under this posterior distribution.

Answer: Following is the way to argue:

Applied

Ex.8

In this exercise, we will generate simulated data, and will then use this data to perform best subset selection.

(a) Use the rnorm() function to generate a predictor \(X\) of length \(n = 100\), as well as a noise vector \(\epsilon\) of length \(n = 100\).

Answer Following are the vectors:

set.seed(123) 
x <- rnorm(100)
set.seed(321)
eps <- rnorm(100)

(b) Generate a response vector \(Y\) of length \(n = 100\) according to the model

\(Y = \beta_0 + \beta_1X + \beta_2X^2 + \beta_3X^3 + \epsilon\)

where \(\beta_0, \beta_1, \beta_2, \beta_3\) are constants of your choice.

Answer: Following is the response vector:

beta_0 <- 0.5
beta_1 <- 3
beta_2 <- 1.2
beta_3 <- -0.123

y <- beta_0 + beta_1*x + beta_2*x^2 + beta_3*x^3 + eps

tbl_data <- tibble(
  x,
  y
)

tbl_data

## # A tibble: 100 x 2
##          x      y
##      <dbl>  <dbl>
##  1 -0.560   0.922
##  2 -0.230  -0.837
##  3  1.56    7.35 
##  4  0.0705  0.598
##  5  0.129   0.784
##  6  1.72    8.82 
##  7  0.461   2.85 
##  8 -1.27   -0.893
##  9 -0.687  -0.615
## 10 -0.446  -1.14 
## # ... with 90 more rows

(c) Use the regsubsets() function to perform best subset selection in order to choose the best model containing the predictors \(X, X^2,...., X^{10}\).

What is the best model obtained according to \(C_p\), \(BIC\), and adjusted \(R^2\)?

Show some plots to provide evidence for your answer, and report the coefficients of the best model obtained.

Note you will need to use the data.frame() function to create a single data set containing both \(X\) and \(Y\).

Answer: We start with performing best subset selection, next we select the best model obtained

performing best subset selection

n <- 10
mod_form <- as.formula(y ~ poly(x, n, raw = T))

bestsel <- regsubsets(mod_form, tbl_data, nvmax = 10)

tbestsel <- tidy.regsubsets(bestsel) %>%
  mutate(model.number = 1:n()) %>%
  select(model.number, everything())

tbestsel

## # A tibble: 10 x 17
##    model.number `(Intercept)` `poly(x, n, raw~ `poly(x, n, raw~
##           <int> <lgl>         <lgl>            <lgl>           
##  1            1 TRUE          TRUE             FALSE           
##  2            2 TRUE          TRUE             TRUE            
##  3            3 TRUE          TRUE             TRUE            
##  4            4 TRUE          TRUE             TRUE            
##  5            5 TRUE          TRUE             TRUE            
##  6            6 TRUE          TRUE             FALSE           
##  7            7 TRUE          TRUE             FALSE           
##  8            8 TRUE          TRUE             FALSE           
##  9            9 TRUE          TRUE             TRUE            
## 10           10 TRUE          TRUE             TRUE            
## # ... with 13 more variables: `poly(x, n, raw = T)3` <lgl>, `poly(x, n,
## #   raw = T)4` <lgl>, `poly(x, n, raw = T)5` <lgl>, `poly(x, n, raw =
## #   T)6` <lgl>, `poly(x, n, raw = T)7` <lgl>, `poly(x, n, raw =
## #   T)8` <lgl>, `poly(x, n, raw = T)9` <lgl>, `poly(x, n, raw =
## #   T)10` <lgl>, r.squared <dbl>, adj.r.squared <dbl>, BIC <dbl>,
## #   mallows_cp <dbl>, rss <dbl>

selecting the best model

gg.adj.r <- ggplot(data = tbestsel, aes(y = adj.r.squared, x = model.number)) + 
  geom_point(color = "red", size = 3) +
  geom_line() +
  scale_x_discrete(limits = c(1:nrow(tbestsel))) +
  xlab("model number")

gg.cp <- ggplot(data = tbestsel, aes(y = mallows_cp, x = model.number)) + 
  geom_point(color = "blue", size = 3) +
  geom_line() +
  scale_x_discrete(limits = c(1:nrow(tbestsel))) +
  xlab("model number")

gg.bic <- ggplot(data = tbestsel, aes(y = BIC, x = model.number)) + 
  geom_point(color = "green", size = 3) +
  geom_line() +
  scale_x_discrete(limits = c(1:nrow(tbestsel))) +
  xlab("model number")


grid.arrange(gg.adj.r, gg.cp, gg.bic, ncol = 1)

Model #2 exhibits most optimum characteristics.

the coefficients of the best model obtained are:

coef(bestsel, 2) %>% broom::tidy()

## # A tibble: 3 x 2
##   names                    x
##   <chr>                <dbl>
## 1 (Intercept)          0.537
## 2 poly(x, n, raw = T)1 2.60 
## 3 poly(x, n, raw = T)2 1.17

(d) Repeat (c), using forward stepwise selection and also using backwards stepwise selection. How does your answer compare to the results in (c)?

Answer: First we do forward and then backwards.

forward stepwise selection

forstepsel <- 
  regsubsets(mod_form, tbl_data, nvmax = 19, method = "forward")
  
tforstepsel <-  
  forstepsel %>%
  tidy %>%
  mutate(model.number = 1:n()) %>%
  select(model.number, everything())

tforstepsel

## # A tibble: 10 x 17
##    model.number `(Intercept)` `poly(x, n, raw~ `poly(x, n, raw~
##           <int> <lgl>         <lgl>            <lgl>           
##  1            1 TRUE          TRUE             FALSE           
##  2            2 TRUE          TRUE             TRUE            
##  3            3 TRUE          TRUE             TRUE            
##  4            4 TRUE          TRUE             TRUE            
##  5            5 TRUE          TRUE             TRUE            
##  6            6 TRUE          TRUE             TRUE            
##  7            7 TRUE          TRUE             TRUE            
##  8            8 TRUE          TRUE             TRUE            
##  9            9 TRUE          TRUE             TRUE            
## 10           10 TRUE          TRUE             TRUE            
## # ... with 13 more variables: `poly(x, n, raw = T)3` <lgl>, `poly(x, n,
## #   raw = T)4` <lgl>, `poly(x, n, raw = T)5` <lgl>, `poly(x, n, raw =
## #   T)6` <lgl>, `poly(x, n, raw = T)7` <lgl>, `poly(x, n, raw =
## #   T)8` <lgl>, `poly(x, n, raw = T)9` <lgl>, `poly(x, n, raw =
## #   T)10` <lgl>, r.squared <dbl>, adj.r.squared <dbl>, BIC <dbl>,
## #   mallows_cp <dbl>, rss <dbl>

gg.adj.r <- ggplot(data = tforstepsel, aes(y = adj.r.squared, x = model.number)) + 
  geom_point(color = "red", size = 3) +
  geom_line() +
  scale_x_discrete(limits = c(1:nrow(tforstepsel))) +
  xlab("model number")

gg.cp <- ggplot(data = tforstepsel, aes(y = mallows_cp, x = model.number)) + 
  geom_point(color = "blue", size = 3) +
  geom_line() +
  scale_x_discrete(limits = c(1:nrow(tforstepsel))) +
  xlab("model number")

gg.bic <- ggplot(data = tforstepsel, aes(y = BIC, x = model.number)) + 
  geom_point(color = "green", size = 3) +
  geom_line() +
  scale_x_discrete(limits = c(1:nrow(tforstepsel))) +
  xlab("model number")


grid.arrange(gg.adj.r, gg.cp, gg.bic, ncol = 1)

Model #2 exhibits most optimum characteristics and is the same choice as from best subset method

coef(forstepsel, 2) %>% broom::tidy()

## # A tibble: 3 x 2
##   names                    x
##   <chr>                <dbl>
## 1 (Intercept)          0.537
## 2 poly(x, n, raw = T)1 2.60 
## 3 poly(x, n, raw = T)2 1.17

backwards stepwise selection

backstepsel <- 
  regsubsets(mod_form, tbl_data, nvmax = 19, method = "backward")
  
tbackstepsel <-  
  backstepsel %>%
  tidy %>%
  mutate(model.number = 1:n()) %>%
  select(model.number, everything())

tbackstepsel

## # A tibble: 10 x 17
##    model.number `(Intercept)` `poly(x, n, raw~ `poly(x, n, raw~
##           <int> <lgl>         <lgl>            <lgl>           
##  1            1 TRUE          TRUE             FALSE           
##  2            2 TRUE          TRUE             FALSE           
##  3            3 TRUE          TRUE             FALSE           
##  4            4 TRUE          TRUE             FALSE           
##  5            5 TRUE          TRUE             FALSE           
##  6            6 TRUE          TRUE             FALSE           
##  7            7 TRUE          TRUE             FALSE           
##  8            8 TRUE          TRUE             FALSE           
##  9            9 TRUE          TRUE             TRUE            
## 10           10 TRUE          TRUE             TRUE            
## # ... with 13 more variables: `poly(x, n, raw = T)3` <lgl>, `poly(x, n,
## #   raw = T)4` <lgl>, `poly(x, n, raw = T)5` <lgl>, `poly(x, n, raw =
## #   T)6` <lgl>, `poly(x, n, raw = T)7` <lgl>, `poly(x, n, raw =
## #   T)8` <lgl>, `poly(x, n, raw = T)9` <lgl>, `poly(x, n, raw =
## #   T)10` <lgl>, r.squared <dbl>, adj.r.squared <dbl>, BIC <dbl>,
## #   mallows_cp <dbl>, rss <dbl>

gg.adj.r <- ggplot(data = tbackstepsel, aes(y = adj.r.squared, x = model.number)) + 
  geom_point(color = "red", size = 3) +
  geom_line() +
  scale_x_discrete(limits = c(1:nrow(tbackstepsel))) +
  xlab("model number")

gg.cp <- ggplot(data = tbackstepsel, aes(y = mallows_cp, x = model.number)) + 
  geom_point(color = "blue", size = 3) +
  geom_line() +
  scale_x_discrete(limits = c(1:nrow(tbackstepsel))) +
  xlab("model number")

gg.bic <- ggplot(data = tbackstepsel, aes(y = BIC, x = model.number)) + 
  geom_point(color = "green", size = 3) +
  geom_line() +
  scale_x_discrete(limits = c(1:nrow(tbackstepsel))) +
  xlab("model number")


grid.arrange(gg.adj.r, gg.cp, gg.bic, ncol = 1)

Model #3 exhibits most optimum characteristics with this method

coef(forstepsel, 3) %>% broom::tidy()

## # A tibble: 4 x 2
##   names                      x
##   <chr>                  <dbl>
## 1 (Intercept)           0.534 
## 2 poly(x, n, raw = T)1  2.79  
## 3 poly(x, n, raw = T)2  1.18  
## 4 poly(x, n, raw = T)3 -0.0823

(e) Now fit a lasso model to the simulated data, again using \(X, X^2,..., X^10\) as predictors. Use cross-validation to select the optimal value of \(\lambda\). Create plots of the cross-validation error as a function of \(\lambda\). Report the resulting coefficient estimates, and discuss the results obtained.

Answer First we fit a lasso model and select optimal \(\lambda\), them we create plots and last we report and discuss

fit a lasso and select optimal \(\lambda\)

# Find the best lambda using cross-validation
x <- model.matrix(mod_form, tbl_data)[,-1]
y <- tbl_data$y

cv <- cv.glmnet(x, y, alpha = 0)
# Display the best lambda value
cv$lambda.min

## [1] 0.2866097

cv error = f(\(\lambda\))

plot(cv)

the resulting coef. estimates

# Fit the final model on the training data
model <- glmnet(x, y, alpha = 1, lambda = cv$lambda.min)

# Dsiplay regression coefficients
coef(model) %>% tidy

##                    row column     value
## 1          (Intercept)     s0 0.7376612
## 2 poly(x, n, raw = T)1     s0 2.3338827
## 3 poly(x, n, raw = T)2     s0 0.9580560

So, we have non-zero coef. for \(X\) and \(X^2\).

(f) Now generate a response vector \(Y\) according to the model

\(Y = \beta_0 + \beta_7X^7 + \epsilon\)

and perform best subset selection and the lasso. Discuss the results obtained.

Answer: First we generate, then we perform best subset, after that we perform the lasso and lastly we discuss the results obtained.

generating \(Y\)

set.seed(123) 
x <- rnorm(100)
set.seed(321) 
eps <- rnorm(100)

beta_0 <- 0.123
beta_7 <- 1.4

y <- beta_0 + beta_7*x^7 + eps

tbl_data <- tibble(
  x,
  y
)

tbl_data

## # A tibble: 100 x 2
##          x          y
##      <dbl>      <dbl>
##  1 -0.560    1.80    
##  2 -0.230   -0.589   
##  3  1.56    31.1     
##  4  0.0705   0.00335 
##  5  0.129   -0.000960
##  6  1.72    61.5     
##  7  0.461    0.856   
##  8 -1.27    -6.90    
##  9 -0.687    0.361   
## 10 -0.446   -0.434   
## # ... with 90 more rows

best subset

n <- 10
mod_form <- 
  as.formula(y ~ poly(x, n, raw = T))

bestsel <- 
  regsubsets(mod_form, tbl_data)

tbestsel <- 
  tidy.regsubsets(bestsel) %>%
  mutate(model.number = 1:n()) %>%
  select(model.number, everything())

tbestsel

## # A tibble: 8 x 17
##   model.number `(Intercept)` `poly(x, n, raw~ `poly(x, n, raw~
##          <int> <lgl>         <lgl>            <lgl>           
## 1            1 TRUE          FALSE            FALSE           
## 2            2 TRUE          FALSE            FALSE           
## 3            3 TRUE          TRUE             FALSE           
## 4            4 TRUE          TRUE             FALSE           
## 5            5 TRUE          TRUE             FALSE           
## 6            6 TRUE          FALSE            TRUE            
## 7            7 TRUE          TRUE             TRUE            
## 8            8 TRUE          FALSE            TRUE            
## # ... with 13 more variables: `poly(x, n, raw = T)3` <lgl>, `poly(x, n,
## #   raw = T)4` <lgl>, `poly(x, n, raw = T)5` <lgl>, `poly(x, n, raw =
## #   T)6` <lgl>, `poly(x, n, raw = T)7` <lgl>, `poly(x, n, raw =
## #   T)8` <lgl>, `poly(x, n, raw = T)9` <lgl>, `poly(x, n, raw =
## #   T)10` <lgl>, r.squared <dbl>, adj.r.squared <dbl>, BIC <dbl>,
## #   mallows_cp <dbl>, rss <dbl>

gg.adj.r <- ggplot(data = tbestsel, aes(y = adj.r.squared, x = model.number)) + 
  geom_point(color = "red", size = 3) +
  geom_line() +
  scale_x_discrete(limits = c(1:nrow(tbestsel))) +
  xlab("model number")

gg.cp <- ggplot(data = tbestsel, aes(y = mallows_cp, x = model.number)) + 
  geom_point(color = "blue", size = 3) +
  geom_line() +
  scale_x_discrete(limits = c(1:nrow(tbestsel))) +
  xlab("model number")

gg.bic <- ggplot(data = tbestsel, aes(y = BIC, x = model.number)) + 
  geom_point(color = "green", size = 3) +
  geom_line() +
  scale_x_discrete(limits = c(1:nrow(tbestsel))) +
  xlab("model number")


grid.arrange(gg.adj.r, gg.cp, gg.bic, ncol = 1)

We observe that model.number #2 exhibits highest adj. \(R^2\) and lowest \(C_p\), though \(BIC\) is not the min.

coef(bestsel, 2) %>% tidy()

## # A tibble: 3 x 2
##   names                     x
##   <chr>                 <dbl>
## 1 (Intercept)           0.151
## 2 poly(x, n, raw = T)3 -0.129
## 3 poly(x, n, raw = T)7  1.41

lasso

# Find the best lambda using cross-validation
x <- model.matrix(mod_form, tbl_data)[,-1]
y <- tbl_data$y

cv <- cv.glmnet(x, y, alpha = 0)
# Display the best lambda value
#cv$lambda.min

# Fit the final model on the training data
model <- glmnet(x, y, alpha = 1, lambda = cv$lambda.min)

# Dsiplay regression coefficients
coef(model) %>% tidy()

##                    row column     value
## 1          (Intercept)     s0 0.5466296
## 2 poly(x, n, raw = T)7     s0 1.2464720

Conclusion: lasso zero-ed feature \(X^3\) otherwise present in best subset.

Ex.9

In this exercise, we will predict the number of applications received using the other variables in the College data set.

(a) Split the data set into a training set and a test set.

Answer: Following is the split

set.seed(123)
College_split <- initial_split(College)
College_train <- training(College_split)
COllege_test <- testing(College_split)

(b) Fit a linear model using least squares on the training set, and report the test error obtained.

Answer: Following is the linear fit and test error

linear_fit <- lm(Apps ~ ., data = College_train)
COllege_augm <- broom::augment(linear_fit, newdata = COllege_test)
mse <- mean((COllege_augm$Apps - COllege_augm$.fitted)^2)

tbl_ex9 <- tibble(
  model = as_factor("linear"),
  mse = mse
)

format(mse, big.mark = " ")

## [1] "748 605.3"

(c) Fit a ridge regression model on the training set, with λ chosen by cross-validation. Report the test error obtained.

Answer: Following is the ridge regression fit and the test error

set.seed(1)

mod_form_9c <- 
  as.formula(Apps ~ .)

x_train <- model.matrix(mod_form_9c, College_train)[,-1]
x_test <- model.matrix(mod_form_9c, COllege_test)[,-1]
y_train <- College_train$Apps
y_test <- COllege_test$Apps

cv.out_9c =cv.glmnet (x_train, y_train, alpha =0)
bestlam_9c =cv.out_9c$lambda.min

ridge_fit_9c <- glmnet(x_train, y_train, alpha = 0, lambda = bestlam_9c)

ridge_pred=predict(ridge_fit_9c, s = bestlam_9c, newx = x_train)
mse_9c <- mean(( ridge_pred - y_train)^2)

tbl_ex9 <- 
  tbl_ex9 %>% add_row(
  model = as_factor("ridge"),
  mse = mse_9c
)

format(mse_9c, big.mark = " ")

## [1] "1 543 276"

(d) Fit a lasso model on the training set, with λ chosen by cross validation. Report the test error obtained, along with the number of non-zero coefficient estimates.

Answer: Following is the ridge regression fit, the test error and the number of non-zero coef

cv.out_9d =cv.glmnet (x_train, y_train, alpha =1)
bestlam_9d =cv.out_9d$lambda.min

lasso_fit_9d <- glmnet(x_train, y_train, alpha = 1, lambda = bestlam_9d)

lasso_pred=predict(lasso_fit_9d, s = bestlam_9d, newx = x_train)
mse_9d <- mean(( lasso_pred - y_train)^2)

tbl_ex9 <- 
  tbl_ex9 %>% add_row(
  model = as_factor("lasso"),
  mse = mse_9d
)

format(mse_9d, big.mark = " ")

## [1] "1 171 588"

coef are

coef(lasso_fit_9d)

## 18 x 1 sparse Matrix of class "dgCMatrix"
##                        s0
## (Intercept) -692.40339314
## PrivateYes  -441.46287133
## Accept         1.56817007
## Enroll        -0.62688724
## Top10perc     50.49907167
## Top25perc    -13.31351997
## F.Undergrad    0.02073397
## P.Undergrad    0.04127057
## Outstate      -0.09524040
## Room.Board     0.16146686
## Books          0.15593419
## Personal       0.02170125
## PhD           -8.81033807
## Terminal      -1.90393631
## S.F.Ratio     13.30074873
## perc.alumni    1.63783734
## Expend         0.07832900
## Grad.Rate      8.86207078

(e) Fit a PCR model on the training set, with M chosen by cross validation. Report the test error obtained, along with the value of M selected by cross-validation.

Answer: Following is the PCR fit, then the value of M, and finally the test error

set.seed(2)
pcr_fit <- pcr(mod_form_9c, data = College_train, scale=TRUE, validation ="CV")
summary(pcr_fit)

## Data:    X dimension: 583 17 
##  Y dimension: 583 1
## Fit method: svdpc
## Number of components considered: 17
## 
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
##        (Intercept)  1 comps  2 comps  3 comps  4 comps  5 comps  6 comps
## CV            4126     4061     2186     2178     1985     1778     1750
## adjCV         4126     4061     2182     2174     1990     1761     1745
##        7 comps  8 comps  9 comps  10 comps  11 comps  12 comps  13 comps
## CV        1706     1689     1649      1645      1651      1654      1662
## adjCV     1696     1680     1644      1640      1645      1648      1657
##        14 comps  15 comps  16 comps  17 comps
## CV         1666      1591      1239      1197
## adjCV      1661      1561      1230      1190
## 
## TRAINING: % variance explained
##       1 comps  2 comps  3 comps  4 comps  5 comps  6 comps  7 comps
## X      30.457    56.17    63.48    69.31    74.75    79.79    83.49
## Apps    3.673    73.18    73.53    78.61    83.32    83.48    84.50
##       8 comps  9 comps  10 comps  11 comps  12 comps  13 comps  14 comps
## X       87.06    90.23     92.79     94.91     96.71     97.83     98.71
## Apps    84.88    85.52     85.76     85.76     85.76     85.79     85.81
##       15 comps  16 comps  17 comps
## X        99.35     99.83     100.0
## Apps     91.04     92.75      93.1

the value of M

validationplot(pcr_fit, val.type="MSEP")

From \(M = 5\) onward RMSE decreases slightly, so we can assume a value of \(6\)

the test error

pcr_pred <- predict(pcr_fit, COllege_test, ncomp = 6)
mse_9e <- mean((pcr_pred - y_test)^2)

tbl_ex9 <- 
  tbl_ex9 %>% add_row(
  model = as_factor("pcr"),
  mse = mse_9e
)

format(mse_9e, big.mark = " ")

## [1] "1 594 157"

(f) Fit a PLS model on the training set, with M chosen by cross validation. Report the test error obtained, along with the value of M selected by cross-validation.

Answer: Following is the PLS fit, then the value of M, and finally the test error

set.seed(2)
plsr_fit <- plsr(mod_form_9c, data = College_train, scale=TRUE, validation ="CV")
summary(plsr_fit)

## Data:    X dimension: 583 17 
##  Y dimension: 583 1
## Fit method: kernelpls
## Number of components considered: 17
## 
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
##        (Intercept)  1 comps  2 comps  3 comps  4 comps  5 comps  6 comps
## CV            4126     2001     1686     1569     1514     1364     1236
## adjCV         4126     1995     1682     1562     1496     1346     1226
##        7 comps  8 comps  9 comps  10 comps  11 comps  12 comps  13 comps
## CV        1219     1211     1203      1197      1200      1197      1197
## adjCV     1211     1204     1196      1190      1193      1190      1190
##        14 comps  15 comps  16 comps  17 comps
## CV         1197      1198      1197      1197
## adjCV      1190      1190      1190      1190
## 
## TRAINING: % variance explained
##       1 comps  2 comps  3 comps  4 comps  5 comps  6 comps  7 comps
## X       25.89    39.42    61.49    64.66    68.69    73.13    77.43
## Apps    78.03    85.26    87.62    90.60    92.35    92.88    92.94
##       8 comps  9 comps  10 comps  11 comps  12 comps  13 comps  14 comps
## X       80.63    82.39     85.55     88.45     90.74     92.17     94.87
## Apps    92.99    93.06     93.07     93.09     93.10     93.10     93.10
##       15 comps  16 comps  17 comps
## X        96.97     98.26     100.0
## Apps     93.10     93.10      93.1

validationplot(plsr_fit, val.type="MSEP")

From \(M = 5\) onward RMSE decreases slightly, so we can assume a value of \(6\)

the test error

plsr_pred <- predict(plsr_fit, COllege_test, ncomp = 6)
mse_9f <- mean((plsr_pred - y_test)^2)

tbl_ex9 <- 
  tbl_ex9 %>% add_row(
  model = as_factor("pls"),
  mse = mse_9f
)

format(mse_9f, big.mark = " ")

## [1] "794 459.7"

(g) Comment on the results obtained. How accurately can we predict the number of college applications received? Is there much difference among the test errors resulting from these five approaches?

Answer: Following is the comparison

ggplot(data = tbl_ex9) +
  geom_line(aes(x = reorder(model, mse), y = mse, group = 1)) +
  geom_point(aes(x = reorder(model, mse), y = mse, shape = model, color = model), size = 3) +
  xlab("model")

Lasso model exhibits the lowest mse and PCR the highest.

Ex.10

We have seen that as the number of features used in a model increases, the training error will necessarily decrease, but the test error may not. We will now explore this in a simulated data set.

(a) Generate a data set with \(p = 20\) features, \(n = 1,000\) observations, and an associated quantitative response vector generated according to the model

\(Y = X\beta + \epsilon\)

where \(\beta\) has some elements that are exactly equal to zero.

Answer: Following is the generation of the data set

set.seed(123)
features_10a <- 20
obs_10a <- 1000

eps_10a <- rnorm(obs_10a)
x_seeds_10a <- rnorm(features_10a)

betas_10a <- rnorm(features_10a)
betas_10a[1] <- 0
betas_10a[5] <- 0
betas_10a[10] <- 0
betas_10a[15] <- 0
betas_10a[20] <- 0

generate_x_10a <- function(seed, nobs){
  rnorm(n = nobs, mean = seed)
}

generate_y_10a <- function(V1, V2, V3, V4, V5, V6, V7, V8, V9, V10, V11, V12, V13, V14, V15, V16, V17, V18, V19, V20, eps_10a, betas_10a){
  y <- V1*betas_10a[1] + V2*betas_10a[2] + V3*betas_10a[3] + V4*betas_10a[4] + V5*betas_10a[5] + V6*betas_10a[6] + V7*betas_10a[7] + V8*betas_10a[8] + V9*betas_10a[9] + V10*betas_10a[10] + V11*betas_10a[11] + V12*betas_10a[12] + V13*betas_10a[13] + V14*betas_10a[14] + V15*betas_10a[15] + V16*betas_10a[16] + V17*betas_10a[17]+ V18*betas_10a[18] + V19*betas_10a[19] + V20*betas_10a[20] + eps_10a
}

tbl_10a <- 
  map_dfc(x_seeds_10a, generate_x_10a, obs_10a) %>%
  add_column(eps_10a)

ls_10a <- map(tbl_10a, list)

y_10a <- pmap(ls_10a, generate_y_10a, betas_10a)

tbl_10a <- tbl_10a %>%
  add_column(y_10a = y_10a[[1]]) %>%
  select(y_10a, everything())

head(tbl_10a)

## # A tibble: 6 x 22
##    y_10a     V1     V2       V3     V4    V5    V6     V7    V8    V9
##    <dbl>  <dbl>  <dbl>    <dbl>  <dbl> <dbl> <dbl>  <dbl> <dbl> <dbl>
## 1  5.73  -2.02  -0.994 -0.00846  1.73  -1.23 2.13   1.58   3.53 1.84 
## 2  0.800 -0.390 -0.387  0.399   -0.185 -2.13 0.259  0.942  2.66 0.672
## 3  3.78  -1.08  -2.69   0.609    0.201 -2.67 2.68   0.320  1.63 0.900
## 4  4.00  -1.26  -1.35   1.19     1.18  -1.97 0.870  0.197  1.15 2.10 
## 5  2.82  -0.532 -0.465  0.755   -0.307 -3.34 0.387 -0.747  2.76 1.46 
## 6 -0.868 -2.02  -1.56  -1.40    -1.18  -2.28 0.333  1.24   1.66 0.603
## # ... with 12 more variables: V10 <dbl>, V11 <dbl>, V12 <dbl>, V13 <dbl>,
## #   V14 <dbl>, V15 <dbl>, V16 <dbl>, V17 <dbl>, V18 <dbl>, V19 <dbl>,
## #   V20 <dbl>, eps_10a <dbl>

(b) Split your data set into a training set containing 100 observations and a test set containing 900 observations.

Answer: Following is the split

set.seed(123)
data_split_10b <- initial_split(tbl_10a %>% select(-eps_10a), prop = 1/10)
data_split_10b_train <- training(data_split_10b)
data_split_10b_test <- testing(data_split_10b)
head(data_split_10b_train)

## # A tibble: 6 x 21
##    y_10a    V1     V2       V3      V4    V5     V6     V7    V8       V9
##    <dbl> <dbl>  <dbl>    <dbl>   <dbl> <dbl>  <dbl>  <dbl> <dbl>    <dbl>
## 1  5.73  -2.02 -0.994 -0.00846  1.73   -1.23  2.13   1.58  3.53   1.84   
## 2  3.78  -1.08 -2.69   0.609    0.201  -2.67  2.68   0.320 1.63   0.900  
## 3  6.54  -2.31 -1.10  -0.380    0.862  -2.38  1.23   1.29  4.22  -0.266  
## 4  0.671 -1.19 -1.39   0.997   -2.89   -2.55  1.10   1.99  1.40  -0.00260
## 5  6.59  -2.08 -1.20   1.47    -0.0867 -1.07 -0.257  0.115 3.53   0.765  
## 6 -4.46  -1.37 -1.33   1.09    -2.26   -2.08  0.727 -0.581 0.895  0.227  
## # ... with 11 more variables: V10 <dbl>, V11 <dbl>, V12 <dbl>, V13 <dbl>,
## #   V14 <dbl>, V15 <dbl>, V16 <dbl>, V17 <dbl>, V18 <dbl>, V19 <dbl>,
## #   V20 <dbl>

(c) Perform best subset selection on the training set, and plot the training set MSE associated with the best model of each size.

Answer: Following is the subset selection, then the plot

best subset selection

mod_form_10c <- as.formula(y_10a ~ .)

bestsel_10c <- regsubsets(mod_form_10c, data_split_10b_train, nvmax = 20)

tbestsel_10c <- tidy.regsubsets(bestsel_10c) %>%
  mutate(var.number = 1:n()) %>%
  select(var.number, everything())

tbestsel_10c

## # A tibble: 20 x 27
##    var.number `(Intercept)` V1    V2    V3    V4    V5    V6    V7    V8   
##         <int> <lgl>         <lgl> <lgl> <lgl> <lgl> <lgl> <lgl> <lgl> <lgl>
##  1          1 TRUE          FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
##  2          2 TRUE          FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE 
##  3          3 TRUE          FALSE FALSE FALSE FALSE FALSE FALSE TRUE  TRUE 
##  4          4 TRUE          FALSE FALSE TRUE  FALSE FALSE FALSE TRUE  TRUE 
##  5          5 TRUE          FALSE FALSE TRUE  FALSE FALSE FALSE TRUE  TRUE 
##  6          6 TRUE          FALSE FALSE TRUE  FALSE FALSE FALSE TRUE  TRUE 
##  7          7 TRUE          FALSE FALSE TRUE  TRUE  FALSE FALSE TRUE  TRUE 
##  8          8 TRUE          FALSE FALSE TRUE  TRUE  FALSE FALSE TRUE  TRUE 
##  9          9 TRUE          FALSE FALSE TRUE  TRUE  FALSE FALSE TRUE  TRUE 
## 10         10 TRUE          FALSE FALSE TRUE  TRUE  FALSE FALSE TRUE  TRUE 
## 11         11 TRUE          FALSE FALSE TRUE  TRUE  FALSE FALSE TRUE  TRUE 
## 12         12 TRUE          FALSE FALSE TRUE  TRUE  FALSE FALSE TRUE  TRUE 
## 13         13 TRUE          FALSE FALSE TRUE  TRUE  FALSE FALSE TRUE  TRUE 
## 14         14 TRUE          TRUE  FALSE TRUE  TRUE  FALSE FALSE TRUE  TRUE 
## 15         15 TRUE          TRUE  TRUE  TRUE  TRUE  FALSE FALSE TRUE  TRUE 
## 16         16 TRUE          TRUE  TRUE  TRUE  TRUE  FALSE FALSE TRUE  TRUE 
## 17         17 TRUE          TRUE  TRUE  TRUE  TRUE  FALSE FALSE TRUE  TRUE 
## 18         18 TRUE          TRUE  TRUE  TRUE  TRUE  FALSE TRUE  TRUE  TRUE 
## 19         19 TRUE          TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE 
## 20         20 TRUE          TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE 
## # ... with 17 more variables: V9 <lgl>, V10 <lgl>, V11 <lgl>, V12 <lgl>,
## #   V13 <lgl>, V14 <lgl>, V15 <lgl>, V16 <lgl>, V17 <lgl>, V18 <lgl>,
## #   V19 <lgl>, V20 <lgl>, r.squared <dbl>, adj.r.squared <dbl>, BIC <dbl>,
## #   mallows_cp <dbl>, rss <dbl>

plot the training MSE

predict.regsubsets = function(id, object, formula, newdata, ...){
  form = as.formula(formula)
  mat = model.matrix(form, newdata)
  coefi = coef(object, id=id)
  xvars = names(coefi)
  mat[ , xvars] %*% coefi
}

y_pred_10c <- 
  map_dfc(tbestsel_10c$var.number, 
          predict.regsubsets, 
          object = bestsel_10c, formula = mod_form_10c, newdata = data_split_10b_train)

mse.calculate <- function(y_pred, y_data){
  mean((as.vector(y_pred) - y_data)^2)
}

mse.error <- 
  map_dbl(y_pred_10c, 
          mse.calculate, 
          y_data = data_split_10b_train %>% select(y_10a) %>% pull())

tbestsel_10c_1 <-
bind_cols(
  tbestsel_10c,
  mse = mse.error
)


ggplot(data = tbestsel_10c_1) +
  geom_line(aes(x = var.number, y = mse, group = 1)) +
  geom_point(aes(x = var.number, y = mse), size = 3) +
  xlab("var.number")

(d) Plot the test set MSE associated with the best model of each size.

Answer: Following is the test MSE

y_pred_10d <- 
  map_dfc(tbestsel_10c$var.number, 
          predict.regsubsets, 
          object = bestsel_10c, formula = mod_form_10c, newdata = data_split_10b_test)

mse.error_10d <- 
  map_dbl(y_pred_10d, 
          mse.calculate, 
          y_data = data_split_10b_test %>% select(y_10a) %>% pull())

tbestsel_10d <-
bind_cols(
  tbestsel_10c,
  mse = mse.error_10d
)


ggplot(data = tbestsel_10d) +
  geom_line(aes(x = var.number, y = mse, group = 1)) +
  geom_point(aes(x = var.number, y = mse), size = 3) +
  xlab("var.number")

(e) For which model size does the test set MSE take on its minimum value? Comment on your results. If it takes on its minimum value for a model containing only an intercept or a model containing all of the features, then play around with the way that you are generating the data in (a) until you come up with a scenario in which the test set MSE is minimized for an intermediate model size.

Answer: Following is the model with the min test MSE

tbestsel_10d %>%
  arrange(mse.error_10d) %>%
  head(1)

## # A tibble: 1 x 28
##   var.number `(Intercept)` V1    V2    V3    V4    V5    V6    V7    V8   
##        <int> <lgl>         <lgl> <lgl> <lgl> <lgl> <lgl> <lgl> <lgl> <lgl>
## 1         13 TRUE          FALSE FALSE TRUE  TRUE  FALSE FALSE TRUE  TRUE 
## # ... with 18 more variables: V9 <lgl>, V10 <lgl>, V11 <lgl>, V12 <lgl>,
## #   V13 <lgl>, V14 <lgl>, V15 <lgl>, V16 <lgl>, V17 <lgl>, V18 <lgl>,
## #   V19 <lgl>, V20 <lgl>, r.squared <dbl>, adj.r.squared <dbl>, BIC <dbl>,
## #   mallows_cp <dbl>, rss <dbl>, mse <dbl>

(f) How does the model at which the test set MSE is minimized compare to the true model used to generate the data? Comment on the coefficient values.

Answer: Following are the coef, then the comparison with the estimates

coef

coef_10a <- 
  coef(bestsel_10c, id = 13) %>% 
  enframe() %>%
  rename(estimates_13 = value)

coef_10a

## # A tibble: 14 x 2
##    name        estimates_13
##    <chr>              <dbl>
##  1 (Intercept)        0.921
##  2 V3                 1.38 
##  3 V4                 0.783
##  4 V7                 1.59 
##  5 V8                 1.36 
##  6 V9                 0.407
##  7 V11               -1.15 
##  8 V12                0.407
##  9 V13               -0.863
## 10 V14               -0.296
## 11 V16               -1.38 
## 12 V17                0.231
## 13 V18                0.489
## 14 V20               -0.296

comparison between real betas and estimates of the best model

betas_10f <-
  betas_10a %>% 
  enframe %>% 
  mutate(name = str_c("V", name)) %>% 
  rename(!!str_c("betas_", nrow(.)) := value) 

betas_10f %>%
  left_join(coef_10a, by = "name") %>%
  mutate(estimates_13 = if_else(is.na(estimates_13), 0, estimates_13))

## # A tibble: 20 x 3
##    name  betas_20 estimates_13
##    <chr>    <dbl>        <dbl>
##  1 V1      0             0    
##  2 V2     -0.163         0    
##  3 V3      1.40          1.38 
##  4 V4      0.898         0.783
##  5 V5      0             0    
##  6 V6      0.229         0    
##  7 V7      1.65          1.59 
##  8 V8      1.42          1.36 
##  9 V9      0.420         0.407
## 10 V10     0             0    
## 11 V11    -1.20         -1.15 
## 12 V12     0.300         0.407
## 13 V13    -0.954        -0.863
## 14 V14    -0.458        -0.296
## 15 V15     0             0    
## 16 V16    -1.14         -1.38 
## 17 V17     0.267         0.231
## 18 V18     0.428         0.489
## 19 V19     0.0549        0    
## 20 V20     0            -0.296

The model has “zero-ed” an additional coef - \(\beta_6\)

(g) Create a plot displaying \(\sqrt{\sum_{j=1}^p(\beta_j - \hat{\beta}_j^r)^2}\) for a range of values of \(r\), where \(\hat\beta_j^r\) is the \(j^{th}\) coefficient estimate for the best model containing \(r\) coefficients. Comment on what you observe. How does this compare to the test MSE plot from (d)?

Answer: First we data wrangle to create all betas and estimates, then we calculate and finally we plot

data wrangle

get_coef_10g <- function(idx, obj){

  coef(object = obj, id = idx) %>%
    enframe %>%
    right_join(betas_10f %>% select(name), by = "name") %>%
    mutate(!!str_c("estimates_", idx) := if_else(is.na(value), 0, value)) %>%
    select(-name, -value)
}

estimates_10g <- 
  tbestsel_10c$var.number %>%
  map_dfc(get_coef_10g, bestsel_10c)

betas_estimates_10g <-
  estimates_10g %>%
  bind_cols(betas_10f) %>%
  select(name, betas_20, everything())

betas_estimates_10g

## # A tibble: 20 x 22
##    name  betas_20 estimates_1 estimates_2 estimates_3 estimates_4
##    <chr>    <dbl>       <dbl>       <dbl>       <dbl>       <dbl>
##  1 V1      0             0           0           0           0   
##  2 V2     -0.163         0           0           0           0   
##  3 V3      1.40          0           0           0           1.26
##  4 V4      0.898         0           0           0           0   
##  5 V5      0             0           0           0           0   
##  6 V6      0.229         0           0           0           0   
##  7 V7      1.65          0           0           1.52        1.63
##  8 V8      1.42          0           1.25        1.52        1.53
##  9 V9      0.420         0           0           0           0   
## 10 V10     0             0           0           0           0   
## 11 V11    -1.20          0           0           0           0   
## 12 V12     0.300         0           0           0           0   
## 13 V13    -0.954         0           0           0           0   
## 14 V14    -0.458         0           0           0           0   
## 15 V15     0             0           0           0           0   
## 16 V16    -1.14         -1.71       -1.68       -1.56       -1.38
## 17 V17     0.267         0           0           0           0   
## 18 V18     0.428         0           0           0           0   
## 19 V19     0.0549        0           0           0           0   
## 20 V20     0             0           0           0           0   
## # ... with 16 more variables: estimates_5 <dbl>, estimates_6 <dbl>,
## #   estimates_7 <dbl>, estimates_8 <dbl>, estimates_9 <dbl>,
## #   estimates_10 <dbl>, estimates_11 <dbl>, estimates_12 <dbl>,
## #   estimates_13 <dbl>, estimates_14 <dbl>, estimates_15 <dbl>,
## #   estimates_16 <dbl>, estimates_17 <dbl>, estimates_18 <dbl>,
## #   estimates_19 <dbl>, estimates_20 <dbl>

calculate and plot

sqdiff <- function(est, beta){
  sum((beta - est)^2)
}

map_dfc(betas_estimates_10g %>% select(-name, -betas_20),
        sqdiff,
        betas_estimates_10g %>% select(betas_20)
) %>%
  gather("estimates_1":"estimates_20", key = "coef", value = "diff") %>%
  mutate(coef = as_factor(coef)) %>%
  ggplot() +
  geom_point(aes(x = coef, y = diff), size = 3) +
  geom_line(aes(x = coef, y = diff, group = 1)) +
  theme(axis.text.x = element_text(angle = 90, hjust = 1, vjust = 0.5))

Both plots seem quite the same

#clean up and load again kr funcs
#
rm(list = ls())
source("kRfunlib.R")

Ex.11

We will now try to predict per capita crime rate in the Boston data set.

(a) Try out some of the regression methods explored in this chapter, such as best subset selection, the lasso, ridge regression, and PCR. Present and discuss results for the approaches that you consider.

Answer: Following are the methods

Boston data look and feel

str(Boston)

## 'data.frame':    506 obs. of  14 variables:
##  $ crim   : num  0.00632 0.02731 0.02729 0.03237 0.06905 ...
##  $ zn     : num  18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
##  $ indus  : num  2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
##  $ chas   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ nox    : num  0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
##  $ rm     : num  6.58 6.42 7.18 7 7.15 ...
##  $ age    : num  65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
##  $ dis    : num  4.09 4.97 4.97 6.06 6.06 ...
##  $ rad    : int  1 2 2 3 3 3 5 5 5 5 ...
##  $ tax    : num  296 242 242 222 222 222 311 311 311 311 ...
##  $ ptratio: num  15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
##  $ black  : num  397 397 393 395 397 ...
##  $ lstat  : num  4.98 9.14 4.03 2.94 5.33 ...
##  $ medv   : num  24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...

split the data

set.seed(123)
Boston_split <- initial_split(Boston)
Boston_train <- training(Boston_split)
Boston_test <- testing(Boston_split)

best subset

mod_form <- 
  as.formula(crim ~ .)

bestsel <- 
  regsubsets(mod_form, Boston_train, nvmax = 20)

tbestsel <- 
  tidy.regsubsets(bestsel) %>%
  mutate(model.number = 1:n()) %>%
  select(model.number, everything())

tbestsel

## # A tibble: 13 x 20
##    model.number `(Intercept)` zn    indus chas  nox   rm    age   dis  
##           <int> <lgl>         <lgl> <lgl> <lgl> <lgl> <lgl> <lgl> <lgl>
##  1            1 TRUE          FALSE FALSE FALSE FALSE FALSE FALSE FALSE
##  2            2 TRUE          FALSE FALSE FALSE FALSE FALSE FALSE FALSE
##  3            3 TRUE          FALSE FALSE FALSE FALSE FALSE FALSE FALSE
##  4            4 TRUE          TRUE  FALSE FALSE FALSE FALSE FALSE TRUE 
##  5            5 TRUE          TRUE  TRUE  FALSE FALSE FALSE FALSE TRUE 
##  6            6 TRUE          TRUE  FALSE FALSE TRUE  FALSE FALSE TRUE 
##  7            7 TRUE          TRUE  FALSE FALSE TRUE  TRUE  FALSE TRUE 
##  8            8 TRUE          TRUE  FALSE FALSE TRUE  TRUE  FALSE TRUE 
##  9            9 TRUE          TRUE  FALSE FALSE TRUE  TRUE  FALSE TRUE 
## 10           10 TRUE          TRUE  TRUE  FALSE TRUE  TRUE  FALSE TRUE 
## 11           11 TRUE          TRUE  TRUE  TRUE  TRUE  TRUE  FALSE TRUE 
## 12           12 TRUE          TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE 
## 13           13 TRUE          TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE 
## # ... with 11 more variables: rad <lgl>, tax <lgl>, ptratio <lgl>,
## #   black <lgl>, lstat <lgl>, medv <lgl>, r.squared <dbl>,
## #   adj.r.squared <dbl>, BIC <dbl>, mallows_cp <dbl>, rss <dbl>

graph.bestsubset(tbestsel)

Model #7 exhibits optimum parameters - max \(R^2\), low \(RSS, C_p\) and not bad \(BIC\)

the lasso

set.seed(123)
# Find the best lambda using cross-validation
x <- model.matrix(mod_form, Boston_train)[,-1]
y <- Boston_train$crim

cv <- cv.glmnet(x, y, alpha = 0)
# Display the best lambda value
# cv$lambda.min

# Fit the final model on the training data
lasso_fit <- glmnet(x, y, alpha = 1, lambda = cv$lambda.min)

# Dsiplay regression coefficients
coef(lasso_fit) #%>% tidy()

## 14 x 1 sparse Matrix of class "dgCMatrix"
##                      s0
## (Intercept) -1.02819343
## zn           .         
## indus        .         
## chas         .         
## nox          .         
## rm           .         
## age          .         
## dis         -0.05570023
## rad          0.47363519
## tax          .         
## ptratio      .         
## black        .         
## lstat        0.14129766
## medv        -0.05978400

ridge regression

set.seed(123)
cv_out <- cv.glmnet (x, y, alpha = 0)
ridge_best_lam =cv_out$lambda.min

ridge_fit <- glmnet(x, y, alpha = 0, lambda = ridge_best_lam)

coef(ridge_fit)

## 14 x 1 sparse Matrix of class "dgCMatrix"
##                       s0
## (Intercept)  3.449868943
## zn           0.036856637
## indus       -0.073853615
## chas        -0.841350172
## nox         -5.199251845
## rm           0.968489639
## age         -0.002793221
## dis         -0.794183807
## rad          0.423181680
## tax          0.004221480
## ptratio     -0.134793539
## black       -0.001688167
## lstat        0.173746242
## medv        -0.183980701

PCR fit

set.seed(123)
pcr_fit <- pcr(mod_form, data = Boston_train, scale=TRUE, validation ="CV")
summary(pcr_fit)

## Data:    X dimension: 380 13 
##  Y dimension: 380 1
## Fit method: svdpc
## Number of components considered: 13
## 
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
##        (Intercept)  1 comps  2 comps  3 comps  4 comps  5 comps  6 comps
## CV            9.07    7.599    7.611    7.252    7.279    7.276    7.254
## adjCV         9.07    7.597    7.609    7.246    7.272    7.270    7.246
##        7 comps  8 comps  9 comps  10 comps  11 comps  12 comps  13 comps
## CV       7.247    7.114    7.147     7.122     7.126     7.063     7.001
## adjCV    7.239    7.104    7.136     7.107     7.112     7.049     6.987
## 
## TRAINING: % variance explained
##       1 comps  2 comps  3 comps  4 comps  5 comps  6 comps  7 comps
## X       47.66    60.73    69.94    76.85    83.23    88.24    91.36
## crim    30.21    30.24    37.39    37.41    37.44    38.43    38.83
##       8 comps  9 comps  10 comps  11 comps  12 comps  13 comps
## X       93.58    95.50     97.09     98.58     99.57    100.00
## crim    41.16    41.18     41.81     42.27     43.40     44.73

validationplot(pcr_fit, val.type="MSEP")

PCR with number of comp: #8 exhibits optimum parameters.

(b) Propose a model (or set of models) that seem to perform well on this data set, and justify your answer. Make sure that you are evaluating model performance using validation set error, cross validation, or some other reasonable alternative, as opposed to using training error.

best subset test error

mse.bestsubset(tbestsel, mod_fit = bestsel, mod_form, data_in = Boston_test, "crim")$graph

mse.bestsubset(tbestsel, bestsel, mod_form, Boston_test, "crim")$tbl %>%
  select(model.number, mse) %>%
  filter(model.number == 6) %>%
  pull(mse)

## [1] 26.84306

Model #6 exhibits optimum test error

lasso test error

x_test <- model.matrix(mod_form, Boston_test)[,-1]
y_test <- Boston_test$crim

lasso_pred <- predict(lasso_fit, s = cv$lambda.min, newx = x_test)
lasso_mse <- mean((lasso_pred - y_test)^2)

lasso_mse

## [1] 26.38916

ridge test error

ridge_pred <- predict(ridge_fit, s = ridge_best_lam, newx = x_test)
ridge_mse <- mean(( ridge_pred - y_test)^2)

ridge_mse

## [1] 26.4498

PCR test error

pcr_pred <- predict(pcr_fit, Boston_test, ncomp = 8)
pcr_mse <- mean((pcr_pred - Boston_test$crim)^2)

pcr_mse

## [1] 26.23383

(c) Does your chosen model involve all of the features in the data set? Why or why not?

Answer: PCR with \(ncomp = 8\) exhibits the lowest mse test error.

#clean up and load again kr funcs
#
rm(list = ls())
source("kRfunlib.R")

Ch6 Linear Model Selection and Regularization

Kiril Raytchev

Conceptual

Ex.1

Ex.2

Ex.3

Ex.4

Ex.5

Ex.6

Ex.7

Applied

Ex.8

Ex.9

Ex.10

Ex.11