We perform best subset, forward stepwise, and backward stepwise selection on a single data set. For each approach, we obtain p + 1 models, containing 0, 1, 2, . . . , p predictors. Explain your answers:
(a) Which of the three models with k predictors has the smallest training RSS?
Answer: Best subset. It tries all possible combinations and the one with the smallest training RSS can be selected.
(b) Which of the three models with k predictors has the smallest test RSS?
Answer: Best subset has the highest chance of having the smallest test RSS as it has the smallest training RSS.
(c) True or False:
i. The predictors in the k-variable model identified by forward stepwise are a subset of the predictors in the (k+1)-variable model identified by forward stepwise selection.
Answer: True
ii. The predictors in the k-variable model identified by backward stepwise are a subset of the predictors in the (k + 1)- variable model identified by backward stepwise selection.
Answer: True
iii. The predictors in the k-variable model identified by backward stepwise are a subset of the predictors in the (k + 1)- variable model identified by forward stepwise selection.
Answer: False
iv. The predictors in the k-variable model identified by forward stepwise are a subset of the predictors in the (k+1)-variable model identified by backward stepwise selection.
Answer: False
v. The predictors in the k-variable model identified by best subset are a subset of the predictors in the (k + 1)-variable model identified by best subset selection.
Answer: False
For parts (a) through (c), indicate which of i. through iv. is correct. Justify your answer.
(a) The lasso, relative to least squares, is:
i. More flexible and hence will give improved prediction accuracy when its increase in bias is less than its decrease in variance.
Answer: Lasso is less flexible due to the inserted \(\lambda\) “noise”. That helps prevent overfitting.
ii. More flexible and hence will give improved prediction accuracy when its increase in variance is less than its decrease in bias.
Answer: The same as i.
iii. Less flexible and hence will give improved prediction accuracy when its increase in bias is less than its decrease in variance.
Answer: The Lasso is better when it decreases variance more than increasing bias.
iv. Less flexible and hence will give improved prediction accuracy when its increase in variance is less than its decrease in bias.
Answer: That is the opposite of iii., hence not correct
(b) Repeat (a) for ridge regression relative to least squares.
Answer: The same as (a)
(c) Repeat (a) for non-linear methods relative to least squares.
Answer: Non-linear methods are more flexible and will give improved prediction accuracy when variance increases less than bias decreases, hence ii. in correct.
Suppose we estimate the regression coefficients in a linear regression model by minimizing
for a particular value of \(s\). For parts (a) through (e), indicate which of i. through v. is correct. Justify your answer.
(a) As we increase \(s\) from 0, the training RSS will:
Answer: Training RSS will steadily decrease, hence iv. is correct
(b) Repeat (a) for test RSS.
Answer: Test RSS will decrease initially, since increase of variance will be hindered, but then eventually will start increasing, hence ii. is correct
(c) Repeat (a) for variance.
Answer Variance will steadily increase, hence iii. is correct
(d) Repeat (a) for (squared) bias.
Answer: Bias will steadily decrease, hence iv. is correct
(e) Repeat (a) for the irreducible error.
Answer: v. is correct - remain constant
Suppose we estimate the regression coefficients in a linear regression model by minimizing
for a particular value of \(\lambda\). For parts (a) through (e), indicate which of i. through v. is correct. Justify your answer.
(a) As we increase \(\lambda\) from 0, the training RSS will:
Answer: Training RSS will steadily increase - iii.. \(\beta\) values will be becoming smaller and smaller, hence train RSS will be increasing
(b) Repeat (a) for test RSS.
Answer: (ii.) is correct - decrease initially, and than eventually start increasing in a U-shape. Decrease at first stage is due to the fact that overfitting is being prevented from the \(\lambda\) increase. Increase at the last stage is due to the fact that too high \(\lambda\) values cause the model to start deviating substantially
(c) Repeat (a) for variance.
Answer: (iv.) is correct - steadily decrease. Increasing \(\lambda\) causes the model to become less and less flexible, hence the variance decrease
(d) Repeat (a) for (squared) bias.
Answer: (iii.) is correct - steadily increase. With the decrease of flexibility comes the increase of bias.
(e) Repeat (a) for the irreducible error.
Answer: (v.) is correct - remain constant. This is by definition.
It is well-known that ridge regression tends to give similar coefficient values to correlated variables, whereas the lasso may give quite different coefficient values to correlated variables. We will now explore this property in a very simple setting.
Suppose that \(n = 2\), \(p = 2\), \(x_{11} = x_{12}\), \(x_{21} = x_{22}\). Furthermore, suppose that \(y_1+y_2 = 0\) and \(x_{11}+x_{21} = 0\) and \(x_{12}+x_{22} = 0\), so that the estimate for the intercept in a least squares, ridge regression, or lasso model is zero: \(\hat\beta_0 = 0\).
(a) Write out the ridge regression optimization problem in this setting.
Answer: We have to minimize the following:
(b) Argue that in this setting, the ridge coefficient estimates satisfy \(\hat\beta_1 = \hat\beta_2\)
Answer: To find the coeff. we have to differentiate by \(\beta_1\) and \(\beta_2\) and solve for \(0\)
(c) Write out the lasso optimization problem in this setting.
(d) Argue that in this setting, the lasso coefficients \(\hat\beta_1\) and \(\hat\beta_2\) are not unique—in other words, there are many possible solutions to the optimization problem in (c). Describe these solutions.
Answer (c)&(d): We get \(\beta_1\) dependent on \(\beta_2\), so we do not have a closed solution
We will now explore (6.12) and (6.13) further.
(a) Consider (6.12) with p = 1. For some choice of \(y_1\) and \(\lambda > 0\), plot (6.12) as a function of \(\beta_1\). Your plot should confirm that (6.12) is solved by (6.14).
Answer: Following is the plot and the confirmation:
beta1 <- seq(from = 1, to = 5, by = 0.1)
y <- 3
lambda <- 0.5
fbeta1_reg_6a <- function(beta1, y, lambda){
fbeta1 <- (y - beta1)^2 + lambda*beta1^2
fbeta1
}
fbeta1 <- map_dbl(beta1, fbeta1_reg_6a, y, lambda)
beta1formula_6a <- y / (1 + lambda)
ggplot() +
geom_line(aes(y = fbeta1, x = beta1), color = "blue") +
geom_hline(aes(yintercept = y), color = "red") +
geom_vline(aes(xintercept = beta1formula_6a), color = "green")
(b) Consider (6.13) with p = 1. For some choice of \(y_1\) and \(\lambda > 0\), plot (6.13) as a function of \(\beta_1\). Your plot should confirm that (6.13) is solved by (6.15).
Answer: Following is the plot and the confirmation:
beta1 <- seq(from = 1, to = 5, by = 0.1)
y <- 3
lambda <- 0.5
fbeta1_reg_6b <- function(beta1, y, lambda){
fbeta1 <- (y - beta1)^2 + lambda*beta1
fbeta1
}
fbeta1 <- map_dbl(beta1, fbeta1_reg_6b, y, lambda)
beta1formula_6b <- y - lambda/2 ## y > lambda/2
ggplot() +
geom_line(aes(y = fbeta1, x = beta1), color = "blue") +
geom_hline(aes(yintercept = y), color = "red") +
geom_vline(aes(xintercept = beta1formula_6b), color = "green")
We will now derive the Bayesian connection to the lasso and ridge regression discussed in Section 6.2.2.
(a) Suppose that \(y_i = \beta_0 + \sum_{j=1}^px_{ij}\beta_j + \epsilon_i\) where \(\epsilon_1, ...., \epsilon_n\) are independent and identically distributed from a \(N(0, \sigma^2)\) distribution. Write out the likelihood for the data.
Answer: Following is the likelihood:
(b) Assume the following prior for \(\beta: \beta_1, . . . , \beta_p\) are independent and identically distributed according to a double-exponential distribution with mean 0 and common scale parameter b: i.e.\(p(\beta) = \frac{1}{2b}exp(\frac{-\lvert\beta\rvert}{b})\). Write out the posterior for \(\beta\) in this setting.
Answer: Following is the posterior:
(c) Argue that the lasso estimate is the \(mode\) for \(\beta\) under this posterior distribution.
Answer: Following is the way to argue:
(d) Now assume the following prior for \(\beta: \beta_1,....,\beta_p\) are independent and identically distributed according to a normal distribution with mean zero and variance \(c\). Write out the posterior for \(\beta\) in this setting.
Answer: Following is the posterior:
(e) Argue that the ridge regression estimate is both the mode and the mean for \(\beta\) under this posterior distribution.
Answer: Following is the way to argue:
In this exercise, we will generate simulated data, and will then use this data to perform best subset selection.
(a) Use the rnorm()
function to generate a predictor \(X\) of length \(n = 100\), as well as a noise vector \(\epsilon\) of length \(n = 100\).
Answer Following are the vectors:
set.seed(123)
x <- rnorm(100)
set.seed(321)
eps <- rnorm(100)
(b) Generate a response vector \(Y\) of length \(n = 100\) according to the model
\(Y = \beta_0 + \beta_1X + \beta_2X^2 + \beta_3X^3 + \epsilon\)
where \(\beta_0, \beta_1, \beta_2, \beta_3\) are constants of your choice.
Answer: Following is the response vector:
beta_0 <- 0.5
beta_1 <- 3
beta_2 <- 1.2
beta_3 <- -0.123
y <- beta_0 + beta_1*x + beta_2*x^2 + beta_3*x^3 + eps
tbl_data <- tibble(
x,
y
)
tbl_data
## # A tibble: 100 x 2
## x y
## <dbl> <dbl>
## 1 -0.560 0.922
## 2 -0.230 -0.837
## 3 1.56 7.35
## 4 0.0705 0.598
## 5 0.129 0.784
## 6 1.72 8.82
## 7 0.461 2.85
## 8 -1.27 -0.893
## 9 -0.687 -0.615
## 10 -0.446 -1.14
## # ... with 90 more rows
(c) Use the regsubsets()
function to perform best subset selection in order to choose the best model containing the predictors \(X, X^2,...., X^{10}\).
What is the best model obtained according to \(C_p\), \(BIC\), and adjusted \(R^2\)?
Show some plots to provide evidence for your answer, and report the coefficients of the best model obtained.
Note you will need to use the data.frame()
function to create a single data set containing both \(X\) and \(Y\).
Answer: We start with performing best subset selection, next we select the best model obtained
n <- 10
mod_form <- as.formula(y ~ poly(x, n, raw = T))
bestsel <- regsubsets(mod_form, tbl_data, nvmax = 10)
tbestsel <- tidy.regsubsets(bestsel) %>%
mutate(model.number = 1:n()) %>%
select(model.number, everything())
tbestsel
## # A tibble: 10 x 17
## model.number `(Intercept)` `poly(x, n, raw~ `poly(x, n, raw~
## <int> <lgl> <lgl> <lgl>
## 1 1 TRUE TRUE FALSE
## 2 2 TRUE TRUE TRUE
## 3 3 TRUE TRUE TRUE
## 4 4 TRUE TRUE TRUE
## 5 5 TRUE TRUE TRUE
## 6 6 TRUE TRUE FALSE
## 7 7 TRUE TRUE FALSE
## 8 8 TRUE TRUE FALSE
## 9 9 TRUE TRUE TRUE
## 10 10 TRUE TRUE TRUE
## # ... with 13 more variables: `poly(x, n, raw = T)3` <lgl>, `poly(x, n,
## # raw = T)4` <lgl>, `poly(x, n, raw = T)5` <lgl>, `poly(x, n, raw =
## # T)6` <lgl>, `poly(x, n, raw = T)7` <lgl>, `poly(x, n, raw =
## # T)8` <lgl>, `poly(x, n, raw = T)9` <lgl>, `poly(x, n, raw =
## # T)10` <lgl>, r.squared <dbl>, adj.r.squared <dbl>, BIC <dbl>,
## # mallows_cp <dbl>, rss <dbl>
gg.adj.r <- ggplot(data = tbestsel, aes(y = adj.r.squared, x = model.number)) +
geom_point(color = "red", size = 3) +
geom_line() +
scale_x_discrete(limits = c(1:nrow(tbestsel))) +
xlab("model number")
gg.cp <- ggplot(data = tbestsel, aes(y = mallows_cp, x = model.number)) +
geom_point(color = "blue", size = 3) +
geom_line() +
scale_x_discrete(limits = c(1:nrow(tbestsel))) +
xlab("model number")
gg.bic <- ggplot(data = tbestsel, aes(y = BIC, x = model.number)) +
geom_point(color = "green", size = 3) +
geom_line() +
scale_x_discrete(limits = c(1:nrow(tbestsel))) +
xlab("model number")
grid.arrange(gg.adj.r, gg.cp, gg.bic, ncol = 1)
Model #2 exhibits most optimum characteristics.
coef(bestsel, 2) %>% broom::tidy()
## # A tibble: 3 x 2
## names x
## <chr> <dbl>
## 1 (Intercept) 0.537
## 2 poly(x, n, raw = T)1 2.60
## 3 poly(x, n, raw = T)2 1.17
(d) Repeat (c), using forward stepwise selection and also using backwards stepwise selection. How does your answer compare to the results in (c)?
Answer: First we do forward and then backwards.
forstepsel <-
regsubsets(mod_form, tbl_data, nvmax = 19, method = "forward")
tforstepsel <-
forstepsel %>%
tidy %>%
mutate(model.number = 1:n()) %>%
select(model.number, everything())
tforstepsel
## # A tibble: 10 x 17
## model.number `(Intercept)` `poly(x, n, raw~ `poly(x, n, raw~
## <int> <lgl> <lgl> <lgl>
## 1 1 TRUE TRUE FALSE
## 2 2 TRUE TRUE TRUE
## 3 3 TRUE TRUE TRUE
## 4 4 TRUE TRUE TRUE
## 5 5 TRUE TRUE TRUE
## 6 6 TRUE TRUE TRUE
## 7 7 TRUE TRUE TRUE
## 8 8 TRUE TRUE TRUE
## 9 9 TRUE TRUE TRUE
## 10 10 TRUE TRUE TRUE
## # ... with 13 more variables: `poly(x, n, raw = T)3` <lgl>, `poly(x, n,
## # raw = T)4` <lgl>, `poly(x, n, raw = T)5` <lgl>, `poly(x, n, raw =
## # T)6` <lgl>, `poly(x, n, raw = T)7` <lgl>, `poly(x, n, raw =
## # T)8` <lgl>, `poly(x, n, raw = T)9` <lgl>, `poly(x, n, raw =
## # T)10` <lgl>, r.squared <dbl>, adj.r.squared <dbl>, BIC <dbl>,
## # mallows_cp <dbl>, rss <dbl>
gg.adj.r <- ggplot(data = tforstepsel, aes(y = adj.r.squared, x = model.number)) +
geom_point(color = "red", size = 3) +
geom_line() +
scale_x_discrete(limits = c(1:nrow(tforstepsel))) +
xlab("model number")
gg.cp <- ggplot(data = tforstepsel, aes(y = mallows_cp, x = model.number)) +
geom_point(color = "blue", size = 3) +
geom_line() +
scale_x_discrete(limits = c(1:nrow(tforstepsel))) +
xlab("model number")
gg.bic <- ggplot(data = tforstepsel, aes(y = BIC, x = model.number)) +
geom_point(color = "green", size = 3) +
geom_line() +
scale_x_discrete(limits = c(1:nrow(tforstepsel))) +
xlab("model number")
grid.arrange(gg.adj.r, gg.cp, gg.bic, ncol = 1)
Model #2
exhibits most optimum characteristics and is the same choice as from best subset method
coef(forstepsel, 2) %>% broom::tidy()
## # A tibble: 3 x 2
## names x
## <chr> <dbl>
## 1 (Intercept) 0.537
## 2 poly(x, n, raw = T)1 2.60
## 3 poly(x, n, raw = T)2 1.17
backstepsel <-
regsubsets(mod_form, tbl_data, nvmax = 19, method = "backward")
tbackstepsel <-
backstepsel %>%
tidy %>%
mutate(model.number = 1:n()) %>%
select(model.number, everything())
tbackstepsel
## # A tibble: 10 x 17
## model.number `(Intercept)` `poly(x, n, raw~ `poly(x, n, raw~
## <int> <lgl> <lgl> <lgl>
## 1 1 TRUE TRUE FALSE
## 2 2 TRUE TRUE FALSE
## 3 3 TRUE TRUE FALSE
## 4 4 TRUE TRUE FALSE
## 5 5 TRUE TRUE FALSE
## 6 6 TRUE TRUE FALSE
## 7 7 TRUE TRUE FALSE
## 8 8 TRUE TRUE FALSE
## 9 9 TRUE TRUE TRUE
## 10 10 TRUE TRUE TRUE
## # ... with 13 more variables: `poly(x, n, raw = T)3` <lgl>, `poly(x, n,
## # raw = T)4` <lgl>, `poly(x, n, raw = T)5` <lgl>, `poly(x, n, raw =
## # T)6` <lgl>, `poly(x, n, raw = T)7` <lgl>, `poly(x, n, raw =
## # T)8` <lgl>, `poly(x, n, raw = T)9` <lgl>, `poly(x, n, raw =
## # T)10` <lgl>, r.squared <dbl>, adj.r.squared <dbl>, BIC <dbl>,
## # mallows_cp <dbl>, rss <dbl>
gg.adj.r <- ggplot(data = tbackstepsel, aes(y = adj.r.squared, x = model.number)) +
geom_point(color = "red", size = 3) +
geom_line() +
scale_x_discrete(limits = c(1:nrow(tbackstepsel))) +
xlab("model number")
gg.cp <- ggplot(data = tbackstepsel, aes(y = mallows_cp, x = model.number)) +
geom_point(color = "blue", size = 3) +
geom_line() +
scale_x_discrete(limits = c(1:nrow(tbackstepsel))) +
xlab("model number")
gg.bic <- ggplot(data = tbackstepsel, aes(y = BIC, x = model.number)) +
geom_point(color = "green", size = 3) +
geom_line() +
scale_x_discrete(limits = c(1:nrow(tbackstepsel))) +
xlab("model number")
grid.arrange(gg.adj.r, gg.cp, gg.bic, ncol = 1)
Model #3 exhibits most optimum characteristics with this method
coef(forstepsel, 3) %>% broom::tidy()
## # A tibble: 4 x 2
## names x
## <chr> <dbl>
## 1 (Intercept) 0.534
## 2 poly(x, n, raw = T)1 2.79
## 3 poly(x, n, raw = T)2 1.18
## 4 poly(x, n, raw = T)3 -0.0823
(e) Now fit a lasso model to the simulated data, again using \(X, X^2,..., X^10\) as predictors. Use cross-validation to select the optimal value of \(\lambda\). Create plots of the cross-validation error as a function of \(\lambda\). Report the resulting coefficient estimates, and discuss the results obtained.
Answer First we fit a lasso model and select optimal \(\lambda\), them we create plots and last we report and discuss
# Find the best lambda using cross-validation
x <- model.matrix(mod_form, tbl_data)[,-1]
y <- tbl_data$y
cv <- cv.glmnet(x, y, alpha = 0)
# Display the best lambda value
cv$lambda.min
## [1] 0.2866097
plot(cv)
# Fit the final model on the training data
model <- glmnet(x, y, alpha = 1, lambda = cv$lambda.min)
# Dsiplay regression coefficients
coef(model) %>% tidy
## row column value
## 1 (Intercept) s0 0.7376612
## 2 poly(x, n, raw = T)1 s0 2.3338827
## 3 poly(x, n, raw = T)2 s0 0.9580560
So, we have non-zero coef. for \(X\) and \(X^2\).
(f) Now generate a response vector \(Y\) according to the model
\(Y = \beta_0 + \beta_7X^7 + \epsilon\)
and perform best subset selection and the lasso. Discuss the results obtained.
Answer: First we generate, then we perform best subset, after that we perform the lasso and lastly we discuss the results obtained.
set.seed(123)
x <- rnorm(100)
set.seed(321)
eps <- rnorm(100)
beta_0 <- 0.123
beta_7 <- 1.4
y <- beta_0 + beta_7*x^7 + eps
tbl_data <- tibble(
x,
y
)
tbl_data
## # A tibble: 100 x 2
## x y
## <dbl> <dbl>
## 1 -0.560 1.80
## 2 -0.230 -0.589
## 3 1.56 31.1
## 4 0.0705 0.00335
## 5 0.129 -0.000960
## 6 1.72 61.5
## 7 0.461 0.856
## 8 -1.27 -6.90
## 9 -0.687 0.361
## 10 -0.446 -0.434
## # ... with 90 more rows
n <- 10
mod_form <-
as.formula(y ~ poly(x, n, raw = T))
bestsel <-
regsubsets(mod_form, tbl_data)
tbestsel <-
tidy.regsubsets(bestsel) %>%
mutate(model.number = 1:n()) %>%
select(model.number, everything())
tbestsel
## # A tibble: 8 x 17
## model.number `(Intercept)` `poly(x, n, raw~ `poly(x, n, raw~
## <int> <lgl> <lgl> <lgl>
## 1 1 TRUE FALSE FALSE
## 2 2 TRUE FALSE FALSE
## 3 3 TRUE TRUE FALSE
## 4 4 TRUE TRUE FALSE
## 5 5 TRUE TRUE FALSE
## 6 6 TRUE FALSE TRUE
## 7 7 TRUE TRUE TRUE
## 8 8 TRUE FALSE TRUE
## # ... with 13 more variables: `poly(x, n, raw = T)3` <lgl>, `poly(x, n,
## # raw = T)4` <lgl>, `poly(x, n, raw = T)5` <lgl>, `poly(x, n, raw =
## # T)6` <lgl>, `poly(x, n, raw = T)7` <lgl>, `poly(x, n, raw =
## # T)8` <lgl>, `poly(x, n, raw = T)9` <lgl>, `poly(x, n, raw =
## # T)10` <lgl>, r.squared <dbl>, adj.r.squared <dbl>, BIC <dbl>,
## # mallows_cp <dbl>, rss <dbl>
gg.adj.r <- ggplot(data = tbestsel, aes(y = adj.r.squared, x = model.number)) +
geom_point(color = "red", size = 3) +
geom_line() +
scale_x_discrete(limits = c(1:nrow(tbestsel))) +
xlab("model number")
gg.cp <- ggplot(data = tbestsel, aes(y = mallows_cp, x = model.number)) +
geom_point(color = "blue", size = 3) +
geom_line() +
scale_x_discrete(limits = c(1:nrow(tbestsel))) +
xlab("model number")
gg.bic <- ggplot(data = tbestsel, aes(y = BIC, x = model.number)) +
geom_point(color = "green", size = 3) +
geom_line() +
scale_x_discrete(limits = c(1:nrow(tbestsel))) +
xlab("model number")
grid.arrange(gg.adj.r, gg.cp, gg.bic, ncol = 1)
We observe that model.number #2
exhibits highest adj. \(R^2\) and lowest \(C_p\), though \(BIC\) is not the min.
coef(bestsel, 2) %>% tidy()
## # A tibble: 3 x 2
## names x
## <chr> <dbl>
## 1 (Intercept) 0.151
## 2 poly(x, n, raw = T)3 -0.129
## 3 poly(x, n, raw = T)7 1.41
# Find the best lambda using cross-validation
x <- model.matrix(mod_form, tbl_data)[,-1]
y <- tbl_data$y
cv <- cv.glmnet(x, y, alpha = 0)
# Display the best lambda value
#cv$lambda.min
# Fit the final model on the training data
model <- glmnet(x, y, alpha = 1, lambda = cv$lambda.min)
# Dsiplay regression coefficients
coef(model) %>% tidy()
## row column value
## 1 (Intercept) s0 0.5466296
## 2 poly(x, n, raw = T)7 s0 1.2464720
In this exercise, we will predict the number of applications received using the other variables in the College
data set.
(a) Split the data set into a training set and a test set.
Answer: Following is the split
set.seed(123)
College_split <- initial_split(College)
College_train <- training(College_split)
COllege_test <- testing(College_split)
(b) Fit a linear model using least squares on the training set, and report the test error obtained.
Answer: Following is the linear fit and test error
linear_fit <- lm(Apps ~ ., data = College_train)
COllege_augm <- broom::augment(linear_fit, newdata = COllege_test)
mse <- mean((COllege_augm$Apps - COllege_augm$.fitted)^2)
tbl_ex9 <- tibble(
model = as_factor("linear"),
mse = mse
)
format(mse, big.mark = " ")
## [1] "748 605.3"
(c) Fit a ridge regression model on the training set, with λ chosen by cross-validation. Report the test error obtained.
Answer: Following is the ridge regression fit and the test error
set.seed(1)
mod_form_9c <-
as.formula(Apps ~ .)
x_train <- model.matrix(mod_form_9c, College_train)[,-1]
x_test <- model.matrix(mod_form_9c, COllege_test)[,-1]
y_train <- College_train$Apps
y_test <- COllege_test$Apps
cv.out_9c =cv.glmnet (x_train, y_train, alpha =0)
bestlam_9c =cv.out_9c$lambda.min
ridge_fit_9c <- glmnet(x_train, y_train, alpha = 0, lambda = bestlam_9c)
ridge_pred=predict(ridge_fit_9c, s = bestlam_9c, newx = x_train)
mse_9c <- mean(( ridge_pred - y_train)^2)
tbl_ex9 <-
tbl_ex9 %>% add_row(
model = as_factor("ridge"),
mse = mse_9c
)
format(mse_9c, big.mark = " ")
## [1] "1 543 276"
(d) Fit a lasso model on the training set, with λ chosen by cross validation. Report the test error obtained, along with the number of non-zero coefficient estimates.
Answer: Following is the ridge regression fit, the test error and the number of non-zero coef
cv.out_9d =cv.glmnet (x_train, y_train, alpha =1)
bestlam_9d =cv.out_9d$lambda.min
lasso_fit_9d <- glmnet(x_train, y_train, alpha = 1, lambda = bestlam_9d)
lasso_pred=predict(lasso_fit_9d, s = bestlam_9d, newx = x_train)
mse_9d <- mean(( lasso_pred - y_train)^2)
tbl_ex9 <-
tbl_ex9 %>% add_row(
model = as_factor("lasso"),
mse = mse_9d
)
format(mse_9d, big.mark = " ")
## [1] "1 171 588"
coef(lasso_fit_9d)
## 18 x 1 sparse Matrix of class "dgCMatrix"
## s0
## (Intercept) -692.40339314
## PrivateYes -441.46287133
## Accept 1.56817007
## Enroll -0.62688724
## Top10perc 50.49907167
## Top25perc -13.31351997
## F.Undergrad 0.02073397
## P.Undergrad 0.04127057
## Outstate -0.09524040
## Room.Board 0.16146686
## Books 0.15593419
## Personal 0.02170125
## PhD -8.81033807
## Terminal -1.90393631
## S.F.Ratio 13.30074873
## perc.alumni 1.63783734
## Expend 0.07832900
## Grad.Rate 8.86207078
(e) Fit a PCR model on the training set, with M chosen by cross validation. Report the test error obtained, along with the value of M selected by cross-validation.
Answer: Following is the PCR fit, then the value of M, and finally the test error
set.seed(2)
pcr_fit <- pcr(mod_form_9c, data = College_train, scale=TRUE, validation ="CV")
summary(pcr_fit)
## Data: X dimension: 583 17
## Y dimension: 583 1
## Fit method: svdpc
## Number of components considered: 17
##
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
## (Intercept) 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps
## CV 4126 4061 2186 2178 1985 1778 1750
## adjCV 4126 4061 2182 2174 1990 1761 1745
## 7 comps 8 comps 9 comps 10 comps 11 comps 12 comps 13 comps
## CV 1706 1689 1649 1645 1651 1654 1662
## adjCV 1696 1680 1644 1640 1645 1648 1657
## 14 comps 15 comps 16 comps 17 comps
## CV 1666 1591 1239 1197
## adjCV 1661 1561 1230 1190
##
## TRAINING: % variance explained
## 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps 7 comps
## X 30.457 56.17 63.48 69.31 74.75 79.79 83.49
## Apps 3.673 73.18 73.53 78.61 83.32 83.48 84.50
## 8 comps 9 comps 10 comps 11 comps 12 comps 13 comps 14 comps
## X 87.06 90.23 92.79 94.91 96.71 97.83 98.71
## Apps 84.88 85.52 85.76 85.76 85.76 85.79 85.81
## 15 comps 16 comps 17 comps
## X 99.35 99.83 100.0
## Apps 91.04 92.75 93.1
validationplot(pcr_fit, val.type="MSEP")
From \(M = 5\) onward RMSE decreases slightly, so we can assume a value of \(6\)
pcr_pred <- predict(pcr_fit, COllege_test, ncomp = 6)
mse_9e <- mean((pcr_pred - y_test)^2)
tbl_ex9 <-
tbl_ex9 %>% add_row(
model = as_factor("pcr"),
mse = mse_9e
)
format(mse_9e, big.mark = " ")
## [1] "1 594 157"
(f) Fit a PLS model on the training set, with M chosen by cross validation. Report the test error obtained, along with the value of M selected by cross-validation.
Answer: Following is the PLS fit, then the value of M, and finally the test error
set.seed(2)
plsr_fit <- plsr(mod_form_9c, data = College_train, scale=TRUE, validation ="CV")
summary(plsr_fit)
## Data: X dimension: 583 17
## Y dimension: 583 1
## Fit method: kernelpls
## Number of components considered: 17
##
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
## (Intercept) 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps
## CV 4126 2001 1686 1569 1514 1364 1236
## adjCV 4126 1995 1682 1562 1496 1346 1226
## 7 comps 8 comps 9 comps 10 comps 11 comps 12 comps 13 comps
## CV 1219 1211 1203 1197 1200 1197 1197
## adjCV 1211 1204 1196 1190 1193 1190 1190
## 14 comps 15 comps 16 comps 17 comps
## CV 1197 1198 1197 1197
## adjCV 1190 1190 1190 1190
##
## TRAINING: % variance explained
## 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps 7 comps
## X 25.89 39.42 61.49 64.66 68.69 73.13 77.43
## Apps 78.03 85.26 87.62 90.60 92.35 92.88 92.94
## 8 comps 9 comps 10 comps 11 comps 12 comps 13 comps 14 comps
## X 80.63 82.39 85.55 88.45 90.74 92.17 94.87
## Apps 92.99 93.06 93.07 93.09 93.10 93.10 93.10
## 15 comps 16 comps 17 comps
## X 96.97 98.26 100.0
## Apps 93.10 93.10 93.1
validationplot(plsr_fit, val.type="MSEP")
From \(M = 5\) onward RMSE decreases slightly, so we can assume a value of \(6\)
plsr_pred <- predict(plsr_fit, COllege_test, ncomp = 6)
mse_9f <- mean((plsr_pred - y_test)^2)
tbl_ex9 <-
tbl_ex9 %>% add_row(
model = as_factor("pls"),
mse = mse_9f
)
format(mse_9f, big.mark = " ")
## [1] "794 459.7"
(g) Comment on the results obtained. How accurately can we predict the number of college applications received? Is there much difference among the test errors resulting from these five approaches?
Answer: Following is the comparison
ggplot(data = tbl_ex9) +
geom_line(aes(x = reorder(model, mse), y = mse, group = 1)) +
geom_point(aes(x = reorder(model, mse), y = mse, shape = model, color = model), size = 3) +
xlab("model")
Lasso model exhibits the lowest mse and PCR the highest.
We have seen that as the number of features used in a model increases, the training error will necessarily decrease, but the test error may not. We will now explore this in a simulated data set.
(a) Generate a data set with \(p = 20\) features, \(n = 1,000\) observations, and an associated quantitative response vector generated according to the model
\(Y = X\beta + \epsilon\)
where \(\beta\) has some elements that are exactly equal to zero.
Answer: Following is the generation of the data set
set.seed(123)
features_10a <- 20
obs_10a <- 1000
eps_10a <- rnorm(obs_10a)
x_seeds_10a <- rnorm(features_10a)
betas_10a <- rnorm(features_10a)
betas_10a[1] <- 0
betas_10a[5] <- 0
betas_10a[10] <- 0
betas_10a[15] <- 0
betas_10a[20] <- 0
generate_x_10a <- function(seed, nobs){
rnorm(n = nobs, mean = seed)
}
generate_y_10a <- function(V1, V2, V3, V4, V5, V6, V7, V8, V9, V10, V11, V12, V13, V14, V15, V16, V17, V18, V19, V20, eps_10a, betas_10a){
y <- V1*betas_10a[1] + V2*betas_10a[2] + V3*betas_10a[3] + V4*betas_10a[4] + V5*betas_10a[5] + V6*betas_10a[6] + V7*betas_10a[7] + V8*betas_10a[8] + V9*betas_10a[9] + V10*betas_10a[10] + V11*betas_10a[11] + V12*betas_10a[12] + V13*betas_10a[13] + V14*betas_10a[14] + V15*betas_10a[15] + V16*betas_10a[16] + V17*betas_10a[17]+ V18*betas_10a[18] + V19*betas_10a[19] + V20*betas_10a[20] + eps_10a
}
tbl_10a <-
map_dfc(x_seeds_10a, generate_x_10a, obs_10a) %>%
add_column(eps_10a)
ls_10a <- map(tbl_10a, list)
y_10a <- pmap(ls_10a, generate_y_10a, betas_10a)
tbl_10a <- tbl_10a %>%
add_column(y_10a = y_10a[[1]]) %>%
select(y_10a, everything())
head(tbl_10a)
## # A tibble: 6 x 22
## y_10a V1 V2 V3 V4 V5 V6 V7 V8 V9
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 5.73 -2.02 -0.994 -0.00846 1.73 -1.23 2.13 1.58 3.53 1.84
## 2 0.800 -0.390 -0.387 0.399 -0.185 -2.13 0.259 0.942 2.66 0.672
## 3 3.78 -1.08 -2.69 0.609 0.201 -2.67 2.68 0.320 1.63 0.900
## 4 4.00 -1.26 -1.35 1.19 1.18 -1.97 0.870 0.197 1.15 2.10
## 5 2.82 -0.532 -0.465 0.755 -0.307 -3.34 0.387 -0.747 2.76 1.46
## 6 -0.868 -2.02 -1.56 -1.40 -1.18 -2.28 0.333 1.24 1.66 0.603
## # ... with 12 more variables: V10 <dbl>, V11 <dbl>, V12 <dbl>, V13 <dbl>,
## # V14 <dbl>, V15 <dbl>, V16 <dbl>, V17 <dbl>, V18 <dbl>, V19 <dbl>,
## # V20 <dbl>, eps_10a <dbl>
(b) Split your data set into a training set containing 100 observations and a test set containing 900 observations.
Answer: Following is the split
set.seed(123)
data_split_10b <- initial_split(tbl_10a %>% select(-eps_10a), prop = 1/10)
data_split_10b_train <- training(data_split_10b)
data_split_10b_test <- testing(data_split_10b)
head(data_split_10b_train)
## # A tibble: 6 x 21
## y_10a V1 V2 V3 V4 V5 V6 V7 V8 V9
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 5.73 -2.02 -0.994 -0.00846 1.73 -1.23 2.13 1.58 3.53 1.84
## 2 3.78 -1.08 -2.69 0.609 0.201 -2.67 2.68 0.320 1.63 0.900
## 3 6.54 -2.31 -1.10 -0.380 0.862 -2.38 1.23 1.29 4.22 -0.266
## 4 0.671 -1.19 -1.39 0.997 -2.89 -2.55 1.10 1.99 1.40 -0.00260
## 5 6.59 -2.08 -1.20 1.47 -0.0867 -1.07 -0.257 0.115 3.53 0.765
## 6 -4.46 -1.37 -1.33 1.09 -2.26 -2.08 0.727 -0.581 0.895 0.227
## # ... with 11 more variables: V10 <dbl>, V11 <dbl>, V12 <dbl>, V13 <dbl>,
## # V14 <dbl>, V15 <dbl>, V16 <dbl>, V17 <dbl>, V18 <dbl>, V19 <dbl>,
## # V20 <dbl>
(c) Perform best subset selection on the training set, and plot the training set MSE associated with the best model of each size.
Answer: Following is the subset selection, then the plot
mod_form_10c <- as.formula(y_10a ~ .)
bestsel_10c <- regsubsets(mod_form_10c, data_split_10b_train, nvmax = 20)
tbestsel_10c <- tidy.regsubsets(bestsel_10c) %>%
mutate(var.number = 1:n()) %>%
select(var.number, everything())
tbestsel_10c
## # A tibble: 20 x 27
## var.number `(Intercept)` V1 V2 V3 V4 V5 V6 V7 V8
## <int> <lgl> <lgl> <lgl> <lgl> <lgl> <lgl> <lgl> <lgl> <lgl>
## 1 1 TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## 2 2 TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE
## 3 3 TRUE FALSE FALSE FALSE FALSE FALSE FALSE TRUE TRUE
## 4 4 TRUE FALSE FALSE TRUE FALSE FALSE FALSE TRUE TRUE
## 5 5 TRUE FALSE FALSE TRUE FALSE FALSE FALSE TRUE TRUE
## 6 6 TRUE FALSE FALSE TRUE FALSE FALSE FALSE TRUE TRUE
## 7 7 TRUE FALSE FALSE TRUE TRUE FALSE FALSE TRUE TRUE
## 8 8 TRUE FALSE FALSE TRUE TRUE FALSE FALSE TRUE TRUE
## 9 9 TRUE FALSE FALSE TRUE TRUE FALSE FALSE TRUE TRUE
## 10 10 TRUE FALSE FALSE TRUE TRUE FALSE FALSE TRUE TRUE
## 11 11 TRUE FALSE FALSE TRUE TRUE FALSE FALSE TRUE TRUE
## 12 12 TRUE FALSE FALSE TRUE TRUE FALSE FALSE TRUE TRUE
## 13 13 TRUE FALSE FALSE TRUE TRUE FALSE FALSE TRUE TRUE
## 14 14 TRUE TRUE FALSE TRUE TRUE FALSE FALSE TRUE TRUE
## 15 15 TRUE TRUE TRUE TRUE TRUE FALSE FALSE TRUE TRUE
## 16 16 TRUE TRUE TRUE TRUE TRUE FALSE FALSE TRUE TRUE
## 17 17 TRUE TRUE TRUE TRUE TRUE FALSE FALSE TRUE TRUE
## 18 18 TRUE TRUE TRUE TRUE TRUE FALSE TRUE TRUE TRUE
## 19 19 TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## 20 20 TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## # ... with 17 more variables: V9 <lgl>, V10 <lgl>, V11 <lgl>, V12 <lgl>,
## # V13 <lgl>, V14 <lgl>, V15 <lgl>, V16 <lgl>, V17 <lgl>, V18 <lgl>,
## # V19 <lgl>, V20 <lgl>, r.squared <dbl>, adj.r.squared <dbl>, BIC <dbl>,
## # mallows_cp <dbl>, rss <dbl>
predict.regsubsets = function(id, object, formula, newdata, ...){
form = as.formula(formula)
mat = model.matrix(form, newdata)
coefi = coef(object, id=id)
xvars = names(coefi)
mat[ , xvars] %*% coefi
}
y_pred_10c <-
map_dfc(tbestsel_10c$var.number,
predict.regsubsets,
object = bestsel_10c, formula = mod_form_10c, newdata = data_split_10b_train)
mse.calculate <- function(y_pred, y_data){
mean((as.vector(y_pred) - y_data)^2)
}
mse.error <-
map_dbl(y_pred_10c,
mse.calculate,
y_data = data_split_10b_train %>% select(y_10a) %>% pull())
tbestsel_10c_1 <-
bind_cols(
tbestsel_10c,
mse = mse.error
)
ggplot(data = tbestsel_10c_1) +
geom_line(aes(x = var.number, y = mse, group = 1)) +
geom_point(aes(x = var.number, y = mse), size = 3) +
xlab("var.number")
(d) Plot the test set MSE associated with the best model of each size.
Answer: Following is the test MSE
y_pred_10d <-
map_dfc(tbestsel_10c$var.number,
predict.regsubsets,
object = bestsel_10c, formula = mod_form_10c, newdata = data_split_10b_test)
mse.error_10d <-
map_dbl(y_pred_10d,
mse.calculate,
y_data = data_split_10b_test %>% select(y_10a) %>% pull())
tbestsel_10d <-
bind_cols(
tbestsel_10c,
mse = mse.error_10d
)
ggplot(data = tbestsel_10d) +
geom_line(aes(x = var.number, y = mse, group = 1)) +
geom_point(aes(x = var.number, y = mse), size = 3) +
xlab("var.number")
(e) For which model size does the test set MSE take on its minimum value? Comment on your results. If it takes on its minimum value for a model containing only an intercept or a model containing all of the features, then play around with the way that you are generating the data in (a) until you come up with a scenario in which the test set MSE is minimized for an intermediate model size.
Answer: Following is the model with the min test MSE
tbestsel_10d %>%
arrange(mse.error_10d) %>%
head(1)
## # A tibble: 1 x 28
## var.number `(Intercept)` V1 V2 V3 V4 V5 V6 V7 V8
## <int> <lgl> <lgl> <lgl> <lgl> <lgl> <lgl> <lgl> <lgl> <lgl>
## 1 13 TRUE FALSE FALSE TRUE TRUE FALSE FALSE TRUE TRUE
## # ... with 18 more variables: V9 <lgl>, V10 <lgl>, V11 <lgl>, V12 <lgl>,
## # V13 <lgl>, V14 <lgl>, V15 <lgl>, V16 <lgl>, V17 <lgl>, V18 <lgl>,
## # V19 <lgl>, V20 <lgl>, r.squared <dbl>, adj.r.squared <dbl>, BIC <dbl>,
## # mallows_cp <dbl>, rss <dbl>, mse <dbl>
(f) How does the model at which the test set MSE is minimized compare to the true model used to generate the data? Comment on the coefficient values.
Answer: Following are the coef, then the comparison with the estimates
coef_10a <-
coef(bestsel_10c, id = 13) %>%
enframe() %>%
rename(estimates_13 = value)
coef_10a
## # A tibble: 14 x 2
## name estimates_13
## <chr> <dbl>
## 1 (Intercept) 0.921
## 2 V3 1.38
## 3 V4 0.783
## 4 V7 1.59
## 5 V8 1.36
## 6 V9 0.407
## 7 V11 -1.15
## 8 V12 0.407
## 9 V13 -0.863
## 10 V14 -0.296
## 11 V16 -1.38
## 12 V17 0.231
## 13 V18 0.489
## 14 V20 -0.296
betas_10f <-
betas_10a %>%
enframe %>%
mutate(name = str_c("V", name)) %>%
rename(!!str_c("betas_", nrow(.)) := value)
betas_10f %>%
left_join(coef_10a, by = "name") %>%
mutate(estimates_13 = if_else(is.na(estimates_13), 0, estimates_13))
## # A tibble: 20 x 3
## name betas_20 estimates_13
## <chr> <dbl> <dbl>
## 1 V1 0 0
## 2 V2 -0.163 0
## 3 V3 1.40 1.38
## 4 V4 0.898 0.783
## 5 V5 0 0
## 6 V6 0.229 0
## 7 V7 1.65 1.59
## 8 V8 1.42 1.36
## 9 V9 0.420 0.407
## 10 V10 0 0
## 11 V11 -1.20 -1.15
## 12 V12 0.300 0.407
## 13 V13 -0.954 -0.863
## 14 V14 -0.458 -0.296
## 15 V15 0 0
## 16 V16 -1.14 -1.38
## 17 V17 0.267 0.231
## 18 V18 0.428 0.489
## 19 V19 0.0549 0
## 20 V20 0 -0.296
The model has “zero-ed” an additional coef - \(\beta_6\)
(g) Create a plot displaying \(\sqrt{\sum_{j=1}^p(\beta_j - \hat{\beta}_j^r)^2}\) for a range of values of \(r\), where \(\hat\beta_j^r\) is the \(j^{th}\) coefficient estimate for the best model containing \(r\) coefficients. Comment on what you observe. How does this compare to the test MSE plot from (d)?
Answer: First we data wrangle to create all betas and estimates, then we calculate and finally we plot
get_coef_10g <- function(idx, obj){
coef(object = obj, id = idx) %>%
enframe %>%
right_join(betas_10f %>% select(name), by = "name") %>%
mutate(!!str_c("estimates_", idx) := if_else(is.na(value), 0, value)) %>%
select(-name, -value)
}
estimates_10g <-
tbestsel_10c$var.number %>%
map_dfc(get_coef_10g, bestsel_10c)
betas_estimates_10g <-
estimates_10g %>%
bind_cols(betas_10f) %>%
select(name, betas_20, everything())
betas_estimates_10g
## # A tibble: 20 x 22
## name betas_20 estimates_1 estimates_2 estimates_3 estimates_4
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 V1 0 0 0 0 0
## 2 V2 -0.163 0 0 0 0
## 3 V3 1.40 0 0 0 1.26
## 4 V4 0.898 0 0 0 0
## 5 V5 0 0 0 0 0
## 6 V6 0.229 0 0 0 0
## 7 V7 1.65 0 0 1.52 1.63
## 8 V8 1.42 0 1.25 1.52 1.53
## 9 V9 0.420 0 0 0 0
## 10 V10 0 0 0 0 0
## 11 V11 -1.20 0 0 0 0
## 12 V12 0.300 0 0 0 0
## 13 V13 -0.954 0 0 0 0
## 14 V14 -0.458 0 0 0 0
## 15 V15 0 0 0 0 0
## 16 V16 -1.14 -1.71 -1.68 -1.56 -1.38
## 17 V17 0.267 0 0 0 0
## 18 V18 0.428 0 0 0 0
## 19 V19 0.0549 0 0 0 0
## 20 V20 0 0 0 0 0
## # ... with 16 more variables: estimates_5 <dbl>, estimates_6 <dbl>,
## # estimates_7 <dbl>, estimates_8 <dbl>, estimates_9 <dbl>,
## # estimates_10 <dbl>, estimates_11 <dbl>, estimates_12 <dbl>,
## # estimates_13 <dbl>, estimates_14 <dbl>, estimates_15 <dbl>,
## # estimates_16 <dbl>, estimates_17 <dbl>, estimates_18 <dbl>,
## # estimates_19 <dbl>, estimates_20 <dbl>
sqdiff <- function(est, beta){
sum((beta - est)^2)
}
map_dfc(betas_estimates_10g %>% select(-name, -betas_20),
sqdiff,
betas_estimates_10g %>% select(betas_20)
) %>%
gather("estimates_1":"estimates_20", key = "coef", value = "diff") %>%
mutate(coef = as_factor(coef)) %>%
ggplot() +
geom_point(aes(x = coef, y = diff), size = 3) +
geom_line(aes(x = coef, y = diff, group = 1)) +
theme(axis.text.x = element_text(angle = 90, hjust = 1, vjust = 0.5))
Both plots seem quite the same
#clean up and load again kr funcs
#
rm(list = ls())
source("kRfunlib.R")
We will now try to predict per capita crime rate in the Boston
data set.
(a) Try out some of the regression methods explored in this chapter, such as best subset selection, the lasso, ridge regression, and PCR. Present and discuss results for the approaches that you consider.
Answer: Following are the methods
Boston
data look and feelstr(Boston)
## 'data.frame': 506 obs. of 14 variables:
## $ crim : num 0.00632 0.02731 0.02729 0.03237 0.06905 ...
## $ zn : num 18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
## $ indus : num 2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
## $ chas : int 0 0 0 0 0 0 0 0 0 0 ...
## $ nox : num 0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
## $ rm : num 6.58 6.42 7.18 7 7.15 ...
## $ age : num 65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
## $ dis : num 4.09 4.97 4.97 6.06 6.06 ...
## $ rad : int 1 2 2 3 3 3 5 5 5 5 ...
## $ tax : num 296 242 242 222 222 222 311 311 311 311 ...
## $ ptratio: num 15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
## $ black : num 397 397 393 395 397 ...
## $ lstat : num 4.98 9.14 4.03 2.94 5.33 ...
## $ medv : num 24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...
set.seed(123)
Boston_split <- initial_split(Boston)
Boston_train <- training(Boston_split)
Boston_test <- testing(Boston_split)
mod_form <-
as.formula(crim ~ .)
bestsel <-
regsubsets(mod_form, Boston_train, nvmax = 20)
tbestsel <-
tidy.regsubsets(bestsel) %>%
mutate(model.number = 1:n()) %>%
select(model.number, everything())
tbestsel
## # A tibble: 13 x 20
## model.number `(Intercept)` zn indus chas nox rm age dis
## <int> <lgl> <lgl> <lgl> <lgl> <lgl> <lgl> <lgl> <lgl>
## 1 1 TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## 2 2 TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## 3 3 TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## 4 4 TRUE TRUE FALSE FALSE FALSE FALSE FALSE TRUE
## 5 5 TRUE TRUE TRUE FALSE FALSE FALSE FALSE TRUE
## 6 6 TRUE TRUE FALSE FALSE TRUE FALSE FALSE TRUE
## 7 7 TRUE TRUE FALSE FALSE TRUE TRUE FALSE TRUE
## 8 8 TRUE TRUE FALSE FALSE TRUE TRUE FALSE TRUE
## 9 9 TRUE TRUE FALSE FALSE TRUE TRUE FALSE TRUE
## 10 10 TRUE TRUE TRUE FALSE TRUE TRUE FALSE TRUE
## 11 11 TRUE TRUE TRUE TRUE TRUE TRUE FALSE TRUE
## 12 12 TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## 13 13 TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## # ... with 11 more variables: rad <lgl>, tax <lgl>, ptratio <lgl>,
## # black <lgl>, lstat <lgl>, medv <lgl>, r.squared <dbl>,
## # adj.r.squared <dbl>, BIC <dbl>, mallows_cp <dbl>, rss <dbl>
graph.bestsubset(tbestsel)
Model #7 exhibits optimum parameters - max \(R^2\), low \(RSS, C_p\) and not bad \(BIC\)
set.seed(123)
# Find the best lambda using cross-validation
x <- model.matrix(mod_form, Boston_train)[,-1]
y <- Boston_train$crim
cv <- cv.glmnet(x, y, alpha = 0)
# Display the best lambda value
# cv$lambda.min
# Fit the final model on the training data
lasso_fit <- glmnet(x, y, alpha = 1, lambda = cv$lambda.min)
# Dsiplay regression coefficients
coef(lasso_fit) #%>% tidy()
## 14 x 1 sparse Matrix of class "dgCMatrix"
## s0
## (Intercept) -1.02819343
## zn .
## indus .
## chas .
## nox .
## rm .
## age .
## dis -0.05570023
## rad 0.47363519
## tax .
## ptratio .
## black .
## lstat 0.14129766
## medv -0.05978400
set.seed(123)
cv_out <- cv.glmnet (x, y, alpha = 0)
ridge_best_lam =cv_out$lambda.min
ridge_fit <- glmnet(x, y, alpha = 0, lambda = ridge_best_lam)
coef(ridge_fit)
## 14 x 1 sparse Matrix of class "dgCMatrix"
## s0
## (Intercept) 3.449868943
## zn 0.036856637
## indus -0.073853615
## chas -0.841350172
## nox -5.199251845
## rm 0.968489639
## age -0.002793221
## dis -0.794183807
## rad 0.423181680
## tax 0.004221480
## ptratio -0.134793539
## black -0.001688167
## lstat 0.173746242
## medv -0.183980701
set.seed(123)
pcr_fit <- pcr(mod_form, data = Boston_train, scale=TRUE, validation ="CV")
summary(pcr_fit)
## Data: X dimension: 380 13
## Y dimension: 380 1
## Fit method: svdpc
## Number of components considered: 13
##
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
## (Intercept) 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps
## CV 9.07 7.599 7.611 7.252 7.279 7.276 7.254
## adjCV 9.07 7.597 7.609 7.246 7.272 7.270 7.246
## 7 comps 8 comps 9 comps 10 comps 11 comps 12 comps 13 comps
## CV 7.247 7.114 7.147 7.122 7.126 7.063 7.001
## adjCV 7.239 7.104 7.136 7.107 7.112 7.049 6.987
##
## TRAINING: % variance explained
## 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps 7 comps
## X 47.66 60.73 69.94 76.85 83.23 88.24 91.36
## crim 30.21 30.24 37.39 37.41 37.44 38.43 38.83
## 8 comps 9 comps 10 comps 11 comps 12 comps 13 comps
## X 93.58 95.50 97.09 98.58 99.57 100.00
## crim 41.16 41.18 41.81 42.27 43.40 44.73
validationplot(pcr_fit, val.type="MSEP")
PCR with number of comp: #8 exhibits optimum parameters.
(b) Propose a model (or set of models) that seem to perform well on this data set, and justify your answer. Make sure that you are evaluating model performance using validation set error, cross validation, or some other reasonable alternative, as opposed to using training error.
mse.bestsubset(tbestsel, mod_fit = bestsel, mod_form, data_in = Boston_test, "crim")$graph
mse.bestsubset(tbestsel, bestsel, mod_form, Boston_test, "crim")$tbl %>%
select(model.number, mse) %>%
filter(model.number == 6) %>%
pull(mse)
## [1] 26.84306
Model #6 exhibits optimum test error
x_test <- model.matrix(mod_form, Boston_test)[,-1]
y_test <- Boston_test$crim
lasso_pred <- predict(lasso_fit, s = cv$lambda.min, newx = x_test)
lasso_mse <- mean((lasso_pred - y_test)^2)
lasso_mse
## [1] 26.38916
ridge_pred <- predict(ridge_fit, s = ridge_best_lam, newx = x_test)
ridge_mse <- mean(( ridge_pred - y_test)^2)
ridge_mse
## [1] 26.4498
pcr_pred <- predict(pcr_fit, Boston_test, ncomp = 8)
pcr_mse <- mean((pcr_pred - Boston_test$crim)^2)
pcr_mse
## [1] 26.23383
(c) Does your chosen model involve all of the features in the data set? Why or why not?
Answer: PCR with \(ncomp = 8\) exhibits the lowest mse test error.
#clean up and load again kr funcs
#
rm(list = ls())
source("kRfunlib.R")