In the Monte Carlo simulations below, we illustrate the bias of the naive post-Lasso estimator. The bias comes from the failure of Lasso to reliably detect small non-zero coefficients. ```{r} library(glmnet) n=100 #sample size R=300 #number of Monte Carlo repetitions ``` The data generating process for simulations: $$\begin{aligned} Y_i &=\beta_1 X_{i,1}+\beta_2 X_{i,2}+\beta_3 X_{i,3}+U_i,\\ \beta_1 &=0,\\ \beta_2 &=0.35,\\ \beta_3 &=0,\\ X_{i,2} &= \rho X_{i,1}+Z_{i,2},\\ X_{i,1},U_i,Z_{i,2},X_{i,3} &\sim \; \text{iid}\;N(0,1)\;\text{and independent from each other}. \end{aligned}$$ Three potential regressors: * Regressor 1 is the main regressor. * Regressor 2 has a small coefficient and its correlation with Regressor 1 depends on the magnitude of $\rho$. * Regressor 3 is irrelevant. ```{r} beta1=0 beta2=0.35 ``` ## "Large" $\rho$: $\rho=1$ We assume that Regressor 1 is strongly correlated with controls. ```{r} rho=1 ``` We write a function for generating data: ```{r} data_sim<-function(n,beta1,beta2,rho){ X=matrix(rnorm(n*3),ncol=3) X[,2]<-rho*X[,1]+X[,2] Y=beta1*X[,1]+beta2*X[,2]+rnorm(n) data<-list(Y=Y,X=X) } ``` We'll use LASSO with cross validation to select the controls, and then estimate the effect of Regressor 1 on $Y$. We set the penalty weight of Regressor 1 to 0 to always include it. ```{r} w=rep(1,3) w[1]=0 data<-data_sim(n,beta1,beta2,rho) CV.Lasso=cv.glmnet(data$X,data$Y,family="gaussian",alpha=1,penalty.factor=w) Included=which(coef(CV.Lasso,s=CV.Lasso$lambda.1se)[-1]!=0) Included ``` We generate data, select covariates using LASSO (with Regressor 1 being always in) and store the t-statistic for the coefficient of the first regressor. ```{r} rho=1 set.seed(42,sample.kind = "Rejection") IN2=0 # counter for inclusion of X2 T_Beta1_post=rep(0,R) # Vector to store T-stats for the main regressor for (r in 1:R){ data<-data_sim(n,beta1,beta2,rho) CV.Lasso=cv.glmnet(data$X,data$Y,family="gaussian",alpha=1,penalty.factor=w) Included=which(coef(CV.Lasso,s=CV.Lasso$lambda.1se)[-1]!=0) Post_OLS=lm(data$Y~data$X[,Included]) T_Beta1_post[r]=coef(summary(Post_OLS))[2,3] #Selects the t-statistic on X1 IN2=IN2+(coef(CV.Lasso,s=CV.Lasso$lambda.1se)[3]!=0) } print("Prob. of X2 included") IN2/R ``` We plot the histogram of the post-LASSO t-statistic for the first regressor. Its asymptotic distribution should be centered around zero since the true coefficient is zero. We compare it with the $N(0,1)$ distribution. Because Regressor 2 is omitted with high probability and correlated with Regressor 1, the distribution is distorted. ```{r} low=min(T_Beta1_post) high=max(T_Beta1_post) step=(high-low)/20 hist(T_Beta1_post,breaks=seq(low-2*step,high+2*step,step),xlab="estimates",main="The exact distribution of the post-LASSO t-statistic vs N(0,1)",freq=FALSE,ylim=c(0,0.5)) # add a vertical line at the true value abline(v=beta1,col="blue") # add the plot of the N(0,1) pdf x=seq(-4,4,0.01) f=exp(-x^2/2)/sqrt(2*pi) lines(x,f,col="red") ``` ## Repeat with a "small" $\rho$: $\rho=0.1$ We rerun the simulations with a smaller $\rho$. Regressor 2 is still dropped with a high probability. ```{r} rho=0.25 set.seed(42,sample.kind = "Rejection") IN2=0 # counter for inclusion of X2 T_Beta1_post=rep(0,R) # Vector to store T-stats for the main regressor for (r in 1:R){ data<-data_sim(n,beta1,beta2,rho) CV.Lasso=cv.glmnet(data$X,data$Y,family="gaussian",alpha=1,penalty.factor=w) Included=which(coef(CV.Lasso,s=CV.Lasso$lambda.1se)[-1]!=0) Post_OLS=lm(data$Y~data$X[,Included]) T_Beta1_post[r]=coef(summary(Post_OLS))[2,3] #Selects the t-statistic on X1 IN2=IN2+(coef(CV.Lasso,s=CV.Lasso$lambda.1se)[3]!=0) } print("Prob. of X2 included") IN2/R ``` The exact distribution is less skewed. When the second regressor is only weakly correlated with Regressor 1, there is less distortion. ```{r} low=min(T_Beta1_post) high=max(T_Beta1_post) step=(high-low)/20 hist(T_Beta1_post,breaks=seq(low-2*step,high+2*step,step),xlab="estimates",main="The exact distribution of the post-LASSO t-statistic vs N(0,1)",freq=FALSE,ylim=c(0,0.5)) # add a vertical line at the true value abline(v=beta1,col="blue") # add the plot of the N(0,1) pdf x=seq(-4,4,0.01) f=exp(-x^2/2)/sqrt(2*pi) lines(x,f,col="red") ```