BetaVisualizer

Vadim Tyuryaev1

Aleksandr Tsybakin2

Jane Heffernan3

Hanna Jankowski4

Kevin McGregor5

LIBRARIES

library(RegrCoeffsExplorer)
library(gridExtra)
library(glmnet)
library(selectiveInference)
library(dplyr)
library(faraway)
library(MASS)
library(psych)
library(ggplot2)
library(rlang)

EXAMPLES

1. LM object

1.1 Data

To assess the impact of automobile design and performance characteristics on fuel efficiency, measured in miles per gallon (MPG), we apply our data visualization tool to the mtcars dataset.


# help(mtcars)

df_mtcars=as.data.frame(mtcars)

df_mtcars[c("cyl","vs","am","gear")] =
  lapply(df_mtcars[c("cyl","vs","am","gear")] , factor) # convert to factor

head(df_mtcars)
mpg cyl disp hp drat wt qsec vs am gear carb
Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1

lm_object=lm(mpg~cyl+hp+wt+disp+vs+am+carb,data=df_mtcars)

summary(lm_object)
#> 
#> Call:
#> lm(formula = mpg ~ cyl + hp + wt + disp + vs + am + carb, data = df_mtcars)
#> 
#> Residuals:
#>     Min      1Q  Median      3Q     Max 
#> -3.8806 -1.1961 -0.2563  1.2542  4.6905 
#> 
#> Coefficients:
#>             Estimate Std. Error t value Pr(>|t|)    
#> (Intercept) 31.96076    3.66507   8.720 9.47e-09 ***
#> cyl6        -2.57040    1.79214  -1.434   0.1650    
#> cyl8         0.20422    3.73915   0.055   0.9569    
#> hp          -0.04911    0.02456  -2.000   0.0575 .  
#> wt          -3.16405    1.42802  -2.216   0.0369 *  
#> disp         0.01032    0.01570   0.657   0.5176    
#> vs1          2.53765    1.97564   1.284   0.2118    
#> am1          2.44093    1.68650   1.447   0.1613    
#> carb         0.53464    0.76313   0.701   0.4906    
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 2.468 on 23 degrees of freedom
#> Multiple R-squared:  0.8756, Adjusted R-squared:  0.8323 
#> F-statistic: 20.23 on 8 and 23 DF,  p-value: 1.105e-08

1.2 Default side-by-side plots


grid.arrange(vis_reg(lm_object)$"SidebySide")

It is imperative to acknowledge that variables such as engine configuration (specifically, the straight engine vs1) and vehicle weight influence fuel efficiency the most, with the effect of vs variable remaining consistent when examining changes in coefficients of continuous variables occurring within empirical data spanning from the first (Q1) to the third (Q3) quartiles by default. Nonetheless, a paradigm shift is observed for several other variables when the analysis transitions from a per-unit change perspective (as depicted in the left plot) to an examination of variations between the Q1 and Q3 quartiles (illustrated in the right plot). Under this new analytical lens, displacement and horsepower emerge as the third positive and first negative most influential factors, respectively. This shift in variable significance can be attributed to the fact that the differences in displacement and horsepower among the majority of vehicles do not typically equate to a mere 1 cubic inch or 1 horsepower. Consequently, this phenomenon underscores the criticality of considering the distribution of variables in the interpretation of regression outcomes, as reliance on per-unit interpretations may lead to misconceptions.

1.3 Visualizing per-unit change together with an intercept


vis_reg(lm_object, intercept=T)$"PerUnitVis"

1.4 Adding Confidence Intervals (CIs)


vis_reg(lm_object, intercept=T, CI=T)$"PerUnitVis"

1.4 Customizing pallete, title, and modifying default realized effect size calculations


grid.arrange(vis_reg(
             lm_object, CI=T,palette=c("palegreen4","tomato1"),
             eff_size_diff=c(1,5),
             title=c(
               "Regression - Unit Change", 
               "Regression - Effective Change (min --> max)"
                     )
                     )$"SidebySide"
             )

1.5 Customizing individual graphs


# obtain coefficients for vs and wt
vline1=lm_object$coefficients['vs1'][[1]]
vline2=lm_object$coefficients['wt'][[1]]

vis_reg(lm_object)$"PerUnitVis"+
  geom_hline(yintercept=vline1, linetype="dashed", color = "blue", size=1)+   # add a vertical line
  geom_hline(yintercept=vline2, linetype="dashed", color = "orange", size=1)+
  ggtitle("Visualization of Regression Results (per unit change)")+
  ylim(-5,5)+                                                                 # note the coordinate flip
  xlab("aspects")+
  ylab("coefficients")+
  theme_bw()+
  scale_fill_manual(values = c("black","pink" ))+                             # change mappings 
  theme(plot.title = element_text(hjust = 0.5))                               # place title in the center
#> Scale for fill is already present.
#> Adding another scale for fill, which will replace the existing scale.

2. GLM object

We employ the High School and Beyond dataset (hsb) to visualize the odds of selecting the Academic high school program. This analysis is based on predictors such as sex, race, socioeconomic status and scores on several subjects.

2.1 Data and fitted object


# ?hsb

glm_object=glm(
  I(prog == "academic") ~ gender +math+ read + write + science + socst,
  family = binomial(link="logit"), 
  data = faraway::hsb)

summary(glm_object)
#> 
#> Call:
#> glm(formula = I(prog == "academic") ~ gender + math + read + 
#>     write + science + socst, family = binomial(link = "logit"), 
#>     data = faraway::hsb)
#> 
#> Coefficients:
#>             Estimate Std. Error z value Pr(>|z|)    
#> (Intercept) -7.86563    1.33027  -5.913 3.36e-09 ***
#> gendermale   0.25675    0.37566   0.683 0.494314    
#> math         0.10454    0.02996   3.490 0.000484 ***
#> read         0.03869    0.02618   1.478 0.139455    
#> write        0.03794    0.02767   1.371 0.170272    
#> science     -0.08102    0.02676  -3.028 0.002460 ** 
#> socst        0.04908    0.02260   2.172 0.029860 *  
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> (Dispersion parameter for binomial family taken to be 1)
#> 
#>     Null deviance: 276.76  on 199  degrees of freedom
#> Residual deviance: 208.87  on 193  degrees of freedom
#> AIC: 222.87
#> 
#> Number of Fisher Scoring iterations: 4

2.2 Default side-by-side plots, CIs and 99% confidence interval


grid.arrange(vis_reg(
             glm_object,
             CI=T, 
             alpha=0.01
                    )$"SidebySide"
             )

Upon examination of the regression coefficients derived from the empirical data distribution for a change between Q1 and Q3 for continuous independent variables, it is evident that the math score variable exerts the highest impact on the odds of selecting an academic program as shown on the right plot. Concurrently, the variable gendermale which predominates in influence as depicted in the left plot, transitions to the position of minimal positive impact within this context.

3 GLMNET model objects

3.1. Data

We utilize the LASSO regression to understand how various car characteristics influence sales price using a data set from 93 Cars on Sale in the USA in 1993.


df_glmnet=data.frame(Cars93)

df_glmnet[sample(dim(df_glmnet)[1], 5), ] # examine 5 randomly selected rows
Manufacturer Model Type Min.Price Price Max.Price MPG.city MPG.highway AirBags DriveTrain Cylinders EngineSize Horsepower RPM Rev.per.mile Man.trans.avail Fuel.tank.capacity Passengers Length Wheelbase Width Turn.circle Rear.seat.room Luggage.room Weight Origin Make
39 Geo Metro Small 6.7 8.4 10.0 46 50 None Front 3 1.0 55 5700 3755 Yes 10.6 4 151 93 63 34 27.5 10 1695 non-USA Geo Metro
38 Ford Crown_Victoria Large 20.1 20.9 21.7 18 26 Driver only Rear 8 4.6 190 4200 1415 No 20.0 6 212 114 78 43 30.0 21 3950 USA Ford Crown_Victoria
77 Pontiac Bonneville Large 19.4 24.4 29.4 19 28 Driver & Passenger Front 6 3.8 170 4800 1565 No 18.0 6 177 111 74 43 30.5 18 3495 USA Pontiac Bonneville
12 Chevrolet Cavalier Compact 8.5 13.4 18.3 25 36 None Front 4 2.2 110 5200 2380 Yes 15.2 5 182 101 66 38 25.0 13 2490 USA Chevrolet Cavalier
13 Chevrolet Corsica Compact 11.4 11.4 11.4 25 34 Driver only Front 4 2.2 110 5200 2665 Yes 15.6 5 184 103 68 39 26.0 14 2785 USA Chevrolet Corsica

levels(df_glmnet$Origin)                                                        # check level attributes
#> [1] "USA"     "non-USA"

df_glmnet=df_glmnet %>% mutate(MPG.avg = (MPG.city + MPG.highway) / 2)          # calculate average MPG

3.2 LASSO - data preparation and model


y_lasso=df_glmnet$Price

x_lasso=model.matrix(
        as.formula(paste("~",
                         paste(c("MPG.avg","Horsepower","RPM","Wheelbase", 
                                 "Passengers","Length", "Width", "Weight",
                                 "Origin","Man.trans.avail"
                                 ), collapse = "+"
                               ),sep = ""
                         )
                   ), data=df_glmnet
                     )
                                                   

x_lasso = x_lasso[, -1]                                                         # remove intercept

ndim_lasso=dim(x_lasso)[1]

cv_model_lasso = cv.glmnet(x_lasso, y_lasso, family="gaussian", alpha=1)        # LASSO regression

# extract value of lambda that gives minimum mean cross-validated error
best_lambda_lasso = cv_model_lasso$lambda.min                                   

plot(cv_model_lasso)


best_model_lasso = glmnet(x_lasso, y_lasso, family="gaussian", alpha=1, 
                            lambda=best_lambda_lasso)

coefficients(best_model_lasso)
#> 11 x 1 sparse Matrix of class "dgCMatrix"
#>                              s0
#> (Intercept)        51.264201929
#> MPG.avg            -0.311611268
#> Horsepower          0.155706428
#> RPM                -0.003256390
#> Wheelbase           0.623949866
#> Passengers         -1.329682491
#> Length              0.094327741
#> Width              -1.387744830
#> Weight             -0.002804314
#> Originnon-USA       4.045929420
#> Man.trans.availYes -2.098636064

Note that on Lasso regression plots two values of regularization parameter \(\lambda\) are indicated: \(\lambda_{min}\) and \(\lambda_{1se}\). What is the difference? The first, \(\lambda_{min}\)is the value that minimizes the cross-validated error, leading to a model that fits the data with the lowest prediction error, but with a potential risk of overfitting. Conversely,\(\lambda_{1se}\). is a more conservative choice, representing the largest \(\lambda\) within one standard error of the minimum error, resulting in a simpler, more robust model that is less likely to overfit while maintaining a prediction error close to the minimum. For our analysis we select \(\lambda_{min}\).

The LASSO regression has reduced the coefficient for the weight variable to zero, likely due to its high correlation with other variables included in the analysis.

3.3 Checking correlations for numeric variables of interest


df_glmnet_num=df_glmnet%>%select_if(function(x) is.numeric(x))                  

cols_to_select = c("MPG.avg","Horsepower","RPM","Wheelbase","Passengers",
                   "Length", "Width", "Weight")

df_glmnet_num=df_glmnet_num %>%select(all_of(cols_to_select))                 

corPlot(df_glmnet_num,xlas=2)

The correlation matrix substantiates our hypothesis, revealing a high correlation between weight and multiple variables incorporated in the model.

3.4 Default LASSO plots with custom realized effect size


grid.arrange(vis_reg(best_model_lasso,eff_size_diff=c(1,3),         # Q2 - minimum
                     glmnet_fct_var="Originnon-USA")$"SidebySide")  # note the naming pattern for categorical variables

Note that the Weight variable is retained and remains consistently equal to \(0\) across both plots. Additionally, the variation in regression coefficients and their interpretation align with the paradigm change discussed previously.

3.5 Modifying idividial plots and arraning them back together


plt_1=vis_reg(best_model_lasso,eff_size_diff=c(1,3),
              glmnet_fct_var="Originnon-USA")$"PerUnitVis"+
              ggtitle("Visualization of CV.GLMNET Results (per unit change)")+
              ylim(-4,4)+
              xlab("Car characteristics")+
              ylab("LASSO coefficients")+
              theme_bw()+
              scale_fill_manual(values = c("red","whitesmoke" ))+                            
              theme(plot.title = element_text(hjust = 0.5))   
#> Scale for fill is already present.
#> Adding another scale for fill, which will replace the existing scale.

plt_2=vis_reg(best_model_lasso, eff_size_diff=c(1,3), 
              glmnet_fct_var="Originnon-USA")$"RealizedEffectVis"+
              ggtitle("Visualization of CV.GLMNET Results (effective:min --> Q2)")+
              ylim(-15,15)+        
              xlab("Car characteristics")+
              ylab("LASSO coefficients")+
              theme_bw()+
              scale_fill_manual(values = c("maroon1","palegreen1" ))+                            
              theme(plot.title = element_text(hjust = 0.5))   
#> Scale for fill is already present.
#> Adding another scale for fill, which will replace the existing scale.

plt_3=arrangeGrob(plt_1,plt_2, nrow=1, widths = c(1,1))

grid.arrange(plt_3)

Note that coefficients with absolute values exceeding those specified in the ylim vector will not be visualized. For instance, setting ylim=c(-2,2) for the left plot would result in the omission of the Originnon-USA coefficient from the visualization.

3.6 Post Selection Inference

3.6.1 Data

We employ the Stanford Heart Transplant data (jasa) which bcontains detailed records of heart transplant patients, including their survival times, status, and other clinical variables, used for survival analysis to demonstrate the construction of CIs for glmnet type objects.


# ?jasa

heart_df=as.data.frame(survival::jasa)

heart_df_filtered = heart_df %>%filter(!rowSums(is.na(.)))                      # remove rows containing NaN values

# check last 6 rows of the data frame
tail(heart_df_filtered)
birth.dt accept.dt tx.date fu.date fustat surgery age futime wait.time transplant mismatch hla.a2 mscore reject
93 1925-10-10 1973-07-11 1973-08-07 1974-04-01 0 0 47.75086 264 27 1 2 0 0.33 0
94 1929-11-11 1973-09-14 1973-09-17 1974-02-25 1 1 43.84120 164 3 1 3 0 1.20 1
96 1947-02-09 1973-10-04 1973-10-16 1974-04-01 0 0 26.65024 179 12 1 2 0 0.46 0
97 1950-04-11 1973-11-22 1973-12-12 1974-04-01 0 0 23.61670 130 20 1 3 1 1.78 0
98 1945-04-28 1973-12-14 1974-03-19 1974-04-01 0 0 28.62971 108 95 1 4 1 0.77 0
100 1939-01-31 1974-02-22 1974-03-31 1974-04-01 0 1 35.06092 38 37 1 3 0 0.67 0

3.6.2 Data observations


# filtered data only contains patients who received a transplant,
sum(heart_df_filtered$transplant!=1)
#> [1] 0

# mismatch scores are weakly correlated,
print('Correlation between mismatch scores:')
#> [1] "Correlation between mismatch scores:"
cor(heart_df_filtered$mscore,heart_df_filtered$mismatch)
#> [1] 0.3881104

# if rejection occurs, the death is certain, at least, in this data set
heart_cont_table=table(heart_df_filtered$reject,heart_df_filtered$fustat)
dimnames(heart_cont_table) =list(
  Reject = c("No", "Yes"), 
  Status = c("Alive", "Deceased")
  )
heart_cont_table
#>       Status
#> Reject Alive Deceased
#>    No     24       12
#>    Yes     0       29
                       
# 'age' is skewed variable with a very big range
paste("Range of '\ age \' variable is : ",diff(range(heart_df_filtered$age)))
#> [1] "Range of ' age ' variable is :  44.8569472963724"

old_par = par()      
par(mfrow=c(2,2))
hist(heart_df_filtered$age, main="Histogram of Age", xlab="age")
boxplot(heart_df_filtered$age,main="Boxplot of Age", ylab="age")
hist(sqrt(heart_df_filtered$age),main="Histogram of transformed data", xlab="Sqrt(age)")
boxplot(sqrt(heart_df_filtered$age),main="Boxplot of transformed data", ylab="Sqrt(age)")

par(old_par)          

3.6.3 A note about rounding


# observe that age variable is not rounded

# it is calculated in the following manner
age_calc_example=difftime(heart_df_filtered$accept.dt, 
                          heart_df_filtered$birth.dt,units = "days")/365.25

# check the first calculated value
age_calc_example[1]==heart_df_filtered[1,]$age
#> [1] TRUE

# check randomly selected value
n_samp=sample(dim(heart_df_filtered)[1],1)
age_calc_example[n_samp]==heart_df_filtered[n_samp,]$age
#> [1] TRUE

# check 5 point summary 
heart_df_filtered$age%>%summary()
#>    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
#>   19.55   42.50   48.02   46.03   52.08   64.41

# check 5 point summary for data rounded down to the nearest integer
heart_df_filtered$age%>%floor()%>%summary()
#>    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
#>   19.00   42.00   48.00   45.54   52.00   64.00

In the realm of our visualization tool, two primary inquiries emerge:

*How does the Odds Ratio (OR) change with a unit increment in the variables under scrutiny?

*How does the OR vary in response to alterations exceeding a single unit, such as the disparity between the first (Q1) and third (Q3) quartiles within the data distribution?

It is crucial to acknowledge that the data distribution may not always support a per-unit interpretation, as exemplified by the age variable within our dataset. Consequently, when engaging in calculations that encompass changes across quartiles, it is advisable to employ rounding strategies (either floor or ceiling functions) prior to data input. This approach facilitates the comparison of ORs associated with unit age discrepancies (e.g., 1 year) against those pertaining to more substantial differences (e.g., 10 years).

Absence of rounding can lead to nuanced interpretations. Consider, for instance, the interquartile range for the age variable, which is calculated as Q3 - Q1 (52.08 - 42.50 = 9.58 years). In such scenarios, the OR derived from the Q3 to Q1 variation in age essentially compares the odds of mortality among individuals with an age gap of 9.58 years, a differential that may not intuitively serve as the most illustrative measure. In the vis_reg() function, the round_func parameter allows for the specification of rounding the calculated differences either upwards or downwards to the nearest integer, thus providing a more instinctual explication.

3.6.4 Model


# reject categorical variable in not included due to the reason previously stated
heart_df_filtered = heart_df_filtered %>%
  mutate(across(all_of(c("surgery")), as.factor))

# apply 'sqrt()' transformation to 'age' variable
heart_df_filtered$sqrt.age=sqrt(heart_df_filtered$age)

y_heart=heart_df_filtered$fustat

x_heart=model.matrix(as.formula(paste("~",
         paste(c("sqrt.age" ,"mismatch","mscore", "surgery"),collapse = "+"),
         sep = "")), data=heart_df_filtered)

x_heart=x_heart[, -1]
x_heart_orig=x_heart                                                            # save original data set
x_heart=scale(x_heart,T,T)                                                    

gfit_heart = cv.glmnet(x_heart,y_heart,standardize=F,family="binomial")

lambda_heart=gfit_heart$lambda.min
n_heart=dim(x_heart)[1]

beta_hat_heart=coef(gfit_heart, x=x_heart, y=y_heart, s=lambda_heart, exact=T)

# note that lambda should be multiplied by the number of rows
out_heart = fixedLassoInf(x_heart,y_heart,beta_hat_heart,lambda_heart*n_heart,
                          family="binomial")
#check the output
out_heart
#> 
#> Call:
#> fixedLassoInf(x = x_heart, y = y_heart, beta = beta_hat_heart, 
#>     lambda = lambda_heart * n_heart, family = "binomial")
#> 
#> Testing results at lambda = 0.921, with alpha = 0.100
#> 
#>  Var   Coef Z-score P-value LowConfPt UpConfPt LowTailArea UpTailArea
#>    1  0.819   2.718   0.009     0.280    1.319       0.050      0.049
#>    2 -0.524  -1.685   0.140    -1.031    0.308       0.050      0.049
#>    3  0.509   1.486   0.197    -0.506    1.065       0.049      0.049
#>    4 -0.435  -1.606   0.135    -0.882    0.241       0.048      0.050
#> 
#> Note: coefficients shown are full regression coefficients

# note the class
class(out_heart)
#> [1] "fixedLogitLassoInf"

Although the data input is centered and scaled, the coefficients and CIs are presented on the original scale. The package includes a function named detransform that carries out the re-scaling and de-centering process for effective size difference calculations. Alternatively, consider rounding down or up before passing the data to the function.

3.6.5 A note on data scaling and centering in relation to glmnet objects

 
# back transformation logic
 x_heart_reconstructed = t(apply(x_heart, 1, function(x) 
   x*attr(x_heart,'scaled:scale') + attr(x_heart, 'scaled:center')))

 # check
 all.equal(x_heart_orig,x_heart_reconstructed)
#> [1] TRUE
 
 # same via a function
 x_heart_reconstructed.2=detransform(x_heart)
 all.equal(x_heart_orig,x_heart_reconstructed.2)
#> [1] TRUE

3.6.6 LASSO regression with CIs and custom realized effect size


grid.arrange(vis_reg(out_heart, CI=T, glmnet_fct_var=c("surgery1"), 
                     round_func="none",eff_size_diff=c(1,3))$"SidebySide"
             )

3.6.7 A note on Selective Inference

In the domain of Selective Inference, it is noteworthy that CIs may not encompass the estimated coefficients. To elucidate, scenarios may arise wherein both bounds of the confidence intervals are positioned beneath the estimated coefficients. The following example is reproduced without any changes from “Tools for Post-Selection Inference” (pp.9-10).


set.seed(43)
n = 50
p = 10
sigma = 1
x = matrix(rnorm(n*p),n,p)
x=scale(x,TRUE,TRUE)
beta = c(3,2,rep(0,p-2))
y = x%*%beta + sigma*rnorm(n)
pf=c(rep(1,7),rep(.1,3)) #define penalty factors
pf=p*pf/sum(pf) # penalty factors should be rescaled so they sum to p
xs=scale(x,FALSE,pf) #scale cols of x by penalty factors
# first run glmnet
gfit = glmnet(xs, y, standardize=FALSE)
# extract coef for a given lambda; note the 1/n factor!
# (and we don't save the intercept term)
lambda = .8
beta_hat = coef(gfit, x=xs, y=y, s=lambda/n, exact=TRUE)[-1]
# compute fixed lambda p-values and selection intervals
out = fixedLassoInf(xs,y,beta_hat,lambda,sigma=sigma)
#rescale conf points to undo the penalty factor
out$ci=t(scale(t(out$ci),FALSE,pf[out$vars]))
out
#> 
#> Call:
#> fixedLassoInf(x = xs, y = y, beta = beta_hat, lambda = lambda, 
#>     sigma = sigma)
#> 
#> Standard deviation of noise (specified or estimated) sigma = 1.000
#> 
#> Testing results at lambda = 0.800, with alpha = 0.100
#> 
#>  Var   Coef Z-score P-value LowConfPt UpConfPt LowTailArea UpTailArea
#>    1  3.987  18.880   0.000     2.657    3.229       0.049      0.050
#>    2  2.911  13.765   0.000     1.454    2.364       0.050      0.049
#>    3  0.187   0.776   0.303    -0.747    1.671       0.050      0.050
#>    4  0.149   0.695   0.625    -1.040    0.353       0.050      0.049
#>    5 -0.294  -1.221   0.743    -0.379    2.681       0.050      0.050
#>    6 -0.206  -0.978   0.685    -0.349    1.568       0.049      0.050
#>    7  0.195   0.914   0.407    -0.487    0.401       0.049      0.049
#>    8  0.006   0.295   0.758    -1.711    0.363       0.050      0.049
#>    9 -0.015  -0.723   0.458    -0.368    0.531       0.050      0.050
#>   10 -0.003  -0.157   0.948    -0.011    7.828       0.050      0.050
#> 
#> Note: coefficients shown are partial regression coefficients

Note that confidence intervals for the first two variables contain the true values c(3,2) and do not encompass the estimated coefficients c(3.987,2.911).

3.6.7 GLMNET with penalty factor and CIs


pf_heart=c(0.3, 0.1,0.1,0.1)
p_l=length(pf_heart)
pf_heart=p_l*pf_heart/sum(pf_heart)

xs_heart_res=scale(x_heart,FALSE,pf_heart)                                      # note that the data is being scaled again

gfit_heart_pef_fac_res = cv.glmnet(xs_heart_res, y_heart, standardize=FALSE, 
                                   family="binomial")

lambda_heart_pef_fac_res=gfit_heart_pef_fac_res$lambda.min

beta_hat_heart_res=coef(gfit_heart_pef_fac_res, x=xs_heart_res, y=y_heart,
                        s=lambda_heart_pef_fac_res, exact=F)

out_heart_res = fixedLassoInf(xs_heart_res,y_heart,beta_hat_heart_res,
                              lambda_heart_pef_fac_res*n_heart,family="binomial")

out_heart_res$ci=t(scale(t(out_heart_res$ci),FALSE,pf_heart[out_heart_res$vars]))

out_heart_res
#> 
#> Call:
#> fixedLassoInf(x = xs_heart_res, y = y_heart, beta = beta_hat_heart_res, 
#>     lambda = lambda_heart_pef_fac_res * n_heart, family = "binomial")
#> 
#> Testing results at lambda = 0.877, with alpha = 0.100
#> 
#>  Var   Coef Z-score P-value LowConfPt UpConfPt LowTailArea UpTailArea
#>    1  1.630   2.774   0.009     0.273    1.304       0.050      0.048
#>    2 -0.351  -1.690   0.121    -1.038    0.244       0.049      0.049
#>    3  0.342   1.491   0.167    -0.397    1.078       0.049      0.049
#>    4 -0.291  -1.608   0.127    -0.884    0.219       0.049      0.050
#> 
#> Note: coefficients shown are full regression coefficients

3.6.8 A second note on data scaling and centering in relation to fixedLassoInf objects


x_heart_test_3=detransform(xs_heart_res, attr_center=NULL)

x_heart_test_3=detransform(x_heart_test_3,
                           attr_scale=attr(x_heart, 'scaled:scale'),
                           attr_center=attr(x_heart, 'scaled:center')
                           )
# check
all.equal(x_heart_test_3,x_heart_orig)
#> [1] TRUE

The vis_reg() function operates by extracting the necessary information from the provided object. However, in the context of generating CIS with penalty factors for fixedLassoInf type objects, a dual transformation, as illustrated previously, is necessary. Direct reconstruction from the passed object is not possible in such instances. Therefore, to obtain CIs for fixedLassoInf objects that have been fitted using penalty factors, it is essential to supply the original, non-transformed data.

3.6.9 Post selection inference with CIs and penalty factors


# note that case_penalty=T and x_data_orig must be specified

# effective change between Q1(2) and max(5)
grid.arrange(vis_reg(out_heart_res, CI=T, glmnet_fct_var=c("surgery1"), 
                     case_penalty=T, x_data_orig=x_heart_orig,
                     eff_size_diff=c(2,5))$"SidebySide")

It is important to observe that when the computed effective size difference is below 1, such would have been the case if we utilized default Q3 - Q1 difference which is 7.217 - 6.519 = 0.698 ( see summary(heart_df_filtered$sqrt.age) ), the OR on the right plot would correspond to a change of less than one unit. As a result, the numerical values presented on the right plot would be lower than those on the left plot. This outcome may appear counter intuitive at first glance.


  1. York University,Mathematics and Statistic, ↩︎

  2. York University,Mathematics and Statistic, ↩︎

  3. York University,Mathematics and Statistic, ↩︎

  4. York University,Mathematics and Statistic, ↩︎

  5. York University,Mathematics and Statistic, ↩︎