bestridge
One of the main tasks of statistical modeling is to exploit the association between a response variable and multiple predictors. The linear model (LM), a simple parametric regression model, is often used to capture linear dependence between response and predictors. As the extensions of the linear model, the other two common models: generalized linear model (GLM) and Cox’s proportional hazards (CoxPH) model, depending on the types of responses, are modeled with mean linear to the inputs. When the number of predictors is large, parameter estimations in these models can be computationally burdensome. At the same time, a Widely accepted heuristic rule for statistical modeling is Occam’s razor, which asks modelers to keep a good balance between the goodness of fit and model complexity, leading to a relatively small subset of important predictors.
bestridge
is a toolkit for the best subset ridge
regression (BSRR) problems. Under many sparse learning regimes, L0 regularization can
outperform commonly used feature selection methods (e.g., L1 and MCP). We state the
problem as
where log L(β) is
the log-likelihood function in the GLM case or the log partial
likelihood function In the Cox model. ‖β‖0 denotes the L0 norm of β, i.e., the number of non-zeros in
β. The vanilla L0 regularization gains
the minimum-variance unbiased estimator on the active set. For
restricted fits, the estimation bias is positive, and it is worthwhile
if the decrease in variance exceeds the increase in bias. Here We would
like to introduce the best subset ridge regression as an option of
bias-variance tradeoff. This will add an L2 norm shrinkage to the
coefficients, the strength of which is controlled by the parameter λ. The fitting is done over a grid
of sparsity s and λ values to generate a
regularization path. For each candidate model size and λ, the best subset ridge regression
is solved by the L2
penalized primal-dual active set algorithm. This algorithm utilizes an
active set updating strategy via primal and dual variables and fits the
sub-model by exploiting the fact that their support sets are non-overlap
and complementary. For the case of method = "sequential"
if
warms.start = "TRUE"
, we run this algorithm for a list of
sequential combination of model sizes and λ and use the estimate from the last
iteration as a warm start. For the case of
method = "psequential"
and
method = "pgsection"
, the Powell conjugate direction method
is implemented. This method finds the optimal parameters by a
bi-directional search along each search vector, in turn. The
bi-directional line search along each search vector can be done by
sequential search path or golden section search, the line search method
of which is specified by method = "psequential"
or
method = "pgsection"
.
This quick start guide walks you through the implementation of the best subset ridge regression on linear and logistic models.
We apply the methods to the data set reported in Scheetz et al. (2006), which contains gene expression levels of 18975 probes obtained from 120 rats. We are interested in finding probes that are related to that of gene TRIM32 using linear regression. This gene had been known to cause Bardet-Biedl syndrome, which is a genetic disease of multiple organ systems including the retina. For simplicity, the number of probes is reduced to 500 by picking out maximum 500 marginal correalation to Trim32 gene.
Supposed we wants to fit a best subset ridge regression model with
λ taking values in 100 grid
points from 100 to 0.001 and the model size between max {p, n/log (n)}
(the decimal portion is rounded) and 1, which is the default model size
range where p denotes the
number of total variables and n the sample size, via Generalized
Information Criterion. We call the bsrr
function as
follows:
The fitted coefficients can be extracted by running the
coef
function.
To make a prediction on the new data, a predict
function
can be used as follows:
If the option newx
is omitted, the fitted linear
predictors are used.
We now turn to the classification task. We use the data set duke (West et al. 2001). This data set contains microarray experiment for 86 breast cancer patients. We are interested in classify the patients into estrogen receptor-positive (Status = 0) and estrogen receptor-negative (Status = 1) based on the expression level of the considered genes.
Supposed we wants to do this classification based on logistic
regression with a BSRR model. This can be realized through calling the
bsrr
function with the parameter
family = "binomial
as follows:
Calling the plot
routine on an bsrr
object
fitted by type = "bsrr"
will provide a heatmap of GIC of
the fitted path.
The bestridge
package also supports studying the
relationship between predictors variables and survival time based on the
Cox proportional hazard model. We now demonstrate the usage on a real
data set, Alizadeh, Eisen, Davis, Ma, Lossos, Rosenwald, Boldrick,
Sabet, Tran, Yu et al. (2000)(Alizadeh et al.
2000): gene-expression data in lymphoma patients. We illustrate
the usage on a subset of the full data set. There were 50 patients with
measurements on 1000 probes.
To implement the best subset ridge regression on the Cox proportional
hazard model, call the bestridge
function with
family
specified to be "cox"
.
We use the summary
function to draw a summary of the
fitted bsrr
object. 10 probes among the total 1000 are
selected according to the result.
summary(cox.bsrr)
#> -------------------------------------------------------------------------------------------
#> Penalized Primal-dual active algorithm with tuning parameter determined by
#> powell method using golden section method for line search
#>
#> Best model with k = 10 lambda = 0.002287302 includes predictors:
#>
#> x4 x5 x114 x195 x394 x533 x604 x656
#> 14.285682 -8.229564 -6.160176 2.227762 11.106649 -4.282352 5.125808 -4.680172
#> x879 x913
#> -8.494358 3.594189
#>
#> log-likelihood: -37.16859
#> deviance: 74.33718
#> GIC: 168.5587
#> -------------------------------------------------------------------------------------------
+ Information criterion: AIC, BIC, GIC, EBIC
+ Cross-validation
So far, we have been stick to the default Generalized Information
Criterion for tuning parameter selections. In this package, we provide a
bunch of criteria including the Akaike information criterion (Akaike 1974) and Bayesian information criterion
(Schwarz et al. 1978), Generalized
Information Criterion (Konishi and Kitagawa
1996), and extended BIC (Chen and Chen
2008, 2012), as well as cross-validation. By default,
bsrr
selects the tuning parameters according to the
Generalized Information Criterion.
To choose one of the information criteria to select the optimal
values of model size s and
shrinkage λ, the input value
tune
in the bsrr
function needs to be
specified:
To use cross-validation for parameter selections, set the input value
of tune = "cv"
. By default, 5-fold cross-validation will be
conducted.
+ Sequential method
+ Powell's method
We shall give a more explicit description of the parameters tuning
paths. As mentioned in the introduction, we either apply a
method = "sequential"
path for examining each combination
of the two tuning parameters s
and λ sequentially and chose
the optimal one according to certain information criteria or cv, or a
less computational burdensome Powell method
(method = "psequential"
or
method = "pgsection"
). The initial starting point x0 of the Powell method
is set to be the (smin, λmin),
and the initial search vectors v1, v2
are the coordinate unit vectors. The method minimizes the value of
chosen information criterion or cross-validation error by a
bi-directional search along each search vector on the 2-dimensional
tuning parameter space composed of s and λ, in turn. Callers can choose to
find the optimal combination along each search vector sequentially or
conducting a Golden-section search by setting
method = "psequential"
or
method = "pgsection"
. If warm.start = TRUE
, we
use the estimate from the last iteration in the PDAS algorithm as a warm
start.
By default, the tuning parameters are determined by the Powell method through a golden-section line search and we exploit warm starts. The minimum λ value is set to be 0.01 and the maximum 100. The maximum model size is the smaller one of the total number of variables p and n/log (n). Advanced users of this toolkit can change this default behavior and supply their own tuning path.
To conduct parameter selections sequentially, users should specify
the method to be “sequential” and provide a list of λ to the lambda.list
as
well as an increasing list of model size to the s.list
.
my.lambda.list <- exp(seq(log(10), log(0.01), length.out = 10))
my.s.list <- 1:10
lm.bsrr.seq <- bsrr(x, y, method = "sequential", s.list = my.s.list,
lambda.list = my.lambda.list)
To conduct parameter selections using the Powell method on a
user-supplied tuning parameter space, users should assign values to
s.min
, s.max
, lambda.min
, and
lambda.max
. 100 values of λ will be generated decreasing from
lambda.max
to lambda.min
on the log scale.
Here we do the line search sequentially.
We also provide feature screening option to deal with ultrahigh
dimensional data for computational and statistical efficiency. Users can
apply the sure independence screening method (Saldana and Feng 2018) based on maximum
marginal likelihood estimators, to pre-exclude some irrelevant variables
before fitting a model by pressing a integer to
screening.num
. The SIS will filter a set of variables with
size equals to screening.num
. Then the active set updates
are restricted on this set of variables.
In this section, we will conduct a monte carol study on our proposed best subset ridge regression method and make a comparison with other commonly used variable selection methods in three aspects. The first aspect is the predictive performance on a held-out testing data of the same size of training data. For linear regression, this is defined as $\frac{\Vert X \beta^\dagger - X \hat{\beta}\Vert _2}{\Vert X \beta^\dagger\Vert_2}$, where β† is the true coefficient and β̂ is an estimator. For logistic regression, we calculate the classification accuracy by the area under the ROC curve (AUC). The second aspect is the coefficient estimation performance defined as ‖β† − β̂‖, where β† denote the underlying true coefficients and β† is out estimation. The third aspect is the selection performance in terms of true positive rate (TPR), false positive rate (FPR), the Matthews correlation coefficient (MCC) score, and the model size.
We generate a multivariate Gaussian data matrix Xn × p ∼ MVN(0, Σ) and consider the variables to have exponential correlation. n denotes the sample size and p the number of total variables, and we set n = 200, p = 2000. We consider the following instance of Σ := ((σij)), setting each entry σij = 0.5|i − j|. We use a sparse coefficient vector β† with 20 equi-spaced nonzero entries, each set to 1. Then the response vector y is generated with gaussian noise added. For linear regression, y = Xβ† + ϵ. For logistic regression, y = Bernoulli(Pr(Y = 1)), where Pr(Y = 1) = exp (xTβ† + ϵ)/(1 + exp (xTβ† + ϵ)). ϵ ∼ N(0, σ2) is independent of X. We define the signal-to-noise ratio (SNR) by SNR = $\frac{Var(X\beta^\dagger)}{Var(\epsilon)} = \frac{\beta^{\dagger T} \Sigma \beta^\dagger}{\sigma^2}$.
we compare the performance of the following methods:
(BSRR methods): Tuning parameters selected through a sequential path and the Powell method. Denoted as Grid-BSRR and Powell-BSRR.
(BSS method): The best subset selection method. We consider the
primal-dual active subset selection method and use our own
implementation in bestridge
package.
(L1 method): Lasso estimator. We use the implementation of glmnet.
(L1L2 method): Elastic net estimator. This uses a combination of the L1 and L2 regularization. We use the implementation of glmnet.
For BSRR estimators, the 2D grid of tuning parameters has λ taking 100 values between λmax and λmin on a log scale. The λmax and λmin will be specified. For the BSS method, the model size parameter takes value in {1, …, 30}. The parameter combined the L1 penalty and the L2 penalty in the Elastic net is fixed at 0.5. And for L1 and L1L2 methods, we use the default settings. The regularization parameters are chosen by 5-fold cross-validation.
Results for linear regression are reported in Figure 1. When SNR is low, BSS has the largest estimation error and selects the smallest model size compared with other methods, while the extra L2 penalty in BSRR has effectively lowered the estimation error of BSS. For high SNR, the performance of BSS and BSRR is similar to each other. In terms of prediction error, all methods are good in the high SNR scenario, but it costs the Lasso and the Elastic Net a very dense supports while still cannot catch up with BSS and BSRR. Notice that for SNR 100, the selected model size of the Elastic Net is almost 7 times the true model size. This trend is no better for the Lasso. Of all measures across the whole SNR range, BSRR generally exhibits excellent performance—accurate in variables selection and prediction.
Results for linear regression are reported in Figure 2. The result shows that as the SNR rises, the extra shrinkage in BSRR methods helps the best subset ridge regression make impressive improvement in terms of prediction accuracy, estimation ability, and variable selection, where it outperforms the state-of-the-art variable selection methods Lasso and Elastic Net.
In the bestridge
toolkit, we introduce the best subset
ridge regression for solving the L2 regularized best
subset selection problem. The L2 penalized PDAS
algorithm allows identification of the best sub-model with a
prespecified model size and shrinkage via a primal-dual formulation of
feasible solutions. To determine the best sub-model over all possible
model sizes and shrinkage parameters, both a sequential search method
and a powell search method are provided. We find that estimators BSRR
models do a better job than the BSS estimator and usually outperform
other variable selection methods in prediction, estimation and variable
selection.