Title: | A Comprehensive R Package for Best Subset Selection |
---|---|
Description: | The bestridge package is designed to provide a one-stand service for users to successfully carry out best ridge regression in various complex situations via the primal dual active set algorithm proposed by Wen, C., Zhang, A., Quan, S. and Wang, X. (2020) <doi:10.18637/jss.v094.i04>. This package allows users to perform the regression, classification, count regression and censored regression for (ultra) high dimensional data, and it also supports advanced usages like group variable selection and nuisance variable selection. |
Authors: | Liyuan Hu [aut, cre] |
Maintainer: | Liyuan Hu <[email protected]> |
License: | GPL-3 |
Version: | 1.0.7 |
Built: | 2025-03-13 05:03:42 UTC |
Source: | https://github.com/cran/bestridge |
The bestridge package is designed to provide a one-stand service for users to successfully carry out best ridge regression in various complex situations via the primal dual active set algorithm proposed by Wen, C., Zhang, A., Quan, S. and Wang, X. (2020) <doi:10.18637/jss.v094.i04>. This package allows users to perform the regression, classification, count regression and censored regression for (ultra) high dimensional data, and it also supports advanced usages like group variable selection and nuisance variable selection.
Liyuan Hu, Kangkang Jiang, Yanhang Zhang, Jin Zhu and Xueqin Wang, Canhong Wen.
Wen, C., Zhang, A., Quan, S. and Wang, X. (2020). BeSS: An R Package for Best Subset Selection in Linear, Logistic and Cox Proportional Hazards Models, Journal of Statistical Software, Vol. 94(4). doi:10.18637/jss.v094.i04.
Best subset ridge regression for generalized linear model and Cox's proportional model.
bsrr( x, y, family = c("gaussian", "binomial", "poisson", "cox"), method = c("pgsection", "sequential", "psequential"), tune = c("gic", "ebic", "bic", "aic", "cv"), s.list, lambda.list = 0, s.min, s.max, lambda.min = 0.001, lambda.max = 100, nlambda = 100, always.include = NULL, screening.num = NULL, normalize = NULL, weight = NULL, max.iter = 20, warm.start = TRUE, nfolds = 5, group.index = NULL, seed = NULL )
bsrr( x, y, family = c("gaussian", "binomial", "poisson", "cox"), method = c("pgsection", "sequential", "psequential"), tune = c("gic", "ebic", "bic", "aic", "cv"), s.list, lambda.list = 0, s.min, s.max, lambda.min = 0.001, lambda.max = 100, nlambda = 100, always.include = NULL, screening.num = NULL, normalize = NULL, weight = NULL, max.iter = 20, warm.start = TRUE, nfolds = 5, group.index = NULL, seed = NULL )
x |
Input matrix, of dimension |
y |
The response variable, of |
family |
One of the following models: |
method |
The method to be used to select the optimal model size and |
tune |
The criterion for choosing the model size and |
s.list |
An increasing list of sequential values representing the model
sizes. Only used for |
lambda.list |
A lambda sequence for |
s.min |
The minimum value of model sizes. Only used for |
s.max |
The maximum value of model sizes. Only used for |
lambda.min |
The minimum value of lambda. Only used for |
lambda.max |
The maximum value of lambda. Only used for |
nlambda |
The number of |
always.include |
An integer vector containing the indexes of variables that should always be included in the model. |
screening.num |
Users can pre-exclude some irrelevant variables according to maximum marginal likelihood estimators before fitting a
model by passing an integer to |
normalize |
Options for normalization. |
weight |
Observation weights. Default is |
max.iter |
The maximum number of iterations in the |
warm.start |
Whether to use the last solution as a warm start. Default
is |
nfolds |
The number of folds in cross-validation. Default is |
group.index |
A vector of integers indicating the which group each variable is in.
For variables in the same group, they should be located in adjacent columns of |
seed |
Seed to be used to divide the sample into K cross-validation folds. Default is |
The best ridge regression problem with model size and the shrinkage parameter
is
In the GLM case, is the
log likelihood function; In the Cox model,
is the log
partial likelihood function.
The best subset selection problem is a special case of the best ridge regression problem
with the shrinkage .
For each candidate model size and , the best subset ridge regression
problems are solved by the
penalized primal-dual active set (PDAS) algorithm, see Wen et
al (2020) for details. This algorithm
utilizes an active set updating strategy via primal and dual variables and
fits the sub-model by exploiting the fact that their support sets are
non-overlap and complementary. For the case of
method = "sequential"
if warm.start = "TRUE"
, we run the PDAS algorithm for a list of
sequential model sizes and use the estimate from the last iteration as a
warm start. For the case of
method = "psequential"
and method = "pgsection"
, the Powell method
using a sequential line search method or a golden section search technique is
used for parameters determination.
A list with class attribute 'bsrr' and named components:
beta |
The best fitting coefficients. |
coef0 |
The best fitting intercept. |
loss |
The training loss of the best fitting model. |
ic |
The information criterion of the best fitting model when model selection is based on a certain information criterion. |
cvm |
The mean cross-validated error for the best fitting model when model selection is based on the cross-validation. |
lambda |
The lambda chosen for the best fitting model |
beta.all |
For |
coef0.all |
For |
loss.all |
For |
ic.all |
For |
cvm.all |
For |
lambda.all |
The lambda chosen for each step in |
family |
Type of the model. |
s.list |
The input
|
nsample |
The sample size. |
type |
Either |
method |
Method used for tuning parameters selection. |
ic.type |
The criterion of model selection. |
Liyuan Hu, Kangkang Jiang, Yanhang Zhang, Jin Zhu, Canhong Wen and Xueqin Wang.
plot.bsrr
, summary.bsrr
,
coef.bsrr
, predict.bsrr
.
#-------------------linear model----------------------# # Generate simulated data n <- 200 p <- 20 k <- 5 rho <- 0.4 seed <- 10 Tbeta <- rep(0, p) Tbeta[1:k*floor(p/k):floor(p/k)] <- rep(1, k) Data <- gen.data(n, p, k, rho, family = "gaussian", beta = Tbeta, seed = seed) x <- Data$x[1:140, ] y <- Data$y[1:140] x_new <- Data$x[141:200, ] y_new <- Data$y[141:200] lambda.list <- exp(seq(log(5), log(0.1), length.out = 10)) lm.bsrr <- bsrr(x, y, method = "pgsection") coef(lm.bsrr) print(lm.bsrr) summary(lm.bsrr) pred.bsrr <- predict(lm.bsrr, newx = x_new) # generate plots plot(lm.bsrr) #-------------------logistic model----------------------# #Generate simulated data Data <- gen.data(n, p, k, rho, family = "binomial", beta = Tbeta, seed = seed) x <- Data$x[1:140, ] y <- Data$y[1:140] x_new <- Data$x[141:200, ] y_new <- Data$y[141:200] lambda.list <- exp(seq(log(5), log(0.1), length.out = 10)) logi.bsrr <- bsrr(x, y, family = "binomial", lambda.list = lambda.list) coef(logi.bsrr) print(logi.bsrr) summary(logi.bsrr) pred.bsrr <- predict(logi.bsrr, newx = x_new) # generate plots plot(logi.bsrr) #-------------------poisson model----------------------# Data <- gen.data(n, p, k, rho=0.3, family = "poisson", beta = Tbeta, seed = seed) x <- Data$x[1:140, ] y <- Data$y[1:140] x_new <- Data$x[141:200, ] y_new <- Data$y[141:200] lambda.list <- exp(seq(log(5), log(0.1), length.out = 10)) poi.bsrr <- bsrr(x, y, family = "poisson", lambda.list = lambda.list) coef(poi.bsrr) print(poi.bsrr) summary(poi.bsrr) pred.bsrr <- predict(poi.bsrr, newx = x_new) # generate plots plot(poi.bsrr) #-------------------coxph model----------------------# #Generate simulated data Data <- gen.data(n, p, k, rho, family = "cox", scal = 10, beta = Tbeta) x <- Data$x[1:140, ] y <- Data$y[1:140, ] x_new <- Data$x[141:200, ] y_new <- Data$y[141:200, ] lambda.list <- exp(seq(log(5), log(0.1), length.out = 10)) cox.bsrr <- bsrr(x, y, family = "cox", lambda.list = lambda.list) coef(cox.bsrr) print(cox.bsrr) summary(cox.bsrr) pred.bsrr <- predict(cox.bsrr, newx = x_new) # generate plots plot(cox.bsrr) #----------------------High dimensional linear models--------------------# ## Not run: data <- gen.data(n, p = 1000, k, family = "gaussian", seed = seed) # Best subset selection with SIS screening lm.high <- bsrr(data$x, data$y, screening.num = 100) ## End(Not run) #-------------------group selection----------------------# beta <- rep(c(rep(1,2),rep(0,3)), 4) Data <- gen.data(200, 20, 5, rho=0.4, beta = beta, seed =10) x <- Data$x y <- Data$y group.index <- c(rep(1, 2), rep(2, 3), rep(3, 2), rep(4, 3), rep(5, 2), rep(6, 3), rep(7, 2), rep(8, 3)) lm.groupbsrr <- bsrr(x, y, s.min = 1, s.max = 8, group.index = group.index) coef(lm.groupbsrr) print(lm.groupbsrr) summary(lm.groupbsrr) pred.groupl0l2 <- predict(lm.groupbsrr, newx = x_new) #-------------------include specified variables----------------------# Data <- gen.data(n, p, k, rho, family = "gaussian", beta = Tbeta, seed = seed) lm.bsrr <- bsrr(Data$x, Data$y, always.include = 2)
#-------------------linear model----------------------# # Generate simulated data n <- 200 p <- 20 k <- 5 rho <- 0.4 seed <- 10 Tbeta <- rep(0, p) Tbeta[1:k*floor(p/k):floor(p/k)] <- rep(1, k) Data <- gen.data(n, p, k, rho, family = "gaussian", beta = Tbeta, seed = seed) x <- Data$x[1:140, ] y <- Data$y[1:140] x_new <- Data$x[141:200, ] y_new <- Data$y[141:200] lambda.list <- exp(seq(log(5), log(0.1), length.out = 10)) lm.bsrr <- bsrr(x, y, method = "pgsection") coef(lm.bsrr) print(lm.bsrr) summary(lm.bsrr) pred.bsrr <- predict(lm.bsrr, newx = x_new) # generate plots plot(lm.bsrr) #-------------------logistic model----------------------# #Generate simulated data Data <- gen.data(n, p, k, rho, family = "binomial", beta = Tbeta, seed = seed) x <- Data$x[1:140, ] y <- Data$y[1:140] x_new <- Data$x[141:200, ] y_new <- Data$y[141:200] lambda.list <- exp(seq(log(5), log(0.1), length.out = 10)) logi.bsrr <- bsrr(x, y, family = "binomial", lambda.list = lambda.list) coef(logi.bsrr) print(logi.bsrr) summary(logi.bsrr) pred.bsrr <- predict(logi.bsrr, newx = x_new) # generate plots plot(logi.bsrr) #-------------------poisson model----------------------# Data <- gen.data(n, p, k, rho=0.3, family = "poisson", beta = Tbeta, seed = seed) x <- Data$x[1:140, ] y <- Data$y[1:140] x_new <- Data$x[141:200, ] y_new <- Data$y[141:200] lambda.list <- exp(seq(log(5), log(0.1), length.out = 10)) poi.bsrr <- bsrr(x, y, family = "poisson", lambda.list = lambda.list) coef(poi.bsrr) print(poi.bsrr) summary(poi.bsrr) pred.bsrr <- predict(poi.bsrr, newx = x_new) # generate plots plot(poi.bsrr) #-------------------coxph model----------------------# #Generate simulated data Data <- gen.data(n, p, k, rho, family = "cox", scal = 10, beta = Tbeta) x <- Data$x[1:140, ] y <- Data$y[1:140, ] x_new <- Data$x[141:200, ] y_new <- Data$y[141:200, ] lambda.list <- exp(seq(log(5), log(0.1), length.out = 10)) cox.bsrr <- bsrr(x, y, family = "cox", lambda.list = lambda.list) coef(cox.bsrr) print(cox.bsrr) summary(cox.bsrr) pred.bsrr <- predict(cox.bsrr, newx = x_new) # generate plots plot(cox.bsrr) #----------------------High dimensional linear models--------------------# ## Not run: data <- gen.data(n, p = 1000, k, family = "gaussian", seed = seed) # Best subset selection with SIS screening lm.high <- bsrr(data$x, data$y, screening.num = 100) ## End(Not run) #-------------------group selection----------------------# beta <- rep(c(rep(1,2),rep(0,3)), 4) Data <- gen.data(200, 20, 5, rho=0.4, beta = beta, seed =10) x <- Data$x y <- Data$y group.index <- c(rep(1, 2), rep(2, 3), rep(3, 2), rep(4, 3), rep(5, 2), rep(6, 3), rep(7, 2), rep(8, 3)) lm.groupbsrr <- bsrr(x, y, s.min = 1, s.max = 8, group.index = group.index) coef(lm.groupbsrr) print(lm.groupbsrr) summary(lm.groupbsrr) pred.groupl0l2 <- predict(lm.groupbsrr, newx = x_new) #-------------------include specified variables----------------------# Data <- gen.data(n, p, k, rho, family = "gaussian", beta = Tbeta, seed = seed) lm.bsrr <- bsrr(Data$x, Data$y, always.include = 2)
This function provides estimated
coefficients from a fitted "bsrr
" object.
## S3 method for class 'bsrr' coef(object, sparse = TRUE, ...)
## S3 method for class 'bsrr' coef(object, sparse = TRUE, ...)
object |
A " |
sparse |
Logical or NULL, specifying whether the coefficients should be presented as sparse matrix or not. |
... |
Other arguments. |
If sparse == FALSE
, a vector containing the estimated coefficients
from a fitted "bsrr
" object is returned. If sparse == TRUE
,
a dgCMatrix
containing the estimated coefficients
is returned.
Liyuan Hu, Kangkang Jiang, Yanhang Zhang, Jin Zhu, Canhong Wen and Xueqin Wang.
# Generate simulated data n <- 200 p <- 20 k <- 5 rho <- 0.4 seed <- 10 Tbeta <- rep(0, p) Tbeta[1:k*floor(p/k):floor(p/k)] <- rep(1, k) Data <- gen.data(n, p, k, rho, family = "gaussian", beta = Tbeta, seed = seed) lambda.list <- exp(seq(log(5), log(0.1), length.out = 10)) lm.bsrr <- bsrr(Data$x, Data$y, method = "pgsection") coef(lm.bsrr)
# Generate simulated data n <- 200 p <- 20 k <- 5 rho <- 0.4 seed <- 10 Tbeta <- rep(0, p) Tbeta[1:k*floor(p/k):floor(p/k)] <- rep(1, k) Data <- gen.data(n, p, k, rho, family = "gaussian", beta = Tbeta, seed = seed) lambda.list <- exp(seq(log(5), log(0.1), length.out = 10)) lm.bsrr <- bsrr(Data$x, Data$y, method = "pgsection") coef(lm.bsrr)
Similar to other deviance methods, which returns deviance from a fitted "bsrr.one
" object.
## S3 method for class 'bsrr' deviance(object, best.model = TRUE, ...)
## S3 method for class 'bsrr' deviance(object, best.model = TRUE, ...)
object |
A " |
best.model |
Whether only return the loglikelihood of the best model. Default is |
... |
additional arguments |
A matrix or vector containing the deviance for each model is returned.
For bsrr
object fitted by sequantial
method, values in each row in the
returned matrix corresponding to the model size in s.list
, and each column the shrinkage parameters
in lambda.list
.
For bsrr
object fitted by gsection
, pgsection
and psequential
, the returned vector
contains deviance for fitted models in each iteration. The coefficients of those model can be extracted
from beta.all
and coef0.all
in the bsrr
object.
Liyuan Hu, Kangkang Jiang, Yanhang Zhang, Jin Zhu, Canhong Wen and Xueqin Wang.
# Generate simulated data n <- 200 p <- 20 k <- 5 rho <- 0.4 seed <- 10 Tbeta <- rep(0, p) Tbeta[1:k*floor(p/k):floor(p/k)] <- rep(1, k) Data <- gen.data(n, p, k, rho, family = "gaussian", seed = seed) lm.bsrr <- bsrr(Data$x, Data$y, method = "sequential") deviance(lm.bsrr) deviance(lm.bsrr, best.model = FALSE)
# Generate simulated data n <- 200 p <- 20 k <- 5 rho <- 0.4 seed <- 10 Tbeta <- rep(0, p) Tbeta[1:k*floor(p/k):floor(p/k)] <- rep(1, k) Data <- gen.data(n, p, k, rho, family = "gaussian", seed = seed) lm.bsrr <- bsrr(Data$x, Data$y, method = "sequential") deviance(lm.bsrr) deviance(lm.bsrr, best.model = FALSE)
This data set details microarray experiment for breast cancer patients.
A data frame with 46 rows and 7130 variables, where the first variable is the label of estrogen receptor-positive/negative, and the remaining 7129 variables are 7129 gene.
The binary variable Status is used to classify the patients into estrogen receptor-positive (y = 0) and estrogen receptor-negative (y = 1). The other variables contain the expression level of the considered genes.
M. West, C. Blanchette, H. Dressman, E. Huang, S. Ishida, R. Spang, H. Zuzan, J.A. Olson, Jr., J.R. Marks and Joseph R. Nevins (2001) <doi:10.1073/pnas.201162998> Predicting the clinical status of human breast cancer by using gene expression profiles, Proceedings of the National Academy of Sciences of the USA, Vol 98(20), 11462-11467.
Generate data for simulations under the generalized linear model and Cox model.
gen.data( n, p, k = NULL, rho = 0, family = c("gaussian", "binomial", "poisson", "cox"), beta = NULL, cortype = 1, snr = 10, censoring = TRUE, c = 1, scal, sigma = 1, seed = 1 )
gen.data( n, p, k = NULL, rho = 0, family = c("gaussian", "binomial", "poisson", "cox"), beta = NULL, cortype = 1, snr = 10, censoring = TRUE, c = 1, scal, sigma = 1, seed = 1 )
n |
The number of observations. |
p |
The number of predictors of interest. |
k |
The number of nonzero coefficients in the underlying regression
model. Can be omitted if |
rho |
A parameter used to characterize the pairwise correlation in
predictors. Default is |
family |
The distribution of the simulated data. |
beta |
The coefficient values in the underlying regression model. |
cortype |
The correlation structure. |
snr |
A numerical value controlling the signal-to-noise ratio (SNR). The SNR is defined as
as the variance of |
censoring |
Whether data is censored or not. Valid only for |
c |
The censoring rate. Default is |
scal |
A parameter in generating survival time based on the Weibull distribution. Only used for the " |
sigma |
A parameter used to control the signal-to-noise ratio. For linear regression,
it is the error variance |
seed |
seed to be used in generating the random numbers. |
We generate an random Gaussian matrix
with mean 0 and a covariance matrix with an exponential structure
or a constant structure. For the exponential structure, the covariance matrix
has
entry equals
. For the constant structure,
the
entry of the covariance matrix is
for every
and 1 elsewhere. For the moving average structure, For the design matrix
,
we first generate an
random Gaussian matrix
whose entries are i.i.d.
and then normalize its columns
to the
length. Then the design matrix
is generated with
for
.
For family = "gaussian"
, the data model is
The underlying regression coefficient has uniform distribution [m, 100m],
For family= "binomial"
, the data model is
The underlying regression coefficient has uniform distribution [2m, 10m],
For family = "poisson"
, the data is modeled to have an exponential distribution:
For family = "cox"
, the data model is
The centering time is generated from uniform distribution ,
then we define the censor status as
.
The underlying regression coefficient
has uniform distribution [2m, 10m],
In the above models,
where
is determined by the
snr
.
x |
Design matrix of predictors. |
y |
Response variable. |
Tbeta |
The coefficients used in the underlying regression model. |
Liyuan Hu, Kangkang Jiang, Yanhang Zhang, Jin Zhu, Canhong Wen and Xueqin Wang.
# Generate simulated data n <- 200 p <- 20 k <- 5 rho <- 0.4 SNR <- 10 cortype <- 1 seed <- 10 Data <- gen.data(n, p, k, rho, family = "gaussian", cortype = cortype, snr = SNR, seed = seed) x <- Data$x[1:140, ] y <- Data$y[1:140] x_new <- Data$x[141:200, ] y_new <- Data$y[141:200] lambda.list <- exp(seq(log(5), log(0.1), length.out = 10)) lm.bsrr <- bsrr(x, y, method = "pgsection")
# Generate simulated data n <- 200 p <- 20 k <- 5 rho <- 0.4 SNR <- 10 cortype <- 1 seed <- 10 Data <- gen.data(n, p, k, rho, family = "gaussian", cortype = cortype, snr = SNR, seed = seed) x <- Data$x[1:140, ] y <- Data$y[1:140] x_new <- Data$x[141:200, ] y_new <- Data$y[141:200] lambda.list <- exp(seq(log(5), log(0.1), length.out = 10)) lm.bsrr <- bsrr(x, y, method = "pgsection")
Gravier et al. (2010) have considered small, invasive ductal carcinomas without axillary lymph node involvement (T1T2N0) to predict metastasis of small node-negative breast carcinoma. Using comparative genomic hybridization arrays, they examined 168 patients over a five-year period. The 111 patients with no event after diagnosis were labelled good, and the 57 patients with early metastasis were labelled poor.
A list containing the design matrix X and response matrix y
No return value
Eleonore Gravier., Gaelle Pierron., and Anne Vincent-Salomon. (2010). A prognostic DNA signature for T1T2 node-negative breast cancer patients.
This function returns the log-likelihood for the fitted models.
## S3 method for class 'bsrr' logLik(object, best.model = TRUE, ...)
## S3 method for class 'bsrr' logLik(object, best.model = TRUE, ...)
object |
A " |
best.model |
Whether only return the log-likelihood of the best model. Default is |
... |
additional arguments |
The log-likelihood for the best model chosen by a certain information criterion
or cross-validation corresponding to the call in bsrr
or the best models
with model size and in the original
s.list
and lambda.list
(or the in the iteration path) can be returned.
For "lm" fits it is assumed that the scale has been estimated
(by maximum likelihood or REML),
and all the constants in the log-likelihood are included.
A matrix or vector containing the log-likelihood for each model is returned.
For bsrr
objects fitted by sequantial
method, values in each row in the
returned matrix corresponding to the model size in s.list
, and each column the shrinkage parameters
in lambda.list
.
For bsrr
objects fitted by gsection
, pgsection
and psequential
, the returned vector
contains log-likelihood for fitted models in each iteration. The coefficients of those model can be extracted
from beta.all
and coef0.all
in the bsrr
object.
Liyuan Hu, Kangkang Jiang, Yanhang Zhang, Jin Zhu, Canhong Wen and Xueqin Wang.
# Generate simulated data n <- 200 p <- 20 k <- 5 rho <- 0.4 SNR <- 10 cortype <- 1 seed <- 10 Tbeta <- rep(0, p) Tbeta[1:k*floor(p/k):floor(p/k)] <- rep(1, k) Data <- gen.data(n, p, k, rho, family = "gaussian", cortype = cortype, snr = SNR, seed = seed) lm.bsrr <- bsrr(Data$x, Data$y, method = "sequential") logLik(lm.bsrr, best.model = FALSE)
# Generate simulated data n <- 200 p <- 20 k <- 5 rho <- 0.4 SNR <- 10 cortype <- 1 seed <- 10 Tbeta <- rep(0, p) Tbeta[1:k*floor(p/k):floor(p/k)] <- rep(1, k) Data <- gen.data(n, p, k, rho, family = "gaussian", cortype = cortype, snr = SNR, seed = seed) lm.bsrr <- bsrr(Data$x, Data$y, method = "sequential") logLik(lm.bsrr, best.model = FALSE)
Lymphoma patients data set
patient.data A list with survival times, staus and covariates from patients.
A subset of the data set of lymphoma patients used in the study of Alizadeh et al. (2000) and also Simon et al. (2011).
Alizadeh, A. A., et al. (2000). Distinct types of diffuse large B-cell lymphoma identified by gene expression profiling. Nature, 403(6769), p.503 Simon, N., Friedman, J., Hastie, T., & Tibshirani, R. (2011). Regularization paths for Cox's proportional hazards model via coordinate descent. Journal of statistical software, 39(5), 1.
Produces a coefficient profile plot of the coefficient or loss function paths
## S3 method for class 'bsrr' plot( x, type = c("tune", "coefficients"), lambda = NULL, sign.lambda = 0, breaks = T, K = NULL, ... )
## S3 method for class 'bsrr' plot( x, type = c("tune", "coefficients"), lambda = NULL, sign.lambda = 0, breaks = T, K = NULL, ... )
x |
A |
type |
One of |
lambda |
For |
sign.lambda |
For |
breaks |
For |
K |
For |
... |
Other graphical parameters to plot |
No return value, called for plots generation
Liyuan Hu, Kangkang Jiang, Yanhang Zhang, Jin Zhu, Canhong Wen and Xueqin Wang.
bsrr
.
# Generate simulated data n <- 200 p <- 20 k <- 5 rho <- 0.4 seed <- 10 Tbeta <- rep(0, p) Tbeta[1:k*floor(p/k):floor(p/k)] <- rep(1, k) Data <- gen.data(n, p, k, rho, family = "gaussian", beta = Tbeta, seed = seed) lambda.list <- exp(seq(log(5), log(0.1), length.out = 10)) lm.bsrr <- bsrr(Data$x, Data$y, method = "pgsection") # generate plots plot(lm.bsrr)
# Generate simulated data n <- 200 p <- 20 k <- 5 rho <- 0.4 seed <- 10 Tbeta <- rep(0, p) Tbeta[1:k*floor(p/k):floor(p/k)] <- rep(1, k) Data <- gen.data(n, p, k, rho, family = "gaussian", beta = Tbeta, seed = seed) lambda.list <- exp(seq(log(5), log(0.1), length.out = 10)) lm.bsrr <- bsrr(Data$x, Data$y, method = "pgsection") # generate plots plot(lm.bsrr)
Returns predictions from a fitted
"bsrr
" object.
## S3 method for class 'bsrr' predict(object, newx, type = c("link", "response"), ...)
## S3 method for class 'bsrr' predict(object, newx, type = c("link", "response"), ...)
object |
Output from the |
newx |
New data used for prediction. If omitted, the fitted linear predictors are used. |
type |
|
... |
Additional arguments affecting the predictions produced. |
The object returned depends on the types of family.
Liyuan Hu, Kangkang Jiang, Yanhang Zhang, Jin Zhu, Canhong Wen and Xueqin Wang.
bsrr
.
#-------------------linear model----------------------# # Generate simulated data n <- 200 p <- 20 k <- 5 rho <- 0.4 seed <- 10 Tbeta <- rep(0, p) Tbeta[1:k*floor(p/k):floor(p/k)] <- rep(1, k) Data <- gen.data(n, p, k, rho, family = "gaussian", beta = Tbeta, seed = seed) x <- Data$x[1:140, ] y <- Data$y[1:140] x_new <- Data$x[141:200, ] y_new <- Data$y[141:200] lambda.list <- exp(seq(log(5), log(0.1), length.out = 10)) lm.bsrr <- bsrr(x, y, method = "pgsection") pred.bsrr <- predict(lm.bsrr, newx = x_new) #-------------------logistic model----------------------# #Generate simulated data Data <- gen.data(n, p, k, rho, family = "binomial", beta = Tbeta, seed = seed) x <- Data$x[1:140, ] y <- Data$y[1:140] x_new <- Data$x[141:200, ] y_new <- Data$y[141:200] lambda.list <- exp(seq(log(5), log(0.1), length.out = 10)) logi.bsrr <- bsrr(x, y, tune="cv", family = "binomial", lambda.list = lambda.list, method = "sequential") pred.bsrr <- predict(logi.bsrr, newx = x_new) #-------------------coxph model----------------------# #Generate simulated data Data <- gen.data(n, p, k, rho, family = "cox", beta = Tbeta, scal = 10) x <- Data$x[1:140, ] y <- Data$y[1:140, ] x_new <- Data$x[141:200, ] y_new <- Data$y[141:200, ] lambda.list <- exp(seq(log(5), log(0.1), length.out = 10)) cox.bsrr <- bsrr(x, y, family = "cox", lambda.list = lambda.list) pred.bsrr <- predict(cox.bsrr, newx = x_new) #-------------------group selection----------------------# beta <- rep(c(rep(1,2),rep(0,3)), 4) Data <- gen.data(200, 20, 5, rho=0.4, beta = beta, seed =10) x <- Data$x y <- Data$y group.index <- c(rep(1, 2), rep(2, 3), rep(3, 2), rep(4, 3), rep(5, 2), rep(6, 3), rep(7, 2), rep(8, 3)) lm.groupbsrr <- bsrr(x, y, s.min = 1, s.max = 8, group.index = group.index) pred.groupbsrr <- predict(lm.groupbsrr, newx = x_new)
#-------------------linear model----------------------# # Generate simulated data n <- 200 p <- 20 k <- 5 rho <- 0.4 seed <- 10 Tbeta <- rep(0, p) Tbeta[1:k*floor(p/k):floor(p/k)] <- rep(1, k) Data <- gen.data(n, p, k, rho, family = "gaussian", beta = Tbeta, seed = seed) x <- Data$x[1:140, ] y <- Data$y[1:140] x_new <- Data$x[141:200, ] y_new <- Data$y[141:200] lambda.list <- exp(seq(log(5), log(0.1), length.out = 10)) lm.bsrr <- bsrr(x, y, method = "pgsection") pred.bsrr <- predict(lm.bsrr, newx = x_new) #-------------------logistic model----------------------# #Generate simulated data Data <- gen.data(n, p, k, rho, family = "binomial", beta = Tbeta, seed = seed) x <- Data$x[1:140, ] y <- Data$y[1:140] x_new <- Data$x[141:200, ] y_new <- Data$y[141:200] lambda.list <- exp(seq(log(5), log(0.1), length.out = 10)) logi.bsrr <- bsrr(x, y, tune="cv", family = "binomial", lambda.list = lambda.list, method = "sequential") pred.bsrr <- predict(logi.bsrr, newx = x_new) #-------------------coxph model----------------------# #Generate simulated data Data <- gen.data(n, p, k, rho, family = "cox", beta = Tbeta, scal = 10) x <- Data$x[1:140, ] y <- Data$y[1:140, ] x_new <- Data$x[141:200, ] y_new <- Data$y[141:200, ] lambda.list <- exp(seq(log(5), log(0.1), length.out = 10)) cox.bsrr <- bsrr(x, y, family = "cox", lambda.list = lambda.list) pred.bsrr <- predict(cox.bsrr, newx = x_new) #-------------------group selection----------------------# beta <- rep(c(rep(1,2),rep(0,3)), 4) Data <- gen.data(200, 20, 5, rho=0.4, beta = beta, seed =10) x <- Data$x y <- Data$y group.index <- c(rep(1, 2), rep(2, 3), rep(3, 2), rep(4, 3), rep(5, 2), rep(6, 3), rep(7, 2), rep(8, 3)) lm.groupbsrr <- bsrr(x, y, s.min = 1, s.max = 8, group.index = group.index) pred.groupbsrr <- predict(lm.groupbsrr, newx = x_new)
Print the primary elements of the "bsrr
" object.
## S3 method for class 'bsrr' print(x, digits = max(5, getOption("digits") - 5), nonzero = FALSE, ...)
## S3 method for class 'bsrr' print(x, digits = max(5, getOption("digits") - 5), nonzero = FALSE, ...)
x |
A " |
digits |
Minimum number of significant digits to be used. |
nonzero |
Whether the output should only contain the non-zero coefficients. |
... |
additional print arguments |
prints the fitted model and returns it invisibly.
No return value, called for side effect
Liyuan Hu, Kangkang Jiang, Yanhang Zhang, Jin Zhu, Canhong Wen and Xueqin Wang.
# Generate simulated data n = 200 p = 20 k = 5 rho = 0.4 seed = 10 Tbeta <- rep(0, p) Tbeta[1:k*floor(p/k):floor(p/k)] <- rep(1, k) Data = gen.data(n, p, k, rho, family = "gaussian", beta = Tbeta, seed=seed) lambda.list = exp(seq(log(5), log(0.1), length.out = 10)) lm.bsrr = bsrr(Data$x, Data$y, lambda.list = lambda.list, method = "sequential") print(lm.bsrr)
# Generate simulated data n = 200 p = 20 k = 5 rho = 0.4 seed = 10 Tbeta <- rep(0, p) Tbeta[1:k*floor(p/k):floor(p/k)] <- rep(1, k) Data = gen.data(n, p, k, rho, family = "gaussian", beta = Tbeta, seed=seed) lambda.list = exp(seq(log(5), log(0.1), length.out = 10)) lm.bsrr = bsrr(Data$x, Data$y, lambda.list = lambda.list, method = "sequential") print(lm.bsrr)
Data from a study by by Stamey et al. (1989) to examine the association between prostate specific antigen (PSA) and several clinical measures that are potentially associated with PSA in men who were about to receive a radical prostatectomy. The variables are as follows:
lcavol: Log cancer volume
lweight: Log prostate weight
age: The man's age
lbph: Log of the amount of benign hyperplasia
svi: Seminal vesicle invasion; 1=Yes, 0=No
lcp: Log of capsular penetration
gleason: Gleason score
pgg45: Percent of Gleason scores 4 or 5
lpsa: Log PSA
A data frame with 97 observations on 9 variables
No return value
Stamey, T., Kabalin, J., McNeal, J., Johnstone, I., Freiha, F., Redwine, E. and Yang, N. (1989). Prostate specific antigen in the diagnosis and treatment of adenocarcinoma of the prostate II. Radical prostatectomy treated patients, Journal of Urology 16: 1076-1083.
Data from a subset of the Coronary Risk-Factor Study baseline survey, carried out in rural South Africa.
The variables are as follows:
sbp: Systolic blood pressure
tobacco: Cumulative tobacco consumption, in kg
ldl: Low-density lipoprotein cholesterol
adiposity: Adipose tissue concentration
famhist: Family history of heart disease (1=Present, 0=Absent)
typea: Score on test designed to measure type-A behavior
obesity: Obesity
alcohol: Current consumption of alcohol
age: Age of subject
chd: Coronary heart disease at baseline; 1=Yes 0=No
A data frame with 462 observations on 10 variables
No return value
Rousseauw, J., du Plessis, J., Benade, A., Jordaan, P., Kotze, J. and Ferreira, J. (1983). Coronary risk factor screening in three rural communities. South African Medical Journal 64: 430-436.
Print a summary of the "bsrr" object.
## S3 method for class 'bsrr' summary(object, ...)
## S3 method for class 'bsrr' summary(object, ...)
object |
A "bsrr" object. |
... |
additional print arguments |
No return value
Liyuan Hu, Kangkang Jiang, Yanhang Zhang, Jin Zhu, Canhong Wen and Xueqin Wang.
bsrr
.
#-------------------linear model----------------------# # Generate simulated data n <- 200 p <- 20 k <- 5 rho <- 0.4 seed <- 10 Tbeta <- rep(0, p) Tbeta[1:k*floor(p/k):floor(p/k)] <- rep(1, k) Data <- gen.data(n, p, k, rho, family = "gaussian", beta = Tbeta, seed = seed) lambda.list <- exp(seq(log(5), log(0.1), length.out = 10)) lm.bsrr <- bsrr(Data$x, Data$y, method = "pgsection") summary(lm.bsrr) #-------------------group selection----------------------# beta <- rep(c(rep(1,2),rep(0,3)), 4) Data <- gen.data(200, 20, 5, rho=0.4, beta = beta, snr = 100, seed =10) group.index <- c(rep(1, 2), rep(2, 3), rep(3, 2), rep(4, 3), rep(5, 2), rep(6, 3), rep(7, 2), rep(8, 3)) lm.groupbsrr <- bsrr(Data$x, Data$y, s.min = 1, s.max = 8, group.index = group.index) summary(lm.groupbsrr)
#-------------------linear model----------------------# # Generate simulated data n <- 200 p <- 20 k <- 5 rho <- 0.4 seed <- 10 Tbeta <- rep(0, p) Tbeta[1:k*floor(p/k):floor(p/k)] <- rep(1, k) Data <- gen.data(n, p, k, rho, family = "gaussian", beta = Tbeta, seed = seed) lambda.list <- exp(seq(log(5), log(0.1), length.out = 10)) lm.bsrr <- bsrr(Data$x, Data$y, method = "pgsection") summary(lm.bsrr) #-------------------group selection----------------------# beta <- rep(c(rep(1,2),rep(0,3)), 4) Data <- gen.data(200, 20, 5, rho=0.4, beta = beta, snr = 100, seed =10) group.index <- c(rep(1, 2), rep(2, 3), rep(3, 2), rep(4, 3), rep(5, 2), rep(6, 3), rep(7, 2), rep(8, 3)) lm.groupbsrr <- bsrr(Data$x, Data$y, s.min = 1, s.max = 8, group.index = group.index) summary(lm.groupbsrr)
Gene expression data (500 gene probes for 120 samples) from the microarray experiments of mammalianeye tissue samples of Scheetz et al. (2006).
A data frame with 120 rows and 501 variables, where the first variable is the expression level of TRIM32 gene, and the remaining 500 variables are 500 gene probes.
In this study, laboratory rats (Rattus norvegicus) were studied to learn about gene expression and regulation in the mammalian eye. Inbred rat strains were crossed and tissue extracted from the eyes of 120 animals from the F2 generation. Microarrays were used to measure levels of RNA expression in the isolated eye tissues of each subject. Of the 31,000 different probes, 18,976 were detected at a sufficient level to be considered expressed in the mammalian eye. For the purposes of this analysis, we treat one of those genes, Trim32, as the outcome. Trim32 is known to be linked with a genetic disorder called Bardet-Biedl Syndrome (BBS): the mutation (P130S) in Trim32 gives rise to BBS.
This data set contains 120 samples with 500 predictors. The 500 predictors are features with maximum marginal correlation to Trim32 gene.
T. Scheetz, k. Kim, R. Swiderski, A. Philp, T. Braun, K. Knudtson, A. Dorrance, G. DiBona, J. Huang, T. Casavant, V. Sheffield, E. Stone. Regulation of gene expression in the mammalian eye and its relevance to eye disease. Proceedings of the National Academy of Sciences of the United States of America, 2006.