Groupyr: Sparse Group Lasso in Python
Groupyr is a scikit-learn compatible implementation of the sparse group lasso linear model. It is intended for high-dimensional supervised learning problems where related covariates can be assigned to predefined groups.
The Sparse Group Lasso
The sparse group lasso [1] is a penalized regression approach that combines the group lasso with the normal lasso penalty to promote both global sparsity and group-wise sparsity. It estimates a target variable \(\hat{y}\) from a feature matrix \(\mathbf{X}\), using
where the coefficients in \(\hat{\beta}\) characterize the relationship between the features and the target and must satisfy [1]
where \(G\) is the total number of groups, \(\mathbf{X}^{(\ell)}\) is the submatrix of \(\mathbf{X}\) with columns belonging to group \(\ell\), \(\beta^{(\ell)}\) is the coefficient vector of group \(\ell\), and \(p_{\ell}\) is the length of \(\beta^{(\ell)}\). The model hyperparameter \(\alpha\) controls the combination of the group-lasso and the lasso, with \(\alpha=0\) giving the group lasso fit and \(\alpha=1\) yielding the lasso fit. The hyperparameter \(\lambda\) controls the strength of the regularization.
Installation
See the installation guide for installation instructions.
Usage
Groupyr is compatible with the scikit-learn API and its estimators offer the
same instantiate, fit
, predict
workflow that will be familiar to
scikit-learn users. See the API and examples for full details. Here, we describe only the key
differences necessary for scikit-learn users to get started with groupyr.
For syntactic parallelism with the scikit-learn ElasticNet
estimator, we
use the keyword l1_ratio
to refer to SGL’s \(\alpha\) hyperparameter
above that controls the mixture of group lasso and lasso penalties. In
addition to keyword parameters shared with scikit-learn’s ElasticNet
,
ElasticNetCV
, LogisticRegression
, and LogisticRegressionCV
estimators, users must specify the group assignments for the columns of the
feature matrix X
. This is done during estimator instantiation using the
groups
parameter, which accepts a list of numpy arrays, where the
\(i\)-th array specifies the feature indices of the \(i\)-th group.
If no grouping information is provided, the default behavior assigns all
features to one group.
Groupyr also offers cross-validation estimators that automatically select
the best values of the hyperparameters \(\alpha\) and \(\lambda\)
using either an exhaustive grid search (with tuning_strategy="grid"
) or
sequential model based optimization (SMBO) using the scikit-optimize library
(with tuning_strategy="bayes"
). For the grid search strategy, our
implementation is more efficient than using the base estimator with
scikit-learn’s GridSearchCV
because it makes use of warm-starting, where
the model is fit along a pre-defined regularization path and the solution
from the previous fit is used as the initial guess for the current
hyperparameter value. The randomness associated with SMBO complicates the use
of a warm start strategy; it can be difficult to determine which of the
previously attempted hyperparameter combinations should provide the initial
guess for the current evaluation. However, even without warm-starting, we
find that the SMBO strategy usually outperforms grid search because far fewer
evaluations are needed to arrive at the optimal hyperparameters. We provide
examples of both strategies.
API Documentation
See the API Documentation for detailed documentation of the API.
Examples
And look at the example gallery for a set of introductory examples.
Citing groupyr
If you use groupyr in a scientific publication, we would appreciate citations. Please see our citation instructions for the latest reference and a bibtex entry.
Acknowledgements
Groupyr development is supported through a grant from the Gordon and Betty Moore Foundation and from the Alfred P. Sloan Foundation to the University of Washington eScience Institute, as well as NIMH BRAIN Initiative grant 1RF1MH121868-01 to Ariel Rokem (University of Washington).
The API design of groupyr was facilitated by the scikit-learn project template and it therefore borrows heavily from scikit-learn [2]. Groupyr relies on the copt optimization library [3] for its solver. The groupyr logo is a flipped silhouette of an image from J. E. Randall and is licensed CC BY-SA.