Groupyr: Sparse Group Lasso in Python

Groupyr is a scikit-learn compatible implementation of the sparse group lasso linear model. It is intended for high-dimensional supervised learning problems where related covariates can be assigned to predefined groups.

The Sparse Group Lasso

The sparse group lasso [1] is a penalized regression approach that combines the group lasso with the normal lasso penalty to promote both global sparsity and group-wise sparsity. It estimates a target variable \(\hat{y}\) from a feature matrix \(\mathbf{X}\), using

\[\hat{y} = \mathbf{X} \hat{\beta},\]

where the coefficients in \(\hat{\beta}\) characterize the relationship between the features and the target and must satisfy [1]

\[\hat{\beta} = \min_{\beta} \frac{1}{2} || y - \sum_{\ell = 1}^{G} \mathbf{X}^{(\ell)} \beta^{(\ell)} ||_2^2 + (1 - \alpha) \lambda \sum_{\ell = 1}^{G} \sqrt{p_{\ell}} ||\beta^{(\ell)}||_2 + \alpha \lambda ||\beta||_1,\]

where \(G\) is the total number of groups, \(\mathbf{X}^{(\ell)}\) is the submatrix of \(\mathbf{X}\) with columns belonging to group \(\ell\), \(\beta^{(\ell)}\) is the coefficient vector of group \(\ell\), and \(p_{\ell}\) is the length of \(\beta^{(\ell)}\). The model hyperparameter \(\alpha\) controls the combination of the group-lasso and the lasso, with \(\alpha=0\) giving the group lasso fit and \(\alpha=1\) yielding the lasso fit. The hyperparameter \(\lambda\) controls the strength of the regularization.

Installation 

See the installation guide for installation instructions.

Usage

Groupyr is compatible with the scikit-learn API and its estimators offer the same instantiate, fit, predict workflow that will be familiar to scikit-learn users. See the API and examples for full details. Here, we describe only the key differences necessary for scikit-learn users to get started with groupyr.

For syntactic parallelism with the scikit-learn ElasticNet estimator, we use the keyword l1_ratio to refer to SGL’s \(\alpha\) hyperparameter above that controls the mixture of group lasso and lasso penalties. In addition to keyword parameters shared with scikit-learn’s ElasticNet, ElasticNetCV, LogisticRegression, and LogisticRegressionCV estimators, users must specify the group assignments for the columns of the feature matrix X. This is done during estimator instantiation using the groups parameter, which accepts a list of numpy arrays, where the \(i\)-th array specifies the feature indices of the \(i\)-th group. If no grouping information is provided, the default behavior assigns all features to one group.

Groupyr also offers cross-validation estimators that automatically select the best values of the hyperparameters \(\alpha\) and \(\lambda\) using either an exhaustive grid search (with tuning_strategy="grid") or sequential model based optimization (SMBO) using the scikit-optimize library (with tuning_strategy="bayes"). For the grid search strategy, our implementation is more efficient than using the base estimator with scikit-learn’s GridSearchCV because it makes use of warm-starting, where the model is fit along a pre-defined regularization path and the solution from the previous fit is used as the initial guess for the current hyperparameter value. The randomness associated with SMBO complicates the use of a warm start strategy; it can be difficult to determine which of the previously attempted hyperparameter combinations should provide the initial guess for the current evaluation. However, even without warm-starting, we find that the SMBO strategy usually outperforms grid search because far fewer evaluations are needed to arrive at the optimal hyperparameters. We provide examples of both strategies.

API Documentation 

See the API Documentation for detailed documentation of the API.

Examples 

And look at the example gallery for a set of introductory examples.

Citing groupyr

If you use groupyr in a scientific publication, we would appreciate citations. Please see our citation instructions for the latest reference and a bibtex entry.

Acknowledgements

Groupyr development is supported through a grant from the Gordon and Betty Moore Foundation and from the Alfred P. Sloan Foundation to the University of Washington eScience Institute, as well as NIMH BRAIN Initiative grant 1RF1MH121868-01 to Ariel Rokem (University of Washington).

The API design of groupyr was facilitated by the scikit-learn project template and it therefore borrows heavily from scikit-learn [2]. Groupyr relies on the copt optimization library [3] for its solver. The groupyr logo is a flipped silhouette of an image from J. E. Randall and is licensed CC BY-SA.

Groupyr: Sparse Group Lasso in Python

The Sparse Group Lasso

Installation

Usage

API Documentation

Examples

Citing groupyr

Acknowledgements

References

Installation 

API Documentation 

Examples 