*Groupyr*: Sparse Group Lasso in Python ======================================= *Groupyr* is a scikit-learn compatible implementation of the sparse group lasso linear model. It is intended for high-dimensional supervised learning problems where related covariates can be assigned to predefined groups. The Sparse Group Lasso ---------------------- The sparse group lasso [1]_ is a penalized regression approach that combines the group lasso with the normal lasso penalty to promote both global sparsity and group-wise sparsity. It estimates a target variable :math:`\hat{y}` from a feature matrix :math:`\mathbf{X}`, using .. math:: \hat{y} = \mathbf{X} \hat{\beta}, where the coefficients in :math:`\hat{\beta}` characterize the relationship between the features and the target and must satisfy [1]_ .. math:: \hat{\beta} = \min_{\beta} \frac{1}{2} || y - \sum_{\ell = 1}^{G} \mathbf{X}^{(\ell)} \beta^{(\ell)} ||_2^2 + (1 - \alpha) \lambda \sum_{\ell = 1}^{G} \sqrt{p_{\ell}} ||\beta^{(\ell)}||_2 + \alpha \lambda ||\beta||_1, where :math:`G` is the total number of groups, :math:`\mathbf{X}^{(\ell)}` is the submatrix of :math:`\mathbf{X}` with columns belonging to group :math:`\ell`, :math:`\beta^{(\ell)}` is the coefficient vector of group :math:`\ell`, and :math:`p_{\ell}` is the length of :math:`\beta^{(\ell)}`. The model hyperparameter :math:`\alpha` controls the combination of the group-lasso and the lasso, with :math:`\alpha=0` giving the group lasso fit and :math:`\alpha=1` yielding the lasso fit. The hyperparameter :math:`\lambda` controls the strength of the regularization. .. toctree:: :hidden: :titlesonly: Home .. toctree:: :maxdepth: 3 :hidden: install auto_examples/index getting_help api FAQ contributing Groupyr on GitHub `Installation `_ ------------------------------ See the `installation guide `_ for installation instructions. Usage ----- *Groupyr* is compatible with the scikit-learn API and its estimators offer the same instantiate, ``fit``, ``predict`` workflow that will be familiar to scikit-learn users. See the `API `_ and `examples `_ for full details. Here, we describe only the key differences necessary for scikit-learn users to get started with *groupyr*. For syntactic parallelism with the scikit-learn ``ElasticNet`` estimator, we use the keyword ``l1_ratio`` to refer to SGL's :math:`\alpha` hyperparameter above that controls the mixture of group lasso and lasso penalties. In addition to keyword parameters shared with scikit-learn's ``ElasticNet``, ``ElasticNetCV``, ``LogisticRegression``, and ``LogisticRegressionCV`` estimators, users must specify the group assignments for the columns of the feature matrix ``X``. This is done during estimator instantiation using the ``groups`` parameter, which accepts a list of numpy arrays, where the :math:`i`-th array specifies the feature indices of the :math:`i`-th group. If no grouping information is provided, the default behavior assigns all features to one group. *Groupyr* also offers cross-validation estimators that automatically select the best values of the hyperparameters :math:`\alpha` and :math:`\lambda` using either an exhaustive grid search (with ``tuning_strategy="grid"``) or sequential model based optimization (SMBO) using the scikit-optimize library (with ``tuning_strategy="bayes"``). For the grid search strategy, our implementation is more efficient than using the base estimator with scikit-learn's ``GridSearchCV`` because it makes use of warm-starting, where the model is fit along a pre-defined regularization path and the solution from the previous fit is used as the initial guess for the current hyperparameter value. The randomness associated with SMBO complicates the use of a warm start strategy; it can be difficult to determine which of the previously attempted hyperparameter combinations should provide the initial guess for the current evaluation. However, even without warm-starting, we find that the SMBO strategy usually outperforms grid search because far fewer evaluations are needed to arrive at the optimal hyperparameters. We provide `examples `_ of both strategies. `API Documentation `_ ------------------------------- See the `API Documentation `_ for detailed documentation of the API. `Examples `_ -------------------------------------- And look at the `example gallery `_ for a set of introductory examples. Citing groupyr -------------- If you use *groupyr* in a scientific publication, we would appreciate citations. Please see our `citation instructions `_ for the latest reference and a bibtex entry. Acknowledgements ---------------- *Groupyr* development is supported through a grant from the `Gordon and Betty Moore Foundation `_ and from the `Alfred P. Sloan Foundation `_ to the `University of Washington eScience Institute `_, as well as `NIMH BRAIN Initiative grant 1RF1MH121868-01 `_ to Ariel Rokem (University of Washington). The API design of *groupyr* was facilitated by the `scikit-learn project template`_ and it therefore borrows heavily from `scikit-learn`_ [2]_. *Groupyr* relies on the copt optimization library [3]_ for its solver. The *groupyr* logo is a flipped silhouette of an `image from J. E. Randall`_ and is licensed `CC BY-SA`_. .. _scikit-learn project template: https://github.com/scikit-learn-contrib/project-template .. _scikit-learn: https://scikit-learn.org/stable/index.html .. _image from J. E. Randall: https://commons.wikimedia.org/wiki/File:Epinephelus_amblycephalus,_banded_grouper.jpg .. _CC BY-SA: https://creativecommons.org/licenses/by-sa/3.0 References ---------- .. [1] Simon, N., Friedman, J., Hastie, T., & Tibshirani, R. (2013). A sparse-group lasso. Journal of Computational and Graphical Statistics, 22(2), 231-245. .. [2] Pedregosa et al. (2011). `Scikit-learn: Machine Learning in Python`_. Journal of Machine Learning Research, 12, 2825-2830; Buitnick et al. (2013). `API design for machine learning software: experiences from the scikit-learn project`_. ECML PKDD Workshop: Languages for Data Mining and Machine Learning, 108-122. .. [3] Pedregosa et al. (2020). `copt: composite optimization in Python`__. DOI:10.5281/zenodo.1283339. .. _Scikit-learn\: Machine Learning in Python: http://jmlr.csail.mit.edu/papers/v12/pedregosa11a.html .. _API design for machine learning software\: experiences from the scikit-learn project: https://arxiv.org/abs/1309.0238 .. __: http://openopt.github.io/copt/