L1General
Mark Schmidt (2006)
L1General is a set of Matlab routines implementing several of the available
strategies for solving L1-regularization problems.
Specifically, they solve the problem of optimizing a differentiable
function f(x) and a (weighted) sum of the absolute values of the
parameters:
min_x: f(x) + sum_i v_i |x_i|
The v_i must be non-negative, but can be zero if we do not want to
penalize some variables.
There is a lot of interest in solving problems of this form from a variety of
communities, such as signal processing and statistics. This is because the
penalty on the absolute values of the variables encourages the solution to
be sparse (many variables are set to exactly 0).
The absolute value function is not
differentiable if a variable is zero. This makes the optimization problem
harder than solving differentiable optimization problems. However, the
absolute value function has a lot of nice properties, and there are a wide
variety of interesting approaches for solving L1-regularization problems. Usually these
focus on the case where f(x) is the linear squared error
function or the logistic regression negative log-likelihood.
Instead of focusing on a specific form of f(x), the L1General software
only assumes that the user can provide a 'black box' function that
returns f(x) and its gradient for a given parameter setting. This is
similar to optimization codes for differentiable optimization like minFunc.
The purpose of writing the L1General codes was to compare the performance of
different optimization strategies in this black box setting. Although treating
the function as a black box means that the codes don't fully take advantage of the
structure of the objective, this perspective makes it very easy to use the codes to
solve a variety of different L1-regularization problems.
Usage
All the methods have a common interface:
>> x = L1General2_PSSgb(funObj,x0,lambda,options);
The input parameters are:
- funObj: an anonymous function that returns the function and gradient
value given a parameter setting 'x'. The second-order methods also require
funObj
to return the Hessian.
- x0: an initial guess of the optimal solution. Some of the methods
will take substantially fewer iterations if this is close to the optimal
solution.
- lambda: a vector that is the same size as x0, giving the (non-negative)
weights on the penalties on the absolute values of the parameters.
Often, these will all be set to the same positive value. To avoid penalizing
a variable, set the corresponding element of this vector to 0.
- options: a structure that specifies any non-default parameter
settings you want the method to use. This is an optional argument.
You can replace L1General2_PSSgb with one of the other functions to use
a different method. For example, you could use L1General_Projection
instead. Roughly, the (newer) 'L1General2' methods only require funObj
to return the function and gradient value, while the (older) 'L1General'
methods also require that funObj returns the matrix of second
derivatives. Alternately, the older methods use a BFGS approximation if you set options.order = 1.
The list of available methods is given in the updates section of this webpage.
Download and Examples
To use the code, download and unzip L1General.zip.
Then, in Matlab, type:
>> cd L1General % Change to the unzipped directory
>> addpath(genpath(pwd)) % Add all directories to the path
>> mexAll % Compile mex files (not necessary on all systems)
>> example_L1General % Runs a demo of the (older) L1General codes
>> demo_L1General % Runs a demo of the (newer) L1General codes
>> demo_L1Generalpath % Runs a demo of computing a regularization path
The first two demos apply a variety of the different methods available to solve the
same problem with the older and newer codes (respectively), pausing between running
each method. The third demo shows how to embed the
codes in an active-set continuation method for handling a larger number of
variables and/or for regularization path calculation.
I prepared a more extensive set of examples of using L1General on the webpage
Examples
of using L1General. Rather than comparing the performance of different
optimization methods, these demos demonstrate the 'plug and play' nature of the code.
The full list of demos is:
- LASSO
- Elastic Net
- Logistic Regression
- Logistic Regression (larger number of variables)
- Lasso Regularization Path
- Huber Robust Regression Regularization Path
- Logistic Regression Regularization Path
- Probit Regression, Smooth SVM, Huberized SVM
- Non-Parameteric Logistic Regression with Sparse Prototypes
- Multinomial Logistic Regression
- Compressed Sensing
- Chain-structured conditional random field
- Graphical Lasso
- Markov Random Field Structure Learning
- Neural Network with Sparse Connections
- Deep Neural Network with Sparse Connections
The code, some documentation, and examples of the results of these demos can be
found here. Note that these demos
use the older code but any of the newer codes could be used in its place.
Citations
If you use this code in a publication, please cite one of the following works
- Mark Schmidt. Graphical Model Structure Learning with
L1-Regularization. Ph.D. Thesis, University of British Columbia, 2010.
- Mark Schmidt, Glenn Fung, Romer Rosales. Optimization Methods for
L1-Regularization. UBC Technical Report TR-2009-19, 2009.
- Mark Schmidt, Glenn Fung, Romer Rosales. Fast Optimization Methods for L1
Regularization: A Comparative Study and Two New Approaches. European
Conference on Machine Learning (ECML), 2007.
Warm-Starting and Sub-Optimization
Note that the performance of many of the methods is strongly dependent on how close the initial parameter vector x0 is to the optimal solution. Further, some methods may terminate before reaching a minimizer if they are not started sufficiently close to one.
One way to obtain a good initial guess is to use the solution for a close value of lambda. When solving for multiple values of lambda, this warm-starting strategy can often substantially reduce the number of iterations. Another way to improve performance is to only optimize over a subset of non-zero variables. This sub-optimization step may be able to make substantial progress while only computing the partial derivatives with respect to a subset of the variables. However, this requires extra coding on the part of the user as it violates the black-box principle behind the code. Both warm-starting and sub-optimization are demonstrated in the special case of logistic regression in the demo_L1Generalpath demo.
Updates
There have been several updates to the L1General software:
2019
Updated the "mexAll.m" function to be compatible with the current versions of Matlab, by adding the "-compatibleArrayDims" flag (thanks to Hatice Ulku Osmanbeyoglu).
2010
This update contains the 'L1General2' codes used in my thesis:
- Mark Schmidt. Graphical Model Structure Learning with
L1-Regularization. Ph.D. Thesis, University of British Columbia, 2010
(pdf)
The methods available in L1General2 are:
- L1General2_SPG: Spectral projected gradient.
- L1General2_BBST: Barzilai-Borwein soft-threshold.
- L1General2_BBSG: Barzilai-Borwein sub-gradient.
- L1General2_OPG: Optimal projected gradient.
- L1General2_DSST: Diagonally-scaled soft-threshold.
- L1General2_OWL: Orthant-wise learning.
- L1General2_AS: Active set.
- L1General2_TMP: Two-metric projection.
- L1General2_PSSgb: Projected scaled sub-gradient (Gafni-Bertsekas
variant).
- L1General2_PSSsp: Projected scaled sub-gradient (sign projection
variant).
- L1General2_PSSas: Projected scaled sub-gradient (active set variant).
The L1General2 codes focus on
limited-memory (L-BFGS) versions of the most effective methods from L1General,
as well as other available first-order solvers.
This update also added several strategies to the older L1General code:
- L1GeneralCoordinateDescent: Added a fully greedy variant.
- L1GeneralPatternSearch: A generalization of the pattern-search
method.
- L1GeneralProjectedSubGradient: A two-metric sub-gradient projection
method.
- L1GeneralProjectedSubGradientBB: A projected sub-gradient method
with the Barzilai-Borwein step size.
- L1GeneralCompositeGradientAccelerated: An implementation of
the accelerated iterative soft-thresholding method.
2008
A update of the original codes and several variants added,
set up as an on-line appendix to the technical report:
- Mark Schmidt, Glenn Fung, Romer Rosales. Optimization Methods for
L1-Regularization. UBC Technical Report TR-2009-19, 2009
(pdf).
The major updates and new methods added were:
- L1GeneralSubGradient: Added a variant that greedily adds a variable
after each iteration.
- L1GeneralOrthantWise: An implementation of the orthant-wise
method.
This version added variants of all methods that don't require the exact Hessian but
instead use a BFGS approximation (this feature can be turned on by setting
options.order = 1), and can still be downloaded here
2007
The first web release of L1General, set up as an on-line appendix
to the paper:
- Mark Schmidt, Glenn Fung, Romer Rosales. Fast Optimization Methods for L1
Regularization: A Comparative Study and Two New Approaches. European
Conference on Machine Learning (ECML), 2007
(pdf).
The methods present in this initial release were:
- L1GeneralCoordinateDescent: Cyclic and partially greedy
coordinate descent methods based on an exact line search.
- L1GeneralGrafting: A active-set sub-gradient method that greedily
adds a variable after each sub-optimizaiton.
- L1GeneralSubGradient: A naive scaled sub-gradient method.
- L1GeneralUnconstrainedApx: Solving a smoothed version of the problem
with an unconstrained optimizer (several smoothing methods and a continuation
strategy are available).
- L1GeneralIteratedRidge: A bound-optimization approach based on a
scale-mixture representation of the regularizer.
- L1GeneralSequentialQuadraticProgramming: A projected Newton approach
applied to a bound-contrained re-formulation.
- L1GeneralProjection: A two-metric projection approach applied to a
bound-constrained re-formulation.
- L1GeneralPrimalDualLogBarrier: A primal-dual interior-point method
applied to a bound-constrained re-formulation.
This older paper focused on scenarios where the Hessian matrix is available,
and can still be downloaded here.
Group L1-Regularization
If you want to solve general 'group' L1-regularization problems, several
methods are available via the L1GeneralGroup and minConf packages
in my thesis code.
Mark Schmidt > Software > L1General