/ regression

Approaches to Linear Regression

Linear regression assumes a linear function between the examples and corresponding labels. We previously talked about ridge regression, a simple way to perform regression that can be coded in a few lines. Here we are going to recap ridge regression and go into some detail on LASSO (Least Absolute Shrinkage and Selection Operator) and Elastic Net. These algorithms are similar to ridge regression in the main, however differ in how they regularise the objective function. Regularisation, broadly, is a way to prevent overfitting to the training dataset. As well as outlining the algorithms we will also see how they differ in their performance on some example datasets.

Data Setup

Let's start with a definition of the data used for learning. We have \(n\) observations or examples \(\textbf{x}_1, \textbf{x}_2, \ldots, \textbf{x}_n\) and a corresponding set of labels \(y_1, y_2, \ldots, y_n\). Each example is a vector/array of numbers consisting of \(m\) features and the labels are scalars. For instance each example is a vector of properties of person (age, eduction, gender, ethnicity, occupation) and the label could be their annual income. Let's say that there is some function linking \(\textbf{x}\) and \(y\) so that \(f^*(\textbf{x}) \approx y\). Due to the nature of the problem, in general we can never recover \(f^*\) (due to a finite number of examples and noisy data, amongst other things) but we can recover a function \(f\) which is a reasonable approximation of \(f^*\). The important aspect of \(f\) is that it makes good predictions for \(y\) not just on the training data, but on new unseen examples. Let's assume that the features are centered, in other words they have a zero mean.

Ridge Regression

So here is one way of finding the function \(f\): try to minimise the error between the \(i\)th predicted and actual label using the square of the difference, i.e. minimise \((y_i - \textbf{x}_i^T \textbf{w})^2\) for all examples. The prediction we are making for the \(i\)th label in this case is \(\textbf{x}_i^T \textbf{w}\). Notice that we are choosing functions which are the dot products between the example \(\textbf{x}_i\) and a weight vector \(\textbf{w}\). This form is easy to study analytically. Note that if \(\textbf{x}_i^T \textbf{w} = y_i\) then the error becomes zero for that example. Putting the pieces together results in the following problem

$$ min \frac{1}{2} \sum_{i=1}^n (y_i - \textbf{x}_i^T \textbf{w})^2 + \lambda \|\textbf{w}\|^2_2, $$

where \(\|\textbf{w}\|^2_2\) is the square of the L2 norm of \(\textbf{w}\) and equal to \(\textbf{w}^T\textbf{w}\) (the magnitude of this vector), and \(\lambda\) is a user-defined regularisation parameter. The first term is simply the sum of the errors for the examples. The second term may not be completely obvious but is required to improve the generalisation ability of this learner. It's called a regulariser.

The LASSO

The LASSO is similar to ridge regression, but with a key difference: the regularisation term is now the L1 norm as opposed to the L2:

$$ min \frac{1}{2} \sum_{i=1}^n (y_i - \textbf{x}_i^T \textbf{w})^2 + \lambda \|\textbf{w}\|_1, $$

where \(\|\textbf{w}\|_1 = \sum_{i=1}^m |w_i| \) is the L1 norm and there are \(m\) features in the input data.

What difference does the change in the regularisation term make? To get some insight, let's look at the the plots for fixed values of the L1 and L2 norms (below).

vectornorms

The plot on the left is the line \(\|\textbf{w}\|_2 = 1\) and the one on the right is \(\|\textbf{w}\|_1 = 1\) for 2-dimensional \(\textbf{w}\). We can see that the L1 norm has corners where the values of the corresponding element in the weight vector are zero. This implies a tendency towards sparse solutions. In other words, when some of the elements of \(\textbf{w}\) are zero only some of the features of the data are being used for prediction. Hence LASSO performs feature selection in conjunction with regression. This is particularly useful if there are a large number of irrelevant features, for example.

Elastic Net

A disadvantage of LASSO is that when you have more features than examples, i.e. \(m > n\), LASSO will only select \(n\) features. In the case that features are correlated, ridge regression seems to perform better. Elastic Net generalises both of these algorithms by penalising in terms of both the L1 and L2 norm:

$$ min \frac{1}{2} \sum_{i=1}^n (y_i - \textbf{x}_i^T \textbf{w})^2 + \alpha \lambda \|\textbf{w}\|_1 + (1 - \alpha)\lambda \|\textbf{w}\|_2^2, $$

where \(\alpha \in [0, 1]\) is a compromise between penalising the L1 and L2 norm, and \(\lambda\) is the penalty parameter. It is easy to see that when \(\alpha = 1\) we recover LASSO and when \(\alpha = 0\) we get ridge regression.

Some Examples

Having explained Ridge Regression, LASSO and Elastic Net let's now look at their performance on some example datasets. We start with a synthetic dataset composed of 100 examples and 50 features, of which 30 are informative for the label. Handily, scikit learn has a function to generate these kinds of datasets:

from sklearn.datasets import make_regression

sigma = 0.5
X, y, v = make_regression(100, 50, n_informative=30, noise=sigma, effective_rank=20, coef=True)

Here sigma is a noise component added to the labels and effective_rank means that some features correlated. We then process the data in the following way:

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

X = StandardScaler().fit_transform(X)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)

The StandardScaler simply ensures that the features in the examples (i.e. the columns of X) have mean 0 and standard deviation 1. The data is then split into training and test sets with approximately 66% in the training set.

Next, we will perform model selection for each regression approach in turn on the training set and get the mean squared error on the test set:

import numpy
from sklearn.linear_model import Ridge
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import cross_val_score

folds = 5
alphas = numpy.linspace(10.0**-3, 10, 200)

mses = numpy.zeros_like(alphas)

for i, alpha in enumerate(alphas):
    learner = Ridge(alpha=alpha, fit_intercept=True)
    scores = cross_val_score(learner, X, y, cv=folds, scoring="neg_mean_squared_error")
    mses[i] = numpy.abs(scores.mean())

learner = Ridge(alpha=alphas[numpy.argmin(mses)], fit_intercept=False)
learner.fit(X_train, y_train)
y_pred = learner.predict(X_test)
mse = mean_squared_error(y_test, y_pred)

print("Ridge Regression: alpha={:.4f} mse={:.3f} nnz={}".format(alphas[numpy.argmin(mses)], mse, numpy.count_nonzero(learner.coef_)))

In the code snippet above, alpha is the regularisation parameter for ridge regression, and we try values between 10.0^-3 and 1. The for loop runs Ridge Regression on the training set and computes a mean squared error for each value of alpha. Then, using the alpha corresponding to the smallest error, we output an error for the test set, as well as the number of non zero coefficients of the weight vector in the final line.

More or less the same process is performed for LASSO and Elastic Net. With Elastic Net, there is an additional parameter l1_ratio which corresponds to alpha in our description above. Here is the output:

Ridge Regression: alpha=0.0763 mse=69.951 nnz=50
LASSO: alpha=0.2922 mse=31.366 nnz=34
Elastic Net: alpha=0.2922 l1_ratio=1.0000 mse=31.366 nnz=34

Notice that LASSO is significantly better than Ridge Regression in terms of the error and picks fewer features for the prediction (34 vs 50). Elastic Net matches LASSO as it picks an l1_ratio of 1 corresponding to LASSO. Clearly using the L1 norm helps for this dataset as 30 features were relevant out of 50. Since ridge regression uses all of the features, its performance suffers.

Next, we generate a dataset composed of 100 examples and 200 features, of which 150 are informative for the label. This results in the following output:

Ridge Regression: alpha=0.0010 mse=2069.554 nnz=200
LASSO: alpha=10.0000 mse=2400.141 nnz=5
Elastic Net: alpha=0.0010 l1_ratio=0.0000 mse=2070.300 nnz=200

Clearly LASSO suffers in this case, having an error worse than ridge regression. Elastic Net once again represents the best of both worlds and chooses l1_ratio=0 corresponding to ridge regression with an identical alpha value.

The full code for these examples is given here.

Summary

We described 3 linear regression approaches with different regularisation terms: ridge regression, LASSO and Elastic Net. Respectively, the algorithms have L2, L1 and a mixture of L2 and L1 terms for regularisation. On a pair of synthetic datasets we demonstrated the performance of the approaches. Elastic Net is a generalisation of ridge regression and LASSO and theoretically should at least match their performance, with extra computational cost for model selection.

Further Reading

  • Friedman, Jerome, Trevor Hastie, and Robert Tibshirani. The elements of statistical learning. Vol. 1. New York: Springer series in statistics, 2001.

Footnotes