/ feature extraction

# The Problem of Irrelevant Features

Think for 10 seconds about the following sequence of pairs of numbers: (4, 8), (9, 1) (16, 0) (25, 5), (36, 9). Can you guess the next pair in the sequence? You probably have worked out that the first element in the pair is the square of the numbers 2, 3, 4, 5 and 6, but the second element is completely random and unpredictable (and may have thrown you off the pattern in the first element). This encapsulates the essence of feature extraction: to make a prediction we must have features which are relevant to the quantity we are trying to predict. An irrelevant feature can result in worse prediction accuracies and higher computational cost.

Let's now put feature extraction in the context of a machine learning algorithm. Consider the popular Iris flower dataset which contains examples of three types of plant characterised by four features: sepal length, sepal width, petal length and petal width. The aim is to predict the plant type (setosa, versicolour or virginica) from these features. In this article, we introduce some key feature extraction algorithms and demonstrate them on real world datasets.

## A Classical Approach: Principal Components Analysis

In the Iris dataset we can see that the sepal width and length are strongly correlated (the black points represent Iris setosa plants, the red points are Iris versicolour, and the blue ones are Iris virginica plants):

Clearly, we could eliminate one of these features without losing anything. This is the general idea of Principal Components Analysis (PCA, [^n]). In a more precise sense, PCA projects the examples onto a basic (set of orthogonal vectors) which is oriented to maximise the variance (spread) of the projection. For the plot above this corresponds to projecting onto a diagonal line going towards the top-right of the graph. We can generalise this example so that if we start with n features, we can choose an output dimension k for the projection. Hence, we have chosen k new features, called Principal Components, that maximise the variance and hence capture the interesting information. The underlying mathematics is not too complicated, and there is a concise description here.

What does an application of PCA look like on the Iris dataset? The following example loads the Iris dataset and then computes the variance of each feature before and after an application of PCA.

dataset = load_iris()

X = dataset["data"]

# Centre each feature and scale the variances to be unitary
X = preprocessing.scale(X)

# Compute the variance for each column
print(numpy.var(X, 0).sum())

# Now use PCA using 3 components
pca = PCA(3)
X2 = pca.fit_transform(X)
print(numpy.var(X2, 0).sum())


The total variance of the features is 4 due to the way we standardised the data using scale, and after an application of PCA with 3 components the total variance becomes 3.97. This implies that almost all of the variance of the original 4 features is captured using a projection onto 3 vectors. The missing variance can be considered as noise and hence a prediction with the PCA features can often increase prediction accuracy.

PCA does come with some disadvantages however, for example it is not always clear how many components to choose. In addition the complexity of PCA is cubic in the number of features, which limits its use on large datasets. Fortunately, scikit-learn has some alternatives:

• RandomizedPCA uses random projections to improve efficiency whilst losing only a small amount of accuracy.
• SparsePCA puts an L1 norm penalty on the projection vectors.
• MiniBatchSparsePCA is a faster but less accurate version of SparsePCA.

The sparse variants of PCA are sparse in the sense that they use just a few examples to compute projection directions and hence can generalise better than PCA.

## Supervised Feature Extraction: Partial Least Squares

PCA is commonly used as a preprocessing step before predicting a target feature. However, in this scenario it makes sense to use the target in the feature extraction process, and this is the motivation behind Partial Least Squares (PLS). The intuition behind PLS is simple: find a projection vector that maximises the covariance between the projected examples and the label. Then remove the resulting component of the projection from the examples and repeat. The resulting projections form a new set of uncorrelated features, which are then used in least squares regression. For more details, see reference [^n].

Continuing from the previous code, the following example illustrates the use of PLS on Iris.

pls = PLSRegression(3)
pls.fit(X, y)
X2 = pls.transform(X)
print(numpy.var(X2, 0).sum())


The output variance is 3.88 which is expectedly less than PCA which explicitly maximises variance.

The advantage of PLS comes when we make predictions. In the following code snippet, we compare PCA and PLS feature extraction in the context of classifying the plant types using a Support Vector Machine (SVM).

pca_error = 0
pls_error = 0
n_folds = 10

svc = LinearSVC()

for train_inds, test_inds in KFold(X.shape[0], n_folds=n_folds):
X_train, X_test = X[train_inds], X[test_inds]
y_train, y_test = y[train_inds], y[test_inds]

# Use PCA and then classify using an SVM
X_train2 = pca.fit_transform(X_train)
X_test2 = pca.transform(X_test)

svc.fit(X_train2, y_train)
y_pred = svc.predict(X_test2)
pca_error += zero_one_loss(y_test, y_pred)

# Use PLS and then classify using an SVM
X_train2, y_train2 = pls.fit_transform(X_train, y_train)
X_test2 = pls.transform(X_test)

svc.fit(X_train2, y_train)
y_pred = svc.predict(X_test2)
pls_error += zero_one_loss(y_test, y_pred)

print(pca_error/n_folds)
print(pls_error/n_folds)


Note that we have used 10-fold cross validation to evaluate the error on different train/test splits. The respective errors in conjunction with PCA and PLS are 0.0867 and 0.08, hence there is a small improvement when using PLS. More compelling evidence can be found in this article Efficient sparse kernel feature extraction based on partial least squares for example. This article also discusses sparse variants of PLS which come with efficiency and generalisation advantages.