/ pandas

Essential Descriptive Statistics in Pandas

The ultimate goal of machine learning is to make generalisable predictions based on data. The form of the features and associated domain knowledge is incredibly important in this endeavour. For example instead of using pixel values of images for face recognition, more sophistical feature representations yield better results[1]. Hence a preliminary step in any machine learning pipeline includes at a minimum understanding the data through an examination of the features. This article outlines some descriptive statistics for feature analysis, and gives associated code implemented using Pandas in Python.

Univariate Statistics

First let's define the dataset as a sample \(S\) of \(m\) observations/examples of real vectors \(\{\textbf{x}_1, \ldots, \textbf{x}_m\}\) such that the \(\textbf{x}_i\)s are \(n\) dimensional (i.e. there are \(n\) features/variables) and sampled independently of one another. Of course we could consider other types of data such as strings or graphs but vectors are more commonly used and easier to deal with. In addition, other object types can be converted into vectors. In the case that a feature is categorical e.g. rainy/sunny/windy, it can mapped onto integers e.g. 0/1/2.

In this section we will focus on single features so let's call our chosen feature \(S_j = \{z_1, \ldots, z_m\}\). Each feature is either discrete or continuous. Discrete means it can take values from a finite set, e.g. a die has 6 values 1 to 6. A continuous variable lies within a particular range, e.g. the height of adults in a population is between 1.5 and 2.5 metres. Which statistics can we compute on the \(z\)s? Two obvious ones are the minimum and maximum values. In addition we can consider some notions of average which characterise the middle value in some sense. The most common average measure is the mean (denoted by \(\bar{z}\)):

$$ \bar{z} = \frac{1}{m} \sum_{i=1}^m z_i, $$

which is simply the sum of all values divided by \(m\). The median is middle value if we sort the \(z\)s. Finally the mode is the most common value and usually only applied to discrete variables.

Next, it's instructive to consider measures of dispersion, i.e. how "spread out" is the data? The variance is a common way of measuring this property:

$$ var(z) = \frac{1}{m} \sum_{i=1}^m (z_i - \bar{z}_i)^2. $$

The variance is just the average difference between each observation and the mean. The square factor ensures that the differences are all non-negative. Of course, since the term in the sum is squared it is more intuitive in some cases to take the root of the variance:

$$ std(z) = \sqrt{var(z)} $$

and this forms the standard deviation. The standard deviation is in a sense in the same scale as \(z\). We can, in addition, use the median (denoted by \(med(z)\)) as an anchor point to measure deviation (where \(|\cdot|\) means absolute value):

$$ mad(z) = \frac{1}{m} \sum_{i=1}^m |z_i - med(z)|, $$

and this quantity is known as the mean absolute deviation. Unlike the variance and standard deviation, the mean absolute deviation is robust.

Finally, instead of considering all points in the measure of dispersion we can focus on the middle 50% which eliminates the effect of outliers. This 50% is the difference between the median of the values less than \(med(z)\) and the median of the values greater than \(med(z)\) and known as the interquartile range.

So what does this look like in code?

# Generate 1000 random numbers from a normal distribution
z = pandas.Series(numpy.random.randn(1000))

# Minimum
print(z.min())
# Maximum
print(z.max())
# Mean
print(z.mean())
# Median
print(z.median())
# Variance
print(z.var())
# Standard deviation
print(z.std())
# Mean absolute deviation
print(z.mad())
# Interquartile range
print(z.quantile(0.75) - z.quantile(0.25))

A handy shortcut to some of these statistics is z.describe() which works on DataFrame and Series objects.

To get a more detailed picture of the feature, we can plot the distribution of its values in a histogram. Effectively, a histogram partitions the range of a feature into bins and then counts the number of observations in each bin. For discrete variables it is simply the count of each discrete value. Pandas make plotting histograms easy:

z.plot(kind="hist")

which yields the following plot. Notice that the histograms bins are approximately normally distributed since this is how the feature is generated.

histogram
A histogram of 1000 randomly generated observations from the normal distribution.

Multivariate Statistics

There is only so much one can learn from single features. Since we are considering multivariate data, an analysis of the features in conjunction with each other is also useful. Two essential ways of characterising the linear similarity of features are covariance and correlation.

The covariance is a measure of how two variables change together. Let's now define these two features as \(S_i = \{y_1, \ldots, y_m\}\) and \(S_j = \{z_1, \ldots, z_m\}\). The covariance is defined as

$$ cov(y, z) = \frac{1}{m} \sum_{i=1}^m (y_i - \bar{y})(z_i - \bar{z}). $$

This looks a little bit like variance, except the terms inside the sum are products of mean-subtracted \(y_i\)s and mean-subtracted \(z_i\)s. When the two features show similar behaviour in this sense the covariance is large. In addition if one feature increases with a decreases in the other then the covariance is negative.

The covariance has a range determined by the range of the input features. In contrast, correlation can be considered as a normalised variant of the covariance with values between -1 and 1:

$$ corr(y, z) = \frac{cov(y, z)}{std(y)std(z)}, $$

and this is known as Pearson's product-moment coefficient. A correlation of 1 or -1 is a perfect correspondence or inverse correspondence between the features, and a correlation of 0 means there is no similarity. The advantage of using the correlation is that quantities can be easily compared without being concerned with the scalings of the features.

Pandas makes computing covariances and correlations simple:

num_examples = 1000
x = pandas.Series(numpy.random.randn(num_examples))
y = x + pandas.Series(numpy.random.randn(num_examples))
z = x + pandas.Series(numpy.random.randn(num_examples))

# Covariance
print(y.cov(z))

# Covariance of y with itself is equal to variance
print(y.cov(y), y.var())

# Correlation
print(y.corr(z))

Here we generated a hidden feature called \(x\) and then produced \(y\) and \(z\) using random perturbations from \(x\). The correlation between \(y\) and \(z\) is approximately 0.5.

Summary

Understanding a dataset before conducting machine learning is a useful step to undertake as it can help to improve accuracy. We presented some essential univariate and multivariate statistics in order to further this aim. Finding these quantities is beautifully simple in Pandas as we have demonstrated, and so there is no excuse not to evaluate them. More advanced techniques include Principal Components Analysis (PCA) and other feature extraction algorithms.

Footnotes


  1. Viola P, Jones MJ. Robust real-time face detection. International journal of computer vision. 2004 May 1;57(2):137-54. ↩︎