/ prediction

What Makes a Good Prediction?

Michel de Nostradamus was a 16th-century French physician and supposedly predicted the rise of Hitler. Is this an example of an incredible prediction, made centuries in advance? Here is James Randi's translation [^n] of an exert of Les Propheties by Nostradamus corresponding to this prediction:

Beasts mad with hunger will swim across rivers,
Most of the army will be against the Lower Danube.
The great one shall be dragged in an iron cage
When the child brother will observe nothing.

This mostly nonsensical paragraph does not correspond to actual events without generous post-hoc interpretation. It is clearly not a good prediction. A more precise prediction is that the sun will rise tomorrow. However, it is almost trivial to state. Finally, how many of us think that lottery winners could reliably repeat their wins?

From these examples, we can see that good predictions must be precise, non-trivial and repeatable. By non-trivial I mean that the predictions are not impossible or certain, but somewhere inbetween. An example is forecasting the rainfall in London. We would however like a way to assess whether a particular prediction is better than another in a quantitative sense. In this article we present some approaches for quantitative comparison for two important classes of prediction problems.

Rainy Days

Let's consider a class of prediction problems known as classification problems. A classification is a prediction from a set of classes e.g. rainy/sunny, yes/no, or heads/tails (we'll only talk about two-class or binary predictions). We need to make a series of \(n\) predictions, for a large number \(n\), to ensure that we are getting the right answer consistently and not just by chance.

For example, when predicting whether it will be sunny or rainy tomorrow, we evaluate the predictions over \(n=10\) days. Let's say that our predictions are \(z_1, \ldots, z_{10}\) and the actual weather for these days is given by \(y_1, \ldots, y_{10}\). Then we simply tally up the number of times \(y_i = z_i\) for all days \(i\), known as the accuracy, and report this as a proportion of the total number of observations (i.e. accuracy is between 0 and 1). An accuracy of 0.8 for example means that we correctly forecast the weather 80% of the time. The predictor with the highest accuracy is the best one. It is important to note however that we should aim to exceed the trivial predictor. If the average number of sunny days is 60% then a trivial predictor which always predicts sunny will achieve this accuracy.

Accuracy, however, is not a complete picture of a set of classification predictions. There are two important special cases below.

Rain Aversion

Imagine now that we organise a picnic in the park whenever the forecast says it is sunny. If the forecast says it will be sunny and it rains, our picnic is ruined and we want to avoid this case. However, if the forecast says it will rain and it doesn't then we don't mind. In this setting two predictions with the same accuracy may not be equivalent. Consider the predictions given below for example.

Actual weatherSunnySunnyRainySunnyRainy
Predictor 1SunnySunnyRainyRainyRainy
Predictor 2SunnySunnyRainySunnySunny

Both predictor 1 and 2 have accuracies of 80% (they make 4 out of 5 correct predictions). In the context of rain aversion, prediction 1 is preferred because there are no cases in which a forecast of sunny coincides with rain, which is not true for predictor 2 (on the 5th day a prediction of sun is incorrect).

Of course not all classification problems will have sunny and rainy as their classes, and instead we will talk about sunny days as positive and rainy days as negative. A true positive (\(tp\)) forecast occurs when we predict positive and the prediction is correct. Similarly, a false positive (\(fp\)) is where we predict positive and the actual class is negative. Therefore rain-aversion is a preference for precision given by \(tp/(tp+fp)\), which has a value between 0 and 1. Precision in large when a large number of predictions are correct of those that are predicted positive.

Sun Obsession

Now instead imagine that we don't mind if it rains on a picnic day, but would mind if we fail to organise a picnic on a sunny day due to the forecast saying it will rain. A prediction of rain on a sunny day is called a false negative (\(fn\)). In this case we are interested in a quantity known as recall which is calculated as \(tp / (tp + fn)\). As the number of false positives tends to zero, the recall tends to one. In the example above, predictor 2 has a recall of 1 since there are no days in which it predicts rain on a sunny day. In contrast, predictor 1 has a recall of 2/3 since there are 2 true positives and 1 false negative.

Summary of error types for classifications
Actual PositiveActual Negative
Predicted PositiveTrue positive (tp)False positive (fp)
Predicted NegativeFalse negative (fn)True negative (tn)

House Prices

A related problem to classification is that of making predictions of numerical quantities, such as height, speed and income. These problems are known as regression ones. As an example, consider the prediction of house values based on features such as crime rate, location, number of rooms, traffic etc. Let's say that we have 10 predicted prices \(z_1, \ldots, z_{10}\) and the real prices are given by \(y_1, \ldots, y_{10}\). Here are two ways of measuring prediction quality:

  1. The mean of \((y_i - z_i)^2\) for all \(i\), known as mean squared error
  2. The mean of \(|y_i - z_i|\) for all \(i\), known as mean absolute error

Why are the errors phrased in this way? Clearly we want to measure difference, so that the error for the \(i\)th prediction is a positive number, or zero if the prediction is perfectly correct. The difference between the two error metrics is that the squared error emphasises larger errors due to the nature of the squared function. Therefore we might choose the mean squared error when we want to penalise predictors with larger errors for individual predictions.


We looked at the properties which make a predictor good: precision, non-triviality and repeatability. In machine learning classification and regression are key types of prediction problems. For classification problems the metrics of accuracy, precision and recall give important insights into the nature of the predictions. For the regression case we outlined mean squared error and mean absolute error.

Note that the metrics we can described can easily be generalised. In the multilabel classification case, we can simple compute the metrics for each label and then average. In the regression case, we could use the vector norm.


Image credit: Dave Winer (cropped and vignetted)