/ anomaly detection

Machine Learning for Network Anomaly Detection

How would you characterise this group of events: a meteor shower, the birth of quadruplets and being struck by lightning? They are interesting by their nature, because they are rare and deviate from the normal, and known as anomalies or outliers. Finding anomalies is useful in a number of applications such as detecting credit card fraud, finding abnormal example test scores and uncovering mechanical components which are likely to fail. In the context of network security an anomaly could well be a potential intrusion, so anomaly detection is an important line of defence in network security.

Since many network attacks are automated, it makes sense to have an automated approach to their detection. Traditional methods for intrusion detection rely on developing signatures representing the attack pattern, coded by human experts (e.g. with Snort). However a variation in the attack patterns would not be detected by the signatures. In addition, by the nature of anomalies there are few examples of them and hence it is hard to uncover patterns. The use of machine learning in this context means that algorithms adapt as new attacks are developed. In this article, we will discuss the application of machine learning techniques in anomaly detection.

Intrusion Detection Data

To get a better idea of the kind of data that is useful for intrusion detection, we look at the KDD CUP 99 dataset. Here the task was to distinguish between "good" and "bad" connections. Data was collected over several weeks pertaining to the TCP dump of connections over a Local Area Network (LAN). Each observation in the dataset corresponds to a single connection, and the fields in each observation include:

  • Source IP
  • Destination IP
  • Starting and ending times
  • Protocol type
  • Number of failed login attempts
  • Number of file operations
  • Number of connections from same host within the last 2 seconds

Each connection is labelled as either normal or attack and the latter category has an associated type such as denial of service (DoS), port scanning or password guessing.

Classifying Connections

To detect an intrusion, we use a classification algorithm for predicting the connection type. According to the results of KDD-CUP-99, the 1-nearest neighbour algorithm scored better than all but 9 entries. Below we will explain some simple Python code to perform this prediction.

First, we need to download the data, in particular kdd.cup.data.gz and corrected.gz. The labels are processed using this awk script to convert them into integers.

Next we read the example and labels files using pandas.read_csv and process the resulting DataFrames using the following function:

def process_data(X, y):
    X = X.drop(41, 1)
    X[1], uniques = pandas.factorize(X[1])
    X[2], uniques = pandas.factorize(X[2])
    X[3], uniques = pandas.factorize(X[3])

    num_examples = 10**6
    X = X[0:num_examples]
    y = y[0:num_examples]

    X = numpy.array(X)
    y = numpy.array(y).ravel()

    return X, y

All this function does is drop the connection type (column 41) from X and turn categorical features into integers, then samples the first million rows and returns the resulting numpy arrays. The data is sampled because it was too big to fit in the memory of my laptop which has 8GB RAM. Naturally, if you have more RAM you can use more data (or better still use an engine for large scale data processing such as Spark).

Next, comes the actual prediction.

learner = KNeighborsClassifier(1, n_jobs=-1)
learner.fit(train_X, train_y)
pred_y = learner.predict(test_X)

results = confusion_matrix(test_y, pred_y)
error = zero_one_loss(test_y, pred_y)

print(results)
print(error)

We create a KNeighborsClassifier and fit it to our training examples and labels train_X and train_y respectively. We then make predictions on the test examples, test_X. The output is a classification error of 0.261 and a confusion matrix as follows:

0 1 2 3 4
0 60258 188 111 0 36
1 1270 2687 192 0 17
2 61725 1273 166855 0 0
3 201 16 0 8 3
4 15669 339 3 1 177

In the confusion matrix, the rows represent actual class labels and the columns are the predicted ones. In the ideal case there are non-zero elements along the major diagonal only. We can see that predictions of classes 0, 1 and 2, corresponding to normal connections, probes and denials of service, are quite good. In contrast, user-to-root and remote-to-local predictions (labels 3 and 4) are quite poor due to lack of training examples in these cases. Therefore, one can naturally improve these areas by obtaining more examples of these cases. The full code of this approach can be found here.

In a practical settings one would like a system which is computationally efficient when dealing with potentially large amounts of traffic. Unfortunately, the nearest neighbours approach is not fast enough to run in real time as each new connection requires a scan through the training examples to find the nearest one in the naive case. Scikit-learn uses more efficient KDTree and BallTree algorithms however they are still not suitable for real-time prediction. The third place algorithm of the competition uses decision trees which would be able to give efficient predictions, provided the trees were not too large.

Summary

We introduced the problem of network anomaly detection for detecting intruders. Machine learning can be an effective approach to solving this problem since attack patterns can change and learning algorithms are well-suited to adapt to change. As a practical example, we used the KDD-CUP-99 dataset which classifies network connections into normal and abnormal, and showed how to form a simple and effective intrusion detection system.

Footnotes

Image credit: Arne Hjorth Johansen