CERN's Large Hadron Collider (LHC) is to-date the world's largest and most expensive particle accelerator. On the 4th of July 2012, it was used to discover an elementary particle called the Higgs Boson, a quantum excitation of the Higgs Field which gives mass to objects. This discovery is deeply important because it enables us to understand this fundamental aspect of nature [^n]. A key part of the discovery of the Higgs is the use of advanced techniques in statistics and machine learning to analyse the data generated by the LHC. In this article, we use the dataset of Baldi et al. [^n] to distinguish between signals corresponding to Higgs Bosons and a background process with identical decay products. The dataset is composed of 11 million observations and 28 features so we use out-of-core algorithms in scikit-learn to perform the classification on a single machine. Out-of-core algorithms work on parts of a dataset at a time, enabling one to scale to large datasets on modest machines (in this case, my laptop).
On The Limits of Algorithms
Any algorithm that works on an input \(S\) of size \(n\) consumes a certain amount of memory and takes a certain amount of time. We can call these quantities \(f(S)\) and \(g(S)\) respectively. The study of how the memory and time changes with \(S\) is known as Computational Complexity. It's not really the purpose of this article to delve into this interesting field. One of the most important aspects of studying the computational complexity of algorithms is the worst case performance i.e. what is the upper bound on time and memory? In a machine learning context, with a fixed memory (RAM) and time and for a particular algorithm, we are therefore limited in the size of the dataset we can learn upon.
As a simple example, consider the classic feature extraction algorithm kernel Principal Components Analysis (KPCA). This scales cubically with the number of examples in time, and quadratically in space (memory). For example increasing the number of examples from 100 to 200 will increase the time taken to run the algorithm by 8 times. A tenfold increase in the number of examples from 100 to 1000 will increase the runtime by 1000! Clearly, for an algorithm to scale to large datasets it must have a reasonable time and space complexity. Typically we can say that an algorithm that scales linearly or better is suitable for large-scale learning. In other words, as the size of the dataset doubles, the runtime increases by two or less.
An additional limit that machine learning algorithms face is that some of them require the entire dataset to be present in memory. Out-of-core algorithms can work on parts of the full dataset at a time, fetching new data when required. As they have to continually retrieve data from a hard disk, an efficient out-of-core approach should scan the whole dataset as few times as possible. In some scenarios, each observation is seen only once by the algorithm, for example when streamed from a network.
Returning the to the Higgs dataset, we use an out-of-core classification algorithm to distinguish between Higgs bosons and background signals. Scikit-learn has a great variety of algorithms for "small-scale" learning, however it also contains several large-scale algorithms including Naive Bayes, Stochastic Gradient Descent (SGD), the Passive Aggressive Classifier and the Perceptron. A bit of experimentation shows that SGD provides the best results on the Higgs dataset so we stick to this for the following parts. This classifier is a linear approach, and an efficient way to optimise a function (it scales linearly in the number of examples).
First, we load parts of the data using pandas and divide the training and test sets. The complete data has 11 million examples, however we fix the test set to the last 500,000 examples of the file (as in the paper), and the initial training set is 100,000. These datasets are standardised using
SGD is a linear classifier, and hence cannot model non-linear functions. Fortunately it can be combined with various feature extraction approaches to model non-linear functions. Inspired by the talk from Andreas Mueller we use Random Forests to generate some features which are then fed into the classification algorithm.
forest = RandomForestClassifier(n_estimators=100, n_jobs=-1) forest.fit(X_train, y_train) encoder = OneHotEncoder() encoder.fit(forest.apply(X_train))
The general idea is to take the initial training set of 100,000 examples and learn a random forest. Then we make predictions for each tree in this forest and record the index of the leaf node for each tree and each example (using
forest.apply). These features are then fed into
OneHotEncoder which converts categorical integer features indicator 0/1 features.
The initial parameters of the SGDClassifier are set up as follows:
learner = SGDClassifier(loss="hinge", penalty="l2", learning_rate="invscaling", alpha=0.001, average=10**4, eta0=1.0, class_weight=class_weights)
Next we present the main training loop which iterates through the complete file 3 times:
num_passes = 3 aucs =  for j in range(num_passes): for i in range(0, num_train_examples, block_size): df = pandas.read_csv(input_filename, skiprows=i, nrows=block_size) X_train = df.values[:, 1:] X_train = scaler.transform(X_train) X_train = encoder.transform(forest.apply(X_train)) y_train = numpy.array(df.values[:, 0], numpy.int) learner.partial_fit(X_train, y_train, classes=numpy.array([0, 1])) y_pred_prob = learner.decision_function(X_test) auc = roc_auc_score(y_test, y_pred_prob) aucs.append([i + block_size * j, auc]) print(aucs[-1])
for loop iterates once for the complete dataset, and the inner one processes the dataset in
block_size chunks. Inside the inner
for loop, lines 6-10 read the relevant part of the input file and then preprocess the training examples using
OneHotEncoder in turn. On line 12 the
partial_fit method updates the learnt model using the current training set. The presence of a
partial_fit method in any scikit-learn estimator means that it can be used for out-of-core learning. Finally, lines 13-16 make predictions for the test set and then compute the Area Under the ROC Curve (AUC) metric. The complete code for this example is provided here.
The resulting AUC curve is given by the plot below.
As we can see, the performance improves quickly at first and then levels out to an AUC of approximately 0.824. This looks like the algorithm has converged, however it is better to consider the loss function itself, which does not appear to be readily available in scikit-learn.
How can results be improved? There could be more parameter tuning for both random forests and SGD. Furthermore, one could use the ensemble methods in Apache Spark or deep learning using PyLearn as Baldi et al. did, with better results on much more powerful hardware.
The discovery of the Higgs Boson is incredibly important for understanding the nature of the universe. We participate in this endeavour by using an out-of-core machine learning algorithm to classify between Higgs Bosons and background signals. The combination of random forests for feature extraction and stochastic gradient descent for classification gave good results with 11 million examples whilst working on modest hardware.
Image Credit: Maximilien Brice
Subscribe to SimplyML: Simply Machine Learning
Get the latest posts delivered right to your inbox