# On Artificial Neural Networks and Deep Learning

The most complex machines in the known-universe are neural networks, and we are each lucky enough to possess one. Inspired by their biological variants, artificial neural networks (ANNs) are classes of algorithms that simulate real neural networks. They have been studied since the 1950s however recent advances in ANNs coupled with improvements in hardware have enabled them to make a resurgence under the name of *deep learners* due to amazing empirical results. For example, in a Kaggle Competition to distinguish images of cats and dogs, the winning entry was correct 98.9% of the time using deep learning, and an ANN with more than 1 billion connections was used to find cats in YouTube videos. In this article we will explain how deep learning works, and give a practical example of its usage.

## A Classical Approach: Backpropagation

Deep learning algorithms are essentially a specific class of ANNs. A famous and classical ANN is Backpropagation. Take a look at the network below which is an example of a backpropagation [^n].

The network is composed of 9 *neurons* represented by circles. These essentially take a vector of inputs, and then compute a single output based on the weighted sum of the inputs. The way that this output is computed varies from network to network, however common choices are the logistic, softmax or sigmoid functions. The neurons take the structure given in the diagram above in which there are 3 layers: input, hidden and output. We can see that the output of each neuron in the previous layer is used as an input to the next. The input data (e.g. an image) is fed into the input later, and the output layer then makes a prediction (e.g. cat or dog).

So how does this network learn? Recall that each neuron weights its inputs. The learning takes place as we adapt these weights to the examples provided in the input. At a high level, we give some examples to the network, propagate the signals to the output layer and compare this to the output we expect by computing an error. This error is then propagated backwards through the network and used to adjust the weights. The process is repeated until the weights converge on a set of values or the number of errors made on the examples falls below a threshold. A more detailed explanation can be found here and there is a step-by-step example here.

## Deep Learners

Deep learning algorithms [^n][^n] have one or more hidden layers (sometimes thousands) which allows them to model more complex functions. In addition the following features give deep learning algorithms advantages over "shallow" approaches:

- The brain is a deep network and models the input at different levels of abstraction with more abstraction further along the network. This seems to be intuitive in the ways that humans organise their thoughts and ideas (e.g. understanding how a plane is constructed can be at the level of large components such as wings and engines, or atomic mechanical parts).
- A deep network models a family of functions considerably more efficiently than a shallow one, particularly those involved in visual recognition.
- New advances in optimisations techniques such as Stochastic Gradient Descent (SGD) enable computationally efficient learning as well as parallelisation. Implementations using Graphical Processing Units (GPUs) as opposed to Central Processing Units (CPUs) provide massive speedups.

Along with these great properties however, come some disadvantages. The structure of deep networks make them difficult to analyse in a mathematical sense. In addition the optimisation problems that arise from deep learning are known as *non-convex* and hence do not have unique solutions or are particularly easy to solve. Despite these issues however deep learning algorithms remain an active area of research and have garnered a lot of attention in industry.

## Classifying Handwritten Digits

To get a better sense of deep learning, let's dive into a simple code example. We will use Keras for the deep learning algorithms, as it is easy to use and supports a wide variety of networks and optimisation methods. Since deep learning seems to work well with image classification we will use the Digits dataset which contains a collection of 1797 images of 8x8 pixel handwritten digits (0-9) along with the associated classification. This dataset is inbuilt in scikit-learn so we can load it as follows

```
dataset = load_digits()
X = dataset["data"]
y = dataset["target"]
y_indicators = pandas.get_dummies(y).values
```

Each row of `X`

is the flattened array of pixel values for each image, y is the corresponding class (0-9). For the application of the neural network we also have a set of indicator classes in `y_indicators`

so that `1`

, `2`

, and `3`

are represented as `[1,0,0,0,0,0,0,0,0,0]`

, `[0,1,0,0,0,0,0,0,0,0]`

, `[0,0,1,0,0,0,0,0,0,0]`

respectively for example. This representation is required so that each neuron in the output layer corresponds to a class.

Here is how we set up the deep learner:

```
ann = Sequential()
ann.add(Dense(output_dim=256, input_dim=X.shape[1], init="glorot_uniform"))
ann.add(Activation("relu"))
ann.add(Dense(output_dim=10, init="glorot_uniform"))
ann.add(Activation("sigmoid"))
ann.compile(loss='categorical_crossentropy', optimizer='sgd')
```

A `Sequential`

object represents a linear stack of layers so that we can build up a network incrementally. The input layer must correspond to the number of pixels (64) and the following hidden layer contains 256 neurons. Note that the `init`

parameter here sets the initial values of the weights based on Gaussian random variables. The `Activation`

class in line 3 allows us to set how the weighted inputs to a neuron are combined to form an output. Following, we created an output layer with 10 outputs and the `sigmoid`

activation function. The final line compiles the model by specifying a loss function and a way to minimise/optimise the loss. In this case the optimiser is Stochastic Gradient Descent (SGD).

Training and prediction is performed using `fit`

and `predict`

which follows the scikit-learn conventions:

```
ann.fit(X_train, y_train_indicators, nb_epoch=50, batch_size=32)
y_pred = ann.predict(X_test)
# The predicted class is the output response with the largest value
y_pred = numpy.argmax(y_pred, 1)
```

This code snippet is inside a `for`

loop which performs cross validation and splits `X`

into `X_train`

and `X_test`

etc. Note that the final line of this block converts a set of predicted indicator values back into an integer class. For brevity we omitted the full code, but it is available here.

We made a comparison with an Support Vector Machine (SVM) which yields an error of 0.066 compared to 0.057 with the deep learner. Hence, we found an improvement (although I did not spend a lot of time optimising parameters etc.). With Keras there are plenty of ways to vary the structure of the network, activation functions, and optimisation routine so its likely this result could be improved.

## Summary

This article has explained why there is so much buzz around deep learning. We introduced Backpropagation as a simple classical deep learner, and discussed some of the advantages and disadvantages of this class of algorithms. Finally, we saw an example of the effectiveness of a deep learner using the Keras library for Python.

### Footnotes

Image credit: *DJ*

### Subscribe to SimplyML: Simply Machine Learning

Get the latest posts delivered right to your inbox