An essential part of any scientific software application is the ability to run quickly. Speed increases can be obtained relatively easily with faster CPUs and more memory. These gains however, come at a financial cost and may come with additional disadvantages such as increased power usage. A software-driven approach to optimisation is much more desirable, and in this article we benchmark some of the core operations on matrices which are fundamental to many scientific and machine learning algorithms. In particular, we look at the dot product, matrix addition and some elementwise matrix operations. The operations are tested for NumPy, Theano and Tensorflow, of which the latter two can make use of fast Graphics Processing Units (GPUs) for incredible speed increases.
Perhaps the most useful operation on matrices is the dot product, also known as matrix multiplication. This is used for gradient-based optimisation, which is often applied to deep learning and almost every machine learning algorithm there is. We also look at matrix addition which is used in optimisation update steps for example. Finally we consider a number of elementwise operations such as the sum, minimum element, exponent and mean value. Note that we didn't consider the very useful eigenvalue decomposition and Singular Value Decomposition since they are not accelerated in either Tensorflow or Theano.
As stated we test these operations in three Python libraries: NumPy, Theano and Tensorflow. NumPy is the most used scientific library in Python, and our test system is set up to use the optimised OpenBLAS for linear algebra. Theano and Tensorflow are primarily deep learning libraries but also allow for key linear algebra to be performed on a GPU resulting in huge speedups over a CPU. We test on an Intel core i5-4460 CPU with 16GiB RAM and a Nvidia GTX 970 with 4 GiB RAM using Theano 0.8.2, Tensorflow 0.11.0, CUDA 8.0 on Linux Mint 18.
The following code snippet shows how to generate some timings for the NumPy dot product:
import numpy import timeit A = numpy.random.rand(i, i).astype(numpy.float32) B = numpy.random.rand(i, i).astype(numpy.float32) timer = timeit.Timer("numpy.dot(A, B)", "import numpy; from __main__ import A, B") numpy_times_list = timer.repeat(num_repeats, 1)
B are randomly generated matrices of size
i. They have to be of type
float32 in order to be accelerated with the GPU. We use the
timeit module to time repeated iterations of
numpy.dot(A, B). Note that
numpy_times_list is a list of timings for each repetition and we simply take the minimum time in this list.
With Theano things are slightly more complicated:
from theano import function import theano import theano.tensor as T A = theano.shared(A) B = theano.shared(B) z = T.dot(A, B) f = function(, z) timer = timeit.Timer("f()", "from __main__ import f") theano_times_list = timer.repeat(num_repeats, 1)
You can see that we need to set up tensor shared variables
B which are then used to define the function
f and its output
z. By using shared variables we ensure that they are present in the GPU memory, if available. The difference in style from the NumPy code is due to the fact that Theano uses an optimising compiler for computing its operations.
TensorFlow works in a similar fashion:
import tensorflow A = tensorflow.constant(A) B = tensorflow.constant(B) # Create a session object sess = tensorflow.Session() result = tensorflow.matmul(A, B) timer = timeit.Timer("sess.run(result)", setup="import tensorflow; from __main__ import sess, A, B, result") tensorflow_times_list = timer.repeat(num_repeats, 1)) sess.close()
In the above code block,
B are TensorFlow constants, and the code
product = tensorflow.matmul(A, B) defines the function we are interested in. The matrix multiplication is profiled with a call to
These tests are run for matrix sizes from 500 to 5000 in steps of 500. In addition we benchmark all the operations detailed in the previous section. The full benchmark code is available here. Feel free to try it on your system.
The results are shown in the plots above. It is clear that the main strengths of Theano and TensorFlow are very fast dot products and matrix exponents. The dot product is approximately 8 and 7 times faster respectively with Theano/Tensorflow compared to NumPy for the largest matrices. Strangely, matrix addition is slow with the GPU libraries and NumPy is the fastest in these tests.
The minimum and mean of matrices are slow in Theano and quick in Tensorflow. It is not clear why Theano is as slow (worse than NumPy) for these operations. Note that both Theano and Tensorflow are under active development so these results may well improve quite quickly.
Our aim in this article was to benchmark the performance of key matrix operations in NumPy, Theano and TensorFlow. The latter two libraries have the advantage of using a GPU when available to massively speed up operations. On the whole Theano/TensorFlow provide very fast dot products and elementwise exponents. On average, TensorFlow is faster for the operations under consideration.
Update 19/11/16: Updated to the latest TensorFlow/CUDA libraries and corrected errors pointed out by mauinz.
Subscribe to SimplyML: Simply Machine Learning
Get the latest posts delivered right to your inbox