There are quite a few programming languages in existence (1500+), so which one should you choose for machine learning (ML)? If you read the title you'll know that the answer is Python. In this brief article we will outline what makes Python a great language, and present some of the key libraries that make it suitable for machine learning.
Python is a high-level interpreted language developed in 1991 by Guido van Rossum. It is designed to be easy to read and less complicated to write than languages such as C++ and Java. It has all the benefits of other modern languages such as object orientation, memory management, functional and imperative programming and strong typing. Its key strengths are that it has an open source implementation, supports multiple programming paradigms and, as it is an interpreted language, highly portable across operating systems. Speed can be an issue with interpreted languages, however Python plays nicely with C and C++ so that critical code components can be optimised.
The popularity and extensibility of Python has made it a good choice for developers, particularly for scientists and mathematicians. Other popular languages in this domain include Matlab, which is non-free, and R, which is a specialised language for statistics.
Machine Learning Libraries
Without further ado we will outline some of the core Python libraries that make machine learning so easy.
NumPy and SciPy
The grandfather of all ML libraries is NumPy which is used creating and manipulating arrays or matrices of any dimension. The linear algebra computations are very fast since under the hood NumPy uses BLAS and LAPACK (like Matlab). SciPy is a companion library to NumPy and includes additional features such as probability distributions, optimisation, signal processing and sparse matrices. Together these two libraries are used as the basis for almost all machine learning algorithms, and are vital for anyone developing their own custom-made approaches.
Pandas stands for the Python Data Analysis Library. It provides a DataFrame class (a matrix with column headers, and indices) which is similar to that found in R. Manipulation of dataframes is very powerful and includes merging, joining, filling in missing data, reading a variety of file formats and time series functionality.
Scikit-learn is the most popular machine learning library for Python (9100 stars on GitHub at the time of writing), and is developed at a rapid pace. It currently contains just about all the most well-known machine learning algorithms including: support vector machines, random forests, boosting, k-means clustering and principal components analysis. It does not however include time series analysis, for that you can try Statsmodels.
If you need to develop your own machine learning algorithms, then you need Cython to make your code as fast and efficient as possible. Cython is a compiler for Python and Python-like code (the Cython programming language). It can be used to interface Python and C/C++ and produce massive speedups. Although it is not quite as fast as pure C/C++, it comes close.
Apache Spark MLlib
For truly huge amounts of data Apache Spark MLlib is a great choice since it contains a number of distributed machine learning algorithms. Spark has an impressive array of algorithms including: support vector machines, k-means clustering, principal components analysis, random forests and recommendation algorithms.
We have given some reasons why Python is the language of choice in both a general sense, and in particular for machine learning on small and big data. If you have any important libraries that you think I left out, please leave a comment below!
Subscribe to SimplyML: Simply Machine Learning
Get the latest posts delivered right to your inbox