High-level overview of the scikit-learn dependences
scikit-learn is mainly written in Python and is built on top of some core libraries of the scientific Python ecosystem.
This ecosystem allows high expressiveness and interactivity: one can perform complex operations in a few lines of code and get the results straight away.
It also allowed setting up simple conventions which makes the code-base algorithms easy to understand and improve for new contributors.
It also allows delegating most of the complex operations
to well-tested third-party libraries. For instance, calls
to functions implemented in
numpy.linalg
,
scipy.linalg
, and
scipy.sparse.linalg
are delegated to
BLAS,
LAPACK,
and ARPACK interfaces.
Main reasons for limited performance
The PyData stack is simple but is not tailored for optimal performance for several reasons.
CPython internals
CPython — the main implementation of Python — is slow.
First, CPython has an interpreter: there’s a cost in converting Python instructions into another intermediate representation — the byte-code — and executing the instructions by interpreting their byte-code.
Secondly, nearly every value in CPython is boxed into a PyObject
— implemented as a C struct. As such, simple operations
(like adding two floats) come with a non-negligible dispatch
overhead as the interpreter has to check the type which is unknown
in advance.
Thirdly, CPython for memory management relies on a global mutex on its interpreter called the Global Interpreter Lock1. This mechanism comes in handy but computations are restricted in most cases to sequential execution in a single thread, removing the benefit of using threads.
Memory-hierarchy suboptimal implementations
numpy
supports high-level operations but this comes with intermediate
and dynamically-allocated arrays.
Moreover, this pattern is inefficient from a memory perspective: during the execution, blocks of data are moved back and forth between the RAM and the different CPU caches several times, not making optimal use of the caches.
For instance, based on this minimalistic toy example:
import numpy as np
A = np.random.rand(100_000_000)
B = np.random.rand(100_000_000)
X = np.exp(A + B)
The following is performed:
- a first temporary array is allocated for
A + B
- a second array is allocated to store
np.exp(A + B)
and the first temporary array is discarded
This temporary allocation makes the implementation suboptimal as memory allocation on the heap is slow.
Furthermore, high-level operations on X
come with more data
moves between the RAM and the CPU than needed to compute the
elements of X
and hardly make use of the memory hierarchy
and the size of the caches.
No “bare-metal” data-structures
The Python ecosystem comes with a few high-level containers such as numpy arrays, and pandas DataFrames.
Contrarily to other languages’ standard libraries (like the one of
C++), no “bare-metal” data structures, including heaps, or
memory-contiguous resizable buffers (as implemented in C++ by
std::priority_queue
and std::vector
)
are available to implement some algorithms efficiently
from both a computational complexity and a technical perspective.
Cython: combining the conciseness of Python and the speed of C
In brief, Cython allows transpiling a superset of Python to C code and allows using code that was written in C or C++, which makes bypassing some of CPython’s internals possible. Moreover, Cython allows using OpenMP, an API that allows using lower-level parallelism primitives for implementations written in C or Fortran2.
In most cases, features provided by Cython are sufficient enough to reach optimal implementations for many scientific algorithms for which static tasks scheduling — at the level of C via OpenMP — is the most natural and optimal one. Plus, its syntax makes this language expressive enough to get nearly optimal performance while keeping the instructions short and concise — which is a real advantage for developers coming from Python who are looking for performance and relief for C or C++ developers3.
As such, many algorithms in scikit-learn
are implemented in Cython performance, some of which use OpenMP when possible. This is for instance the case of KMeans
which was initially written in Python using numpy and which was rewritten in Cython by Jérémie du Boisberranger, improving the execution time by a factor of 5 for this algorithm4.
In the following posts, the case of \(k\)-nearest neighbors search — the base routine
for KNearestNeighborsClassifier
, KNearestNeighborsRegressor
and other scikit-learn interfaces — is covered
and a new Cython implementation is proposed.
Notes
- For more information about the GIL, see this reference from the Python Wiki. ↩
- For more information on Cython, see its documentation. ↩
- Compared to C or C++, Cython is also less verbose and can be integrated Python build system more easily. ↩
- For more information about
KMeans
, see the original contribution,scikit-learn#11950
, and this blog post. ↩