High-level overview of the scikit-learn dependences
scikit-learn is mainly written in Python and is built on top of some core libraries of the scientific Python ecosystem.
This ecosystem allows high expressiveness and interactivity: one can perform complex operations in a few lines of code and get the results straight away.
It also allowed setting up simple conventions which makes the code-base algorithms easy to understand and to improve for new contributors.
It also allows delegating most of the complex operations
to well tested third party libraries. For instance, calls
to functions implemented in
are delegated to
and ARPACK interfaces.
Main reasons for limited performance
The PyData stack is simple but is not tailored for optimal performance for several reasons.
CPython — the main implementation of Python — is slow.
First, CPython has an interpreter: there’s a cost in converting Python instructions into another intermediate representation — the byte-code — and executing the instructions by interpreting their byte-code.
Secondly, nearly every value in CPython is boxed into a
— implemented as a C struct. As such, simple operations
(like adding two floats) come with a non-negligeable dispatch
overhead as the interpreter has to check the type which is unknown
Thirdly, CPython for memory management relies on a global mutex on its interpreter called the Global Interpreter Lock1. This mechanism comes in handy but computations are restricted in most cases to sequential execution in a single thread, removing the benefit of using threads.
Memory-hierarchy suboptimal implementations
numpy supports high-level operations but this comes with intermediate
and dynamically-allocated arrays.
Moreover, this pattern is inefficient from a memory perspective: during the execution, blocks of data are moved back and forth between the RAM and the different CPU caches several times, not making optimally use of the caches.
For instance, based on this minimalistic toy example:
import numpy as np A = np.random.rand(100_000_000) B = np.random.rand(100_000_000) X = np.exp(A + B)
The following is performed:
- a first temporary array is allocated for
A + B
- a second array is allocated to store
np.exp(A + B)and the first temporary array is discarded
This temporary allocation makes the implementation suboptimal as memory allocation on the heap is slow.
Furthermore, high level operations on
X come with more data
moves between the RAM and the CPU than needed to compute the
X and hardly make use of the memory hierarchy
and the size of the caches.
No “bare-metal” data-structures
The Python ecosystem comes with a few high-level containers such as numpy arrays, and pandas DataFrames.
Contrarily to other languages’ standard libraries (like the one of
C++), no “bare-metal” data-structures, including heaps, or
memory-contiguous resizable buffers (as implemented in C++ by
are available to implement some algorithms efficiently
from a computational complexity and a technical perspectives.
Cython: combining the conciseness of Python and the speed of C
In brief, Cython allows transpiling a superset of Python to C code and allows using code which was written in C or C++, which makes bypassing the some of CPython’s internals possible. Moreover, Cython allows using OpenMP, an API which allows using lower-level parallelism primitives for implementations written in C or Fortran2.
In most cases, features provided by Cython are sufficient enough to reach optimal implementations for many scientific algorithms for which a static tasks scheduling — at the level of C via OpenMP — is the most natural and optimal one. Plus, its syntax makes this language expressive enough to get nearly optimal performance while keeping the instructions short and concise — which is a real advantage for developers coming from Python which are looking for performance and a relief for C or C++ developers3.
As such, many algorithms in
scikit-learn are implemented in Cython for performance reasons, some of which use OpenMP when possible. This is for instance the case of
KMeans which was initially written in Python using numpy and which was rewritten in Cython by Jérémie du Boisberranger, improving the execution time by a factor of 5 for this algorithm4.
In the following posts, the case of \(k\)-nearest neighbors search — the base routine
KNearestNeighborsRegressor and others scikit-learn interfaces — is covered
and a new Cython implementation is proposed.
- For more information about the GIL, see this reference from the Python Wiki. ↩
- For more information on Cython, see its documentation. ↩
- Compared to C or C++, Cython is also less verbose and can be integrated Python build system more easily. ↩
- For more information about
KMeans, see the original contribution,
scikit-learn#11950, and this blog post. ↩