High-level overview of the scikit-learn stack
scikit-learn is built on top the scientific Python ecosystem.
Its most notable dependences are:
numpy
, for generic operations on arraysscipy
, for generic operations in linear algebra and othersjoblib
, for a back-end agnostic high-level parallelism
This comes in handy because, because this allows having high expressiveness in a few lines.
It also allows deleguating most of the complex operations
to well tested third party libraries. For instance, efficient
linear algebra routines of car deleguated to BLAS, LAPACK, ARPACK
via numpy.linalg
, scipy.linalg
, and scipy.sparse.linalg
.
This simple stack allows having most of the implementations of scikit-learn concise.
Moreover, this simple stack also allowed setting up simple conventions which making the code-base accessible for new contributors.
The main reasons for limited performances
This stack is simple but is not tailored for optimal performance for several reasons.
CPython internals
CPython, the main implementation of Python does use a global mutex, called the Global Interpreter Lock. This just blocks the execution of a program to use a single thread at a time.
Memory-hierarchy suboptimal implementations
numpy
supports high-level operations but this comes with intermediate
and dynamically-allocated temporary arrays.
Moreover, this pattern is inefficient from a memory perspective: during the execution, blocks of data are moved back and forth between the RAM and the different CPU caches several time, not making use of the caches.
For instance, based on this minimalistic toy example:
import numpy as np
A = np.random.rand(100)
B = np.random.rand(100)
C = np.random.rand(100)
X = A + B + C
The following is performed:
- a first temporary array is allocated for
A + B
- a second array is allocated to store
A + B + C
and the first temporary array is discarded
This is costly because memory allocation on the heap is slow and blocks at the level of the operating system.
Moreover, with high level operations on X
:
- comes with data moves between the RAM and the CPU than
needed to compute the elements of
X
1 - hardly make use of the memory hierarchy and the size of the caches
No simple datastructures
The Python ecosystem comes with a variety of high-level containers such as numpy arrays, and pandas DataFrames.
No low-level datastructures can be used to implement some algorithms efficiently from a theoretical and a technical perspectives.
Improving performances: the plan
General plan to improve the performances of a library:
- Measuring speed-ups and profiling implementations
- Identifying cases that are worth improving
- Understanding what’s wrong with those cases
- Proposing a new implementation for those cases
This plan can be performed iteratively.
Measuring speed-ups and profiling
Mathis Batoul worked on a setting up benchmarks to compare scikit-learn current implementations to other libraries in the ecosystem, such as XGBoost and CatBoost. This setup also embeds some results of profiling.
The results of those benchmarks are accessible online here: https://mbatoul.github.io/sklearn_benchmarks/
From those benchmarks, a few notable implementations can be improved, especially:
LogisticRegression
LinearRegression
KNearestNeighborsClassifier
KNearestNeighborsRegressor
In what follows, we are going to focus on \(k\)-nearest neighbors search used
as a base routine for KNearestNeighborsClassifier
and KNearestNeighborsRegressor
.
Notes
- In this example, there is exactly 6 moves instead of 4. ↩