Performances and scikit-learn (2/4)

Hardware scalability issue: the k-neighbors search example

Published on the: 17.12.2021
Last modified on the: 12.06.2022
Estimated reading time: ~ 4 min.

\(k\)-nearest neighbors search in scikit-learn

\(k\)-nearest neighbors search is at the base of many implementations used within scikit-learn.

For instance, it is used in Affinity Propagation, BIRCH, Mean Shift, OPTICS, Spectral Clustering, TSNE, KNeighbors Regressor and KNeighbors Classifier.

Whilst many libraries implement approximated versions of \(k\)-nearest neighbors search to speed-up the execution time1, scikit-learn’s implementation aims at returning the exact \(k\)-nearest neighbors.

Computing chunks of the distance matrix computations

The general steps for \(k\)-nearest neighbors search are:

  • Compute \(\mathbf{D}_d(\mathbf{X}, \mathbf{Y})\), the distance matrix between the vectors of two arrays \(\mathbf{X}\) and \(\mathbf{Y}\).
  • Reduce rows of \(\mathbf{D}_d(\mathbf{X}, \mathbf{Y})\) appropriately for the given algorithm: for instance, the adapted reduction for \(k\)-nn search is to return the \(k\) smallest indices of values in an ordered set. In what follows, we call this reduction \(\texttt{argkmin}\).
  • Perform extra operations on results of the reductions (here sort values).

Generally, one does not compute \(\mathbf{D}_d(\mathbf{X}, \mathbf{Y})\) entirely because its space complexity is \(\Theta(n_x \times n_y)\). Practically, \(\mathbf{D}_d(\mathbf{X}, \mathbf{Y})\) does not fit in RAM for sound datasets.

Thus, in practice, one computes chunks of this dataset and reduced them directly. This is what is currently done as of scikit-learn 1.02.

Current issues

The current implementations relies on a general parallelisation scheme using higher-levels functions with joblib.Parallel.

This technically is not the most efficient: working at this level with views on numpy arrays moves large chunks of data back and forth several time between the RAM and the CPUs’ caches, hardly make use of caches, and allocate temporary results.

Moreover, the cost of manipulating those chunks of data in the CPython interpreter causes a non negligeable overhead because they are associated to Python objects which are bound to the Global Lock Interpreter for their reference counting.

Hence, this does not allow getting proper hardware scalability: an ideal parallel implementation of an algorithm would run close to \(n\) times faster when running on \(n\) threads on \(n\) CPU cores compared to sequentially on \(1\) thread.

For instance, the current implementation of \(k\)-nearest neighbors search of scikit-learn cannot efficiently leverage all the available CPUs on a machine — as shown by this figure bellow.

Scalability of kneighbors as of scikit-learn 1.0

When using \(8\) threads, it only run \(\times 2\) faster than the sequential implementation and adding more threads and CPUs beyond \(8\) does not help getting better performances.

In the next blog, we go over the design of a new implementation to improve the scalability of \(k\)-nn search.


Notes

  1. Approximate nearest neighbors search algorithms comes in many different flavours, and there’s even a benchmark suite comparing them!. ↩
  2. KNearestNeighbors.kneighbors is the interface to look for. ↩