## \(k\)-nearest neighbors search inÂ scikit-learn

\(k\)-nearest neighbors search is at the base of many implementations used withinÂ scikit-learn.

For instance, it is used in Affinity Propagation, BIRCH, Mean Shift, OPTICS, Spectral Clustering, TSNE, KNeighbors Regressor and KNeighborsÂ Classifier.

Whilst many libraries implement approximated versions of \(k\)-nearest neighbors search to speed-up
the execution time^{1}, scikit-learnâ€™s implementation aims at returning the exact \(k\)-nearestÂ neighbors.

## Computing chunks of the distance matrixÂ computations

The general steps for \(k\)-nearest neighbors searchÂ are:

- Compute \(\mathbf{D}_d(\mathbf{X}, \mathbf{Y})\), the distance matrix between the vectors of two arrays \(\mathbf{X}\) and \(\mathbf{Y}\).
- Reduce rows of \(\mathbf{D}_d(\mathbf{X}, \mathbf{Y})\) appropriately for the given algorithm: for instance, the adapted reduction for \(k\)-nn search is to return the \(k\) smallest indices of values in an ordered set. In what follows, we call this reduction \(\texttt{argkmin}\).
- Perform extra operations on results of the reductions (here sortÂ values).

Generally, one does not compute \(\mathbf{D}_d(\mathbf{X}, \mathbf{Y})\) entirely because its space complexity is \(\Theta(n_x \times n_y)\). Practically, \(\mathbf{D}_d(\mathbf{X}, \mathbf{Y})\) does not fit in RAM for soundÂ datasets.

Thus, in practice, one computes chunks of this dataset and reduced them directly.
This is what is currently done as of scikit-learn 1.0^{2}.

## CurrentÂ issues

The current implementations relies on a general parallelisation scheme using higher-levels functions with
`joblib.Parallel`

.

This technically is not the most efficient: working at this level with views on numpy arrays moves large chunks of data back and forth several time between the RAM and the CPUsâ€™ caches, hardly make use of caches, and allocate temporaryÂ results.

Moreover, the cost of manipulating those chunks of data in the CPython interpreter causes a non negligeable overhead because they are associated to Python objects which are bound to the Global Lock Interpreter for their referenceÂ counting.

Hence, this does not allow getting proper *hardware scalability*: an ideal parallel
implementation of an algorithm would run close to \(n\) times faster when running on
\(n\) threads on \(n\) CPU cores compared to sequentially on \(1\)Â thread.

For instance, the current implementation of \(k\)-nearest neighbors search of scikit-learn cannot efficiently leverage all the available CPUs on a machine â€” as shown by this figureÂ bellow.

When using \(8\) threads, it only run \(\times 2\) faster than the sequential implementation and adding more threads and CPUs beyond \(8\) does not help getting betterÂ performance.

In the next blog, we go over the design of a new implementation to improve the scalability of \(k\)-nnÂ search.

## Notes

- Approximate nearest neighbors search algorithms comes in many different flavours, and thereâ€™s even a benchmark suite comparing them!. â†©
`KNearestNeighbors.kneighbors`

is the interface to lookÂ for. â†©