Context
We have seen that \(\text{argkmin}\) is the reduction which is performed on pairwise distances for \(k\)nearest neighbors search.
Yet, there exist other reductions over pairwise distances (\(\text{argmin}\), \(\text{argkmin}\), threshold filtering, cumulative sum, etc.) which are at the core of computational foundations of many machine learning algorithms.
This blog post presents a design which takes into account the requirements of the existing implementations
to introduce a set of new abstraction to implement reductions over pairwise distances: PairwiseDistancesReduction
. This set of interfaces aims at reimplementing pattern that are similar to the \(k\)nn search in Cython, to improving the
performance of its computational foundations, and thus the ones of its userfacing interfaces.
To our knowledge, though some projects like KeOps implement those patterns efficiently for GPUs, no project implements such operations for CPUs efficiently.
This blog post won’t introduce every technical details for the sake of conciseness and to respect the single source of truth principle as much as possible. If the reader is interested in knowing those, they can read the first proposed implementations given in
scikitlearn#22134
.
Some notation and wording formalism
In what follows, the following notations are used:
 \(p\): the dimension of vectors
 \([n] \triangleq \{0, \cdots, n  1\}\)
 \(\mathbf{X} \in \mathbb{R}^{n_x \times p}\): a first dataset
 \(\mathbf{X}_{i\cdot} \in \mathbb{R}^{p}\): the \(i\)th vector of \(\mathbf{X}\)
 \(\mathbf{Y} \in \mathbb{R}^{n_y \times p}\): a second dataset
 \(\mathbf{Y}_{j\cdot} \in \mathbb{R}^{p}\): the \(j\)th vector of \(\mathbf{Y}\)
 \(c\): the chunk size, i.e. the number of vectors in a chunk (a group of adjacent vectors)
 \(c_x \triangleq \left\lceil \frac{n_x}{c} \right\rceil\), the number of chunks for \(\mathbf{X}\)
 \(c_y \triangleq \left\lceil \frac{n_y}{c} \right\rceil\), the number of chunks for \(\mathbf{Y}\)
 \((\mathbf{X}_c^{(l)})_{l \in [c_x]}\): the ordered family of all the chunks of \(\mathbf{X}\)
 \((\mathbf{Y}_c^{(k)})_{k \in [c_y]}\): the ordered family of all the chunks of \(\mathbf{Y}\)
 \(\mathbf{C}_\text{chunk_size}\mathbf{(X, Y)} \triangleq \left(\mathbf{X}_c^{(l)}, \mathbf{Y}_c^{(k)}\right)_{(l,k) \in [c_x] \times [c_y] }\): the ordered family of all the pairs of chunks
 \(d\), the distance metric to use
 \(\mathbf{D}_d(\mathbf{A}, \mathbf{B}) \in \mathbf{R}^{n_a \times n_b}\) the distance matrix for \(d\) between vectors of two matrices \(\mathbf{A} \in \mathbb{R}^{n_a \times p}\) and \(\mathbf{B} \in \mathbb{R}^{n_b \times p}\):
 \(k\): parameter for the \(\text{argkmin}\) operation at the base of \(k\) nearest neighbors search
Moreover, the terms “samples” and “vectors” will also be used interchangeably.
Requirements for reductions over pairwise distances
The following requirements are currently supported within scikitlearn’s implementations:
 Support for 32bit datasets pairs and 64bit datasets pairs
 Support for fused \(\{\text{sparse}, \text{dense}\}^2\) datasets pairs, i.e.:
 dense \(\mathbf{X}\) and dense \(\mathbf{Y}\)
 sparse \(\mathbf{X}\) and dense \(\mathbf{Y}\)
 dense \(\mathbf{X}\) and sparse \(\mathbf{Y}\)
 sparse \(\mathbf{X}\) and sparse \(\mathbf{Y}\)
 Support all the distance metrics as defined via
sklearn.metrics.DistanceMetric
 Parallelise computations effectively on all cores
 Prevent threads’ oversubscription^{1} (by OpenMP, joblib, or any BLAS implementations)
 Implement adapted operations for each reduction (\(\text{argmin}\), \(\text{argkmin}\), threshold filtering, cumulative sum, etc.)
 Support generic returned values for reductions (varying number, varying types, varying shapes, etc.)
 Optimise the Euclidean distance computations
Proposed design
The following design proposes treating the given requirements as much independently as possible.
DatasetsPair
: an abstract class for manipulating datasets^{2}
This allows:
 Supporting 32bit datasets pairs and 64bit datasets pairs
 Supporting fused \(\{\text{sparse}, \text{dense}\}^2\) datasets pairs via concrete implementation, i.e.:
DenseDenseDatasetsPair
SparseDenseDatasetsPair
DenseSparseDatasetsPair
SparseSparseDatasetsPair
 Supporting all the distance metrics as defined via
sklearn.metrics.DistanceMetric
Internally, a DatasetsPair
wraps \((\mathbf{X}, \mathbf{Y}, d)\) and exposes an interface which allows computing \(d(\mathbf{X}_{i\cdot}, \mathbf{Y}_{j\cdot})\) for a given tuple \((i, j)\).
PairwiseDistancesReduction
: an abstract class defining parallelisation templates
This allows:
 Parallelising computations effectively on all cores
 Preventing threads’ oversubscription (by OpenMP, joblib, or any BLAS implementations)
 Supporting generic returned values for reductions (varying number, varying types, varying shapes, etc.)
This is made possible by:
 setting up a general interface which performs the parallelisation of computations on \(\mathbf{C}_\text{chunk_size}\mathbf{(X, Y)}\): two strategies of parallelisation are implemented as it’s worth parallelising on \(\mathbf{X}\) or on \(\mathbf{Y}\) depending on the context. To choose one or the other strategy, a simple heuristic comparing \(c_x\) and \(c_y\) with regard to the number of available threads is used and is sufficient.
 using a
threadpoolctl.threadpool_limits
context at the start of the execution of the generic parallel template  having a flexible Python interface to return results and have the parallel computations be defined agnostically from the datastructures being modified in concret classes^{3}.
The critical areas of the computations — that is the computations of the chunk of the distance matrix associated to \(\mathbf{C}_\text{chunk_size}\mathbf{(X, Y)}\) and its reduction — is made abstract. This way, when defining a concrete PairwiseDistancesReduction
, a sole method is to define up to some eventual python helpers methods^{4}.
PairwiseDistancesReductionArgKmin
: a first concrete PairwiseDistancesReduction
for \(\text{argkmin}\)
For this reduction, one can simply use maxheaps which are by design doing the work of keeping the first \(k\) minimum values with their indices. scikitlearn current implementation of maxheaps is simple, readable and efficient^{5} and can be used to manipulate the datastructures that we need^{6}.
Specialising reductions for the Euclidean distance metric
Generally, distances associated to neighbors aren’t returned to the user. This allows some optimisation.
In the case of the Euclidean distance metric, one can use the Squared Euclidean distance metric as a proxy: it is less costly, it preserves ordering and it can be computed efficiently.
Indeed, \(\mathbf{D}_d(\mathbf{X}_c^{(l)}, \mathbf{Y}_c^{(k)})\) — the \((l,k)\)th chunk of \(\mathbf{D}_d(\mathbf{X}, \mathbf{Y})\) — can be computed as follows:
This allows using two optimisations:

\(\left[\Vert \mathbf{X}_{i\cdot}\Vert_2^2\right]_{i \in [n_x]}\) and \(\left[\Vert \mathbf{Y}_{j\cdot}\Vert_2^2\right]_{j \in [n_y]}\) can be computed once and for all at the start and be cached. Those two vectors will be reused on each chunk of the distance matrix.

More importantly, \( 2 \mathbf{X}^{(l)} {\mathbf{Y}^{(k)}}^\top\) can be computed using the GEneral Matrix Multiplication from BLAS Level 3 — hereinafter refered to as GEMM. This allows getting the maximum arithmetic intensity for the computations, making use of recent BLAS backends implementing vectorised kernels, such as OpenBLAS.
For instance FastEuclideanPairwiseDistancesArgkmin
is the main specialisation of PairwiseDistancesArgkmin
for the
Euclidean distance metric. This specialisation solely recomputes the actual Euclidean distances when the caller asked them to be returned.
Interfacing PairwiseDistancesReductions
with scikitlearn’s algorithms
As of now, the overall design was covered without mentionning ways they can be plugged with the existing scikitlearn algorithms, progressively migrating most algorithms’ backend to those new implementations.
Furthermore, in the future, specialised implementations for various vendors of CPUs and GPUs can be created.
In this case, we want to have such specialised implementations separated from scikitlearn source code (e.g. by having them in optional and vendorspecific packages) so as to keep PairwiseDistancesReductions
interfaces vendorspecialisationagnostic but still be able to dispatch the computations to the most adapted and available implementations.
To touch two birds with one tiny stone^{7}, the new implementations can be used conditionally to the yetsupported cases based on provided datasets and executed agnostically from them.
This can be implemented by a PairwiseDistancesReduction.{is_usable_for,compute}
pattern:
PairwiseDistancesReduction.is_usable_for
returnsTrue
if any implementation for the provided \((\mathbf{X}, \mathbf{Y}, d)\) can be used. If none is available, the caller can default to the current implementation within scikitlearn.PairwiseDistancesReduction.compute
returns the results of the reduction. Internally, it is responsible for choosing the most appropriate implementation prior to executing it.
In this context, aforementioned vendorspecific packages could register custom implementations explicitly (i.e. with a python context manager as suggested by Olivier Grisel) or implicitly (by some package reflection when importing relevant interfaces).
Implementing the design
A few first experiments have been made and converged to sklearn#22134
, a contribution which proposes integrating the previous interfaces progressively via a feature branch.
Future work
Further work would treat the last requirements:
 Support for 32 bits datasets pairs
 Support for the last fused \(\{\text{sparse}, \text{dense}\}^2\) datasets pairs, i.e.:
 sparse \(\mathbf{X}\) and dense \(\mathbf{Y}\)
 dense \(\mathbf{X}\) and sparse \(\mathbf{Y}\)
 sparse \(\mathbf{X}\) and sparse \(\mathbf{Y}\)
 Implement adapted operations for each reductions (radius neighborhood, threshold filtering, cumulative sum, etc.)
If you are interested in reading more about this, read this section from the extra notes.
Acknowledgement
This was a joint work with other coredevelopers — namely Olivier Grisel, Jérémie du Boisberranger, Thomas J. Fan and Christian Lorentzen.
Finally and more importantly, the implementations presented here are made possible thanks to other notable opensource projects, especially Cython but also of OpenBLAS, which provides fast vectorized kernels implemented in C and assembly for BLAS.
Notes
 Threads’ oversubscription happens when threads are spawned at various levels of parallelism, causing the OS to use more threads than necessary for the execution of the program to be optimal. ↩
 We use the term “abstract class” here so as to talk about the design: no such concept exist in Cython. ↩
 A set of template methods are defined so as to have concrete implementations modify datastructures when and where needed. ↩
 If you are looking for the concrete implementations’ critical regions, look for
_compute_and_reduce_distances_on_chunks
. ↩  Thanks to Jake VanDerplas! ↩
 We mainly use heapallocated buffers that we manipulate through pointers and offsets at the lowest lowel of this new implementation for maximum efficiency. ↩
 Disclaimer: during this work, no animal were killed, nor hurt; nor are and nor will. ↩