Context
We have seen that \(\text{argkmin}\) is the reduction which is performed on pairwise distances for \(k\)nearest neighborsÂ search.
Yet, there exist other reductions over pairwise distances (\(\text{argmin}\), \(\text{argkmin}\), threshold filtering, cumulative sum, etc.) which are at the core of computational foundations of many machine learningÂ algorithms.
This blog post presents a design which takes into account the requirements of the existing implementations
to introduce a set of new abstraction to implement reductions over pairwise distances: PairwiseDistancesReduction
. This set of interfaces aims at reimplementing pattern that are similar to the \(k\)nn search in Cython, to improving the
performance of its computational foundations, and thus the ones of its userfacingÂ interfaces.
To our knowledge, though some projects like KeOps implement those patterns efficiently for GPUs, no project implements such operations for CPUsÂ efficiently.
This blog post wonâ€™t introduce every technical details for the sake of conciseness and to respect the single source of truth principle as much as possible. If the reader is interested in knowing those, they can read the first proposed implementations given in
scikitlearn#22134
.
Some notation and wordingÂ formalism
In what follows, the following notations areÂ used:
 \(p\): the dimension ofÂ vectors
 \([n] \triangleq \{0, \cdots, n Â 1\}\)
 \(\mathbf{X} \in \mathbb{R}^{n_x \times p}\): a firstÂ dataset
 \(\mathbf{X}_{i\cdot} \in \mathbb{R}^{p}\): the \(i\)th vector of \(\mathbf{X}\)
 \(\mathbf{Y} \in \mathbb{R}^{n_y \times p}\): a secondÂ dataset
 \(\mathbf{Y}_{j\cdot} \in \mathbb{R}^{p}\): the \(j\)th vector of \(\mathbf{Y}\)
 \(c\): the chunk size, i.e. the number of vectors in a chunk (a group of adjacentÂ vectors)
 \(c_x \triangleq \left\lceil \frac{n_x}{c} \right\rceil\), the number of chunks for \(\mathbf{X}\)
 \(c_y \triangleq \left\lceil \frac{n_y}{c} \right\rceil\), the number of chunks for \(\mathbf{Y}\)
 \((\mathbf{X}_c^{(l)})_{l \in [c_x]}\): the ordered family of all the chunks of \(\mathbf{X}\)
 \((\mathbf{Y}_c^{(k)})_{k \in [c_y]}\): the ordered family of all the chunks of \(\mathbf{Y}\)
 \(\mathbf{C}_\text{chunk_size}\mathbf{(X, Y)} \triangleq \left(\mathbf{X}_c^{(l)}, \mathbf{Y}_c^{(k)}\right)_{(l,k) \in [c_x] \times [c_y] }\): the ordered family of all the pairs ofÂ chunks
 \(d\), the distance metric toÂ use
 \(\mathbf{D}_d(\mathbf{A}, \mathbf{B}) \in \mathbf{R}^{n_a \times n_b}\) the distance matrix for \(d\) between vectors of two matrices \(\mathbf{A} \in \mathbb{R}^{n_a \times p}\) and \(\mathbf{B} \in \mathbb{R}^{n_b \times p}\):
 \(k\): parameter for the \(\text{argkmin}\) operation at the base of \(k\) nearest neighborsÂ search
Moreover, the terms â€śsamplesâ€ť and â€śvectorsâ€ť will also be usedÂ interchangeably.
Requirements for reductions over pairwiseÂ distances
The following requirements are currently supported within scikitlearnâ€™sÂ implementations:
 Support for 32bit datasets pairs and 64bit datasetsÂ pairs
 Support for fused \(\{\text{sparse}, \text{dense}\}^2\) datasets pairs, i.e.:
 dense \(\mathbf{X}\) and dense \(\mathbf{Y}\)
 sparse \(\mathbf{X}\) and dense \(\mathbf{Y}\)
 dense \(\mathbf{X}\) and sparse \(\mathbf{Y}\)
 sparse \(\mathbf{X}\) and sparse \(\mathbf{Y}\)
 Support all the distance metrics as defined via
sklearn.metrics.DistanceMetric
 Parallelise computations effectively on allÂ cores
 Prevent threadsâ€™ oversubscription^{1} (by OpenMP, joblib, or any BLASÂ implementations)
 Implement adapted operations for each reduction (\(\text{argmin}\), \(\text{argkmin}\), threshold filtering, cumulative sum,Â etc.)
 Support generic returned values for reductions (varying number, varying types, varying shapes,Â etc.)
 Optimise the Euclidean distanceÂ computations
ProposedÂ design
The following design proposes treating the given requirements as much independently asÂ possible.
DatasetsPair
: an abstract class for manipulating datasets^{2}
ThisÂ allows:
 Supporting 32bit datasets pairs and 64bit datasetsÂ pairs
 Supporting fused \(\{\text{sparse}, \text{dense}\}^2\) datasets pairs via concrete implementation, i.e.:
DenseDenseDatasetsPair
SparseDenseDatasetsPair
DenseSparseDatasetsPair
SparseSparseDatasetsPair
 Supporting all the distance metrics as defined via
sklearn.metrics.DistanceMetric
Internally, a DatasetsPair
wraps \((\mathbf{X}, \mathbf{Y}, d)\) and exposes an interface which allows computing \(d(\mathbf{X}_{i\dot}, \mathbf{Y}_{j\dot})\) for a given tuple \((i, j)\).
PairwiseDistancesReduction
: an abstract class defining parallelisationÂ templates
ThisÂ allows:
 Parallelising computations effectively on allÂ cores
 Preventing threadsâ€™ oversubscription (by OpenMP, joblib, or any BLASÂ implementations)
 Supporting generic returned values for reductions (varying number, varying types, varying shapes,Â etc.)
This is made possibleÂ by:
 setting up a general interface which performs the parallelisation of computations on \(\mathbf{C}_\text{chunk_size}\mathbf{(X, Y)}\): two strategies of parallelisation are implemented as itâ€™s worth parallelising on \(\mathbf{X}\) or on \(\mathbf{Y}\) depending on the context. To choose one or the other strategy, a simple heuristic comparing \(c_x\) and \(c_y\) with regard to the number of available threads is used and isÂ sufficient.
 using a
threadpoolctl.threadpool_limits
context at the start of the execution of the generic parallelÂ template  having a flexible Python interface to return results and have the parallel computations be defined agnostically from the datastructures being modified in concret classes^{3}.
The critical areas of the computations â€” that is the computations of the chunk of the distance matrix associated to \(\mathbf{C}_\text{chunk_size}\mathbf{(X, Y)}\) and its reduction â€” is made abstract. This way, when defining a concrete PairwiseDistancesReduction
, a sole method is to define up to some eventual python helpers methods^{4}.
PairwiseDistancesReductionArgKmin
: a first concrete PairwiseDistancesReduction
for \(\text{argkmin}\)
For this reduction, one can simply use maxheaps which are by design doing the work of keeping the first \(k\) minimum values with their indices. scikitlearn current implementation of maxheaps is simple, readable and efficient^{5} and can be used to manipulate the datastructures that we need^{6}.
Specialising reductions for the Euclidean distanceÂ metric
Generally, distances associated to neighbors arenâ€™t returned to the user. This allows someÂ optimisation.
In the case of the Euclidean distance metric, one can use the Squared Euclidean distance metric as a proxy: it is less costly, it preserves ordering and it can be computedÂ efficiently.
Indeed, \(\mathbf{D}_d(\mathbf{X}_c^{(l)}, \mathbf{Y}_c^{(k)})\) â€” the \((l,k)\)th chunk of \(\mathbf{D}_d(\mathbf{X}, \mathbf{Y})\) â€” can be computed asÂ follows:
This allows using twoÂ optimisations:

\(\left[\Vert \mathbf{X}_{i\cdot}\Vert_2^2\right]_{i \in [n_x]}\) and \(\left[\Vert \mathbf{Y}_{j\cdot}\Vert_2^2\right]_{j \in [n_y]}\) can be computed once and for all at the start and be cached. Those two vectors will be reused on each chunk of the distanceÂ matrix.

More importantly, \( 2 \mathbf{X}^{(l)} {\mathbf{Y}^{(k)}}^\top\) can be computed using the GEneral Matrix Multiplication from BLAS Level 3 â€” hereinafter refered to as GEMM. This allows getting the maximum arithmetic intensity for the computations, making use of recent BLAS backends implementing vectorised kernels, such as OpenBLAS.
For instance FastEuclideanPairwiseDistancesArgkmin
is the main specialisation of PairwiseDistancesArgkmin
for the
Euclidean distance metric. This specialisation solely recomputes the actual Euclidean distances when the caller asked them to beÂ returned.
Interfacing PairwiseDistancesReductions
with scikitlearnâ€™sÂ algorithms
As of now, the overall design was covered without mentionning ways they can be plugged with the existing scikitlearn algorithms, progressively migrating most algorithmsâ€™ backend to those newÂ implementations.
Furthermore, in the future, specialised implementations for various vendors of CPUs and GPUs can be created.
In this case, we want to have such specialised implementations separated from scikitlearn source code (e.g. by having them in optional and vendorspecific packages) so as to keep PairwiseDistancesReductions
interfaces vendorspecialisationagnostic but still be able to dispatch the computations to the most adapted and availableÂ implementations.
To touch two birds with one tiny stone^{7}, the new implementations can be used conditionally to the yetsupported cases based on provided datasets and executed agnostically fromÂ them.
This can be implemented by a PairwiseDistancesReduction.{is_usable_for,compute}
pattern:
PairwiseDistancesReduction.is_usable_for
returnsTrue
if any implementation for the provided \((\mathbf{X}, \mathbf{Y}, d)\) can be used. If none is available, the caller can default to the current implementation withinÂ scikitlearn.PairwiseDistancesReduction.compute
returns the results of the reduction. Internally, it is responsible for choosing the most appropriate implementation prior to executingÂ it.
In this context, aforementioned vendorspecific packages could register custom implementations explicitly (i.e. with a python context manager as suggested by Olivier Grisel) or implicitly (by some package reflection when importing relevantÂ interfaces).
Implementing theÂ design
A few first experiments have been made and converged to sklearn#22134
, a contribution which proposes integrating the previous interfaces progressively via a featureÂ branch.
FutureÂ work
Further work would treat the lastÂ requirements:
 Support for 32 bits datasetsÂ pairs
 Support for the last fused \(\{\text{sparse}, \text{dense}\}^2\) datasets pairs, i.e.:
 sparse \(\mathbf{X}\) and dense \(\mathbf{Y}\)
 dense \(\mathbf{X}\) and sparse \(\mathbf{Y}\)
 sparse \(\mathbf{X}\) and sparse \(\mathbf{Y}\)
 Implement adapted operations for each reductions (radius neighborhood, threshold filtering, cumulative sum,Â etc.)
If you are interested in reading more about this, read this section from the extra notes.
Acknowledgement
This was a joint work with other coredevelopers â€” namely Olivier Grisel, JĂ©rĂ©mie du Boisberranger, Thomas J. Fan and Christian Lorentzen.
Finally and more importantly, the implementations presented here are made possible thanks to other notable opensource projects, especially Cython but also of OpenBLAS, which provides fast vectorized kernels implemented in C and assembly for BLAS.
Notes
 Threadsâ€™ oversubscription happens when threads are spawned at various levels of parallelism, causing the OS to use more threads than necessary for the execution of the program to be optimal. â†©
 We use the term â€śabstract classâ€ť here so as to talk about the design: no such concept exist inÂ Cython. â†©
 A set of template methods are defined so as to have concrete implementations modify datastructures when and whereÂ needed. â†©
 If you are looking for the concrete implementationsâ€™ critical regions, look for
_compute_and_reduce_distances_on_chunks
. â†©  Thanks to Jake VanDerplas! â†©
 We mainly use heapallocated buffers that we manipulate through pointers and offsets at the lowest lowel of this new implementation for maximumÂ efficiency. â†©
 Disclaimer: during this work, no animal were killed, nor hurt; nor are and nor will. â†©