Context
We have seen that \(\text{argkmin}\) is the reduction that is performed on pairwise distances for \(k\)nearest neighbors search.
Yet, there exist other reductions over pairwise distances (\(\text{argmin}\), threshold filtering, cumulative sum, etc.) which are at the core of the computational foundations of many machine learning algorithms.
This blog post presents a design that takes into account the requirements of the existing implementations
to introduce a set of new abstractions to implement reductions over pairwise distances: PairwiseDistancesReduction
. This set of interfaces aims at reimplementing patterns that are similar to the \(k\)nn search in Cython, to improve the
performance of its computational foundations, and thus the ones of its userfacing interfaces.
To our knowledge, though some projects like KeOps implement those patterns efficiently for GPUs, no project implements such operations for CPUs efficiently.
💡 This blog post won’t introduce every technical detail for the sake of conciseness, maintenance, and to respect the single source of truth principle as much as possible. The implementations are available in the
sklearn.metrics._pairwise_distances_reduction
submodule.💡 This presentation gives elements of the design of
PairwiseDistancesReductions
.
Notation
In what follows, the following notations are used:
 \(p\): the dimension of vectors
 \([n] \triangleq \{0, \cdots, n  1\}\)
 \(\mathbf{X} \in \mathbb{R}^{n_x \times p}\): a first dataset
 \(\mathbf{X}_{i\cdot} \in \mathbb{R}^{p}\): the \(i\)th vector of \(\mathbf{X}\)
 \(\mathbf{Y} \in \mathbb{R}^{n_y \times p}\): a second dataset
 \(\mathbf{Y}_{j\cdot} \in \mathbb{R}^{p}\): the \(j\)th vector of \(\mathbf{Y}\)
 \(c\): the chunk size, i.e. the number of vectors in a chunk (a group of adjacent vectors)
 \(c_x \triangleq \left\lceil \frac{n_x}{c} \right\rceil\), the number of chunks for \(\mathbf{X}\)
 \(c_y \triangleq \left\lceil \frac{n_y}{c} \right\rceil\), the number of chunks for \(\mathbf{Y}\)
 \((\mathbf{X}_c^{(l)})_{l \in [c_x]}\): the ordered family of all the chunks of \(\mathbf{X}\)
 \((\mathbf{Y}_c^{(k)})_{k \in [c_y]}\): the ordered family of all the chunks of \(\mathbf{Y}\)
 \(\mathbf{C}_\text{chunk_size}\mathbf{(X, Y)} \triangleq \left(\mathbf{X}_c^{(l)}, \mathbf{Y}_c^{(k)}\right)_{(l,k) \in [c_x] \times [c_y] }\): the ordered family of all the pairs of chunks
 \(d\), the distance metric to use
 \(\mathbf{D}_d(\mathbf{A}, \mathbf{B}) \in \mathbf{R}^{n_a \times n_b}\) the distance matrix for \(d\) between vectors of two matrices \(\mathbf{A} \in \mathbb{R}^{n_a \times p}\) and \(\mathbf{B} \in \mathbb{R}^{n_b \times p}\):
 \(k\): parameter for the \(\text{argkmin}\) operation at the base of \(k\) nearest neighbors search
Moreover, the terms “samples” and “vectors” will also be used interchangeably.
Requirements for reductions over pairwise distances
The following requirements are currently supported within scikitlearn’s implementations:
 Support for float32 datasets pairs and float64 datasets pairs
 Support for fused \(\{\text{sparse}, \text{dense}\}^2\) datasets pairs, i.e.:
 dense \(\mathbf{X}\) and dense \(\mathbf{Y}\)
 sparse \(\mathbf{X}\) and dense \(\mathbf{Y}\)
 dense \(\mathbf{X}\) and sparse \(\mathbf{Y}\)
 sparse \(\mathbf{X}\) and sparse \(\mathbf{Y}\)
 Support all the distance metrics as defined via
sklearn.metrics.DistanceMetric
 Parallelise computations effectively on all cores
 Prevent threads’ oversubscription^{1} (by OpenMP, joblib, or any BLAS implementations)
 Implement adapted operations for each reduction (\(\text{argmin}\), \(\text{argkmin}\), threshold filtering, cumulative sum, etc.)
 Support generic returned values for reductions (varying number, varying types, varying shapes, etc.)
 Optimise the Euclidean distance metric computations
Proposed design
The following design proposes treating the given requirements as independently from one another as possible.
DatasetsPair
: an abstract class for manipulating datasets^{2}
This allows:
 Supporting float32 datasets pairs and float64 datasets pairs
 Supporting fused \(\{\text{sparse}, \text{dense}\}^2\) datasets pairs via concrete implementation, i.e.:
DenseDenseDatasetsPair
SparseDenseDatasetsPair
DenseSparseDatasetsPair
SparseSparseDatasetsPair
 Supporting all the distance metrics as defined via
sklearn.metrics.DistanceMetric
Internally, a DatasetsPair
wraps \((\mathbf{X}, \mathbf{Y}, d)\) and exposes an interface which allows computing \(d(\mathbf{X}_{i\cdot}, \mathbf{Y}_{j\cdot})\) for a given tuple \((i, j)\).
PairwiseDistancesReduction
: an abstract class defining parallelization templates
This allows:
 Parallelising computations effectively on all cores
 Preventing threads’ oversubscription (by OpenMP, joblib, or any BLAS implementations)
 Supporting generic returned values for reductions (varying number, varying types, varying shapes, etc.)
This is made possible by:
 setting up a general interface that performs the parallelization of computations on \(\mathbf{C}_\text{chunk_size}\mathbf{(X, Y)}\): two strategies of parallelization are implemented as it’s worth parallelizing on \(\mathbf{X}\) or on \(\mathbf{Y}\) depending on the context. To choose one or the other strategy, a simple heuristic comparing \(c_x\) and \(c_y\) about the number of available threads is used and is sufficient.
 using a
threadpoolctl.threadpool_limits
context at the start of the execution of the generic parallel template  having a flexible Python interface to return results and have the parallel computations be defined agnostically from the data structures being modified in concrete classes^{3}.
The critical areas of the computations — that is the computations of the chunk of the distance matrix associated to \(\mathbf{C}_\text{chunk_size}\mathbf{(X, Y)}\) and its reduction — is made abstract. This way, when defining a concrete PairwiseDistancesReduction
, a sole method is to define up to some eventual python helpers methods^{4}.
PairwiseDistancesReductionArgKmin
: a first concrete PairwiseDistancesReduction
for \(\text{argkmin}\)
For this reduction, one can use maxheaps which are by design doing the work of keeping the first \(k\) minimum values with their indices. scikitlearn current implementation of maxheaps is simple, readable and efficient^{5} and can be used to manipulate the data structures that we need^{6}.
Specialising reductions for the Euclidean distance metric
Generally, distances associated with neighbors aren’t returned to the user. This allows some optimization.
In the case of the Euclidean distance metric, one can use the Squared Euclidean distance metric as a proxy: it is less costly, it preserves ordering and it can be computed efficiently.
Indeed, \(\mathbf{D}^{\odot 2}_d(\mathbf{X}_c^{(l)}, \mathbf{Y}_c^{(k)})\) — the elementwise squared version of the \((l,k)\)th chunk of \(\mathbf{D}_d(\mathbf{X}, \mathbf{Y})\) — can be computed as follows:
This allows using two optimizations:

\(\left[\Vert \mathbf{X}_{i\cdot}\Vert_2^2\right]_{i \in [n_x]}\) and \(\left[\Vert \mathbf{Y}_{j\cdot}\Vert_2^2\right]_{j \in [n_y]}\) can be computed once and for all at the start and be cached. Those two vectors will be reused on each chunk of the distance matrix.

More importantly, \( 2 \mathbf{X}^{(l)} {\mathbf{Y}^{(k)}}^\top\) can be computed using the GEneral Matrix Multiplication from BLAS Level 3 — hereinafter refered to as GEMM. This allows getting the maximum arithmetic intensity for the computations, making use of recent BLAS backends implementing vectorised kernels, such as OpenBLAS.
For instance EuclideanPairwiseDistancesArgkmin
is the main specialization of PairwiseDistancesArgkmin
for the
Euclidean distance metric. This specialization solely recomputes the actual Euclidean distances when the caller asked them to be returned.
Interfacing PairwiseDistancesReductions
with scikitlearn’s algorithms
As of now, the overall design was covered without mentioning ways they can be plugged with the existing scikitlearn algorithms, progressively migrating most algorithms’ backend to those new implementations.
Furthermore, in the future, specialized implementations for various vendors of CPUs and GPUs can be created.
In this case, we want to have such specialized implementations separated from scikitlearn source code (e.g. by having them in optional and vendorspecific packages) so as to keep PairwiseDistancesReductions
interfaces vendorspecializationagnostic but still be able to dispatch the computations to the most adapted and available implementations.
To touch two birds with one tiny stone^{7}, the new implementations can be used conditionally to the yetsupported cases based on provided datasets and executed agnostically from them.
This can be implemented by a PairwiseDistancesReduction.{is_usable_for,compute}
pattern:
PairwiseDistancesReduction.is_usable_for
returnsTrue
if any implementation for the provided \((\mathbf{X}, \mathbf{Y}, d)\) can be used. If none is available, the caller can default to the current implementation within scikitlearn.PairwiseDistancesReduction.compute
returns the results of the reduction. Internally, it is responsible for choosing the most appropriate implementation prior to executing it.
In this context, aforementioned vendorspecific packages could register custom implementations explicitly (i.e. with a python context manager as suggested by Olivier Grisel) or implicitly (by some package reflection when importing relevant interfaces).
Implementing the design
A few first experiments have been made and converged to sklearn#22134
, a contribution that proposes integrating the previous interfaces progressively via a feature branch.
Future work
Further work would treat the last requirements:
 Support for 32 bits datasets pairs
 Support for the last fused \(\{\text{sparse}, \text{dense}\}^2\) datasets pairs, i.e.:
 sparse \(\mathbf{X}\) and dense \(\mathbf{Y}\)
 dense \(\mathbf{X}\) and sparse \(\mathbf{Y}\)
 sparse \(\mathbf{X}\) and sparse \(\mathbf{Y}\)
 Implement adapted operations for each reduction (radius neighborhood, threshold filtering, cumulative sum, etc.)
If you are interested in reading more about this, read this section from the extra notes.
Acknowledgement
This was a joint work with other coredevelopers — namely Olivier Grisel, Jérémie du Boisberranger, Thomas J. Fan and Christian Lorentzen.
Finally and more importantly, the implementations presented here are made possible thanks to other notable opensource projects, especially Cython but also OpenBLAS, which provides fast vectorized kernels implemented in C and assembly for BLAS.
Notes
 Threads’ oversubscription happens when threads are spawned at various levels of parallelism, causing the OS to use more threads than necessary for the execution of the program to be optimal. ↩
 We use the term “abstract class” here to talk about the design: no such concept exists in Cython. ↩
 A set of template methods are defined so as to have concrete implementations modify data structures when and where needed. ↩
 If you are looking for the concrete implementations’ critical regions, look for
_compute_and_reduce_distances_on_chunks
. ↩  Thanks to Jake VanDerplas! ↩
 We mainly use heapallocated buffers that we manipulate through pointers and offsets at the lowest level of this new implementation for maximum efficiency. ↩
 Disclaimer: during this work, no animal was killed, nor hurt; nor are and nor will. ↩