âš Disclaimer: this is beingÂ written
In particular, diagrams are beingÂ improved.
Context
We have seen that \(\text{argkmin}\) is the reduction which is performed on pairwise distances for \(k\)nearest neighborsÂ search.
Yet, there are plenty of others reductions over pairwise distances (\(\text{argmin}\), \(\text{argkmin}\), threshold filtering, cumulative sum, etc.) which are at the core of computational foundations of many algorithms in machineÂ learning.
This blog post presents a design which takes into account the requirements of the existing implementations
to introduce a new abstraction to implement reductions over pairwise distances â€” PairwiseDistancesReduction
â€” reimplementing those operations under a private Cython module for scikitlearn, hopefully improving the overall
performances of its computational foundations, and thus the ones of its userfacingÂ interfaces.
This blog post wonâ€™t introduce every technical details for the sake of conciseness and to respect the single source of truth principle as much as possible. If the reader is interested in knowing those, they can read the first proposed implementations given in
scikitlearn#21462
.
Context
Reductions over pairwise distances in scikitlearnÂ 1.0
Unfortunately, the current implementations have evolved organically and arenâ€™t unified nor easy to follow.
This makes extending those reductions uneasy. Moreover, most of those reductions are implementated using python functions as reductions functions in a general parallelisation pattern which relies on joblib.Parallel
.
This makes the code concise and readable but this technically is not the most efficient: working at this
level with views on numpy arrays moves large chunks of data back and forth several time between the RAM and the CPUsâ€™ caches and allocate temporary results. Moreover, the cost of manipulating those chunks of data by CPythonâ€™s interpreter causes a non negligeableÂ overhead.
This does not allow getting proper hardware scalability, especially when using more than 8Â cores.
Hence, this motivates the refactoring of those patterns so as to both improve their computational efficiency and their codeÂ readability.
To our knowledge, though some projects like KeOps implement those patterns efficiently for GPUs, no project implements such operations for CPUsÂ efficiently.
Notation andÂ wording
In what follows, the following notations areÂ used:
 \([n] \triangleq \{0, \cdots, n Â 1\}\)
 \(\mathbf{X} \in \mathbb{R}^{n_x \times p}\): a firstÂ dataset
 \(\mathbf{X}_{i\cdot} \in \mathbb{R}^{n_x}\): the \(i\)th vector of \(\mathbf{X}\)
 \(\mathbf{Y} \in \mathbb{R}^{n_y \times p}\): a secondÂ dataset
 \(\mathbf{Y}_{j\cdot} \in \mathbb{R}^{n_y}\): the \(j\)th vector of \(\mathbf{Y}\)
 \(c\): the chunk size, i.e. the number of vectors in a chunk (a group of adjacentÂ vectors)
 \(c_x \triangleq \left\lceil \frac{n_x}{c} \right\rceil\), the number of chunks for \(\mathbf{X}\)
 \(c_y \triangleq \left\lceil \frac{n_y}{c} \right\rceil\), the number of chunks for \(\mathbf{Y}\)
 \((\mathbf{X}_c^{(l)})_{l \in [c_x]}\): the ordered family of all the chunks of \(\mathbf{X}\)
 \((\mathbf{Y}_c^{(k)})_{k \in [c_y]}\): the ordered family of all the chunks of \(\mathbf{Y}\)
 \(\mathbf{C}_\text{chunk_size}\mathbf{(X, Y)} \triangleq \left(\mathbf{X}_c^{(l)}, \mathbf{Y}_c^{(k)}\right)_{(l,k) \in [c_x] \times [c_y] }\): the ordered family of all the pairs ofÂ chunks
 \(d\), the distance metric toÂ use
 \(\mathbf{D}_d(\mathbf{A}, \mathbf{B}) \in \mathbf{R}^{n_a \times n_b}\) the distance matrix for \(d\) between vectors of two matrices \(\mathbf{A} \in \mathbb{R}^{n_a \times p}\) and \(\mathbf{B} \in \mathbb{R}^{n_b \times p}\):
 \(k\): parameter for the \(\text{argkmin}\) operation at the base of \(k\) nearest neighborsÂ search
The terms â€śsamplesâ€ť and â€śvectorsâ€ť will also be usedÂ interchangeably.
Visually, thisÂ gives:
Requirements for reductions over pairwiseÂ distances
The following requirements are currently supported within scikitlearnâ€™sÂ implementations:
 Support for 32bit datasets pairs and 64bit datasetsÂ pairs
 Support for fused \(\{\text{sparse}, \text{dense}\}^2\) datasets pairs, i.e.:
 dense \(\mathbf{X}\) and dense \(\mathbf{Y}\)
 sparse \(\mathbf{X}\) and dense \(\mathbf{Y}\)
 dense \(\mathbf{X}\) and sparse \(\mathbf{Y}\)
 sparse \(\mathbf{X}\) and sparse \(\mathbf{Y}\)
 Support all the distance metrics as defined via
sklearn.metrics.DistanceMetric
 Parallelise computations effectively on allÂ cores
 Prevent threadsâ€™ oversubscription^{1} (by OpenMP, joblib, or any BLASÂ implementations)
 Implement adapted operations for each reduction (\(\text{argmin}\), \(\text{argkmin}\), threshold filtering, cumulative sum,Â etc.)
 Support generic returned values for reductions (varying number, varying types, varying shapes,Â etc.)
 Optimise the Euclidean distanceÂ computations
ProposedÂ design
The following design proposes treating the given requirements as much independently asÂ possible.
DatasetsPair
: an abstract class for manipulating datasets^{2}
ThisÂ allows:
 Supporting 32bit datasets pairs and 64bit datasetsÂ pairs
 Supporting fused \(\{\text{sparse}, \text{dense}\}^2\) datasets pairs via concrete implementation, i.e.:
DenseDenseDatasetsPair
SparseDenseDatasetsPair
DenseSparseDatasetsPair
SparseSparseDatasetsPair
 Supporting all the distance metrics as defined via
sklearn.metrics.DistanceMetric
Internally, a DatasetsPair
wraps \((\mathbf{X}, \mathbf{Y}, d)\) and exposes an interface which allows computing \(d(\mathbf{X}_i, \mathbf{Y}_j)\) for a given tuple \((i, j)\).
PairwiseDistancesReduction
: an abstract class to reduce distancesÂ streamingly
ThisÂ allows:
 Parallelising computations effectively on allÂ cores
 Preventing threadsâ€™ oversubscription (by OpenMP, joblib, or any BLASÂ implementations)
 Supporting generic returned values for reductions (varying number, varying types, varying shapes,Â etc.)
This is madeÂ possible:
 setting up a general interface which performs the parallelisation of computations on \(\mathbf{C}_\text{chunk_size}\mathbf{(X, Y)}\): two strategies of parallelisation are implemented as itâ€™s worth parallelising on \(\mathbf{X}\) or on \(\mathbf{Y}\) depending on the context. To choose one or the other strategy, a simple heuristic comparing \(c_x\) and \(c_y\) with regard to the number of available threads is used and isÂ sufficient.
 by using a
threadpoolctl.threadpool_limits
context at the start of the execution of the generic parallelÂ template  having a flexible Python interface to return results and have the parallel computations be defined agnostically from the datastructures being modified in concret classes^{3}.
The critical areas of the computations â€” that is the computations of the chunk of the distance matrix associated to \(\mathbf{C}_\text{chunk_size}\mathbf{(X, Y)}\) and its reduction â€” is made abstract. This way, when defining a concrete PairwiseDistancesReduction
a sole method is to define up to some eventual python helpers methods^{4}.
Specialisations for the Euclidean distance metric are covered in the nextÂ sessions.
PairwiseDistancesReductionArgKmin
: a first concrete PairwiseDistancesReduction
for \(\text{argkmin}\)
For this reduction, one can simply use maxheaps which are doing the smart job of keeping the first \(k\) minimum values as well as with their indices in \(\mathcal{O}(\log k)\). There might be advanced implementations of heaps â€” such as rankpairing heaps^{5} â€” but scikitlearn curent implementation of maxheaps current is simple, readable and efficient enough that will stick with it^{6} and could easily be used to manipulate the datastructures that we need^{7}.
Specialising reductions for the Euclidean distanceÂ metric
In the case of the Euclidean distance metric, one can use the Squared Euclidean Distance as a proxy: it is less costly, it preserves ordering and it can be computedÂ efficiently.
Indeed, for \(\mathbf{D}_d(\mathbf{X}_c^{(l)}, \mathbf{Y}_c^{(k)})\) : the \((l,k)\)th chunk of \(\mathbf{D}_d(\mathbf{X}, \mathbf{Y})\):
This allows using twoÂ optimisations:

\(\left[\Vert \mathbf{X}_{i\cdot}\Vert_2^2\right]_{i \in [n_x]}\) and \(\left[\Vert \mathbf{Y}_{j\cdot}\Vert_2^2\right]_{j \in [n_y]}\) can be computed once and for all at the start and be cached. Those two vectors will be reused on each chunk of the distanceÂ matrix.

More importantly, \( 2 \mathbf{X}^{(l)} {\mathbf{Y}^{(k)}}^\top\) can be computed using the GEneral Matrix Multiplication from BLAS Level 3 â€” hereinbelow refered to as GEMM. This allows getting the maximum arithmetic intensity for the computations, making use of recent BLAS implementationsâ€™ vectorised implementations of computational kernels such as OpenBLAS.
For instance FastEuclideanPairwiseDistancesArgkmin
is the main specialisation of this concrete implementation for the
Euclidean distance metric case which solely computes or recomputes what is needed (e.g. actual Euclidean distances when the caller asked them to beÂ returned).
Interfacing PairwiseDistancesReductions
with scikitlearnâ€™sÂ algorithms
As of now, the overall design was covered without mentionning ways they can be plugged with the existing scikitlearn algorithms, progressively migrating most algorithmsâ€™ backend to those newÂ implementations.
Moreover, in the future, we could imagine creating specialised implementation for various vendors of CPUs and GPUs.
In this case, we want to have such specialised implementations separated from scikitlearn source code (e.g. by having them in optional and vendorspecific packages) so as to keep PairwiseDistancesReductions
interfaces vendorspecialisationagnostic but still be able to dispatch the computations to the most adapted and availableÂ implementations.
To touch two birds with one tiny stone^{8}, the new implementation can be used conditionnaly to the yetsupported cases based provided on data and executed agnostically from them. This can be implemented by a PairwiseDistancesReduction.{is_usable_for,compute}
pattern:
PairwiseDistancesReduction.is_usable_for
returnsTrue
if any implementation for the provided \((\mathbf{X}, \mathbf{Y}, d)\) can be used. If none is available, the caller can default to the current implementation withinÂ scikitlearn.PairwiseDistancesReduction.compute
returns the results of the reduction. Internally, it is responsible to choose the best implementation prior to executingÂ computations.
In this context, aforementioned vendorspecific packages could register custom implementations explicitly (i.e. with a python context manager as suggested by Olivier Grisel) or implicitly (by some package reflection when importing relevantÂ interfaces).
Implementating theÂ design
A few first experiments have been made and converged to sklearn#21462
, a contribution which proposes integrating the previousÂ interfaces.
If you are interested in reading more about this, read this section from the extra notes.
FutureÂ work
Further work would treat the lastÂ requirements:
 Support for 32 bits datasetsÂ pairs
 Support for the last fused \(\{\text{sparse}, \text{dense}\}^2\) datasets pairs, i.e.:
 sparse \(\mathbf{X}\) and dense \(\mathbf{Y}\)
 dense \(\mathbf{X}\) and sparse \(\mathbf{Y}\)
 sparse \(\mathbf{X}\) and sparse \(\mathbf{Y}\)
 Implement adapted operations for each reductions (radius neighborhood, threshold filtering, cumulative sum,Â etc.)
But first, scikitlearn#21462
is to be reviewed and agreed upon before proceeding with subsequent PRs implementing suchÂ changes.
If you are interested in reading more about this, read this section from the extra notes.
Acknowledgement
Other scikitlearn coredevelopersâ€™ suggestions â€” namely the ones from Olivier Grisel, JĂ©rĂ©mie du Boisberranger, Thomas J. Fan and Christian Lorentzen â€” helped improving the design andÂ performances.
Finally and more importantly, the implementations presented here are made possible thanks to other notable opensource projects,Â especially:
 Cython, which allows getting performances Ă la baremetal C using a superset ofÂ Python
 OpenBLAS, which provide hardcrafted kernels in C and assembly for BLAS
Many thanks go to theirÂ maintainers!
References
 Threadsâ€™ oversubscription happens when threads are spawned at various levels of parallelism, causing the OS to use more threads than necessary for the execution of the program to be optimal. â†©
 We use the term â€śabstract classâ€ť here so as to talk about the design: no such concept exist inÂ Cython. â†©
 A set of template methods are defined so as to have concrete implementations modify datastructures when and whereÂ needed. â†©
 If you are looking for the concrete implementationsâ€™ carburator, look for
_compute_and_reduce_distances_on_chunks
. â†©  Haeupler B., Sen S., Tarjan R.E.(2009) RankPairing Heaps. In: Fiat A., Sanders P. (eds) Algorithms  ESA 2009. ESA 2009. Lecture Notes in Computer Science, vol 5757. Springer, Berlin, Heidelberg. DOI:
https://doi.org/10.1007/9783642041280_59
â†©  Thanks to Jake VanDerplas! â†©
 We mainly use heapallocated buffers that we manipulate through pointers and offsets at the lowest lowel of this new implementation for maximumÂ efficiency. â†©
 Disclaimer: during this work, no animal were killed, nor hurt; nor are and nor will. â†©