Performances and scikit-learn: Pairwise Distances Reductions

Abstracting the k-nn search pattern

Published on the: 15.12.2021
Last modified on the: 16.01.2022
Estimated reading time: ~ 11 min.

⚠ Disclaimer: this is being written

In particular, diagrams are being improved.


Context

We have seen that \(\text{argkmin}\) is the reduction which is performed on pairwise distances for \(k\)-nearest neighbors search.

Yet, there are plenty of others reductions over pairwise distances (\(\text{argmin}\), \(\text{argkmin}\), threshold filtering, cumulative sum, etc.) which are at the core of computational foundations of many algorithms in machine learning.

This blog post presents a design which takes into account the requirements of the existing implementations to introduce a new abstraction to implement reductions over pairwise distances — PairwiseDistancesReduction — reimplementing those operations under a private Cython module for scikit-learn, hopefully improving the overall performances of its computational foundations, and thus the ones of its user-facing interfaces.

This blog post won’t introduce every technical details for the sake of conciseness and to respect the single source of truth principle as much as possible. If the reader is interested in knowing those, they can read the first proposed implementations given in scikit-learn#21462.

Context

Reductions over pairwise distances in scikit-learn 1.0

Unfortunately, the current implementations have evolved organically and aren’t unified nor easy to follow. This makes extending those reductions uneasy. Moreover, most of those reductions are implementated using python functions as reductions functions in a general parallelisation pattern which relies on joblib.Parallel. This makes the code concise and readable but this technically is not the most efficient: working at this level with views on numpy arrays moves large chunks of data back and forth several time between the RAM and the CPUs’ caches and allocate temporary results. Moreover, the cost of manipulating those chunks of data by CPython’s interpreter causes a non negligeable overhead.

This does not allow getting proper hardware scalability, especially when using more than 8 cores.

Hence, this motivates the refactoring of those patterns so as to both improve their computational efficiency and their code readability.

To our knowledge, though some projects like KeOps implement those patterns efficiently for GPUs, no project implements such operations for CPUs efficiently.

Notation and wording

In what follows, the following notations are used:

  • \([n] \triangleq \{0, \cdots, n - 1\}\)
  • \(\mathbf{X} \in \mathbb{R}^{n_x \times p}\): a first dataset
  • \(\mathbf{X}_{i\cdot} \in \mathbb{R}^{n_x}\): the \(i\)-th vector of \(\mathbf{X}\)
  • \(\mathbf{Y} \in \mathbb{R}^{n_y \times p}\): a second dataset
  • \(\mathbf{Y}_{j\cdot} \in \mathbb{R}^{n_y}\): the \(j\)-th vector of \(\mathbf{Y}\)
  • \(c\): the chunk size, i.e. the number of vectors in a chunk (a group of adjacent vectors)
  • \(c_x \triangleq \left\lceil \frac{n_x}{c} \right\rceil\), the number of chunks for \(\mathbf{X}\)
  • \(c_y \triangleq \left\lceil \frac{n_y}{c} \right\rceil\), the number of chunks for \(\mathbf{Y}\)
  • \((\mathbf{X}_c^{(l)})_{l \in [c_x]}\): the ordered family of all the chunks of \(\mathbf{X}\)
  • \((\mathbf{Y}_c^{(k)})_{k \in [c_y]}\): the ordered family of all the chunks of \(\mathbf{Y}\)
  • \(\mathbf{C}_\text{chunk_size}\mathbf{(X, Y)} \triangleq \left(\mathbf{X}_c^{(l)}, \mathbf{Y}_c^{(k)}\right)_{(l,k) \in [c_x] \times [c_y] }\): the ordered family of all the pairs of chunks
  • \(d\), the distance metric to use
$$ d: \mathbb{R}^{p} \times \mathbb{R}^{p} \longrightarrow \mathbb{R}_+ $$
  • \(\mathbf{D}_d(\mathbf{A}, \mathbf{B}) \in \mathbf{R}^{n_a \times n_b}\) the distance matrix for \(d\) between vectors of two matrices \(\mathbf{A} \in \mathbb{R}^{n_a \times p}\) and \(\mathbf{B} \in \mathbb{R}^{n_b \times p}\):
$$ \forall (i, j) \in [n_a]\times [n_b], \quad \mathbf{D}_d(\mathbf{A}, \mathbf{B})_{i,j} = d\left(\mathbf{A}_i, \mathbf{B}_j\right) $$
  • \(k\): parameter for the \(\text{argkmin}\) operation at the base of \(k\) nearest neighbors search

The terms “samples” and “vectors” will also be used interchangeably.

Visually, this gives:

Overview of the chunking

Requirements for reductions over pairwise distances

The following requirements are currently supported within scikit-learn’s implementations:

  • Support for 32bit datasets pairs and 64bit datasets pairs
  • Support for fused \(\{\text{sparse}, \text{dense}\}^2\) datasets pairs, i.e.:
    • dense \(\mathbf{X}\) and dense \(\mathbf{Y}\)
    • sparse \(\mathbf{X}\) and dense \(\mathbf{Y}\)
    • dense \(\mathbf{X}\) and sparse \(\mathbf{Y}\)
    • sparse \(\mathbf{X}\) and sparse \(\mathbf{Y}\)
  • Support all the distance metrics as defined via sklearn.metrics.DistanceMetric
  • Parallelise computations effectively on all cores
  • Prevent threads’ oversubscription1 (by OpenMP, joblib, or any BLAS implementations)
  • Implement adapted operations for each reduction (\(\text{argmin}\), \(\text{argkmin}\), threshold filtering, cumulative sum, etc.)
  • Support generic returned values for reductions (varying number, varying types, varying shapes, etc.)
  • Optimise the Euclidean distance computations

Proposed design

The following design proposes treating the given requirements as much independently as possible.

DatasetsPair: an abstract class for manipulating datasets2

This allows:

  • Supporting 32bit datasets pairs and 64bit datasets pairs
  • Supporting fused \(\{\text{sparse}, \text{dense}\}^2\) datasets pairs via concrete implementation, i.e.:
    • DenseDenseDatasetsPair
    • SparseDenseDatasetsPair
    • DenseSparseDatasetsPair
    • SparseSparseDatasetsPair
  • Supporting all the distance metrics as defined via sklearn.metrics.DistanceMetric

Internally, a DatasetsPair wraps \((\mathbf{X}, \mathbf{Y}, d)\) and exposes an interface which allows computing \(d(\mathbf{X}_i, \mathbf{Y}_j)\) for a given tuple \((i, j)\).

PairwiseDistancesReduction: an abstract class to reduce distances streamingly

This allows:

  • Parallelising computations effectively on all cores
  • Preventing threads’ oversubscription (by OpenMP, joblib, or any BLAS implementations)
  • Supporting generic returned values for reductions (varying number, varying types, varying shapes, etc.)

This is made possible:

  • setting up a general interface which performs the parallelisation of computations on \(\mathbf{C}_\text{chunk_size}\mathbf{(X, Y)}\): two strategies of parallelisation are implemented as it’s worth parallelising on \(\mathbf{X}\) or on \(\mathbf{Y}\) depending on the context. To choose one or the other strategy, a simple heuristic comparing \(c_x\) and \(c_y\) with regard to the number of available threads is used and is sufficient.
  • by using a threadpoolctl.threadpool_limits context at the start of the execution of the generic parallel template
  • having a flexible Python interface to return results and have the parallel computations be defined agnostically from the datastructures being modified in concret classes3.

The critical areas of the computations — that is the computations of the chunk of the distance matrix associated to \(\mathbf{C}_\text{chunk_size}\mathbf{(X, Y)}\) and its reduction — is made abstract. This way, when defining a concrete PairwiseDistancesReduction a sole method is to define up to some eventual python helpers methods4.

Computing distances on chunks

Specialisations for the Euclidean distance metric are covered in the next sessions.

PairwiseDistancesReductionArgKmin: a first concrete PairwiseDistancesReduction for \(\text{argkmin}\)

For this reduction, one can simply use max-heaps which are doing the smart job of keeping the first \(k\) minimum values as well as with their indices in \(\mathcal{O}(\log k)\). There might be advanced implementations of heaps — such as rank-pairing heaps5 — but scikit-learn curent implementation of max-heaps current is simple, readable and efficient enough that will stick with it6 and could easily be used to manipulate the data-structures that we need7.

Specialising reductions for the Euclidean distance metric

In the case of the Euclidean distance metric, one can use the Squared Euclidean Distance as a proxy: it is less costly, it preserves ordering and it can be computed efficiently.

Indeed, for \(\mathbf{D}_d(\mathbf{X}_c^{(l)}, \mathbf{Y}_c^{(k)})\) : the \((l,k)\)-th chunk of \(\mathbf{D}_d(\mathbf{X}, \mathbf{Y})\):

$$ \mathbf{D}_d(\mathbf{X}_c^{(l)}, \mathbf{Y}_c^{(k)}) \triangleq \left[\Vert \mathbf{X}_{i\cdot}^{(l)} - \mathbf{Y}_{j\cdot}^{(k)} \Vert^2_2\right]_{(i,j)\in [c]^2} = \left[\Vert \mathbf{X}_{i\cdot}^{(l)}\Vert^2_2 \right]_{(i,j)\in [c]^2} + \left[\Vert \mathbf{Y}_{j\cdot }^{(k)}\Vert^2_2 \right]_{(i, j)\in [c]^2} - 2 \mathbf{X}^{(l)} {\mathbf{Y}^{(k)}}^\top $$

This allows using two optimisations:

  1. \(\left[\Vert \mathbf{X}_{i\cdot}\Vert_2^2\right]_{i \in [n_x]}\) and \(\left[\Vert \mathbf{Y}_{j\cdot}\Vert_2^2\right]_{j \in [n_y]}\) can be computed once and for all at the start and be cached. Those two vectors will be reused on each chunk of the distance matrix.

  2. More importantly, \(- 2 \mathbf{X}^{(l)} {\mathbf{Y}^{(k)}}^\top\) can be computed using the GEneral Matrix Multiplication from BLAS Level 3 — hereinbelow refered to as GEMM. This allows getting the maximum arithmetic intensity for the computations, making use of recent BLAS implementations’ vectorised implementations of computational kernels such as OpenBLAS.

For instance FastEuclideanPairwiseDistancesArgkmin is the main specialisation of this concrete implementation for the Euclidean distance metric case which solely computes or recomputes what is needed (e.g. actual Euclidean distances when the caller asked them to be returned).

Interfacing PairwiseDistancesReductions with scikit-learn’s algorithms

As of now, the overall design was covered without mentionning ways they can be plugged with the existing scikit-learn algorithms, progressively migrating most algorithms’ back-end to those new implementations.

Moreover, in the future, we could imagine creating specialised implementation for various vendors of CPUs and GPUs. In this case, we want to have such specialised implementations separated from scikit-learn source code (e.g. by having them in optional and vendor-specific packages) so as to keep PairwiseDistancesReductions interfaces vendor-specialisation-agnostic but still be able to dispatch the computations to the most adapted and available implementations.

To touch two birds with one tiny stone8, the new implementation can be used conditionnaly to the yet-supported cases based provided on data and executed agnostically from them. This can be implemented by a PairwiseDistancesReduction.{is_usable_for,compute} pattern:

  • PairwiseDistancesReduction.is_usable_for returns True if any implementation for the provided \((\mathbf{X}, \mathbf{Y}, d)\) can be used. If none is available, the caller can default to the current implementation within scikit-learn.
  • PairwiseDistancesReduction.compute returns the results of the reduction. Internally, it is responsible to choose the best implementation prior to executing computations.

In this context, aforementioned vendor-specific packages could register custom implementations explicitly (i.e. with a python context manager as suggested by Olivier Grisel) or implicitly (by some package reflection when importing relevant interfaces).

Implementating the design

A few first experiments have been made and converged to sklearn#21462, a contribution which proposes integrating the previous interfaces.

If you are interested in reading more about this, read this section from the extra notes.

Future work

Further work would treat the last requirements:

  • Support for 32 bits datasets pairs
  • Support for the last fused \(\{\text{sparse}, \text{dense}\}^2\) datasets pairs, i.e.:
    • sparse \(\mathbf{X}\) and dense \(\mathbf{Y}\)
    • dense \(\mathbf{X}\) and sparse \(\mathbf{Y}\)
    • sparse \(\mathbf{X}\) and sparse \(\mathbf{Y}\)
  • Implement adapted operations for each reductions (radius neighborhood, threshold filtering, cumulative sum, etc.)

But first, scikit-learn#21462 is to be reviewed and agreed upon before proceeding with subsequent PRs implementing such changes.

If you are interested in reading more about this, read this section from the extra notes.

Acknowledgement

Other scikit-learn core-developers’ suggestions — namely the ones from Olivier Grisel, Jérémie du Boisberranger, Thomas J. Fan and Christian Lorentzen — helped improving the design and performances.

Finally and more importantly, the implementations presented here are made possible thanks to other notable open-source projects, especially:

  • Cython, which allows getting performances Ă  la bare-metal C using a superset of Python
  • OpenBLAS, which provide hardcrafted kernels in C and assembly for BLAS

Many thanks go to their maintainers!


References

  1. Threads’ oversubscription happens when threads are spawned at various levels of parallelism, causing the OS to use more threads than necessary for the execution of the program to be optimal. ↩
  2. We use the term “abstract class” here so as to talk about the design: no such concept exist in Cython. ↩
  3. A set of template methods are defined so as to have concrete implementations modify datastructures when and where needed. ↩
  4. If you are looking for the concrete implementations’ carburator, look for _compute_and_reduce_distances_on_chunks. ↩
  5. Haeupler B., Sen S., Tarjan R.E.(2009) Rank-Pairing Heaps. In: Fiat A., Sanders P. (eds) Algorithms - ESA 2009. ESA 2009. Lecture Notes in Computer Science, vol 5757. Springer, Berlin, Heidelberg. DOI: https://doi.org/10.1007/978-3-642-04128-0_59 ↩
  6. Thanks to Jake VanDerplas! ↩
  7. We mainly use heap-allocated buffers that we manipulate through pointers and offsets at the lowest lowel of this new implementation for maximum efficiency. ↩
  8. Disclaimer: during this work, no animal were killed, nor hurt; nor are and nor will. ↩