## â Â Disclaimer

This whole blog post is about some experimentations I made at INRIA and is hosted thereÂ temporarily.

It is work in progress and will be improved overÂ time.

It assumes that the reader is knowledgeable about a few subjects,Â namely:

- Python and CPython (its main implementationÂ internals)
- parallel computing programming model (such asÂ OpenMPâs)
`KDTree`

use-cases and internals,Â ideally- scikit-learn interfaces eventually, especially
`sklearn.neighbors.KDtree`

- elements of sortingÂ algorithms

`tl;dr`

This blog post introduces a way of optimally constructing `KDTree`

â an efficient datastructure for space vector elements queries â with Cython+, an extension of the Cython project developed byÂ Nexedi.

It compares two possible implementations of the construction of this data-structure as implemented within scikit-learn and withÂ Cython+.

Finally, comparisons with `sklearn.neighbors.KDTree`

are drawn^{1}.

The code presented by this prototype is accessible online here: `https://lab.nexedi.com/cython-plus/kdtree`

.

## Context

Querying points in a vector space (to look for nearest neighbours, for instance) is a common computational tasks in many algorithms, especially in machine learningÂ ones.

Several algorithms exist to perform nearest neighbors queries. One of them is particularly adapted in the case of queries in low dimensional spaces: the `KDTree`

^{2}.

This datastructure is at the core of many machine learning algorithms. As an example, `sklearn.neighbors.KDTree`

â the implementation of the `KDTree`

available within scikit-learn â is used under the hood by many implementations of thisÂ library.

Yet, this datastructureâs construction can be improved using an suitable pattern of tasksÂ parallelisation.

## Notation

In what follows, we use the followingÂ parameters:

- \(n_\mbox{samples}\): the number of vectors to construct the
`KDTree`

on. This isÂ given. - \(n_\mbox{features}\): the dimension of the vector space. This isÂ given.
- \(\mbox{leaf_size}\): the number of vectors in each leaf of the tree. This has to beÂ set.
- \(n_\mbox{nodes}\): the number of nodes in the
`KDTree`

.

## Internals of the `KDTree`

The datastructures of a `KDTree`

are:

`data`

, the buffer of`X`

, the numpy array passed by the caller.- stores coordinates ofÂ samples
- of shape \((n_\mbox{samples},Â n_\mbox{features})\)
- will not beÂ modified

`indices`

of shape \((n_\mbox{samples}, 1)\), which stores the vectorsâ indices as sorted by theÂ tree`node_data`

of shape \((n_\mbox{nodes}, 1)\), an array of entries which are storing the following information of a node:`start_idx`

: the start of a`Node`

range`end_index`

: the end of a`Node`

range`is_leaf`

: whether the`Node`

is a leaf (terminal`Node`

)

`node_bounds`

of shape \((2, n_\mbox{nodes}, n_\mbox{features})\), an array which stores the bounds of the subspace covered by each`Nodes`

The number of nodes can easily beÂ deduced:

## Constructing a `KDTree`

: a perfect fit for asynchronous parallelÂ execution

The goal of constructing the `KDTree`

is to partition the space in two subspaces on a given dimension and recurse. The steps of the algorithm are materialised as `Node`

s, breaking down the construction of the `KDTree`

in severalÂ steps.

This is an *decomposable parallel algorithm*: at a given `Node`

, the first half of the indicesâ partition is independent of its second. Hence, one can modify this array inplace concurrently without any races conditions on individual elements of theÂ array.

Hence it is safe scheduling tasks dynamically to construct the tree, one thing that is needed is a runtime which allows executing tasks dynamically, which is something Cython+ proposes^{3}.

### A few notes about `sklearn.neighbors.KDTree`

construction

The construction of the `sklearn.neighbors.KDTree`

is sequential. The datastructures are allocated based on the provided numpy array on start. Then `Nodes`

are recursively created, starting from the root of the `KDTree`

.

At each `Node`

creation, indices are partitioned on the `Node`

range based on a pivot. This pivot is chosen as the median on the chosen dimension which can generally be found in \(\mathcal{O}(n_\text{samples})\)^{4}.

`KDTree`

construction withÂ Cython+

Cython+ allows scheduling tasks asynchronously thanks to a system of `Actor`

^{5}.

With Cython+, the algorithm can be broken downÂ defining:

- a main
`KDTree`

, which keeps tracks of its root`Node`

and which wraps the datastructures presentedÂ above. `Nodes`

which are actors responsible for:- partitionning indices with respect on their range of indices given a dimension and obtaining a pivot.
- There are several heuristics to choose the dimension to partition indices on and to partition them
^{6}. Here, we solely consider two simple heuristics: the first choice is just to round-robin over dimensions at each`Node`

level, the second is to choose the dimension with the maximum spread on the given subspace at each`Node`

level (as done by`sklearn.neighbors.KDTree`

)Â . - Moreover and similarly to
`sklearn.neighbors.KDTree`

, we define the pivot to be the index of median of the coordinates on the chosenÂ dimension.

- There are several heuristics to choose the dimension to partition indices on and to partition them
- recursing to their left and right
`Nodes`

, hence partitionning indices on subranges at the right and the left of the pivot.*This is the part which can be made non-blocking using the proposed asynchronousÂ back-end.*

- partitionning indices with respect on their range of indices given a dimension and obtaining a pivot.

If the reader is interested in the implementation, it is available online here. Note that Cython+ is still being developped and that the current implementation has some quircks as to use its experimental features which are being improved overtime^{7}.

### Program design: A simple actor pattern forÂ synchronisation

Frictions can exist between synchronous and asynchronous computations. This is especially the case here between the synchronous Python interface users interact with and the implementation of it which runsÂ asynchronously.

For instance interfacing the asynchronuous logic with a synchronous execution is done internally via a `Counter`

which encapsulates the number of sorted indices being updated by leafs (terminal `Nodes`

). This mecanism, which is similar to a semaphore, allows making the main caller wait while the construction of the
`KDTree`

is not entirely
done.

## Comparison ofÂ implementations

The experimentations made in this blog post are present online. The results which are presented here have been made on a machine with 40 cores and can easily be reproduced using the setup which is documentedÂ there.

This pattern used via Cython+ can be made nearly optimal^{8} as to be faster than the current scikit-learn implementation (as of scikit-learn 1.0) by a factor than more than 10 when partitionning in a round-robin
fashion.

It is also approximately faster by factor 2 when partitioning on the dimension with the maximum
spread^{9}.

When using the round-robin choice of the dimension, the execution is memory bound by theÂ CPUs.

This case is interesting because it allows showing that Cython+ runtime really has a small memory footprint and a fast execution. Indeed, a simple usage of `perf(1)`

shows that about 90% of the time spent constructing the tree is spent partitioning indices using the `std::nth_element`

interface from the standard library of C++ (which uses an implementation of introspective
sorting
^{10} under the hood in this configuration, partitionning points effectively in \(\mathcal{O}(n_\text{samples})\)):

```
Samples: 15K of event 'cycles:u', Event count (approx.): 11409982420
- 100.00% python â
- 92.91% kdtree.cpython-39-x86_64-linux-gnu.so â
88.45% [.] std::__introselect<int*, long, __gnu_cxx::__ops::_Iterâ
0.64% [.] CyLock::wlock â
0.50% [.] __pyx_t_7runtime_7runtime_Worker::get_queue â
0.34% [.] __pyx_t_7runtime_7runtime_BatchMailBox::activate â
0.21% [.] CyLock::unwlock â
0.21% [.] CyObject::CyObject_DECREF â
0.19% [.] __pyx_t_7runtime_7runtime_Scheduler::post_queue
```

## Conclusion and furtherÂ work

First, we think that the runtime of Cython+ might be beneficial to bring asynchronous parallel execution of tasks toÂ Cython.

Secondly, this prototype allowed us identifying some patterns for synchronisation between synchronous and asynchronous executionÂ contexts.

Finally, further work would involve using the type system of Cython+ to ensure proper reference safety at compile time â which is one of the proposals of Cython+ â as well as comparison with other `KDTrees`

âsÂ implementation.

## References

`sklearn.neighbors.KDTree`

has been implemented with the querying methods by Jake VanderPlas and was latter refined by otherÂ contributors. â©- Jon Louis Bentley. 1975. Multidimensional binary search trees used for associative searching. Commun. ACM 18, 9 (Sept. 1975), 509â517. DOI:
`https://doi.org/10.1145/361002.361007`

â© - For more information about Cython+ design and features, you can watch Xavier Thompsonâs presentation at EuroPython 2021 and/or read his presentationâs slides. â©
- This search can be done efficiently using
`std::nth_element`

interface from C++ standardÂ library. â© - See the different blog posts from Nexedi and this blog post. â©
- Brown, Russell. (2014). Building a Balanced k-d Tree in \(\mathcal{O}(kn \log n)\) Time.
`arXiv:1410.5420`

â© - This implementation itself is being extended to performÂ queries. â©
- In this case, one can choose a proper specification of the
`KDTree`

via`leaf_size`

to use the optimal number of workers for the loads. As of now, itâs up to the caller to perform this choice but a heuristic could be introduced in the logic of the`KDTree`

itself. â© - Full benchmarks results for both cases are accessible online here. â©
- David R. Musser. 1997. Introspective sorting and selection algorithms. Softw. Pract. Exper. 27, 8 (Aug. 1997), 983â993.DOI:
`https://doi.org/10.1002/(SICI)1097-024X(199708)27:8<983::AID-SPE117>3.0.CO;2-%23`

â©