Towards asynchronously constructing KDTree

Some experimentations with an asynchronous runtime for Cython.

Published on the: 02.11.2021
Last modified on the: 13.01.2022
Estimated reading time: ~ 9 min.

⚠ Disclaimer

This whole blog post is about some experimentations I made at INRIA and is hosted there temporarily.

It is work in progress and will be improved over time.

It assumes that the reader is knowledgeable about a few subjects, namely:

  • Python and CPython (its main implementation internals)
  • parallel computing programming model (such as OpenMP’s)
  • KDTree use-cases and internals, ideally
  • scikit-learn interfaces eventually, especially sklearn.neighbors.KDtree
  • elements of sorting algorithms

tl;dr

This blog post introduces a way of optimally constructing KDTree — an efficient datastructure for space vector elements queries — with Cython+, an extension of the Cython project developed by Nexedi.

It compares various possible implementations of such a data-structure as implemented within scikit-learn and with Cython+.

Finally, comparisons with sklearn.neighbors.KDTree are drawn1.

The code presented by this prototype is accessible online here: https://lab.nexedi.com/cython-plus/kdtree.


Context

Querying points in a vector space (to look for nearest neighbours, for instance) is a common computational tasks in many algorithms, especially in machine learning ones.

Several algorithms exist to perform nearest neighbors queries. One of them is particularly adapted in the case of queries in low dimensional spaces: the KDTree2.

This datastructure is at the core of many machine learning algorithms. As an example, sklearn.neighbors.KDTree — the implementation of the KDTree available within scikit-learn — and is used under the hood by many implementations of this library .

Yet, this datastructure’s construction can be speeded-up using an suitable pattern of tasks parallelisation.

Notation

In what follows, we use the following parameters:

  • \(n_\mbox{samples}\): the number of vectors to construct the KDTree on. This is given.
  • \(n_\mbox{features}\): the dimension of the vector space. This is given.
  • \(\mbox{leaf_size}\): the number of vectors in each leaf of the tree. This has to be set.
  • \(n_\mbox{nodes}\): the number of nodes in the KDTree.

Internals of the KDTree

The datastructures of a KDTree are:

  • data, the buffer of X, the numpy array passed by the caller.
    • stores coordinates of samples
    • of shape \((n_\mbox{samples}, n_\mbox{features})\)
    • will not be modified
  • indices of shape \((n_\mbox{samples}, 1)\), which stores the vectors’ indices as sorted by the tree
  • node_data of shape \((n_\mbox{nodes}, 1)\), an array of entries which are storing the following information of a node:
    • start_idx: the start of a Node range
    • end_index: the end of a Node range
    • is_leaf: whether the Node is a leaf (terminal Node)
  • node_bounds of shape \((2, n_\mbox{nodes}, n_\mbox{features})\), an array which stores the bounds of the subspace covered by each Nodes

In layman’s diagram3:

Datastructures of a KDTree

The number of nodes can easily be deduced:

\begin{align} \begin{cases} n_\mbox{levels} = \left \lfloor \log_2\left( \max\left(1, \frac{n_\mbox{samples} - 1}{\mbox{leaf_size}}\right) +1 \right) \right \rfloor \\ n_\mbox{nodes} = 2^{n_\mbox{levels}} - 1 \end{cases} \end{align}

Constructing a KDTree: a perfect fit for asynchronous parallel execution

The goal of constructing the KDTree is to partition the space in two subspaces on a given dimension and recurse. The steps of the algorithm are materialised as Nodes, breaking down the construction of the KDTree in several steps.

This is an decomposable parallel algorithm: at a given Node, the first half of the indices’ partition is independent of its second. Hence, one can modify this array inplace concurrently without any races conditions on individual elements of the array.

Hence it is safe scheduling tasks dynamically to construct the tree, one thing that is needed is a runtime which allows executing tasks dynamically, which is something Cython+ proposes4.

A few notes about sklearn.neighbors.KDTree construction

The construction of the sklearn.neighbors.KDTree is sequential. The datastructures are allocated based on the provided numpy array on start. Then Nodes are recursively created, starting from the root of the KDTree.

At each Node creation, indices are partitioned on the Node range based on a pivot. This pivot is chosen as the median on the chosen dimension which can generally be found in \(\mathcal{O}(n_\text{samples})\)5.

KDTree construction with Cython+

Cython+ allows scheduling tasks asynchronously thanks to a system of Actor6.

With Cython+, the algorithm can be broken down defining:

  • a main KDTree, which keeps tracks of its root Node and which wraps the datastructures presented above.
  • Nodes which are actors responsible for:
    • partitionning indices with respect on their range of indices given a dimension and obtaining a pivot.
      • There are several heuristics to choose the dimension to partition indices on and to partition them7. Here, we solely consider two simple heuristics: the first choice is just to round-robin over dimensions at each Node level, the second is to choose the dimension with the maximum spread on the given subspace at each Node level (as done by sklearn.neighbors.KDTree) .
      • Moreover and similarly to sklearn.neighbors.KDTree, we define the pivot to be the index of median of the coordinates on the chosen dimension.
    • recursing to their left and right Nodes, hence partitionning indices on subranges at the right and the left of the pivot. This is the part which can be made non-blocking using the proposed asynchronous back-end.

If the reader is interested in the implementation, it is available online here. Note that Cython+ is still being developped and that the current implementation has some quircks as to use its experimental features which are being improved overtime8.

Program design: A simple actor pattern for synchronisation

Frictions can exist between synchronous and asynchronous computations. This is especially the case here between the synchronous Python interface users interact with and the implementation of it which runs asynchronously.

For instance interfacing the asynchronuous logic with a synchronous execution is done internally via a Counter which encapsulates the number of sorted indices being updated by leafs (terminal Nodes). This mecanism, which is similar to a semaphore, allows making the main caller wait while the construction of the KDTree is not entirely done.

Comparison of implementations

The experimentations made in this blog post are present online. The results which are presented here have been made on a machine with 40 cores and can easily be reproduced using the setup which is documented there.

This pattern used via Cython+ can be made nearly optimal9 as to be faster than the current scikit-learn implementation (as of scikit-learn 1.0) by a factor than more than 10 when partitionning in a round-robin fashion.

Benchmarks results (round robin choice)

It is also approximately faster by factor 2 when partitioning on the dimension with the maximum spread10.

Benchmarks results (maximum spread choice)

When using the round-robin choice of the dimension, the execution is memory bound by the CPUs.

This case is interesting because it allows showing that Cython+ runtime really has a small memory footprint and a fast execution. Indeed, a simple usage of perf(1) shows that about 90% of the time spent constructing the tree is spent partitioning indices using the std::nth_element interface from the standard library of C++ (which uses an implementation of introspective sorting 11 under the hood in this configuration, partitionning points effectively in \(\mathcal{O}(n_\text{samples})\)):

Samples: 15K of event 'cycles:u', Event count (approx.): 11409982420

-  100.00%        python                                                          ◆
   -   92.91%        kdtree.cpython-39-x86_64-linux-gnu.so                        ▒
          88.45%        [.] std::__introselect<int*, long, __gnu_cxx::__ops::_Iter▒
           0.64%        [.] CyLock::wlock                                         ▒
           0.50%        [.] __pyx_t_7runtime_7runtime_Worker::get_queue           ▒
           0.34%        [.] __pyx_t_7runtime_7runtime_BatchMailBox::activate      ▒
           0.21%        [.] CyLock::unwlock                                       ▒
           0.21%        [.] CyObject::CyObject_DECREF                             ▒
           0.19%        [.] __pyx_t_7runtime_7runtime_Scheduler::post_queue

Conclusion and further work

First, we think that the runtime of Cython+ might be beneficial to bring asynchronous parallel execution of tasks to Cython.

Secondly, this prototype allowed us identifying some patterns for synchronisation between synchronous and asynchronous execution contexts.

Finally, further work would involve using the type system of Cython+ to ensure proper reference safety at compile time — which is one of the proposals of Cython+ — as well as comparison with other KDTrees‘s implementation.


References

  1. sklearn.neighbors.KDTree has been implemented with the querying methods by Jake VanderPlas and was latter refined by other contributors — thanks to them!. ↩
  2. Jon Louis Bentley. 1975. Multidimensional binary search trees used for associative searching. Commun. ACM 18, 9 (Sept. 1975), 509–517. DOI:https://doi.org/10.1145/361002.361007 ↩
  3. Once again, I beg the readers’ perdon: this diagram is to be improved. ↩
  4. For more information about Cython+ design and features, you can watch Xavier Thompson’s presentation at EuroPython 2021 and/or read his presentation’s slides. ↩
  5. This search can be done efficiently using std::nth_element interface from C++ standard library. ↩
  6. See the different blog posts from Nexedi and this blog post. ↩
  7. Brown, Russell. (2014). Building a Balanced k-d Tree in \(\mathcal{O}(kn \log n)\) Time. arXiv:1410.5420 ↩
  8. This implementation itself is being extended to perform queries. ↩
  9. In this case, one can choose a proper specification of the KDTree via leaf_size to use the optimal number of workers for the loads. As of now, it’s up to the caller to perform this choice but a heuristic could be introduced in the logic of the KDTree itself. ↩
  10. Full benchmarks results for both cases are accessible online here. ↩
  11. David R. Musser. 1997. Introspective sorting and selection algorithms. Softw. Pract. Exper. 27, 8 (Aug. 1997), 983–993.DOI: https://doi.org/10.1002/(SICI)1097-024X(199708)27:8<983::AID-SPE117>3.0.CO;2-%23 ↩