â Â Disclaimer
This whole blog post is about some experimentations I made at INRIA and is hosted there temporarily.
It is work in progress and will be improved over time.
It assumes that the reader is knowledgeable about a few subjects, namely:
- Python and CPython (its main implementation internals)
- parallel computing programming model (such as OpenMPâs)
KDTree
use-cases and internals, ideally- scikit-learn interfaces eventually, especially
sklearn.neighbors.KDtree
- elements of sorting algorithms
tl;dr
This blog post introduces a way of optimally constructing KDTree
â an efficient datastructure for space vector elements queries â with Cython+, an extension of the Cython project developed by Nexedi.
It compares two possible implementations of the construction of this data-structure as implemented within scikit-learn and with Cython+.
Finally, comparisons with sklearn.neighbors.KDTree
are drawn1.
The code presented by this prototype is accessible online here: https://lab.nexedi.com/cython-plus/kdtree
.
Context
Querying points in a vector space (to look for nearest neighbours, for instance) is a common computational tasks in many algorithms, especially in machine learning ones.
Several algorithms exist to perform nearest neighbors queries. One of them is particularly adapted in the case of queries in low dimensional spaces: the KDTree
2.
This datastructure is at the core of many machine learning algorithms. As an example, sklearn.neighbors.KDTree
â the implementation of the KDTree
available within scikit-learn â is used under the hood by many implementations of this library.
Yet, this datastructureâs construction can be improved using an suitable pattern of tasks parallelisation.
Notation
In what follows, we use the following parameters:
- \(n_\mbox{samples}\): the number of vectors to construct the
KDTree
on. This is given. - \(n_\mbox{features}\): the dimension of the vector space. This is given.
- \(\mbox{leaf_size}\): the number of vectors in each leaf of the tree. This has to be set.
- \(n_\mbox{nodes}\): the number of nodes in the
KDTree
.
Internals of the KDTree
The datastructures of a KDTree
are:
data
, the buffer ofX
, the numpy array passed by the caller.- stores coordinates of samples
- of shape \((n_\mbox{samples}, n_\mbox{features})\)
- will not be modified
indices
of shape \((n_\mbox{samples}, 1)\), which stores the vectorsâ indices as sorted by the treenode_data
of shape \((n_\mbox{nodes}, 1)\), an array of entries which are storing the following information of a node:start_idx
: the start of aNode
rangeend_index
: the end of aNode
rangeis_leaf
: whether theNode
is a leaf (terminalNode
)
node_bounds
of shape \((2, n_\mbox{nodes}, n_\mbox{features})\), an array which stores the bounds of the subspace covered by eachNodes
The number of nodes can easily be deduced:
Constructing a KDTree
: a perfect fit for asynchronous parallel execution
The goal of constructing the KDTree
is to partition the space in two subspaces on a given dimension and recurse. The steps of the algorithm are materialised as Node
s, breaking down the construction of the KDTree
in several steps.
This is an decomposable parallel algorithm: at a given Node
, the first half of the indicesâ partition is independent of its second. Hence, one can modify this array inplace concurrently without any races conditions on individual elements of the array.
Hence it is safe scheduling tasks dynamically to construct the tree, one thing that is needed is a runtime which allows executing tasks dynamically, which is something Cython+ proposes3.
A few notes about sklearn.neighbors.KDTree
construction
The construction of the sklearn.neighbors.KDTree
is sequential. The datastructures are allocated based on the provided numpy array on start. Then Nodes
are recursively created, starting from the root of the KDTree
.
At each Node
creation, indices are partitioned on the Node
range based on a pivot. This pivot is chosen as the median on the chosen dimension which can generally be found in \(\mathcal{O}(n_\text{samples})\)4.
KDTree
construction with Cython+
Cython+ allows scheduling tasks asynchronously thanks to a system of Actor
5.
With Cython+, the algorithm can be broken down defining:
- a main
KDTree
, which keeps tracks of its rootNode
and which wraps the datastructures presented above. Nodes
which are actors responsible for:- partitionning indices with respect on their range of indices given a dimension and obtaining a pivot.
- There are several heuristics to choose the dimension to partition indices on and to partition them6. Here, we solely consider two simple heuristics: the first choice is just to round-robin over dimensions at each
Node
level, the second is to choose the dimension with the maximum spread on the given subspace at eachNode
level (as done bysklearn.neighbors.KDTree
)Â . - Moreover and similarly to
sklearn.neighbors.KDTree
, we define the pivot to be the index of median of the coordinates on the chosen dimension.
- There are several heuristics to choose the dimension to partition indices on and to partition them6. Here, we solely consider two simple heuristics: the first choice is just to round-robin over dimensions at each
- recursing to their left and right
Nodes
, hence partitionning indices on subranges at the right and the left of the pivot. This is the part which can be made non-blocking using the proposed asynchronous back-end.
- partitionning indices with respect on their range of indices given a dimension and obtaining a pivot.
If the reader is interested in the implementation, it is available online here. Note that Cython+ is still being developped and that the current implementation has some quircks as to use its experimental features which are being improved overtime7.
Program design: A simple actor pattern for synchronisation
Frictions can exist between synchronous and asynchronous computations. This is especially the case here between the synchronous Python interface users interact with and the implementation of it which runs asynchronously.
For instance interfacing the asynchronuous logic with a synchronous execution is done internally via a Counter
which encapsulates the number of sorted indices being updated by leafs (terminal Nodes
). This mecanism, which is similar to a semaphore, allows making the main caller wait while the construction of the
KDTree
is not entirely
done.
Comparison of implementations
The experimentations made in this blog post are present online. The results which are presented here have been made on a machine with 40 cores and can easily be reproduced using the setup which is documented there.
This pattern used via Cython+ can be made nearly optimal8 as to be faster than the current scikit-learn implementation (as of scikit-learn 1.0) by a factor than more than 10 when partitionning in a round-robin fashion.
It is also approximately faster by factor 2 when partitioning on the dimension with the maximum spread9.
When using the round-robin choice of the dimension, the execution is memory bound by the CPUs.
This case is interesting because it allows showing that Cython+ runtime really has a small memory footprint and a fast execution. Indeed, a simple usage of perf(1)
shows that about 90% of the time spent constructing the tree is spent partitioning indices using the std::nth_element
interface from the standard library of C++ (which uses an implementation of introspective
sorting
10 under the hood in this configuration, partitionning points effectively in \(\mathcal{O}(n_\text{samples})\)):
Samples: 15K of event 'cycles:u', Event count (approx.): 11409982420
- 100.00% python â
- 92.91% kdtree.cpython-39-x86_64-linux-gnu.so â
88.45% [.] std::__introselect<int*, long, __gnu_cxx::__ops::_Iterâ
0.64% [.] CyLock::wlock â
0.50% [.] __pyx_t_7runtime_7runtime_Worker::get_queue â
0.34% [.] __pyx_t_7runtime_7runtime_BatchMailBox::activate â
0.21% [.] CyLock::unwlock â
0.21% [.] CyObject::CyObject_DECREF â
0.19% [.] __pyx_t_7runtime_7runtime_Scheduler::post_queue
Conclusion and further work
First, we think that the runtime of Cython+ might be beneficial to bring asynchronous parallel execution of tasks to Cython.
Secondly, this prototype allowed us identifying some patterns for synchronisation between synchronous and asynchronous execution contexts.
Finally, further work would involve using the type system of Cython+ to ensure proper reference safety at compile time â which is one of the proposals of Cython+ â as well as comparison with other KDTrees
âs implementation.
References
sklearn.neighbors.KDTree
has been implemented with the querying methods by Jake VanderPlas and was latter refined by other contributors. â©- Jon Louis Bentley. 1975. Multidimensional binary search trees used for associative searching. Commun. ACM 18, 9 (Sept. 1975), 509â517. DOI:
https://doi.org/10.1145/361002.361007
â© - For more information about Cython+ design and features, you can watch Xavier Thompsonâs presentation at EuroPython 2021 and/or read his presentationâs slides. â©
- This search can be done efficiently using
std::nth_element
interface from C++ standard library. â© - See the different blog posts from Nexedi and this blog post. â©
- Brown, Russell. (2014). Building a Balanced k-d Tree in \(\mathcal{O}(kn \log n)\) Time.
arXiv:1410.5420
â© - This implementation itself is being extended to perform queries. â©
- In this case, one can choose a proper specification of the
KDTree
vialeaf_size
to use the optimal number of workers for the loads. As of now, itâs up to the caller to perform this choice but a heuristic could be introduced in the logic of theKDTree
itself. â© - Full benchmarks results for both cases are accessible online here. â©
- David R. Musser. 1997. Introspective sorting and selection algorithms. Softw. Pract. Exper. 27, 8 (Aug. 1997), 983â993.DOI:
https://doi.org/10.1002/(SICI)1097-024X(199708)27:8<983::AID-SPE117>3.0.CO;2-%23
â©