Julien Jerphanion

⚠ Disclaimer

This whole blog post is about some experimentations I made at INRIA and is hosted there temporarily.

It is work in progress and will be improved over time.

It assumes that the reader is knowledgeable about a few subjects, namely:

Python and CPython (its main implementation internals)
parallel computing programming model (such as OpenMP’s)

`tl;dr`

This exposes extentions made to Cython allow implementing new algorithms which are breakable and executable via dynamic tasks scheduling in this language.

First the context of parallel computing in Python is briefly presented. Then, an overview of the proposed experimental backend is given.

Finally, an example using this scheduler is presented as well as results showing the capacity of this experimental backend.

Context: the state of parallel computing in Python

Python and CPython — Python’s main implementation — were not initially designed to parallelize computations natively.

As such, computations are restricted in most case to sequential execution, mainly due to a global mutex on its interpreter: the Global Interpreter Lock as reffered to as the GIL¹.

Yet, several libraries and submodules allow parallelising computations at a higher level on a single machine, such as joblib².

Those solutions come in handy because:

their interfaces are intuitive and decouples the definition of tasks from their execution: for instance, joblib.Parallel decouples the definition of tasks from their execution by simplify specifying the backend keyword
the expressiveness is high: a couple of instructions allows parallelising computations on a machine CPUs, for instance, in the case of joblib.Parallel complex computations can be defined in a couple of lines

Yet, those solutions suffers from several limitations:

they offer limited control on the way effective computations are defined and dispatched on CPUs³
they make the execution of program memory-bound because it is tight to the cost of the internals of CPython interpreter and PyObject

Still a few projects allow implementing lower-level parallelism, removing those restructions. One of such projects is Cython.

Cython and low-level parallelism

In brief, Cython allows transpilling a superset of Python to C code and using code which was written in C or C++, which makes bypassing the some of CPython’s internals possible. Moreover, Cython allows using OpenMP, an API which allows using lower-level parallelism primitives for implementation written in C or Fortran⁴.

In most cases, features provided by Cython are sufficient enough to reach optimal implementations for many scientific algorithms for which a static scheduling of tasks — at the level of C via OpenMP — is the most natural and optimal one. Plus, its syntax makes this language expressive enough to get nearly optimal performances while keeping the instructions short and concise — which is a real advantage for developers coming from Python which are looking for performance and a relief for C and C++ developers.

As such and as an example, many algorithms in scikit-learn are implemented in Cython for performance reasons, some of which using OpenMP when possible. This is for instance the case of KMeans which was initially written in Python using numpy and which was rewritten in Cython by Jérémie du Boisberranger, allowing getting up to ×5 faster execution for this algorithm⁵.

Some recent work also showed that using Cython allows removing all the barrier of CPython and getting up to ×20 speed-ups on computational primitives of scikit-learn⁶.

Still in some cases, some algorithms can better be executed via dynamic and asynchronous task scheduling, yet Cython does not provide constructs nor extensions allowing to do so⁷.

Hence the question: Could it be possible to build a simple scheduling system for dynamic and asynchronous task scheduling for Cython?

An experimental scheduler for dynamic and asynchronous task scheduling

Experimental extensions of Cython are being developped at Nexedi for asynchronous and dynamic task scheduling.

The core of one of those extensions is a simple scheduler built on top of Actor abstractions. To overly simplify decades of research in a few words: Actors have been introduced in programming languages as to enforce encapsulation in a concurrent and asynchronuous execution context⁸: from an object-oriented perspective, Actors can be seen as objects with an extra caching mecanism storing calls to their methods. This caching mecanism is generally implemented using messages (which materializes methods’ calls) and mailboxes (which stores those messages).

Hence, this abstraction allows bringing flexibility by making the tasks execution agnostic from their execution on hardware: programs are written using Actors which interact with one another unknowingly from the way effective CPUs instructions are called.

Bellow this programming model, a \(M-N\) scheduler can be used: this scheduler maps a varying number of messages \(M\) (calls to Actors‘ methods) to a given (constant) number \(N\) of Workers executing in their own thread.

Messages are getting stacked onto queues which are themselves encapsulated in a FIFO datastructure in Workers. In layman’s diagram:

Cython+ scheduler -- overview

Part of the efficiency of Cython+ scheduler is due to its scheduling algorithm à la work stealing algorithm⁹: when Workers do not have queues of tasks to execute, they just steal queues in some other Workers‘ FIFO. In layman’s diagram¹⁰:

Cython+ scheduler -- work stealing

In practice, such simple scheduling algorithm is efficient enough to be used in modern languages’ scheduler, such as Go’s¹¹.

Scanning the file system

WIP: ⚠ this whole section which illustrate our point with an example is under construction

Conclusion and further work

As of now, this simple backend for the scheduler allows getting somewhat good performances.

Yet, this backend shows some limitations: tasks can run endlessly and can thus block a whole program execution. Might one way to solve this issue, be to implement a context switching mecanism on top of the scheduling?

References

For more information about the GIL, see this reference from the Python Wiki. ↩
For more information, a global overview of such libraries and submodules are given in the Python wiki. ↩
Writting kernels in Python is possible thanks to some recent projects such as numba and numba-dppy but they might not provide full and exact control especially for types attributes for memory alignment. On this affirmation, I might be wrong and please do change my mind if this is the case. ↩
For more information on Cython, see its documentation. ↩
For more information about KMeans, see the original contribution, scikit-learn#11950, and this blog post. ↩
For more information about this work, refer to scikit-learn#22134. ↩
It might eventually be possible to use frameworks or libraries written in C and C++ — such as Intel’s oneTBB — via Cython, but this would come with both more dependencies to support and unneededly complicated interfaces’ adaptations which might introduce high maintenance costs. ↩
Gul Agha. 1986. Actors: a model of concurrent computation in distributed systems. MIT Press, Cambridge, MA, USA. https://dl.acm.org/doi/book/10.5555/7929 ↩
Robert D. Blumofe and Charles E. Leiserson. 1999. Scheduling multithreaded computations by work stealing. J. ACM 46, 5 (Sept. 1999), 720–748. DOI: https://doi.org/10.1145/324133.324234 ↩
I beg the readers’ pardon for my drawing skills. Those are temporary drawings which will be improved over time. ↩
For more information on such a scheduling algorithm, see this blog post on Go’s work-stealing scheduler from Jaana Dogan. ↩

A backend for a dynamic and asynchronous tasks scheduling for Cython

Towards expanding parallel computing programming model in Cython.