Performances and scikit-learn: Context

The current state of scikit-learn performances

Published on the: 13.01.2022
Last modified on the: 16.01.2022
Estimated reading time: ~ 4 min.

High-level overview of the scikit-learn stack

scikit-learn is built on top the scientific Python ecosystem.

Its most notable dependences are:

  • numpy, for generic operations on arrays
  • scipy, for generic operations in linear algebra and others
  • joblib, for a back-end agnostic high-level parallelism

This comes in handy because, because this allows having high expressiveness in a few lines.

It also allows deleguating most of the complex operations to well tested third party libraries. For instance, efficient linear algebra routines of car deleguated to BLAS, LAPACK, ARPACK via numpy.linalg, scipy.linalg, and scipy.sparse.linalg.

This simple stack allows having most of the implementations of scikit-learn concise.

Moreover, this simple stack also allowed setting up simple conventions which making the code-base accessible for new contributors.

The main reasons for limited performances

This stack is simple but is not tailored for optimal performance for several reasons.

CPython internals

CPython, the main implementation of Python does use a global mutex, called the Global Interpreter Lock. This just blocks the execution of a program to use a single thread at a time.

Memory-hierarchy suboptimal implementations

numpy supports high-level operations but this comes with intermediate and dynamically-allocated temporary arrays.

Moreover, this pattern is inefficient from a memory perspective: during the execution, blocks of data are moved back and forth between the RAM and the different CPU caches several time, not making use of the caches.

For instance, based on this minimalistic toy example:

import numpy as np

A = np.random.rand(100)
B = np.random.rand(100)
C = np.random.rand(100)

X = A + B + C

The following is performed:

  • a first temporary array is allocated for A + B
  • a second array is allocated to store A + B + C and the first temporary array is discarded

This is costly because memory allocation on the heap is slow and blocks at the level of the operating system.

Moreover, with high level operations on X:

  • comes with data moves between the RAM and the CPU than needed to compute the elements of X1
  • hardly make use of the memory hierarchy and the size of the caches

No simple datastructures

The Python ecosystem comes with a variety of high-level containers such as numpy arrays, and pandas DataFrames.

No low-level datastructures can be used to implement some algorithms efficiently from a theoretical and a technical perspectives.

Improving performances: the plan

General plan to improve the performances of a library:

  • Measuring speed-ups and profiling implementations
  • Identifying cases that are worth improving
  • Understanding what’s wrong with those cases
  • Proposing a new implementation for those cases

This plan can be performed iteratively.

Measuring speed-ups and profiling

Mathis Batoul worked on a setting up benchmarks to compare scikit-learn current implementations to other libraries in the ecosystem, such as XGBoost and CatBoost. This setup also embeds some results of profiling.

The results of those benchmarks are accessible online here: https://mbatoul.github.io/sklearn_benchmarks/

From those benchmarks, a few notable implementations can be improved, especially:

  • LogisticRegression
  • LinearRegression
  • KNearestNeighborsClassifier
  • KNearestNeighborsRegressor

In what follows, we are going to focus on \(k\)-nearest neighbors search used as a base routine for KNearestNeighborsClassifier and KNearestNeighborsRegressor.


Notes

  1. In this example, there is exactly 6 moves instead of 4.