Scientific python and sparse arrays (scipy summary + future directions)

Hi everyone,

First, thanks @ivirshup for starting discussions and setting the context on this topic.

Here are a few inputs on scikit-learn usage of sparse matrices.

scikit-learn and sparse arrays tl;dr : SciPy CSR matrices as the most common case, yet room for alternatives.

In scikit-learn, sparse matrices are supported with CSR matrices being the most commonly used kind of sparse matrices.

A variety of scikit-learn Estimators support SciPy sparse matrices using the Python API, and some algorithms (implemented using Cython) directly work on the CSR matrices’ data, indices, and indptr numpy arrays directly via memoryviews: SciPy and scikit-learn using Cython to respectively provide and work with matrices internals allows both eased work and efficient implementations (for instance, see this set of utility functions). In the case of those algorithms, the arrays need not originates from SciPy CSR matrices’ but they currently do.

We do not use or have any sparse linear operations but this safe sparse dot product.

Using SciPy sparse matrices was (and in my opinion still is) the most natural choice: SciPy is one of the few dependencies of scikit-learn, users might use its’ sparse matrices for downstream and upstream tasks in their work, and potential sub-optimal performance for the Python API is not a problem. Yet technically scikit-learn should technically in the future not be restricted (up to a few changes) to supporting other libraries’ implementations of sparse arrays: this support should nonetheless be discussed among scikit-learn maintainers.

A problem with SciPy CSR matrice: variant indptr dtype

The most significant problem that we currently face with SciPy, is statically typing SciPy CSR matrices’ indptr array in Cython, which relates to previous discussions:

This is useful to know. :+1:

I think it would be worth having int64 be used by default for indptr but have the possibility to require the use of int32 for this array.

@rgommers: scikit-learn definitely had this need. Should I create a issue or PR for SciPy?

For reference, some relevant issues reported for scikit-learn contributions and maintenance include:

Side note: tabmat, efficient matrix representations for working with tabular data

tabmat is project that might be worth considering. In scikit-learn, some newly considered implementations of solvers have bottleneck which fall under tabmat use-case.


2 Likes