Scientific python and sparse arrays (scipy summary + future directions)

jjerphan · August 4, 2022, 6:51am

Hi everyone,

First, thanks @ivirshup for starting discussions and setting the context on this topic.

Here are a few inputs on scikit-learn usage of sparse matrices.

scikit-learn and sparse arrays `tl;dr` : SciPy CSR matrices as the most common case, yet room for alternatives.

In scikit-learn, sparse matrices are supported with CSR matrices being the most commonly used kind of sparse matrices.

A variety of scikit-learn Estimators support SciPy sparse matrices using the Python API, and some algorithms (implemented using Cython) directly work on the CSR matrices’ data, indices, and indptr numpy arrays directly via memoryviews: SciPy and scikit-learn using Cython to respectively provide and work with matrices internals allows both eased work and efficient implementations (for instance, see this set of utility functions). In the case of those algorithms, the arrays need not originates from SciPy CSR matrices’ but they currently do.

We do not use or have any sparse linear operations but this safe sparse dot product.

Using SciPy sparse matrices was (and in my opinion still is) the most natural choice: SciPy is one of the few dependencies of scikit-learn, users might use its’ sparse matrices for downstream and upstream tasks in their work, and potential sub-optimal performance for the Python API is not a problem. Yet technically scikit-learn should technically in the future not be restricted (up to a few changes) to supporting other libraries’ implementations of sparse arrays: this support should nonetheless be discussed among scikit-learn maintainers.

A problem with SciPy CSR matrice: variant `indptr` dtype

The most significant problem that we currently face with SciPy, is statically typing SciPy CSR matrices’ indptr array in Cython, which relates to previous discussions:

This is useful to know.

I think it would be worth having int64 be used by default for indptr but have the possibility to require the use of int32 for this array.

@rgommers: scikit-learn definitely had this need. Should I create a issue or PR for SciPy?

For reference, some relevant issues reported for scikit-learn contributions and maintenance include:

github.com/scikit-learn/scikit-learn

[RFC] Support for int64 indexed SciPy sparse matrices in Cython code

opened 02:45PM - 16 Jun 22 UTC

ogrisel

API Needs Decision cython

At the moment we do not have systematic support for very large sparse matrices i…n our Cython code. That would be useful when the data is passed as a sparse matrix with more than ~2e9 columns or non-zero values. The purpose of this issue is to link: - reference all related issues in scikit-learn. - decide if we want to have some uniform support guarantees or not - decide if we need centralized Cython tooling (e.g. type declarations, tempita conventions) to add support for such matrices. ### Related issues and PRs (feel free to update this list): For polynomial feature expansion (quite popular request): - #16803 - #17554 - #19676 Other models with open issues: - #11355 - #11356 - #18090 - #18403 Other Cython estimators that could also be updated: - neighbors models (k-NN and radius-based models) - related issues not just about this problem: #23604 - k-means & variants - Feature Hasher / Hashing Vectorizer (`sklearn/feature_extraction/_hashing_fast.pyx`) The following PR will introduce a scikit-learn transformer that can output `int64` indexed sparse matrices (even if it's input is `int32` indexed). - #23731 ### Helpful Python snippet SciPy decides to use the int32 or int64 dtype depending on the dimensions of the matrix and on the number of stored non-zero elements. Here is a quick way to generate a CSR matrix that requires int64-typed `.indices` and `.indptr` attributes: ```python >>> from scipy.sparse import csr_matrix >>> import numpy as np >>> >>> X = csr_matrix(([1.0], [np.iinfo(np.int32).max + 1], [0, 1])) >>> X <1x2147483649 sparse matrix of type '<class 'numpy.float64'>' with 1 stored elements in Compressed Sparse Row format> >>> X.indices array([2147483648]) >>> X.indices.dtype dtype('int64') >>> X.indptr.dtype dtype('int64') ```

github.com/scipy/scipy

BUG: `sparse.hstack` returns incorrect result when the stack would result in indices too large for `np.int32`

opened 11:11PM - 08 Jul 22 UTC

closed 03:44PM - 27 Sep 22 UTC

Micky774

defect scipy.sparse

### Describe your issue. Over in scikit-learn we ran into this bug in the pursu…it of [this PR](https://github.com/scikit-learn/scikit-learn/pull/23731). Essentially when using `sparse.hstack` on a collection of sparse (csr) matrices whose `indices` arrays contain values no greater than the maximum for `np.int32` the operation produces incorrect results. I believe the problem is within `sparse._construct._stack_along_minor_axis`. In particular its `dtype` resolution for `indices` and `indptr` misses this edge case (and perhaps some others). ### Reproducing Code Example ```python from scipy import sparse import numpy as np data = [1.0] row = [0] max_int32 = np.iinfo(np.int32).max ind_1 = max_int32 ind_2 = 2 assert ind_1 + ind_2 - 1 > max_int32 #condition of failure assert max(ind_1 - 1, ind_2 - 1) < max_int32 #condition of failure col_1 = [ind_1 - 1] col_2 = [ind_2 - 1] X_1 = sparse.csr_matrix((data, (row, col_1))) X_2 = sparse.csr_matrix((data, (row, col_2))) Z = sparse.hstack([X_1, X_2], format="csr") print(Z.indices) # [65534 -2147450882] assert Z.indices.max() == ind_1 + ind_2 - 1 ``` ### Error message ```shell N/A ``` ### SciPy/NumPy/Python version information 1.10.0.dev0+0.df3fe4e 1.24.0.dev0+449.g353fea031 sys.version_info(major=3, minor=9, micro=12, releaselevel='final', serial=0)

Side note: tabmat, efficient matrix representations for working with tabular data

tabmat is project that might be worth considering. In scikit-learn, some newly considered implementations of solvers have bottleneck which fall under tabmat use-case.

Scientific python and sparse arrays (scipy summary + future directions)

scikit-learn and sparse arrays tl;dr : SciPy CSR matrices as the most common case, yet room for alternatives.

A problem with SciPy CSR matrice: variant indptr dtype

Side note: tabmat, efficient matrix representations for working with tabular data

scikit-learn and sparse arrays `tl;dr` : SciPy CSR matrices as the most common case, yet room for alternatives.

A problem with SciPy CSR matrice: variant `indptr` dtype