At this year’s fantastic scipy conference I had the opportunity to meet a number of other developers in our community who are interested in the future of sparse arrays in python.
Associated projects included:
I’m creating this thread to start a discussion about the state of sparse arrays in the scientific python ecosystem. I would like to find out what people are working on, what our common needs are, and if we can figure out how we can collaborate on a path forward. I would love for other members of the community to chime in on this thread about their experiences and work with sparse data.
First I’ll summarize a meet up a few of us had, then will suggest some possible future directions.
Via the conference slack, we had a small meet up of some maintainers interested in sparse data. Projects represented include
- python-graphblas represented by @jim22k (working at anaconda)
- QuTip represented by @hodgestar (working at Riken) and @Ericgig (working at Sherbrooke)
- scverse (the anndata package specifically) represented by myself (@ivirshup)
We first talked a bit about how each of us are working with sparse data.
QuTip has recently included Cython implementations of sparse arrays. They’ve done this because they found the scipy implementations too slow. Particular issues they ran into were:
- Sparse addition doing unnecessary copies
- Initialisation and complex types
- There are lots of checks making things slow
- Casting around complex types
- Downcasting indices/ indptr
- E.g. they want 64 bit indptr, but scipy aggressively downcasts these to int32
- Checking that the values are within int32 range, and making copies takes an unnecessary amount of time
The main use cases for sparse data were Linear algebra and ODE related
python-graphblas is a new python api for the very cool suitesparse/ GraphBLAS library. This provides high performance graph operations via sparse adjacency matrix representations of graphs. It allows users to apply arbitrary semi-rings which is a an extremely powerful abstraction for sparse matrix operations in general. I would highly recommend checking out some of the literature on GraphBLAS – like:
- Mathematical foundations of the GraphBLAS for a short introduction
- SuiteSparse:GraphBLAS: Graph Algorithms in the Language of Sparse Linear Algebra for something longer
Some key features of this library:
- “sparse values” that aren’t zero
- suite sparse – sparse representation
- Masked arrays
- COO, CSR, and hyper-sparse
- MLIR graph blas
- Not Jit
- Matrices must be 64 bit
scverse is a consortium of python tools for working with single cell high-dimensional omics data. We need representations for sparse data because:
- Much of our data is sparse (e.g. .5-2k out of 20k total non-zero features per cell for up to millions of cells)
- Nearest neighbor and spatial graphs are very useful for our data. We represent these as sparse adjacency matrices.
We pass around data between analysis packages using the AnnData object and it’s on disk format. The AnnData object itself is like a special case of an Xarray dataset, but with two required dimensions (observations and variables), the ability to hold scipy sparse matrices and DataFrames, and support for adjacency matrices.
One of the key features of the AnnData object is it’s on-disk representation (via hdf5 or zarr) which allows partial access to sparse arrays.
Some common pain points that came up during discussion:
- None of us like how scipy sparse tries to downcast everything aggressively
- AnnData is starting to need 64bit
indptr, which can quickly lead to dtype errors when interacting with arrays that have been down casted
- This blocks zero-copy interaction with python-graphblas and the QuTip libraries
- AnnData is starting to need 64bit
- No common standards for this
- Varying access patterns – one could make highly specialized formats
- Possible solutions:
- Trying AnnData’s on disk representation (docs)
- tileDB based storage
- So far more interest than need from QuTip and graphblas
- Some tools:
It was really great meeting with so many maintainers who are working with sparse arrays in python. Even though we are are working on similar data structures, few of us new of the sparse-specific work that others were doing.
To me, this suggests we need increased communication and visibility here. I think creation of the #sparse-arrays tag on this forum and the suggested sparse array summit are great starting points.
One of the most critical features of any data structure is ecosystem support. How can we come together in supporting a fully featured sparse array library?
I think this will need input from maintainers of the large libraries that currently work with sparse arrays including scikit-learn, xarray, and scipy. In particular I would love to hear their current thoughts on what’s happening with pydata/sparse and the new scipy sparse arrays.
Clearly there is need for a common sparse array representation – as demonstrated by
scipy.sparses continued usage. What else do we need to share here?
- More formats? This could include optimized n-dimensional arrays and hypersparse formats
- More extensive linear algebra operations?
- Dask (or other out-of-core) support? In particular, do people need specialized chunking strategies?
I would love to hear more input on this topic from the rest of the scientific python community. This could include more pain points, cool new efforts, or just interest in future conversations.