Scientific python and sparse arrays (scipy summary + future directions)

At this year’s fantastic scipy conference I had the opportunity to meet a number of other developers in our community who are interested in the future of sparse arrays in python.

Associated projects included: networkx, open2c, pangeo, python-graphblas, QuTip, and scverse.

I’m creating this thread to start a discussion about the state of sparse arrays in the scientific python ecosystem. I would like to find out what people are working on, what our common needs are, and if we can figure out how we can collaborate on a path forward. I would love for other members of the community to chime in on this thread about their experiences and work with sparse data.

First I’ll summarize a meet up a few of us had, then will suggest some possible future directions.

Sparse data informal meet up

original notes

Via the conference slack, we had a small meet up of some maintainers interested in sparse data. Projects represented include

We first talked a bit about how each of us are working with sparse data.

The projects

QuTip

QuTip has recently included Cython implementations of sparse arrays. They’ve done this because they found the scipy implementations too slow. Particular issues they ran into were:

  • Sparse addition doing unnecessary copies
  • Initialisation and complex types
    • There are lots of checks making things slow
    • Casting around complex types
  • Downcasting indices/ indptr
    • E.g. they want 64 bit indptr, but scipy aggressively downcasts these to int32
    • Checking that the values are within int32 range, and making copies takes an unnecessary amount of time

The main use cases for sparse data were Linear algebra and ODE related

python-graphblas

python-graphblas is a new python api for the very cool suitesparse/ GraphBLAS library. This provides high performance graph operations via sparse adjacency matrix representations of graphs. It allows users to apply arbitrary semi-rings which is a an extremely powerful abstraction for sparse matrix operations in general. I would highly recommend checking out some of the literature on GraphBLAS – like:

Some key features of this library:

  • “sparse values” that aren’t zero
  • suite sparse – sparse representation
    • Masked arrays
    • COO, CSR, and hyper-sparse
    • MLIR graph blas
  • Not Jit
  • Matrices must be 64 bit

anndata/ scverse

scverse is a consortium of python tools for working with single cell high-dimensional omics data. We need representations for sparse data because:

  • Much of our data is sparse (e.g. .5-2k out of 20k total non-zero features per cell for up to millions of cells)
  • Nearest neighbor and spatial graphs are very useful for our data. We represent these as sparse adjacency matrices.

We pass around data between analysis packages using the AnnData object and it’s on disk format. The AnnData object itself is like a special case of an Xarray dataset, but with two required dimensions (observations and variables), the ability to hold scipy sparse matrices and DataFrames, and support for adjacency matrices.

One of the key features of the AnnData object is it’s on-disk representation (via hdf5 or zarr) which allows partial access to sparse arrays.

Common pain points/ possible solutions

Some common pain points that came up during discussion:

Scipy sparse casting

  • None of us like how scipy sparse tries to downcast everything aggressively
    • AnnData is starting to need 64bit indptr, which can quickly lead to dtype errors when interacting with arrays that have been down casted
    • This blocks zero-copy interaction with python-graphblas and the QuTip libraries

On disk representations for sparse arrays

  • No common standards for this
  • Varying access patterns – one could make highly specialized formats
  • Possible solutions:
    • Trying AnnData’s on disk representation (docs)
    • tileDB based storage

Out of core

  • So far more interest than need from QuTip and graphblas
  • Some tools:
    • Dask Graph blas from @eriknw
    • Vitessce performs out of core access to AnnData-format sparse-matrices in zarr stores for browser based viz
    • anndata-rs is a rust based reader and writer for hdf5 based anndata files with increased out-of-core functionality

Future directions/ open questions

Communication

It was really great meeting with so many maintainers who are working with sparse arrays in python. Even though we are are working on similar data structures, few of us new of the sparse-specific work that others were doing.

To me, this suggests we need increased communication and visibility here. I think creation of the sparse-arrays tag on this forum and the suggested sparse array summit are great starting points.

https://discuss.scientific-python.org/t/sparse-array-summit/365

Standardization/ interoperability

One of the most critical features of any data structure is ecosystem support. How can we come together in supporting a fully featured sparse array library?

I think this will need input from maintainers of the large libraries that currently work with sparse arrays including scikit-learn, xarray, and scipy. In particular I would love to hear their current thoughts on what’s happening with pydata/sparse and the new scipy sparse arrays.

How much can be shared? What is reasonable to collaborate on?

Clearly there is need for a common sparse array representation – as demonstrated by scipy.sparses continued usage. What else do we need to share here?

  • More formats? This could include optimized n-dimensional arrays and hypersparse formats
  • More extensive linear algebra operations?
  • Dask (or other out-of-core) support? In particular, do people need specialized chunking strategies?

I would love to hear more input on this topic from the rest of the scientific python community. This could include more pain points, cool new efforts, or just interest in future conversations.

9 Likes

@ivirshup Thank you for getting the tag set up, driving the discussion and taking comprehensive notes.

I’m also keen to hear what the thoughts of those involved with scipy.sparse. Where do they see it going next?

Internally in QuTiP we’ve found it very useful to support a low-level gather pattern where a collection of smaller sparse matrices or individual values are “gathered” while performing an operation that will return a new sparse matrix, and then “reduced” into a new sparse matrix at the end. Efficient reduction can be done when one can make some guarantees about where the collection of elements or matrices are.

I wonder if designing and exposing such a low-level operation would be of interest to others, especially for the case of performing more complex operations on distributed sparse matrices. Maybe there are other libraries that already provide good abstractions for this?

@ivirshup Thanks for getting this thread started.

I really hope this effort can lead to a common storage and interchange format for different sparse libraries. Matrix Market feels like the “CSV” of sparse arrays – human readable, functional, stable, and very slow. The AnnData format looks like a good start. I also have a proof-of-concept library called sscdf doing something similar with netCDF4+HDF5 to store metadata and underlying dense arrays for CSR/CSC, plus the other formats that SuiteSparse:GraphBLAS supports.

I’m also intrigued by ASDF which was shown at SciPy. It has support for metadata and dense arrays, so it could be a lighter-weight alternative to HDF5.

Whatever the container chosen, the basic idea of storing the dense component arrays along with metadata describing how to reconstruct the sparse array feels like something we could discuss and agree on.

Indeed, thanks @ivirshup for starting the thread.

This is obviously a huge topic with many dimensions (ha) and I agree very much with the various points raised above. I just wanted to add another component that I think would represent a major improvement on the usability/adoption front: data structures with an ndarray interface instead of matrix. I don’t want to belabor the point here, but just to highlight that I think this is critically important for users.

There has already been some work on this front both within scipy (see e.g. scipy/scipy#14822) and in other projects like pydata-sparse. I just wanted to draw attention to this topic in the thread!

Thanks for the great write-up @ivirshup. I think from SciPy’s perspective the data structure has some challenges, and if a better replacement gets created that can be used or incorporated. A key thing to plan for there I think is how such a new data structure connects to scipy.sparse.linalg and scipy.sparse.csgraph; that code would be a lot harder to replace. PyData Sparse array instances are understood and supported, via conversion to the native SciPy format, since ENH: sparse/linalg: support pydata/sparse matrices by pv · Pull Request #10901 · scipy/scipy · GitHub. A new format could be supported like that. Or a true replacement could be tried - but that’s a lot more effort of course.

This does not seem like something that would be hard to support in scipy.sparse. I think the history there is that the data structure was 32-bit only first, and extended to 64-bit in a fully backwards-compatible way.I don’t remember a discussion about either doing a deprecation cycle or opting in to 64-bit only behavior. It’d be great if someone with this need could propose that.

1 Like

Thanks for chiming in everyone!

@hodgestar

Internally in QuTiP we’ve found it very useful to support a low-level gather pattern where a collection of smaller sparse matrices or individual values are “gathered” while performing an operation that will return a new sparse matrix, and then “reduced” into a new sparse matrix at the end.

Just to make sure I’m understanding this right, this is like a lazy API for sparse linear algebra?

If so, I think this would be super cool. I think this was a potential goal for pydata/sparse through TACO bindings.

@jim22k

The AnnData format looks like a good start. I also have a proof-of-concept library called sscdfdoing something similar with netCDF4+HDF5 to store metadata and underlying dense arrays for CSR/CSC, plus the other formats that SuiteSparse:GraphBLAS supports.

It would be nice to move beyond matrix market. @eriknw seems to be in discussion with the GraohBLAS team about this.

As I mentioned to you and Erik at scipy, I have the sense people often come up with complicated on disk sparse formats which are optimized for specific classes of computation. E.g. tiledb and its density based chunking. I’m curious to see if we can come up with something thats focussed on interchange (so is fairly simple), but can be augmented for more specific use cases.

I’m also intrigued by ASDF

I am interested by the schema’s from asdf. However I’m not sure how widespread implementations are. For AnnData we primarily suggest hdf5 because most languages can read hdf5. Zarr also seems to be gaining adoption (netcdf can use zarr, for example).

I believe there was a meeting between the zarr and ASDF teams at scipy. I’ve pinged the zarr team about outcomes of this meeting and will get back on this.

Do you think it’d be worth starting a new topic specifically for discussion of on disk formats? @nvictus may have some thoughts on this as well.

@rossbar

That is absolutely on the top of my mind! I would actually really love to hear if there is any consensus between scipy.sparse, pydata/sparse, and downstream libraries like scikit-learn on where we can focus efforts!

I was more involved in this conversation on the pydata/sparse side about a year ago, but then left my PhD for a RSE position on the other side of the planet so fell out of the loop :sweat_smile:.

@rgommers

Thanks for the response! It’s really nice to hear from the scipy team here!

A key thing to plan for there I think is how such a new data structure connects to scipy.sparse.linalg and scipy.sparse.csgraph

I’m definitely a big fan of “not re-implementing”, or at least “implementing as little as possible”. It would make a lot of sense to me if these libraries were re-used, at least for the more standard sparse formats.

This does not seem like something that would be hard to support in scipy.sparse .

Last I was up to date on these conversations, it seemed to me like the eventual goal was for pydata/sparse to replace scipy.sparse (or at least the classes). If that were still the goal, I think there’s already a solution to this because pydata/sparse matrices are effectivley parametric on the index dtype as well as the value dtype.

I had thought a quick path to replacing the csr_matrix/ csc_matrix would be to just call out to their methods from sparse.CSR and sparse.CSC classes. I’m a little confused about the path forward now that there are coo_arrays.

Could you offer a perspective here? Do you think there’s renewed appetite (maybe even funding) for improving the scipy.sparse implementations? One implementation point here: I’m personally much more capable of writing numba than C or Cython.

Hello! Maintainer of PyData/Sparse checking in. I’m working with a GSoC student to incorporate TACO as a full implementation for sparse arrays. I have read the needs above of various parties, and TACO (or a framework based on it) seems to have everything. I’m aware the TACO maintainers are looking for use-cases and believe they would be very interested in working with us.

They are also thinking along the same dimensions as those presented here. The biggest concern is the need for a compiler at run-time – I’m not sure how much of a blocker that would be for folks here. Of course, one could always pre-generate kernels and then not need a compiler.

I hadn’t thought of it as a lazy API, but yes, I suppose it is. I’m glad it sounds cool to you too!

I’m not a TACO expert – would you mind explaining how TACO would support this? Is see it can combine multiple operations into one step, but what I’m imagining is a bit different perhaps.

For example, when performing matrix multiplication, one has to multiply elements together pairwise and then sum up those that contribute to the same entry in the new matrix. An efficient way to do this is to keep track of all the elements and then to sum up the entries at the end while creating the new sparse matrix.

Another example, when constructing a very large sparse matrix that consists of many small, dense blocks, it can help significantly (e.g. O(N**2) vs O(N)) to generate all the blocks as small blocks, and then combine them into a one sparse matrix only at the end.

So it’s more of a way to implement new operations than to combine old ones (although there are definite similarities).

Hi everyone,

First, thanks @ivirshup for starting discussions and setting the context on this topic.

Here are a few inputs on scikit-learn usage of sparse matrices.

scikit-learn and sparse arrays tl;dr : SciPy CSR matrices as the most common case, yet room for alternatives.

In scikit-learn, sparse matrices are supported with CSR matrices being the most commonly used kind of sparse matrices.

A variety of scikit-learn Estimators support SciPy sparse matrices using the Python API, and some algorithms (implemented using Cython) directly work on the CSR matrices’ data, indices, and indptr numpy arrays directly via memoryviews: SciPy and scikit-learn using Cython to respectively provide and work with matrices internals allows both eased work and efficient implementations (for instance, see this set of utility functions). In the case of those algorithms, the arrays need not originates from SciPy CSR matrices’ but they currently do.

We do not use or have any sparse linear operations but this safe sparse dot product.

Using SciPy sparse matrices was (and in my opinion still is) the most natural choice: SciPy is one of the few dependencies of scikit-learn, users might use its’ sparse matrices for downstream and upstream tasks in their work, and potential sub-optimal performance for the Python API is not a problem. Yet technically scikit-learn should technically in the future not be restricted (up to a few changes) to supporting other libraries’ implementations of sparse arrays: this support should nonetheless be discussed among scikit-learn maintainers.

A problem with SciPy CSR matrice: variant indptr dtype

The most significant problem that we currently face with SciPy, is statically typing SciPy CSR matrices’ indptr array in Cython, which relates to previous discussions:

This is useful to know. :+1:

I think it would be worth having int64 be used by default for indptr but have the possibility to require the use of int32 for this array.

@rgommers: scikit-learn definitely had this need. Should I create a issue or PR for SciPy?

For reference, some relevant issues reported for scikit-learn contributions and maintenance include:

Side note: tabmat, efficient matrix representations for working with tabular data

tabmat is project that might be worth considering. In scikit-learn, some newly considered implementations of solvers have bottleneck which fall under tabmat use-case.


1 Like

Yes please! Since this is an important topic, I would suggest creating an issue first for discoverability and then a PR to follow up.

I’ve opened this issue to pursue discussion:

I would first go through the given references’ past discussions before opening a PR.

1 Like

I would refer you to this video, except imagine multiply/add can be any elemwise/reduction.

Hi all,

Thank you for initiating this conversation @ivirshup and great to meet everyone! So much great information in here to digest and I’m excited to see what come out of this discussion and future interactions.

I’m representing Open2C and the cooler format. I’ve also contributed some work at a SciPy sprint a couple years back to make pydata/sparse and xarray work together, allowing for sparse xarrays.

Like scverse, we work in the genomics space, but more often than not with high-dimensional bulk data rather than single-cell data. Our field specifically maps out the organization of genomes in 3D, using fancy DNA sequencing technologies instead of microscopes. These methods generate massive matrices like this.

The data is fundamentally sparse: millions to billions of counts dispersed in a peculiar way over a very large discrete lattice (genome-quared) up to order 10^9 x 10^9 in resolution. In our use-case, there are some obvious image and geo-spatial analogies. For example, there are logical divisions in the heatmap representation (e.g., chromosome boundaries). Querying is also often done in bounding boxes of arbitrary size. Most importantly, multiscale (pre-aggregation at multiple coarse resolutions) is important for both visualization and analysis.

Our storage solution cooler is very similar to AnnData. It is a CSR matrix format modeled using dense arrays in HDF5 (it is really a COO/CSR hybrid; we chose to keep the row-value array hanging around because it gets compressed so well, but it’s not needed at all). Since the axes of our array are genomic intervals, we also store a metadata table describing the rows and columns, similar to an xarray.DataArray. So the array is “labeled” or “annotated”. Also, our matrices are normally symmetric, so we only keep the upper triangle. This adds some complexity to querying 2D ranges when pulling from disk.

Summary in an image.

Some thoughts on the topics brought up:

On-disk representations

Whatever the container chosen, the basic idea of storing the dense component arrays along with metadata describing how to reconstruct the sparse array feels like something we could discuss and agree on.

Totally agree! Since HDF5 and Zarr share the same data model, we also support Zarr as a backend, which I consider to be a “protocol” rather than a “format”. It would be nice to come up with a similar protocol for sparse storage, in analogy to how Zarr abstracts away the storage details of chunked, encoded, n-dimensional arrays. Thank you for sharing sscdf!

On a related note, the improved interop between zarr, fsspec, and HDF5/NetCDF has created an exciting new way to bypass the cumbersome libhdf5 and netcdf C libraries for efficient read access to remote data: see kerchunk.

An additional type of sparse storage modality that we once considered for data like ours (sparse but often accessed in a geo-spatial-like fashion) is space-filling curve indexes. This method sorts the dense component arrays in a way that preserves spatial locality, unlike CSR/CSC which rely on grouping the elements by row or column. There is a Hilbert curve implementation for dataframes in spatialpandas, and former maintainer Jon Mease also pointed me to dask-geopandas, so it looks like some of the plumbing for this now exists in the ecosystem. There are also tiered storage schemes, which I think is what tiledb does: break the sparse array into smaller tiles and store them in a more usual format like CSR.

Out-of-core

Besides storage/access and visualization, sparse representations are also vital for the types of analyses we do, including linear algebra and graph algorithms. Other types of computations we do could be considered “convolutional scoring” (sweeping a kernel over a region or along the main diagonal) for feature detection and quantification. We also do aggregate analyses that sound similar to what @hodgestar was describing. Often it involves mapping the locations of a large number of small, fixed-size “snippets”, collecting them, and reducing them into composite heatmaps. So flexibility in querying and some amount of laziness is important.

We maintain a collection of hand-crafted, mostly out-of-core, common operations on cooler files in a separate package called cooltools. Many of these tools actually end up relying on dataframe representations of chunks of array data, rather than sparse array data structures (Though, sometimes we manipulate row, col, data and indptr arrays of CSR matrices, or convert small submatrices to dense in memory). As a result, the algorithms are often clunky and can be challenging to reason about. Better out-of-core sparse array machinery would definitely make our lives easier.

1 Like