Scientific python and sparse arrays (scipy summary + future directions)

ivirshup · August 1, 2022, 11:56am

At this year’s fantastic scipy conference I had the opportunity to meet a number of other developers in our community who are interested in the future of sparse arrays in python.

Associated projects included: networkx, open2c, pangeo, python-graphblas, QuTip, and scverse.

I’m creating this thread to start a discussion about the state of sparse arrays in the scientific python ecosystem. I would like to find out what people are working on, what our common needs are, and if we can figure out how we can collaborate on a path forward. I would love for other members of the community to chime in on this thread about their experiences and work with sparse data.

First I’ll summarize a meet up a few of us had, then will suggest some possible future directions.

Sparse data informal meet up

original notes

Via the conference slack, we had a small meet up of some maintainers interested in sparse data. Projects represented include

python-graphblas represented by @jim22k (working at anaconda)
QuTip represented by @hodgestar (working at Riken) and @Ericgig (working at Sherbrooke)
scverse (the anndata package specifically) represented by myself (@ivirshup)

We first talked a bit about how each of us are working with sparse data.

The projects

QuTip

QuTip has recently included Cython implementations of sparse arrays. They’ve done this because they found the scipy implementations too slow. Particular issues they ran into were:

Sparse addition doing unnecessary copies
Initialisation and complex types
- There are lots of checks making things slow
- Casting around complex types
Downcasting indices/ indptr
- E.g. they want 64 bit indptr, but scipy aggressively downcasts these to int32
- Checking that the values are within int32 range, and making copies takes an unnecessary amount of time

The main use cases for sparse data were Linear algebra and ODE related

python-graphblas

python-graphblas is a new python api for the very cool suitesparse/ GraphBLAS library. This provides high performance graph operations via sparse adjacency matrix representations of graphs. It allows users to apply arbitrary semi-rings which is a an extremely powerful abstraction for sparse matrix operations in general. I would highly recommend checking out some of the literature on GraphBLAS – like:

Mathematical foundations of the GraphBLAS for a short introduction
SuiteSparse:GraphBLAS: Graph Algorithms in the Language of Sparse Linear Algebra for something longer

Some key features of this library:

“sparse values” that aren’t zero
suite sparse – sparse representation
- Masked arrays
- COO, CSR, and hyper-sparse
- MLIR graph blas
Not Jit
Matrices must be 64 bit

anndata/ scverse

scverse is a consortium of python tools for working with single cell high-dimensional omics data. We need representations for sparse data because:

Much of our data is sparse (e.g. .5-2k out of 20k total non-zero features per cell for up to millions of cells)
Nearest neighbor and spatial graphs are very useful for our data. We represent these as sparse adjacency matrices.

We pass around data between analysis packages using the AnnData object and it’s on disk format. The AnnData object itself is like a special case of an Xarray dataset, but with two required dimensions (observations and variables), the ability to hold scipy sparse matrices and DataFrames, and support for adjacency matrices.

One of the key features of the AnnData object is it’s on-disk representation (via hdf5 or zarr) which allows partial access to sparse arrays.

Common pain points/ possible solutions

Some common pain points that came up during discussion:

Scipy sparse casting

None of us like how scipy sparse tries to downcast everything aggressively
- AnnData is starting to need 64bit indptr, which can quickly lead to dtype errors when interacting with arrays that have been down casted
- This blocks zero-copy interaction with python-graphblas and the QuTip libraries

On disk representations for sparse arrays

No common standards for this
Varying access patterns – one could make highly specialized formats
Possible solutions:
- Trying AnnData’s on disk representation (docs)
- tileDB based storage

Out of core

So far more interest than need from QuTip and graphblas
Some tools:
- Dask Graph blas from @eriknw
- Vitessce performs out of core access to AnnData-format sparse-matrices in zarr stores for browser based viz
- anndata-rs is a rust based reader and writer for hdf5 based anndata files with increased out-of-core functionality

Future directions/ open questions

Communication

It was really great meeting with so many maintainers who are working with sparse arrays in python. Even though we are are working on similar data structures, few of us new of the sparse-specific work that others were doing.

To me, this suggests we need increased communication and visibility here. I think creation of the sparse-arrays tag on this forum and the suggested sparse array summit are great starting points.

https://discuss.scientific-python.org/t/sparse-array-summit/365

Standardization/ interoperability

One of the most critical features of any data structure is ecosystem support. How can we come together in supporting a fully featured sparse array library?

I think this will need input from maintainers of the large libraries that currently work with sparse arrays including scikit-learn, xarray, and scipy. In particular I would love to hear their current thoughts on what’s happening with pydata/sparse and the new scipy sparse arrays.

How much can be shared? What is reasonable to collaborate on?

Clearly there is need for a common sparse array representation – as demonstrated by scipy.sparses continued usage. What else do we need to share here?

More formats? This could include optimized n-dimensional arrays and hypersparse formats
More extensive linear algebra operations?
Dask (or other out-of-core) support? In particular, do people need specialized chunking strategies?

I would love to hear more input on this topic from the rest of the scientific python community. This could include more pain points, cool new efforts, or just interest in future conversations.

hodgestar · August 1, 2022, 12:35pm

@ivirshup Thank you for getting the tag set up, driving the discussion and taking comprehensive notes.

I’m also keen to hear what the thoughts of those involved with scipy.sparse. Where do they see it going next?

Internally in QuTiP we’ve found it very useful to support a low-level gather pattern where a collection of smaller sparse matrices or individual values are “gathered” while performing an operation that will return a new sparse matrix, and then “reduced” into a new sparse matrix at the end. Efficient reduction can be done when one can make some guarantees about where the collection of elements or matrices are.

I wonder if designing and exposing such a low-level operation would be of interest to others, especially for the case of performing more complex operations on distributed sparse matrices. Maybe there are other libraries that already provide good abstractions for this?

jim22k · August 1, 2022, 2:52pm

@ivirshup Thanks for getting this thread started.

I really hope this effort can lead to a common storage and interchange format for different sparse libraries. Matrix Market feels like the “CSV” of sparse arrays – human readable, functional, stable, and very slow. The AnnData format looks like a good start. I also have a proof-of-concept library called sscdf doing something similar with netCDF4+HDF5 to store metadata and underlying dense arrays for CSR/CSC, plus the other formats that SuiteSparse:GraphBLAS supports.

I’m also intrigued by ASDF which was shown at SciPy. It has support for metadata and dense arrays, so it could be a lighter-weight alternative to HDF5.

Whatever the container chosen, the basic idea of storing the dense component arrays along with metadata describing how to reconstruct the sparse array feels like something we could discuss and agree on.

rossbar · August 3, 2022, 7:28am

Indeed, thanks @ivirshup for starting the thread.

This is obviously a huge topic with many dimensions (ha) and I agree very much with the various points raised above. I just wanted to add another component that I think would represent a major improvement on the usability/adoption front: data structures with an ndarray interface instead of matrix. I don’t want to belabor the point here, but just to highlight that I think this is critically important for users.

There has already been some work on this front both within scipy (see e.g. scipy/scipy#14822) and in other projects like pydata-sparse. I just wanted to draw attention to this topic in the thread!

rgommers · August 3, 2022, 9:20am

Thanks for the great write-up @ivirshup. I think from SciPy’s perspective the data structure has some challenges, and if a better replacement gets created that can be used or incorporated. A key thing to plan for there I think is how such a new data structure connects to scipy.sparse.linalg and scipy.sparse.csgraph; that code would be a lot harder to replace. PyData Sparse array instances are understood and supported, via conversion to the native SciPy format, since ENH: sparse/linalg: support pydata/sparse matrices by pv · Pull Request #10901 · scipy/scipy · GitHub. A new format could be supported like that. Or a true replacement could be tried - but that’s a lot more effort of course.

This does not seem like something that would be hard to support in scipy.sparse. I think the history there is that the data structure was 32-bit only first, and extended to 64-bit in a fully backwards-compatible way.I don’t remember a discussion about either doing a deprecation cycle or opting in to 64-bit only behavior. It’d be great if someone with this need could propose that.

ivirshup · August 3, 2022, 12:08pm

Thanks for chiming in everyone!

@hodgestar

Internally in QuTiP we’ve found it very useful to support a low-level gather pattern where a collection of smaller sparse matrices or individual values are “gathered” while performing an operation that will return a new sparse matrix, and then “reduced” into a new sparse matrix at the end.

Just to make sure I’m understanding this right, this is like a lazy API for sparse linear algebra?

If so, I think this would be super cool. I think this was a potential goal for pydata/sparse through TACO bindings.

@jim22k

The AnnData format looks like a good start. I also have a proof-of-concept library called sscdfdoing something similar with netCDF4+HDF5 to store metadata and underlying dense arrays for CSR/CSC, plus the other formats that SuiteSparse:GraphBLAS supports.

It would be nice to move beyond matrix market. @eriknw seems to be in discussion with the GraohBLAS team about this.

As I mentioned to you and Erik at scipy, I have the sense people often come up with complicated on disk sparse formats which are optimized for specific classes of computation. E.g. tiledb and its density based chunking. I’m curious to see if we can come up with something thats focussed on interchange (so is fairly simple), but can be augmented for more specific use cases.

I’m also intrigued by ASDF

I am interested by the schema’s from asdf. However I’m not sure how widespread implementations are. For AnnData we primarily suggest hdf5 because most languages can read hdf5. Zarr also seems to be gaining adoption (netcdf can use zarr, for example).

I believe there was a meeting between the zarr and ASDF teams at scipy. I’ve pinged the zarr team about outcomes of this meeting and will get back on this.

Do you think it’d be worth starting a new topic specifically for discussion of on disk formats? @nvictus may have some thoughts on this as well.

ivirshup · August 3, 2022, 12:09pm

@rossbar

That is absolutely on the top of my mind! I would actually really love to hear if there is any consensus between scipy.sparse, pydata/sparse, and downstream libraries like scikit-learn on where we can focus efforts!

I was more involved in this conversation on the pydata/sparse side about a year ago, but then left my PhD for a RSE position on the other side of the planet so fell out of the loop .

@rgommers

Thanks for the response! It’s really nice to hear from the scipy team here!

A key thing to plan for there I think is how such a new data structure connects to scipy.sparse.linalg and scipy.sparse.csgraph

I’m definitely a big fan of “not re-implementing”, or at least “implementing as little as possible”. It would make a lot of sense to me if these libraries were re-used, at least for the more standard sparse formats.

This does not seem like something that would be hard to support in scipy.sparse .

Last I was up to date on these conversations, it seemed to me like the eventual goal was for pydata/sparse to replace scipy.sparse (or at least the classes). If that were still the goal, I think there’s already a solution to this because pydata/sparse matrices are effectivley parametric on the index dtype as well as the value dtype.

I had thought a quick path to replacing the csr_matrix/ csc_matrix would be to just call out to their methods from sparse.CSR and sparse.CSC classes. I’m a little confused about the path forward now that there are coo_arrays.

Could you offer a perspective here? Do you think there’s renewed appetite (maybe even funding) for improving the scipy.sparse implementations? One implementation point here: I’m personally much more capable of writing numba than C or Cython.

hameerabbasi · August 3, 2022, 7:01pm

Hello! Maintainer of PyData/Sparse checking in. I’m working with a GSoC student to incorporate TACO as a full implementation for sparse arrays. I have read the needs above of various parties, and TACO (or a framework based on it) seems to have everything. I’m aware the TACO maintainers are looking for use-cases and believe they would be very interested in working with us.

They are also thinking along the same dimensions as those presented here. The biggest concern is the need for a compiler at run-time – I’m not sure how much of a blocker that would be for folks here. Of course, one could always pre-generate kernels and then not need a compiler.

hodgestar · August 3, 2022, 10:27pm

I hadn’t thought of it as a lazy API, but yes, I suppose it is. I’m glad it sounds cool to you too!

I’m not a TACO expert – would you mind explaining how TACO would support this? Is see it can combine multiple operations into one step, but what I’m imagining is a bit different perhaps.

For example, when performing matrix multiplication, one has to multiply elements together pairwise and then sum up those that contribute to the same entry in the new matrix. An efficient way to do this is to keep track of all the elements and then to sum up the entries at the end while creating the new sparse matrix.

Another example, when constructing a very large sparse matrix that consists of many small, dense blocks, it can help significantly (e.g. O(N**2) vs O(N)) to generate all the blocks as small blocks, and then combine them into a one sparse matrix only at the end.

So it’s more of a way to implement new operations than to combine old ones (although there are definite similarities).

jjerphan · August 4, 2022, 6:51am

Hi everyone,

First, thanks @ivirshup for starting discussions and setting the context on this topic.

Here are a few inputs on scikit-learn usage of sparse matrices.

scikit-learn and sparse arrays `tl;dr` : SciPy CSR matrices as the most common case, yet room for alternatives.

In scikit-learn, sparse matrices are supported with CSR matrices being the most commonly used kind of sparse matrices.

A variety of scikit-learn Estimators support SciPy sparse matrices using the Python API, and some algorithms (implemented using Cython) directly work on the CSR matrices’ data, indices, and indptr numpy arrays directly via memoryviews: SciPy and scikit-learn using Cython to respectively provide and work with matrices internals allows both eased work and efficient implementations (for instance, see this set of utility functions). In the case of those algorithms, the arrays need not originates from SciPy CSR matrices’ but they currently do.

We do not use or have any sparse linear operations but this safe sparse dot product.

Using SciPy sparse matrices was (and in my opinion still is) the most natural choice: SciPy is one of the few dependencies of scikit-learn, users might use its’ sparse matrices for downstream and upstream tasks in their work, and potential sub-optimal performance for the Python API is not a problem. Yet technically scikit-learn should technically in the future not be restricted (up to a few changes) to supporting other libraries’ implementations of sparse arrays: this support should nonetheless be discussed among scikit-learn maintainers.

A problem with SciPy CSR matrice: variant `indptr` dtype

The most significant problem that we currently face with SciPy, is statically typing SciPy CSR matrices’ indptr array in Cython, which relates to previous discussions:

This is useful to know.

I think it would be worth having int64 be used by default for indptr but have the possibility to require the use of int32 for this array.

@rgommers: scikit-learn definitely had this need. Should I create a issue or PR for SciPy?

For reference, some relevant issues reported for scikit-learn contributions and maintenance include:

github.com/scikit-learn/scikit-learn

[RFC] Support for int64 indexed SciPy sparse matrices in Cython code

opened 02:45PM - 16 Jun 22 UTC

ogrisel

API Needs Decision cython

At the moment we do not have systematic support for very large sparse matrices i…n our Cython code. That would be useful when the data is passed as a sparse matrix with more than ~2e9 columns or non-zero values. The purpose of this issue is to link: - reference all related issues in scikit-learn. - decide if we want to have some uniform support guarantees or not - decide if we need centralized Cython tooling (e.g. type declarations, tempita conventions) to add support for such matrices. ### Related issues and PRs (feel free to update this list): For polynomial feature expansion (quite popular request): - #16803 - #17554 - #19676 Other models with open issues: - #11355 - #11356 - #18090 - #18403 Other Cython estimators that could also be updated: - neighbors models (k-NN and radius-based models) - related issues not just about this problem: #23604 - k-means & variants - Feature Hasher / Hashing Vectorizer (`sklearn/feature_extraction/_hashing_fast.pyx`) The following PR will introduce a scikit-learn transformer that can output `int64` indexed sparse matrices (even if it's input is `int32` indexed). - #23731 ### Helpful Python snippet SciPy decides to use the int32 or int64 dtype depending on the dimensions of the matrix and on the number of stored non-zero elements. Here is a quick way to generate a CSR matrix that requires int64-typed `.indices` and `.indptr` attributes: ```python >>> from scipy.sparse import csr_matrix >>> import numpy as np >>> >>> X = csr_matrix(([1.0], [np.iinfo(np.int32).max + 1], [0, 1])) >>> X <1x2147483649 sparse matrix of type '<class 'numpy.float64'>' with 1 stored elements in Compressed Sparse Row format> >>> X.indices array([2147483648]) >>> X.indices.dtype dtype('int64') >>> X.indptr.dtype dtype('int64') ```

github.com/scipy/scipy

BUG: `sparse.hstack` returns incorrect result when the stack would result in indices too large for `np.int32`

opened 11:11PM - 08 Jul 22 UTC

closed 03:44PM - 27 Sep 22 UTC

Micky774

defect scipy.sparse

### Describe your issue. Over in scikit-learn we ran into this bug in the pursu…it of [this PR](https://github.com/scikit-learn/scikit-learn/pull/23731). Essentially when using `sparse.hstack` on a collection of sparse (csr) matrices whose `indices` arrays contain values no greater than the maximum for `np.int32` the operation produces incorrect results. I believe the problem is within `sparse._construct._stack_along_minor_axis`. In particular its `dtype` resolution for `indices` and `indptr` misses this edge case (and perhaps some others). ### Reproducing Code Example ```python from scipy import sparse import numpy as np data = [1.0] row = [0] max_int32 = np.iinfo(np.int32).max ind_1 = max_int32 ind_2 = 2 assert ind_1 + ind_2 - 1 > max_int32 #condition of failure assert max(ind_1 - 1, ind_2 - 1) < max_int32 #condition of failure col_1 = [ind_1 - 1] col_2 = [ind_2 - 1] X_1 = sparse.csr_matrix((data, (row, col_1))) X_2 = sparse.csr_matrix((data, (row, col_2))) Z = sparse.hstack([X_1, X_2], format="csr") print(Z.indices) # [65534 -2147450882] assert Z.indices.max() == ind_1 + ind_2 - 1 ``` ### Error message ```shell N/A ``` ### SciPy/NumPy/Python version information 1.10.0.dev0+0.df3fe4e 1.24.0.dev0+449.g353fea031 sys.version_info(major=3, minor=9, micro=12, releaselevel='final', serial=0)

Side note: tabmat, efficient matrix representations for working with tabular data

tabmat is project that might be worth considering. In scikit-learn, some newly considered implementations of solvers have bottleneck which fall under tabmat use-case.

rgommers · August 4, 2022, 6:59am

Yes please! Since this is an important topic, I would suggest creating an issue first for discoverability and then a PR to follow up.

jjerphan · August 4, 2022, 8:41am

I’ve opened this issue to pursue discussion:

github.com/scipy/scipy

ENH: Use `np.int64` by default for CSR matrices' `indptr`

opened 08:39AM - 04 Aug 22 UTC

jjerphan

### Context When creating a [`scipy.sparse.csr_matrix`](https://docs.scipy.or…g/doc/scipy/reference/generated/scipy.sparse.csr_matrix.html) (for instance using a numpy array), the underlying `indptr` might either use `np.int32` or `np.int64` depending on `nnz`, the number of non zeros elements (namely `int64` are used when `nnz >= 2**31`). This mechanism has been introduced in https://github.com/scipy/scipy/pull/442 to support `np.int64` while conserving the previous behaviour based on discussions . It has latter been adapted and propagated via other contributions, notably https://github.com/scipy/scipy/pull/3468, https://github.com/scipy/scipy/pull/4678/. ### Problem Unfortunately, some algorithms which directly work on CSR arrays necessitate typing `indptr`. This is for instance the case of some algorithms implemented in Cython in scikit-learn. Supporting both `np.int32` and `np.int64` comes with complex code adaptation (and potentially performance regressions) that downstream libraries might not want to maintain (this is the case of scikit-learn for instance). Ideally, the dtype of `indptr` could be set at CSR matrices' creation, but this is not controllable by downstream libraries. The only possible control is to cast `indptr` to use the dtype chosen by the downstream library and hence create a copy `indptr`. ### Proposed solution The creation of SciPy CSR matrices could be change to use `int64` by default for `indptr` while still being able to specify using `int32` if needed. This would allow avoiding to copy and have a better range representation and the cost of a slightly larger memory footprint for CSR matrices for with `nnz < 2**31`. Before doing any UX and adaptation work, it might be worth reading discussions in https://github.com/scipy/scipy/pull/442, https://github.com/scipy/scipy/pull/3468 and https://github.com/scipy/scipy/pull/4678/ has potential solutions already might have had been studied. ### References Relevant issues and discussions linked to this issue not exhaustively include: - [[Scientific python Discourse] Scientific python and sparse arrays (scipy summary + future directions)](https://discuss.scientific-python.org/t/scientific-python-and-sparse-arrays-scipy-summary-future-directions/368/10) - https://github.com/scikit-learn/scikit-learn/issues/23653 - https://github.com/scipy/scipy/issues/16569

I would first go through the given references’ past discussions before opening a PR.

hameerabbasi · August 11, 2022, 6:16pm

I would refer you to this video, except imagine multiply/add can be any elemwise/reduction.

nvictus · August 15, 2022, 8:13pm

Hi all,

Thank you for initiating this conversation @ivirshup and great to meet everyone! So much great information in here to digest and I’m excited to see what come out of this discussion and future interactions.

I’m representing Open2C and the cooler format. I’ve also contributed some work at a SciPy sprint a couple years back to make pydata/sparse and xarray work together, allowing for sparse xarrays.

Like scverse, we work in the genomics space, but more often than not with high-dimensional bulk data rather than single-cell data. Our field specifically maps out the organization of genomes in 3D, using fancy DNA sequencing technologies instead of microscopes. These methods generate massive matrices like this.

The data is fundamentally sparse: millions to billions of counts dispersed in a peculiar way over a very large discrete lattice (genome-quared) up to order 10^9 x 10^9 in resolution. In our use-case, there are some obvious image and geo-spatial analogies. For example, there are logical divisions in the heatmap representation (e.g., chromosome boundaries). Querying is also often done in bounding boxes of arbitrary size. Most importantly, multiscale (pre-aggregation at multiple coarse resolutions) is important for both visualization and analysis.

Our storage solution cooler is very similar to AnnData. It is a CSR matrix format modeled using dense arrays in HDF5 (it is really a COO/CSR hybrid; we chose to keep the row-value array hanging around because it gets compressed so well, but it’s not needed at all). Since the axes of our array are genomic intervals, we also store a metadata table describing the rows and columns, similar to an xarray.DataArray. So the array is “labeled” or “annotated”. Also, our matrices are normally symmetric, so we only keep the upper triangle. This adds some complexity to querying 2D ranges when pulling from disk.

Summary in an image.

Some thoughts on the topics brought up:

On-disk representations

Whatever the container chosen, the basic idea of storing the dense component arrays along with metadata describing how to reconstruct the sparse array feels like something we could discuss and agree on.

Totally agree! Since HDF5 and Zarr share the same data model, we also support Zarr as a backend, which I consider to be a “protocol” rather than a “format”. It would be nice to come up with a similar protocol for sparse storage, in analogy to how Zarr abstracts away the storage details of chunked, encoded, n-dimensional arrays. Thank you for sharing sscdf!

On a related note, the improved interop between zarr, fsspec, and HDF5/NetCDF has created an exciting new way to bypass the cumbersome libhdf5 and netcdf C libraries for efficient read access to remote data: see kerchunk.

An additional type of sparse storage modality that we once considered for data like ours (sparse but often accessed in a geo-spatial-like fashion) is space-filling curve indexes. This method sorts the dense component arrays in a way that preserves spatial locality, unlike CSR/CSC which rely on grouping the elements by row or column. There is a Hilbert curve implementation for dataframes in spatialpandas, and former maintainer Jon Mease also pointed me to dask-geopandas, so it looks like some of the plumbing for this now exists in the ecosystem. There are also tiered storage schemes, which I think is what tiledb does: break the sparse array into smaller tiles and store them in a more usual format like CSR.

Out-of-core

Besides storage/access and visualization, sparse representations are also vital for the types of analyses we do, including linear algebra and graph algorithms. Other types of computations we do could be considered “convolutional scoring” (sweeping a kernel over a region or along the main diagonal) for feature detection and quantification. We also do aggregate analyses that sound similar to what @hodgestar was describing. Often it involves mapping the locations of a large number of small, fixed-size “snippets”, collecting them, and reducing them into composite heatmaps. So flexibility in querying and some amount of laziness is important.

We maintain a collection of hand-crafted, mostly out-of-core, common operations on cooler files in a separate package called cooltools. Many of these tools actually end up relying on dataframe representations of chunks of array data, rather than sparse array data structures (Though, sometimes we manipulate row, col, data and indptr arrays of CSR matrices, or convert small submatrices to dense in memory). As a result, the algorithms are often clunky and can be challenging to reason about. Better out-of-core sparse array machinery would definitely make our lives easier.

Scientific python and sparse arrays (scipy summary + future directions)

Sparse data informal meet up

The projects

QuTip

python-graphblas

anndata/ scverse

Common pain points/ possible solutions

Scipy sparse casting

On disk representations for sparse arrays

Out of core

Future directions/ open questions

Communication

Standardization/ interoperability

How much can be shared? What is reasonable to collaborate on?

scikit-learn and sparse arrays tl;dr : SciPy CSR matrices as the most common case, yet room for alternatives.

A problem with SciPy CSR matrice: variant indptr dtype

Side note: tabmat, efficient matrix representations for working with tabular data

On-disk representations

Out-of-core

scikit-learn and sparse arrays `tl;dr` : SciPy CSR matrices as the most common case, yet room for alternatives.

A problem with SciPy CSR matrice: variant `indptr` dtype