Sparse data refers to datasets where a high percentage of the values are zero or empty. This happens when relationships across dimensions (e.g. rows and columns) don’t exist or are neglected. Sparse datasets are ubiquitous in modern scientific computing, including network analysis, signal processing, image processing, machine learning, etc. There exist many sparse data formats which save memory by only storing non-zero values, yet still allow efficient computation and manipulation.
Recently, a sparse array API was added to scipy.sparse as a first step in removing the sparse matrix API and eventually np.matrix. This effort is complimentary to work being done on the PyData sparse package, which provides n-dimensional sparse data structures that support array semantics appropriate for Numba compiled code.
The summit would bring together developers and users of sparse arrays to discuss shortcomings of the current implementations, the needs of various scientific communities, and to develop a shared roadmap and vision for better supporting sparse arrays.
Depending on interest, this could be one of a series of sparse array summits over the next few years. The first summit would likely occur near the end of 2022 or the beginning of 2023. Please let me know if you would be interested in attending or if you if there are important communities or projects that should be represented at the summit.
More details about the summit will be posted here as they become available.
I’d also be interested in having a place for lower-commitment and asynchronous communication about this topic. I think this discourse and the discord would be ideal platforms. Towards this, I’ve created a post summarizing some discussions from scipy, and asking for more feedback here:
I’d be interested in coming as well, I’m representing the PyData/Sparse, XSparse (which is a VERY WIP re-implementation of TACO). I’d be interested in joining.
I might also add by saying I’m incredibly happy the community has come together to sort these things out, and that discussion is heating up. I would love nothing more than to be an active member of the community. I will contribute on some weekends and possibly come to meetups, but that said, my time on this may be limited in the short-term.
I don’t have a ton of experience with coding sparse arrays but I encounter them from time to time in my area of interest. It would be wonderful to have something that dynamically (and automagically?) optimized for sparseness. In that spirit, it would be nice to generalize to compressible large arrays, for instance when most of the entries are identical. We occasionally (often?) meet such matrices in portfolio optimization through dimensional reduction, the most extreme probably being Elton-Gruber where all the off diagonal entries are replaced by a single average value.
Anyway, I’m looking for a place to start making code contributions and this seems as good a place as any