Ragged Array Summit

SimonHeybrock · September 2, 2022, 10:31am

Different from the Sparse Array Summit, ragged arrays (for lack of a better generic name, or more generally arrays with irregular (sub)structures?) come up in a number of fields. Unlike sparse arrays, ragged arrays are not simply a more compact representation of a dense array with missing/zero entries.

I know of two communities/implementations:

Awkward Array, used by the HEP community (and more?).
Scipp (by me/us) used for neutron-scattering data, where data is recorded in “event-mode”.

Who else might be interested?

jpivarski · September 2, 2022, 4:47pm

I’m up for that! Some more instances of libraries with ragged arrays:

Apache Arrow is a big project with a lot of momentum that has the same set of data types as Awkward Array (our aim is to be compatible with Arrow). There’s a lot of overlap in development between Arrow and the Parquet file format, which has the same suite of data types.
Ragged arrays are a feature of Zarr v2, and we’ve had off-and-on discussions about ensuring that it will be possible to wrap Awkward Arrays as a v3 extension (Protocol extensions for awkward arrays · Issue #62 · zarr-developers/zarr-specs · GitHub).
For that matter, HDF5 has a varlen data type, too. When combined with compound types, you have almost as much expressiveness as Arrow, but it’s a record-oriented format, not a columnar one.
TensorFlow has had RaggedTensors for a few years, and PyTorch has recently mainlined NestedTensor.
Historically, Continuum Analytics and Quansight were working on DyND and XND, intended as NumPy replacements with first-class ragged array support, but it is my understanding that these projects are no longer being developed.

There seem to be two main categories: (a) systems that provide a complete type system (as much as a typical programming language would have), which include Awkward, Arrow, Parquet, DyND, and XND, and (b) systems that focus on ragged arrays of numbers, which include Zarr, TensorFlow, PyTorch, and Scipp. Discussions about integrating Awkward Array into AnnData (first attempt to support awkward arrays by giovp · Pull Request #647 · scverse/anndata · GitHub) and xarray (Awkward array backend? · Issue #4285 · pydata/xarray · GitHub) highlight this tension, since they’re generally looking for the ragged array functionality only.

We should also link back to “Ragged data in Awkward vs. Scipp · Discussion #1663 · scikit-hep/awkward · GitHub”, which links here.

SimonHeybrock · September 5, 2022, 5:04am

Another file type that supports (a very special case of) ragged data is NeXus data format. The NXevent_data group (defining an HDF5 structure) is essentially what Awkward would call ListOffsetArray. It is however defined in a very specific manner, so it is not useful beyond the one particular purpose of storing raw neutron event data from (pulsed) neutron scattering facilities.

jjerphan · September 10, 2022, 2:32pm

Thank you for initiating discussions for this summit, @SimonHeybrock.

I am interested in participating.

For some context, scikit-learn’s datastructures mainly are 1D and 2D C- and F-contiguous numpy array.

Yet in some context, ragged arrays are used via numpy arrays of numpy arrays (for instance for return values for radius-neighborhood based interfaces, such as sklearn.neighbors.NearestNeighbors.radius_neighbors)

At the lowest-level implementation, libcpp is used via Cython (namely for std::vector<std::vector > and move semantic) to manipulate buffers and are then coerced into the NumPy representation given above to be returned to callers. This coercion is a bit uneasy but fortunately works for our use case.

I think scikit-learn maintainers might be interested in discussions to lead to better datastructures in the ecosystem, be they used for scikit-learn or not (we want to keep our number of dependencies small).