Ragged Array Summit

I’m up for that! Some more instances of libraries with ragged arrays:

  • Apache Arrow is a big project with a lot of momentum that has the same set of data types as Awkward Array (our aim is to be compatible with Arrow). There’s a lot of overlap in development between Arrow and the Parquet file format, which has the same suite of data types.
  • Ragged arrays are a feature of Zarr v2, and we’ve had off-and-on discussions about ensuring that it will be possible to wrap Awkward Arrays as a v3 extension (Protocol extensions for awkward arrays · Issue #62 · zarr-developers/zarr-specs · GitHub).
  • For that matter, HDF5 has a varlen data type, too. When combined with compound types, you have almost as much expressiveness as Arrow, but it’s a record-oriented format, not a columnar one.
  • TensorFlow has had RaggedTensors for a few years, and PyTorch has recently mainlined NestedTensor.
  • Historically, Continuum Analytics and Quansight were working on DyND and XND, intended as NumPy replacements with first-class ragged array support, but it is my understanding that these projects are no longer being developed.

There seem to be two main categories: (a) systems that provide a complete type system (as much as a typical programming language would have), which include Awkward, Arrow, Parquet, DyND, and XND, and (b) systems that focus on ragged arrays of numbers, which include Zarr, TensorFlow, PyTorch, and Scipp. Discussions about integrating Awkward Array into AnnData (first attempt to support awkward arrays by giovp · Pull Request #647 · scverse/anndata · GitHub) and xarray (Awkward array backend? · Issue #4285 · pydata/xarray · GitHub) highlight this tension, since they’re generally looking for the ragged array functionality only.

We should also link back to “Ragged data in Awkward vs. Scipp · Discussion #1663 · scikit-hep/awkward · GitHub”, which links here.

2 Likes