Multiple duck array layers: Composability, implementations, and usability concerns

SimonHeybrock · October 10, 2022, 8:13am

Some of you may have seen my recent discussions with @jthielen in, e.g., Best practices for "carrying through" type-specific operations to wrapped types · Issue #6 · pydata/duck-array-discussion · GitHub. I was exploring the idea of composing arrays from multiple layers of duck arrays, which each layer can provide distinct features. Based on my journey through many issues on Github (from Xarray over Pint to Awkward and more), various NEPs (such as NEP 41 — First step towards a new datatype system — NumPy Enhancement Proposals), as well as our own requirements, I identified the following requirements:

Dimension labels (as in xarray.Variable or the proposed xarray-lite)
Physical units (as in pint.Quantity)
Coordinates (xarray.DataArray)
Vectors or other spatial dtypes (such as linear transforms, see also Conceptualizing `xfield`: an xarray extension package for generic and geospatial scalar and vector fields · Discussion #6331 · pydata/xarray · GitHub)
Bin edges or cell boundaries (see scipp for the former, xgcm for the latter)
Masks (as in numpy.ma or scipp.DataArray (providing a dict of masks)).
Uncertainties
Sparse data
Ragged data (similar to awkward or as in scipp)

In my eyes, there are two conflicting requirements, (1) providing simple and self-contained array-implementations (i.e., avoid a single super-package that does all of the above) that add mainly a single feature to an underlying array (such as pint adding physical units, or xarray.Variable adding dimension labels). (2) a coherent and usable solution. The need for projects such as GitHub - xarray-contrib/pint-xarray: Interface for using pint with xarray, providing convenience accessors is an example of this conflict. Once we extend the approach to more of the duck arrays listed above this approach might reach its limits.

Anyway, based on those thoughts I implemented a prototype that demonstrates how multiple duck-arrays layers can interact without requiring changes to individual layers or special packages for pair-wise (or tuple-wise) interactions. I won’t delve into details here initially (see for issue linked on top for a few example), see the README in the repository for some relevant places to look at. The central mechanism is the introduction of __array_getattr__, which lets duck array implementations opt-in to expose selected properties without dropping important meta-data (such as dim-labels) from wrapping layers.

I think I got almost everything (including dask support) to work in the prototype as I wanted. Nevertheless I am far from convinced that this is a desirable solution (I am not entirely sure why, but I fear that it may turn out too brittle). That is, see this as a starting point for a discussion, not as a proposal.

Highlighting (sorry for the noise, are there more, please mention!): @jthielen @TomNicholas @shoyer @rgommers @rabernat @benbovy @jpivarski