Some of you may have seen my recent discussions with @jthielen in, e.g., Best practices for "carrying through" type-specific operations to wrapped types · Issue #6 · pydata/duck-array-discussion · GitHub. I was exploring the idea of composing arrays from multiple layers of duck arrays, which each layer can provide distinct features. Based on my journey through many issues on Github (from Xarray over Pint to Awkward and more), various NEPs (such as NEP 41 — First step towards a new datatype system — NumPy Enhancement Proposals), as well as our own requirements, I identified the following requirements:
- Dimension labels (as in
xarray.Variable
or the proposedxarray-lite
) - Physical units (as in
pint.Quantity
) - Coordinates (
xarray.DataArray
) - Vectors or other spatial dtypes (such as linear transforms, see also Conceptualizing `xfield`: an xarray extension package for generic and geospatial scalar and vector fields · Discussion #6331 · pydata/xarray · GitHub)
- Bin edges or cell boundaries (see
scipp
for the former,xgcm
for the latter) - Masks (as in
numpy.ma
orscipp.DataArray
(providing a dict of masks)). - Uncertainties
- Sparse data
- Ragged data (similar to
awkward
or as inscipp
)
In my eyes, there are two conflicting requirements, (1) providing simple and self-contained array-implementations (i.e., avoid a single super-package that does all of the above) that add mainly a single feature to an underlying array (such as pint
adding physical units, or xarray.Variable
adding dimension labels). (2) a coherent and usable solution. The need for projects such as GitHub - xarray-contrib/pint-xarray: Interface for using pint with xarray, providing convenience accessors is an example of this conflict. Once we extend the approach to more of the duck arrays listed above this approach might reach its limits.
Anyway, based on those thoughts I implemented a prototype that demonstrates how multiple duck-arrays layers can interact without requiring changes to individual layers or special packages for pair-wise (or tuple-wise) interactions. I won’t delve into details here initially (see for issue linked on top for a few example), see the README in the repository for some relevant places to look at. The central mechanism is the introduction of __array_getattr__
, which lets duck array implementations opt-in to expose selected properties without dropping important meta-data (such as dim-labels) from wrapping layers.
I think I got almost everything (including dask support) to work in the prototype as I wanted. Nevertheless I am far from convinced that this is a desirable solution (I am not entirely sure why, but I fear that it may turn out too brittle). That is, see this as a starting point for a discussion, not as a proposal.
Highlighting (sorry for the noise, are there more, please mention!): @jthielen @TomNicholas @shoyer @rgommers @rabernat @benbovy @jpivarski
Related:
- Addition/removal of layers in a nested duck array · Issue #5 · pydata/duck-array-discussion · GitHub
- Representing vector quantities in xarray · Discussion #5775 · pydata/xarray · GitHub
- [Proposal] Expose Variable without Pandas dependency · Issue #3981 · pydata/xarray · GitHub
- Awkward array backend? · Issue #4285 · pydata/xarray · GitHub
- many more?