Creating community standards for meta arrays (arrays that wrap other arrays)

@SimonHeybrock created a recent discussion (Multiple duck array layers: Composability, implementations, and usability concerns) regarding usability concerns with multiple duck array layers, which I would recommend reading if you haven’t encountered use cases where multiply-layered array types are needed. While the discussion there around a prototype protocol for exchange of select properties/attributes between array layers is certainly a crucial concern, it doesn’t quite encompass the full scope of coordination that is still needed within the community about arrays that wrap other arrays. And so, building on community discussions that began over a year ago but have now languished (see GitHub - pydata/duck-array-discussion: A repository for discussing duck array hierarchies with Python), I’d like to revisit these topics from a lens of working towards community standards.

Overview

Several years ago, the combination of NEPs 13 and 18 (the __array_ufunc__ and __array_function__ protocols) allowed NumPy-like arrays be used interchangeably within much of the NumPy API. This in turn led to a greatly increased potential for compatibility between libraries both implementing and consuming arrays, particularly those that do both (e.g., Dask, Pint, xarray). However, compatibility of array APIs and interaction between array types have remained a challenge across the ecosystem, and so many efforts (e.g., the Python Array API standard; NEPs 30, 31, 35, 37, and 47, SPEC 2) have been (and continue to be) made to improve compatibility between array types.

The progress made in these compatibility efforts has been remarkable, however, one area where community-wide engagement has languished has been the interaction (and cooperation) between libraries that both implement and consume arrays (e.g., arrays that wrap other arrays). In late September 2021, many developers from across the PyData ecosystem met to discuss the issues in this space, and a GitHub repo for continuing asynchronous discussion was created. The primary topics of concern were:

  • Nested array reprs
  • Defining pairwise interactions (to avoid cyclic type deferrals)
  • Consistency of type deferral (e.g., between construction, binops, and NumPy ufuncs, functions, and modules)
  • Addition/removal of layers in a nested duck array
  • Best practices for “carrying through” type-specific operations to wrapped types

However, very little engagement occurred following that initial meeting, and so, the many issues that prompted these coordination efforts have remained. For example:

  • When applying dask array functions to an array type that wraps dask (e.g., pint), dask treats that array as wrappable, so we end up with broken constructions like dask wrapping pint wrapping dask
  • Libraries applying inconsistent procedures for determining what array types it can handle vs. not (e.g., Dask Array's NEP-13 and NEP-18 protocols should use common utils · Issue #6672 · dask/dask · GitHub, Consistent Handling of Type Casting Hierarchy · Issue #3950 · pydata/xarray · GitHub)
  • Performing operations specific to an array type nested within other arrays doesn’t have well-defined protocols or APIs, and can frequently require a user to deconstruct and then reconstruct the stack of arrays
  • When a user introduces a custom array type, there is little guidance on what needs to be done for it to behave well with other types in a well-defined type casting hierarchy, especially when existing array-wrapping libraries inconsistently use either allow-lists or deny-lists for what other array types they can wrap

Naming

Prior discussions across issue trackers, discussion boards, and virtual meetings have used several different phrases for “duck arrays that wrap other duck arrays,” including:

  • wrapping arrays
  • meta arrays
  • array containers/container arrays
  • nested/nesting duck arrays

The remainder of this write-up will use “meta arrays” (both following a suggestion by @rgommers and for brevity), however, this can certainly be reconsidered if there is strong sentiment otherwise.

Main Topics

Achieving Consensus on Type Priority Between Types

Background

In order for the dispatch between types to work out consistently and unambiguously, the directed graph of interactions between array types needs to be acyclic, which thereby requires some form of agreement/coordination between duck array libraries (see NEP 13’s definition and discussion of the type casting hierarchy for more details). A simple (and often suggested) approach for arranging these interactions is through a (linked-)list of priorities (as can arise from __array_priority__). However this is insufficient in practice because any given meta array may only have a limited set of array types it can take in, rather than any arbitrary array of lower ranked priority (in the language of graph theory, we cannot treat the type casting hierarchy graph as equivalent to its transitive reduction). Instead, all the specific pairwise interactions between array types need to be defined in some fashion, and, ideally, coordinated and verified between libraries so that an acyclic graph always results.

Between major participating libraries, the consensus directed acyclic graph (DAG) of this interactions looked like the following as of 2020:

DAG of 2020 array types

However, this example DAG is

  1. emergent from bespoke handling in each library
  2. poorly adaptable for new meta arrays
  3. subtly broken due to inconsistencies between the kind of interaction between arrays (such as constructors vs. binops vs. array functions)
  4. incomplete due to interactions that are merely planned or handled by generic protocols but untested
  5. does not account for compatibility efforts since 2020, such as the Python Array API and various NEPs

And so, some key goals for community standards to resolve these concerns are:

  • The directed graph of array type interactions is agreed upon across the community and remains acyclic
  • Introduction of new types to the accepted DAG is easy and can be safely done without introducing cyclicness
  • Ensure that separate kinds of array operations/interactions do not partially introduce cyclicness to the otherwise acyclic graph of interactions

Proposing a Common Library to Manage Type Hierarchy

The September 2021 coordination meeting came to a loose consensus that a common library to manage the type casting hierarchy for meta arrays should be created. This way, there cannot be inconsistencies among participating libraries, any conflicts would be resolved publicly in the same place, and hooks could be introduced for one-off inclusion of a user-defined array type within the “official” DAG.

Features in this shared library could include:

  • Registering or collecting meta array types in present environment
  • Checking/verification that the DAG works out
    • Optional, or,
    • Enforcement (including utils to raise errors where relevant)
  • Provide full utilities that meta arrays can (or must?) use in their implementations of wrapping/binop/__array_ufunc__/__array_function__/array function modules
  • Hooks to introduce user-defined meta arrays into the DAG

Even with a shared library for coordinating meta arrays, the additional question of how these pairwise interactions are defined (by what mechanism or protocol) remains. Several ideas have been suggested previously:

Finally, it is uncertain where this kind of shared library should be hosted (Python for Data · GitHub, or elsewhere?) or what it should be named.

I would love to see further discussion of these items below, but given the poor past results I’ve seen with merely raising the topic, I’ll give an initial concept to start things off. Please feel free to critique, adapt, or replace this concept below!

@jthielen’s Initial Concept for Common Library

  1. Create a package titled metaarray-management hosted within PyData (https://github.com/pydata/metaarray-management)
  2. Ensure critical mass of meta array libraries agree to participate (need at least Xarray, Dask, and Pint for this to be meaningful)
  3. Use NEP 37’s __array_module__ as protocol for any meta array to define what types it handles
    • In practice, check will then look like:
      is_handled = this_type.__array_module__((other_type,)) is not NotImplemented
      
    • This will have the side-effect of enforcing (at least some degree of) NEP 37 support in all meta array libraries
  4. metaarray-management will be based around a run-time registry of meta array types and DAG checking. For performance reasons, registry assembly and DAG checks will only occur on demand. Nightly CI for metaarray-management will operate on a list of participating meta array types to ensure that compliance does not break. Participating libraries agree to treat any such failure traceable to their meta array as a bug.
  5. Participating libraries need not import/have dependency on metaarray-management, however, they may optionally do so to take advantage of meta array utils that end up being usefully reused across meta array types, or for their own compatibility checks.
  6. Participating libraries agree to provide an entrypoint (or other well-documented procedure) for adding a new type to its list of wrappable types, and contribute this for inclusion within metaarray-management.
  7. User-defined meta arrays must adopt NEP 37 (to define what arrays it can handle), as well as register with metaarray-management what existing meta array types should be made to be able to wrap this new array. metaarray-management will then verify that the DAG is maintained, or else reject this new meta array’s specifications.
  8. Python Array API implementors that are not meta arrays need not be concerned with metaarray-management, as they can always be considered to exist at an endpoint/bottom of the DAG. It will be assumed that a meta array can choose to support Python Array API implementors that are not meta arrays generically, rather than having an explicit list of all such types. To enable this, all meta arrays must implement a detectable protocol so that they can be readily distinguished from non-meta arrays (see __meta_array_getattr__ below for such a protocol suggestion).

Recommendation of “Allow List” over “Deny List”

Among existing packages, the aforementioned interactions have been defined informally through independent/ad-hoc implementations in each meta array library. Two main approaches have arisen so far:

  • Have an “allow list” of types that this array type can handle/wrap, and defer to any other (e.g., Dask)
  • Have a “deny list” of types to which this array type defers, and assume any other “sufficiently array-like” type can be handled/wrapped (e.g., Pint, but also xarray if the “deny list” is effectively empty)

For a limited set of commonly-used array types in the PyData stack, this has often worked out in practice so far. However, as the number of duck array libraries increases, maintaining agreement between libraries through these existing independent approaches becomes difficult.

Based on consensus from prior discussions, and the requirements of shared management of the type casting hierarchy, I would propose a community standard that all meta arrays should operate off of an “allow list” approach rather than “deny list.” The only exception to this should be non-meta-array Python Array API implementors, which could reasonably be supported generically (as there is no concern about such generic implementation introducing cyclicness to the type casting hierarchy).

Consistency Between Forms of Interaction

As hinted at by the language of “array interaction” previously, there are several different interactions in practice in which handling/deferral needs to be considered. As far as I can recall, these are:

  • Meta array layer nesting order
  • Constructors
  • Binary operations (addition, multiplication, etc.)
  • __array_ufunc__ (NEP 13)
  • __array_function__ (NEP 18)
  • __array_module__ (NEP 37) and other importable modules of array functions (e.g., dask.array)

Presently, the consistency of type handling between these different forms of interaction is varied. For example, Pint Quantities are designed with these consistent (where they exist, as Pint doesn’t have an array function module/namespace…yet), whereas with Dask arrays are deliberately different (e.g., deferring to other types in binops and __array_*__, but assuming user intent to override the standard deferral in construction/wrapping and array functions) and (at least as I see it) xarray is accidentally different (see pydata/xarray#3950). The main sticking point in the discussions involved in each of these respective implementations can be summarized as “How much should we trust user to not ‘break’ the DAG through inconsistencies in type deferral?” or “How much should consistency be enforced?”

Consensus (or at least absence of any objections) from the September 2021 meeting supports that all these forms of interaction should be consistent. It will be up to every meta array implementor to do so, however, the previously proposed type hierarchy management library could both provide reusable utils to do so and run test suites to verify consistency.

At the very least, adopting this stance of “all forms of interaction between meta arrays must be consistent with respect to an agreed-upon type hierarchy” will require some changes to participating libraries, such as:

TL;DR

When multiple meta array types exist, especially in combination, we need coordination and agreement to ensure that cyclicness is not introduced to the hierarchy of interactions. And so, we need community standards among meta arrays regarding:

  • a common library for managing the type hierarchy
  • how meta arrays determine what arrays they can wrap
  • the need for differing forms of array interactions to be consistently handled

Designing Protocols/Standards for Exchange Between Array Layers

Once the problem of verifying that meta arrays can consistently and robustly “play nice” with each other (prior section), there still remain several usability concerns. Particularly, what should the APIs for multiply nested meta arrays look like in practice, and how do we ensure that type-specific details within a given nested array construct are handled in a robust and easy-to-implement way?

reprs

Once a meta array is constructed, especially one with multiple nested array layers, a foremost user concern is the repr of this meta array: how are the most relevant details of each array layer exposed through the display of the outermost object? In the topic of meta arrays, reprs have been one of the the oldest discussion points (see Design: nested _meta · Issue #5329 · dask/dask · GitHub and reprs of nested duck arrays · Issue #6637 · dask/dask · GitHub), and still to this day there does not exist well-defined expectations/standards. Instead, most meta array libraries implement some bespoke form of nested/recursive inheritance from the wrapped type (e.g., xarray’s usage of _repr_html_ and _repr_inline_).

Since several suggestions have been raised previously, I will simply present them here rather than suggest a single solution.

@keewis suggestion from Sep 2020

Simply including the repr of the wrapped array usually results in reprs that are way too long, so instead we could introduce a protocol (or method) that returns a mapping of type names to metadata, and a function to call that protocol. If a duck array does not implement the protocol, the default would be to return {typename: {}}. For example calling repr(arr) where arr is a xarray(pint(dask(sparse))) would make xarray call the data’s __metadata__ and use that to construct it’s repr. This could be something like

def __metadata__(self):
    wrapped_metadata = self._data.__metadata__()
    metadata = {
        "a": str(self.a),
        "b": str(self.b),
    }
    # or something like {**wrapped_metadata, **{type(self).__qualname__: metadata}
    return wrapped_metadata | {type(self).__qualname__: metadata}

In the example, the result of the first call could be something like:

{
    "sparse.COO": {"shape": (30, 1000), "dtype": "float64", "nnz": 768, "fill_value": 0},
    "dask.array.Array": {"chunks": (10, 100), "shape": (30, 1000), "dtype": "float64"},
    "pint.Quantity": {"units": "m", "shape": (30, 1000), "dtype": "float64"},
}

with that, xarray could either manually format the repr or use a helper function (which should probably have a max_width parameter).

This doesn’t work for duck arrays wrapping more than a single duck array, though. Also, I’m not sure if the type name should be the type’s __qualname__, and if shape and dtype, which every duck array has to implement as properties, should be included.

xref pydata/xarray#4324, pydata/xarray#4248

@hameerabbasi suggestion from Sep 2021

I thought about this, and calculating the repr is indeed a hard problem, but it becomes easy if we exclude distributed/chunked arrays.

What I propose is:

* every meta array have a `__array_meta__(): duck_array` which returns the inner array (as one can see for distributed arrays, this could require materialising the array).

* In addition, there should be an `__array_attrs__(html: bool): Dict[str, Union[int, bool, float, np.generic, str]]` which should be on ALL arrays that want to be included into meta-arrays. This method returns a `dict` containing the elements it would like displayed. One can then use an implementation like the following:
def __repr__(self):
    attrs = {}
    type_names = []
    current_arr = self
    while current_arr is not None:
        # This formulation guards against duplicate attrs
        attrs = dict(**attrs, **current_arr.__array_attrs__(false))
        class_names.append(type(current_arr).__name__)
        current_arr = getattr(current_array, "__array_meta__", lambda: None)()
    return "<{super_type_name}: {attrs_repr}>".format(
        super_type_name=".".join(type_names),
        attrs_repr=", ".join(f"{k}: {v!r}" for k, v in attrs.items())
    )

For a chunked representation, one could also have a third slot, __attr_aggregate__(attr_name): Callable[[Iterable[Union[int, bool, float, np.generic, str]]], Union[int, bool, float, np.generic, str]] which could aggregate the given attribute. This way, Dask (or other distributed/chunked array libraries) could easily aggregate the attrs.

Handling Wrapped-Type-Specific Details from Outermost Wrapping Array

Type Casting (Layer Addition/Removal) Within Layers of a Nested Array

It is often necessary to “mutate” what array layers exist within a multiply nested meta array (e.g., downcasting contents to NumPy, collecting a distributed array, stripping units). However, without high-level custom handling for particular nested structures (such as exists in compatibility packages like pint-xarray), this kind of inter-layer type casting requires manual deconstruction and reconstruction of the meta array, which is a poor user experience. This is a rather nebulous topic, as many array types have unique APIs for casting to other array types, and the problem can become complex rather quickly with explicit (non-protocol) support for many types (see sparse and other duck array issues · Issue #3245 · pydata/xarray · GitHub for discussion of many .to_* or .as_* methods).

Initial comments by @SimonHeybrock suggest that a well-designed property/method attribute protocol could solve this (e.g., Pint’s .magnitude property functions as a downcast to whatever Pint is wrapping), and so, I won’t offer any direct suggestions here, and instead defer to the next item.

__meta_array_getattr__ Protocol

This is an adaptation of the initial suggestion by @SimonHeybrock in Multiple duck array layers: Composability, implementations, and usability concerns, but renamed and with added discussion.

We can introduce a new protocol through which array layers in a meta array can expose a) select attributes directly or b) namespaced attribute collections for composable handling by parent meta arrays apart from standard Python __getattr__ handling (which is prone to namespace cluttering and/or collision).

To start, it is built around a nesting protocol of (optional) wrapped attributes like the following:

no_op = lambda x: x

class MetaArraySample:
    
    def __getattr__(self, name):
        # When called as top-level, defer to meta array attr protocol
        return self.__meta_array_getattr__(name, wrap=wrap=no_op)
    
    def __meta_array_getattr__(self, name, wrap):
        # Mock of a Pint Quantity, with "magnitude" and "units". "magnitude" is
        # expected to be re-wrapped by parent meta array, but "units" should be
        # returned outright. Both are supported.
        contents = self._magnitude
        if name == 'magnitude':
            return wrap(contents)
        elif name == 'units':
            return self._units
        
        # Now, allow inheriting from wrapped contents
        wrap_with_self = lambda contents: wrap(self._rewrap_content(contents))
        if hasattr(contents, '__meta_array_getattr__'):
            return contents.__meta_array_getattr__(name, wrap=wrap_with_self)
        raise AttributeError(
            f'Meta array attribute {name} not present on {self.__class__}'
        )

When the number of “exchanged” attributes per layer is small, this direct handling of select attributes works well. However, many use cases would require further attributes (up to the entire public API of the array type minus standard Array API attributes), which would pollute any top-level meta array’s effective namespace. This could require an encapsulating object within the protocol to act as a namespace for a particular array layer (a la xarray’s accessor interface):

no_op = lambda x: x

class MetaArrayAttributeContainer:
    
    def __init__(self, metaarray, wrap):
        self._metaarray = metarray
        self._wrap =
        
    def to(self, other):
        return self._wrap(self._metaarray.to(other))
    
    def to_base_units(self):
        return self._wrap(self._metaarray.to_base_units())
    
    def m_as(self, other):
        return self._wrap(self._metaarray.m_as(other))
    
    @property
    def dimensionality(self):
        return self._metaarray.dimensionality

    
class MetaArraySample:
    
    def __getattr__(self, name):
        # When called as top-level, defer to meta array attr protocol
        return self.__meta_array_getattr__(name, wrap=no_op)
    
    def __meta_array_getattr__(self, name, wrap):
        # Mock of a Pint Quantity, with "magnitude" and "units" as direct
        # attributes, and others via "pint" pseudo-namespace.
        contents = self._magnitude
        if name == 'pint':
            return MetaArrayAttributeContainer(self, wrap)
        elif name == 'magnitude':
            return wrap(contents)
        elif name == 'units':
            return self._units
        
        # Now, allow inheriting from wrapped contents
        wrap_with_self = lambda contents: wrap(self._rewrap_content(contents))
        if hasattr(contents, '__meta_array_getattr__'):
            return contents.__meta_array_getattr__(name, wrap=wrap_with_self)
        raise AttributeError(
            f'Meta array attribute {name} not present on {self.__class__}'
        )

Additionally, the single-content wrapping approach used in this initial example may not be robust enough for multiple contents (e.g., masked arrays). While there may be several approaches handling this case, the scippx prototype (Copyright (c) 2022 Scipp contributors, reused with modification under BSD-3-Clause license) does so like the following:

def make_wrap(wrap, attr, extra_cols=None):

    def do_wrap(obj, *args, **kwargs):
        # case rewrap_result:
        # obj is result of call with args and kwargs, need those to apply to
        # extra_cols
        if extra_cols is None:
            return wrap(obj)
        cols = []
        for col in extra_cols:
            try:
                proto_col = col.__meta_array_getattr__(attr, wrap=no_op)
            except AttributeError:
                cols.append(col)
            else:
                if hasattr(proto_col, 'shape'):
                    cols.append(proto_col)
                else:  # callable we got from rewrap_result
                    cols.append(proto_col(*args, **kwargs))
        return wrap((obj, ) + tuple(cols))

    return do_wrap


class MetaArraySample:
    
    def __getattr__(self, name):
        # When called as top-level, defer to meta array attr protocol
        return self.__meta_array_getattr__(name, wrap=no_op)
    
    def _forward_meta_array_getattr_to_content(self, name, wrap):
        wrap_with_self = lambda contents: wrap(self._rewrap_content(contents))
        content = self._unwrap_content(self)
        if isinstance(content, tuple):
            content, *extra_cols = content
            wrap_with_self = make_wrap(wrap_with_self, name, extra_cols)
        if hasattr(content, '__meta_array_getattr__'):
            return content.__meta_array_getattr__(name, wrap=wrap_with_self)
        raise AttributeError(
            f'Meta array attribute {name} not present on {self.__class__}'
        )
    
    def __meta_array_getattr__(self, name, wrap):
        # Mock of a Pint Quantity, with "magnitude" and "units" as direct
        # attributes, and others via "pint" pseudo-namespace.
        if name == 'pint':
            return MetaArrayAttributeContainer(self, wrap)
        elif name == 'magnitude':
            return wrap(self._magnitude)
        elif name == 'units':
            return self._units
        
        # Now, allow inheriting from wrapped contents
        return self._forward_array_getattr_to_content(name, wrap)

The scippx prototype also includes an unwrap argument in this protocol, for which the need may arise in certain use cases. I personally have not grasped its utility, and so am refraining from demonstrating it here. However, it certainly merits discussion as we work towards an eventual standard if this style of side-chained & wrapped attribute protocol is endorsed by the community.

This protocol would remove the need for existing __getattr__ deferral mechanisms (like that present in Pint) and partially obviate the need for compatibility packages (like pint-xarray). For the latter, compatibility needs solely for the wrapped array (e.g., pint-specific things, in case of pint-xarray) would be obviated, however, integration of otherwise unsupported features in the wrapping array (e.g., unit-aware xarray API features not otherwise supported out-of-the-box) would still benefit from such a compatibility package. This would further imply naming conflicts (i.e., should the “pint” attribute on an xarray object point to the pint-xarray accessor or to the encapsulated pint namespace through the meta array protocol?), but those would likely be best worked out independently when encountered, rather than addressed in a general meta array community standard.

Also, once a protocol is standardized, the previously referenced “common library” could include utilities to ease implementation within meta array libraries.

TL;DR

It would be highly beneficial for the community to develop a set of standards and protocols for meta arrays to seamlessly handle reprs as well as nested-array-specific operations from the outermost array object (i.e., the one with the exposed API surface) without falling back to a paradigm of deconstruct & reconstruct and/or requiring bespoke code for every possible nested array interaction.

Additional Discussion

Relation to Python Array API

Many prior discussions in this area have referenced the Python Array API. However, the issues referenced above in need of community standards are orthogonal to the Array API…these are specifically about meta arrays and the interactions between different levels of these arrays that wrap other arrays, rather than defining a shared API surface that could be targeted generically by any array consuming library. That isn’t to say that there aren’t mutual considerations (e.g., the previously mentioned non-meta-array Array API implementors being generically considered to be at the bottom of the DAG), however. But, regardless, I feel it best to move these meta array efforts forward within the Scientific Python community separate from the Python Array API standards.

Concluding Summary (or, even the TL;DRs were TL)

The Scientific Python community has seen some amazing work in the past several years around community standards for array providing libraries to streamline compatibility within array consuming libraries. However, there are libraries that do both (provide and consume arrays), which we can refer to as meta array libraries, and standardization work in this area (especially in concerns arising with using multiple meta arrays in combination) has languished. I propose we promptly move forward as community by establishing standards and protocols in two major areas:

  • Achieving Consensus on Type Priority Between Array Types
  • Designing Protocols/Standards for Exchange Between Array Layers

Please share your thoughts below!

1 Like

Thanks for that detailed write-up!

I think I agree with most of what you wrote, but I would like to carefully question the DAG approach. I may not have fully understood your suggestion, but I fear it is not sufficient, and creates a structure that is too rigid. I will try to give a couple of examples, but I am sure there must be more:

1. Mask array and physical units

The DAG shows pint wrapping numpy.ma. As masks have no units, this requires that pint knows about numpy.ma. This implies:

  • The approach cannot work with other masking implementations (such as Scipp’s, which provides a dict of masks), as we certainly do not want to change pint for such other implementations?
  • With NEP 41 — First step towards a new datatype system — NumPy Enhancement Proposals, units support could be moved into the NumPy dtype. This implies that the units are now suddenly on the inside, and numpy.ma is on the outside. Why did the order change? Is the DAG wrong? Yes, I know that with the units in the dtype this would not be part of the DAG any more, but conceptually we still have an inversion of concepts.

2. Spatial library for vectors, rotations, and other transformations

I don’t think this exists right know, but I do remember discussions on better support for vectors in Xarray, and a meta array could be one solution for this. Scipp handles this using custom dtypes (not NumPy dtypes) and provides 3-D vectors, rotations, linear_transforms, and affine_transforms. But it would be simple two implement a meta array that wraps a shape=(n, 3) array to provide an shape=(n,) array of vectors, or a dict of arrays could be used (one item per vector component). Now, masking vector or matrix elements is probably not useful, so when we have masks we would use numpy.ma to wrap our VectorArray. Then:

  • Often Pint->MaskArray->VectorArray would be fine…
  • … but people with other applications may have vectors that have different dimensions (units) for different components (I cannot think of a good example right now, Phase space - Wikipedia may be one, but I am sure I have seen others). Thus, we would need MaskArray->VectorArray->Pint, but this would break the DAG. We would thus artificially limit what our meta array can do.
  • Our hypothetical spatial library might provide AffineTransform. This has a rotation-part (a matrix) and a translation-part (a vector). Each of them would need different physical units, but as before this would break the DAG.

I think I agree with most of what you wrote, but I would like to carefully question the DAG approach. I may not have fully understood your suggestion, but I fear it is not sufficient, and creates a structure that is too rigid. I will try to give a couple of examples, but I am sure there must be more:

[…]

@SimonHeybrock Thank you for sharing your concerns! While I had thought that the following points would be implied in my (already too) lengthy post above, it does appear that there was some misunderstanding, and so I sincerely apologize. Having more people review these discussions is wonderful, as it helps identify things I would have assumed were sufficiently implied, yet may not have been so to other readers! So, hopefully these points help clear up those implications:

  1. The example DAG shown in the Background section is only that, an example. The ecosystem has already changed from 2020 (from when that structure arose), and it will certainly change further if array types like your comment suggests (a masked array type that isn’t an ndarray subclass, physical units on the dtype, VectorArray-likes, etc.) arise as well-established in the community. Note that all the proposals in that section are about having community wide agreement about coordinating meta array interactions in the first place, not about what exactly that DAG would look like in 2022 and the years to come. Those would instead be secondary discussions for after the community decides upon how they would be managed.

  2. A directed acyclic graph is already the most general model we have that doesn’t lead to ill-defined interactions. A directed graph of interactions can always be fit to a given set of array types (for a given interaction type, or if all forms of interaction are consistent), and acyclicness is a necessary condition to avoid ambiguous partial ordering (which is a very bad thing…you could end up with an unresolvable loop of types deferring to each other, or multiple array layers of the same type ending up in a single nest without awareness of the other). Please read the relevant sections of NEP-13 if you haven’t already for more detail on this. Any model other than a DAG (such as a priority list) that still maintains unambiguous partial ordering would end up as a subset of what a DAG could depict. And so, in later discussions about how meta array interactions are decided upon, acyclicness must be a first-order constraint.

For example, in relation to your examples, if this new VectorArray can have components that have differing units, and Pint.Quantity only describes an array with a single unit, then VectorArray needs to come above Pint.Quantity, simple as that. This would indeed mean that, for masked arrays to behave like you expect, we would need a new masked array type that exists toward the top of the type casting hierarchy, rather than near the bottom (and this is pretty important to note; the only reason it has been there up to now is that the canonical implementation is a NumPy subclass, which constrains it to exist below non-subclass encapsulating types). If you can wrap your mind around prioritizing acyclicness above existing implementation details, and instead let well-defined interactions guide the implementations, then it’s hard (to me) to see the issue with framing our cross-library coordination efforts in the context of an agreed-upon DAG of meta array interactions.

I understood that. My point is that it is likely impossible to define a fixed or generic DAG. One DAG per application maybe, but even there I would be hesitant to bet on it. In other words, I think a DAG may not be sufficient and will limit what future users and libraries can do in a problematic manner.

I agree, but also disagree: Yes, for a single array we can use a DAG and everything must be acyclic. But my point is that as soon as we deal with multiple arrays (data, masks, coordinates, multiple data arrays, datasets, …) a different DAG may be required for each of them. Within each of the examples from my earlier post you can certainly find a DAG that works. But I believe you cannot, at least not in all cases, find a single DAG that conceptually captures all the cases that are needed within a single application (let alone a single DAG for all applications).

@SimonHeybrock Thank you for elaborating on your points, and I apologize for previously misunderstanding what you had to say! It does seem now that the prioritization of consistently unambiguous partial ordering vs. flexibility for any conceivable use-case is a fundamental disagreement between our viewpoints. Though, perhaps you have seen a solution (that I have not) which you haven’t shared yet? Do you happen to have an alternative proposal for how to always maintain unambiguous partial ordering of array interaction priorities while allowing more application-by-application flexibility than a single community standard DAG would permit? If not, then I believe we are at an impasse, since I cannot conceptualize forsaking unambiguous partial ordering as a core constraint.

@jthielen No need to apologize, I was definitely too vague! I do not think I have a solution. That being said, my claim that a single DAG is not possible should probably be qualified, as it depends on what requirements you have, i.e., how ambitious the entire meta-array mechanism is and how you define the scope.

If we limit ourselves to meta arrays that contain exactly one array then one can probably come up with a unique DAG (though I need to think about this some more). So the question is then whether meta arrays containing multiple arrays is useful or required. I see the following applications — most but not all of these could be implemented as AOS (array-of-structures), but I consider SOA (structure-of-arrays) more useful and simpler in many situations:

  • MaskedArray: built from array of values and one or multiple arrays for the mask(s).
  • UncertaintiesArray: array of values and array of variances or standard-deviations.
  • RecordArray: unlike NumPy structured dtype, stored as multiple arrays, i.e., SOA.
  • Ragged data: like Awkward and Scipp, e.g., array of start/stop indices + content array (which could be a RecordArray, a 1-D DataArray, or a pd.DataFrame).
  • VectorArray: could be one array per component, useful if we need different units for different vector components.
  • AffineTransformArray: a spatial rotation and translation.

If we consider how these may interact, e.g., with Pint, Xarray, and Dask things can get tricky. For example, it would be more useful if Dask were not be the lowest layer, as otherwise we will have multiple Dask arrays within a single array. But before going into more detailed considerations, I would like to hear whether we should consider these as in scope or out of scope?