@SimonHeybrock created a recent discussion (Multiple duck array layers: Composability, implementations, and usability concerns) regarding usability concerns with multiple duck array layers, which I would recommend reading if you haven’t encountered use cases where multiply-layered array types are needed. While the discussion there around a prototype protocol for exchange of select properties/attributes between array layers is certainly a crucial concern, it doesn’t quite encompass the full scope of coordination that is still needed within the community about arrays that wrap other arrays. And so, building on community discussions that began over a year ago but have now languished (see GitHub - pydata/duck-array-discussion: A repository for discussing duck array hierarchies with Python), I’d like to revisit these topics from a lens of working towards community standards.
Overview
Several years ago, the combination of NEPs 13 and 18 (the __array_ufunc__
and __array_function__
protocols) allowed NumPy-like arrays be used interchangeably within much of the NumPy API. This in turn led to a greatly increased potential for compatibility between libraries both implementing and consuming arrays, particularly those that do both (e.g., Dask, Pint, xarray). However, compatibility of array APIs and interaction between array types have remained a challenge across the ecosystem, and so many efforts (e.g., the Python Array API standard; NEPs 30, 31, 35, 37, and 47, SPEC 2) have been (and continue to be) made to improve compatibility between array types.
The progress made in these compatibility efforts has been remarkable, however, one area where community-wide engagement has languished has been the interaction (and cooperation) between libraries that both implement and consume arrays (e.g., arrays that wrap other arrays). In late September 2021, many developers from across the PyData ecosystem met to discuss the issues in this space, and a GitHub repo for continuing asynchronous discussion was created. The primary topics of concern were:
- Nested array
repr
s - Defining pairwise interactions (to avoid cyclic type deferrals)
- Consistency of type deferral (e.g., between construction, binops, and NumPy ufuncs, functions, and modules)
- Addition/removal of layers in a nested duck array
- Best practices for “carrying through” type-specific operations to wrapped types
However, very little engagement occurred following that initial meeting, and so, the many issues that prompted these coordination efforts have remained. For example:
- When applying dask array functions to an array type that wraps dask (e.g., pint), dask treats that array as wrappable, so we end up with broken constructions like dask wrapping pint wrapping dask
- Libraries applying inconsistent procedures for determining what array types it can handle vs. not (e.g., Dask Array's NEP-13 and NEP-18 protocols should use common utils · Issue #6672 · dask/dask · GitHub, Consistent Handling of Type Casting Hierarchy · Issue #3950 · pydata/xarray · GitHub)
- Performing operations specific to an array type nested within other arrays doesn’t have well-defined protocols or APIs, and can frequently require a user to deconstruct and then reconstruct the stack of arrays
- When a user introduces a custom array type, there is little guidance on what needs to be done for it to behave well with other types in a well-defined type casting hierarchy, especially when existing array-wrapping libraries inconsistently use either allow-lists or deny-lists for what other array types they can wrap
Naming
Prior discussions across issue trackers, discussion boards, and virtual meetings have used several different phrases for “duck arrays that wrap other duck arrays,” including:
- wrapping arrays
- meta arrays
- array containers/container arrays
- nested/nesting duck arrays
The remainder of this write-up will use “meta arrays” (both following a suggestion by @rgommers and for brevity), however, this can certainly be reconsidered if there is strong sentiment otherwise.
Main Topics
Achieving Consensus on Type Priority Between Types
Background
In order for the dispatch between types to work out consistently and unambiguously, the directed graph of interactions between array types needs to be acyclic, which thereby requires some form of agreement/coordination between duck array libraries (see NEP 13’s definition and discussion of the type casting hierarchy for more details). A simple (and often suggested) approach for arranging these interactions is through a (linked-)list of priorities (as can arise from __array_priority__
). However this is insufficient in practice because any given meta array may only have a limited set of array types it can take in, rather than any arbitrary array of lower ranked priority (in the language of graph theory, we cannot treat the type casting hierarchy graph as equivalent to its transitive reduction). Instead, all the specific pairwise interactions between array types need to be defined in some fashion, and, ideally, coordinated and verified between libraries so that an acyclic graph always results.
Between major participating libraries, the consensus directed acyclic graph (DAG) of this interactions looked like the following as of 2020:
However, this example DAG is
- emergent from bespoke handling in each library
- poorly adaptable for new meta arrays
- subtly broken due to inconsistencies between the kind of interaction between arrays (such as constructors vs. binops vs. array functions)
- incomplete due to interactions that are merely planned or handled by generic protocols but untested
- does not account for compatibility efforts since 2020, such as the Python Array API and various NEPs
And so, some key goals for community standards to resolve these concerns are:
- The directed graph of array type interactions is agreed upon across the community and remains acyclic
- Introduction of new types to the accepted DAG is easy and can be safely done without introducing cyclicness
- Ensure that separate kinds of array operations/interactions do not partially introduce cyclicness to the otherwise acyclic graph of interactions
Proposing a Common Library to Manage Type Hierarchy
The September 2021 coordination meeting came to a loose consensus that a common library to manage the type casting hierarchy for meta arrays should be created. This way, there cannot be inconsistencies among participating libraries, any conflicts would be resolved publicly in the same place, and hooks could be introduced for one-off inclusion of a user-defined array type within the “official” DAG.
Features in this shared library could include:
- Registering or collecting meta array types in present environment
- Checking/verification that the DAG works out
- Optional, or,
- Enforcement (including utils to raise errors where relevant)
- Provide full utilities that meta arrays can (or must?) use in their implementations of wrapping/binop/
__array_ufunc__
/__array_function__
/array function modules - Hooks to introduce user-defined meta arrays into the DAG
Even with a shared library for coordinating meta arrays, the additional question of how these pairwise interactions are defined (by what mechanism or protocol) remains. Several ideas have been suggested previously:
- interpeted from NEP 37’s
__array_module__
- a new slot for handled types
- a registry within the shared meta array library
Finally, it is uncertain where this kind of shared library should be hosted (Python for Data · GitHub, or elsewhere?) or what it should be named.
I would love to see further discussion of these items below, but given the poor past results I’ve seen with merely raising the topic, I’ll give an initial concept to start things off. Please feel free to critique, adapt, or replace this concept below!
@jthielen’s Initial Concept for Common Library
- Create a package titled
metaarray-management
hosted within PyData (https://github.com/pydata/metaarray-management) - Ensure critical mass of meta array libraries agree to participate (need at least Xarray, Dask, and Pint for this to be meaningful)
- Use NEP 37’s
__array_module__
as protocol for any meta array to define what types it handles- In practice, check will then look like:
is_handled = this_type.__array_module__((other_type,)) is not NotImplemented
- This will have the side-effect of enforcing (at least some degree of) NEP 37 support in all meta array libraries
- In practice, check will then look like:
metaarray-management
will be based around a run-time registry of meta array types and DAG checking. For performance reasons, registry assembly and DAG checks will only occur on demand. Nightly CI formetaarray-management
will operate on a list of participating meta array types to ensure that compliance does not break. Participating libraries agree to treat any such failure traceable to their meta array as a bug.- Participating libraries need not import/have dependency on
metaarray-management
, however, they may optionally do so to take advantage of meta array utils that end up being usefully reused across meta array types, or for their own compatibility checks. - Participating libraries agree to provide an entrypoint (or other well-documented procedure) for adding a new type to its list of wrappable types, and contribute this for inclusion within
metaarray-management
. - User-defined meta arrays must adopt NEP 37 (to define what arrays it can handle), as well as register with
metaarray-management
what existing meta array types should be made to be able to wrap this new array.metaarray-management
will then verify that the DAG is maintained, or else reject this new meta array’s specifications. - Python Array API implementors that are not meta arrays need not be concerned with
metaarray-management
, as they can always be considered to exist at an endpoint/bottom of the DAG. It will be assumed that a meta array can choose to support Python Array API implementors that are not meta arrays generically, rather than having an explicit list of all such types. To enable this, all meta arrays must implement a detectable protocol so that they can be readily distinguished from non-meta arrays (see__meta_array_getattr__
below for such a protocol suggestion).
Recommendation of “Allow List” over “Deny List”
Among existing packages, the aforementioned interactions have been defined informally through independent/ad-hoc implementations in each meta array library. Two main approaches have arisen so far:
- Have an “allow list” of types that this array type can handle/wrap, and defer to any other (e.g., Dask)
- Have a “deny list” of types to which this array type defers, and assume any other “sufficiently array-like” type can be handled/wrapped (e.g., Pint, but also xarray if the “deny list” is effectively empty)
For a limited set of commonly-used array types in the PyData stack, this has often worked out in practice so far. However, as the number of duck array libraries increases, maintaining agreement between libraries through these existing independent approaches becomes difficult.
Based on consensus from prior discussions, and the requirements of shared management of the type casting hierarchy, I would propose a community standard that all meta arrays should operate off of an “allow list” approach rather than “deny list.” The only exception to this should be non-meta-array Python Array API implementors, which could reasonably be supported generically (as there is no concern about such generic implementation introducing cyclicness to the type casting hierarchy).
Consistency Between Forms of Interaction
As hinted at by the language of “array interaction” previously, there are several different interactions in practice in which handling/deferral needs to be considered. As far as I can recall, these are:
- Meta array layer nesting order
- Constructors
- Binary operations (addition, multiplication, etc.)
__array_ufunc__
(NEP 13)__array_function__
(NEP 18)__array_module__
(NEP 37) and other importable modules of array functions (e.g., dask.array)
Presently, the consistency of type handling between these different forms of interaction is varied. For example, Pint Quantities are designed with these consistent (where they exist, as Pint doesn’t have an array function module/namespace…yet), whereas with Dask arrays are deliberately different (e.g., deferring to other types in binops and __array_*__
, but assuming user intent to override the standard deferral in construction/wrapping and array functions) and (at least as I see it) xarray is accidentally different (see pydata/xarray#3950). The main sticking point in the discussions involved in each of these respective implementations can be summarized as “How much should we trust user to not ‘break’ the DAG through inconsistencies in type deferral?” or “How much should consistency be enforced?”
Consensus (or at least absence of any objections) from the September 2021 meeting supports that all these forms of interaction should be consistent. It will be up to every meta array implementor to do so, however, the previously proposed type hierarchy management library could both provide reusable utils to do so and run test suites to verify consistency.
At the very least, adopting this stance of “all forms of interaction between meta arrays must be consistent with respect to an agreed-upon type hierarchy” will require some changes to participating libraries, such as:
- xarray should carry out changes in
Consistent Handling of Type Casting Hierarchy xarray#3950 - Dask should carry out Dask Array’s NEP-13 and NEP-18 protocols should use common utils dask/dask#6672, while also reconsidering intentional inconsistency in Add simple chunk type registry and defer as appropriate to upcast types dask/dask#6393, so that UserWarning when wrapping pint & dask arrays together xarray#5559 (comment) is resolved
TL;DR
When multiple meta array types exist, especially in combination, we need coordination and agreement to ensure that cyclicness is not introduced to the hierarchy of interactions. And so, we need community standards among meta arrays regarding:
- a common library for managing the type hierarchy
- how meta arrays determine what arrays they can wrap
- the need for differing forms of array interactions to be consistently handled
Designing Protocols/Standards for Exchange Between Array Layers
Once the problem of verifying that meta arrays can consistently and robustly “play nice” with each other (prior section), there still remain several usability concerns. Particularly, what should the APIs for multiply nested meta arrays look like in practice, and how do we ensure that type-specific details within a given nested array construct are handled in a robust and easy-to-implement way?
reprs
Once a meta array is constructed, especially one with multiple nested array layers, a foremost user concern is the repr
of this meta array: how are the most relevant details of each array layer exposed through the display of the outermost object? In the topic of meta arrays, repr
s have been one of the the oldest discussion points (see Design: nested _meta · Issue #5329 · dask/dask · GitHub and reprs of nested duck arrays · Issue #6637 · dask/dask · GitHub), and still to this day there does not exist well-defined expectations/standards. Instead, most meta array libraries implement some bespoke form of nested/recursive inheritance from the wrapped type (e.g., xarray’s usage of _repr_html_
and _repr_inline_
).
Since several suggestions have been raised previously, I will simply present them here rather than suggest a single solution.
@keewis suggestion from Sep 2020
Simply including the repr of the wrapped array usually results in reprs that are way too long, so instead we could introduce a protocol (or method) that returns a mapping of type names to metadata, and a function to call that protocol. If a duck array does not implement the protocol, the default would be to return
{typename: {}}
. For example callingrepr(arr)
wherearr
is axarray(pint(dask(sparse)))
would makexarray
call the data’s__metadata__
and use that to construct it’srepr
. This could be something likedef __metadata__(self): wrapped_metadata = self._data.__metadata__() metadata = { "a": str(self.a), "b": str(self.b), } # or something like {**wrapped_metadata, **{type(self).__qualname__: metadata} return wrapped_metadata | {type(self).__qualname__: metadata}
In the example, the result of the first call could be something like:
{ "sparse.COO": {"shape": (30, 1000), "dtype": "float64", "nnz": 768, "fill_value": 0}, "dask.array.Array": {"chunks": (10, 100), "shape": (30, 1000), "dtype": "float64"}, "pint.Quantity": {"units": "m", "shape": (30, 1000), "dtype": "float64"}, }
with that,
xarray
could either manually format therepr
or use a helper function (which should probably have amax_width
parameter).This doesn’t work for duck arrays wrapping more than a single duck array, though. Also, I’m not sure if the type name should be the type’s
__qualname__
, and ifshape
anddtype
, which every duck array has to implement as properties, should be included.
@hameerabbasi suggestion from Sep 2021
I thought about this, and calculating the repr is indeed a hard problem, but it becomes easy if we exclude distributed/chunked arrays.
What I propose is:
* every meta array have a `__array_meta__(): duck_array` which returns the inner array (as one can see for distributed arrays, this could require materialising the array). * In addition, there should be an `__array_attrs__(html: bool): Dict[str, Union[int, bool, float, np.generic, str]]` which should be on ALL arrays that want to be included into meta-arrays. This method returns a `dict` containing the elements it would like displayed. One can then use an implementation like the following:
def __repr__(self): attrs = {} type_names = [] current_arr = self while current_arr is not None: # This formulation guards against duplicate attrs attrs = dict(**attrs, **current_arr.__array_attrs__(false)) class_names.append(type(current_arr).__name__) current_arr = getattr(current_array, "__array_meta__", lambda: None)() return "<{super_type_name}: {attrs_repr}>".format( super_type_name=".".join(type_names), attrs_repr=", ".join(f"{k}: {v!r}" for k, v in attrs.items()) )
For a chunked representation, one could also have a third slot,
__attr_aggregate__(attr_name): Callable[[Iterable[Union[int, bool, float, np.generic, str]]], Union[int, bool, float, np.generic, str]]
which could aggregate the given attribute. This way, Dask (or other distributed/chunked array libraries) could easily aggregate the attrs.
Handling Wrapped-Type-Specific Details from Outermost Wrapping Array
Type Casting (Layer Addition/Removal) Within Layers of a Nested Array
It is often necessary to “mutate” what array layers exist within a multiply nested meta array (e.g., downcasting contents to NumPy, collecting a distributed array, stripping units). However, without high-level custom handling for particular nested structures (such as exists in compatibility packages like pint-xarray), this kind of inter-layer type casting requires manual deconstruction and reconstruction of the meta array, which is a poor user experience. This is a rather nebulous topic, as many array types have unique APIs for casting to other array types, and the problem can become complex rather quickly with explicit (non-protocol) support for many types (see https://github.com/pydata/xarray/issues/3245 for discussion of many .to_*
or .as_*
methods).
Initial comments by @SimonHeybrock suggest that a well-designed property/method attribute protocol could solve this (e.g., Pint’s .magnitude
property functions as a downcast to whatever Pint is wrapping), and so, I won’t offer any direct suggestions here, and instead defer to the next item.
__meta_array_getattr__
Protocol
This is an adaptation of the initial suggestion by @SimonHeybrock in Multiple duck array layers: Composability, implementations, and usability concerns, but renamed and with added discussion.
We can introduce a new protocol through which array layers in a meta array can expose a) select attributes directly or b) namespaced attribute collections for composable handling by parent meta arrays apart from standard Python __getattr__
handling (which is prone to namespace cluttering and/or collision).
To start, it is built around a nesting protocol of (optional) wrapped attributes like the following:
no_op = lambda x: x
class MetaArraySample:
def __getattr__(self, name):
# When called as top-level, defer to meta array attr protocol
return self.__meta_array_getattr__(name, wrap=wrap=no_op)
def __meta_array_getattr__(self, name, wrap):
# Mock of a Pint Quantity, with "magnitude" and "units". "magnitude" is
# expected to be re-wrapped by parent meta array, but "units" should be
# returned outright. Both are supported.
contents = self._magnitude
if name == 'magnitude':
return wrap(contents)
elif name == 'units':
return self._units
# Now, allow inheriting from wrapped contents
wrap_with_self = lambda contents: wrap(self._rewrap_content(contents))
if hasattr(contents, '__meta_array_getattr__'):
return contents.__meta_array_getattr__(name, wrap=wrap_with_self)
raise AttributeError(
f'Meta array attribute {name} not present on {self.__class__}'
)
When the number of “exchanged” attributes per layer is small, this direct handling of select attributes works well. However, many use cases would require further attributes (up to the entire public API of the array type minus standard Array API attributes), which would pollute any top-level meta array’s effective namespace. This could require an encapsulating object within the protocol to act as a namespace for a particular array layer (a la xarray’s accessor interface):
no_op = lambda x: x
class MetaArrayAttributeContainer:
def __init__(self, metaarray, wrap):
self._metaarray = metarray
self._wrap =
def to(self, other):
return self._wrap(self._metaarray.to(other))
def to_base_units(self):
return self._wrap(self._metaarray.to_base_units())
def m_as(self, other):
return self._wrap(self._metaarray.m_as(other))
@property
def dimensionality(self):
return self._metaarray.dimensionality
class MetaArraySample:
def __getattr__(self, name):
# When called as top-level, defer to meta array attr protocol
return self.__meta_array_getattr__(name, wrap=no_op)
def __meta_array_getattr__(self, name, wrap):
# Mock of a Pint Quantity, with "magnitude" and "units" as direct
# attributes, and others via "pint" pseudo-namespace.
contents = self._magnitude
if name == 'pint':
return MetaArrayAttributeContainer(self, wrap)
elif name == 'magnitude':
return wrap(contents)
elif name == 'units':
return self._units
# Now, allow inheriting from wrapped contents
wrap_with_self = lambda contents: wrap(self._rewrap_content(contents))
if hasattr(contents, '__meta_array_getattr__'):
return contents.__meta_array_getattr__(name, wrap=wrap_with_self)
raise AttributeError(
f'Meta array attribute {name} not present on {self.__class__}'
)
Additionally, the single-content wrapping approach used in this initial example may not be robust enough for multiple contents (e.g., masked arrays). While there may be several approaches handling this case, the scippx prototype (Copyright (c) 2022 Scipp contributors, reused with modification under BSD-3-Clause license) does so like the following:
def make_wrap(wrap, attr, extra_cols=None):
def do_wrap(obj, *args, **kwargs):
# case rewrap_result:
# obj is result of call with args and kwargs, need those to apply to
# extra_cols
if extra_cols is None:
return wrap(obj)
cols = []
for col in extra_cols:
try:
proto_col = col.__meta_array_getattr__(attr, wrap=no_op)
except AttributeError:
cols.append(col)
else:
if hasattr(proto_col, 'shape'):
cols.append(proto_col)
else: # callable we got from rewrap_result
cols.append(proto_col(*args, **kwargs))
return wrap((obj, ) + tuple(cols))
return do_wrap
class MetaArraySample:
def __getattr__(self, name):
# When called as top-level, defer to meta array attr protocol
return self.__meta_array_getattr__(name, wrap=no_op)
def _forward_meta_array_getattr_to_content(self, name, wrap):
wrap_with_self = lambda contents: wrap(self._rewrap_content(contents))
content = self._unwrap_content(self)
if isinstance(content, tuple):
content, *extra_cols = content
wrap_with_self = make_wrap(wrap_with_self, name, extra_cols)
if hasattr(content, '__meta_array_getattr__'):
return content.__meta_array_getattr__(name, wrap=wrap_with_self)
raise AttributeError(
f'Meta array attribute {name} not present on {self.__class__}'
)
def __meta_array_getattr__(self, name, wrap):
# Mock of a Pint Quantity, with "magnitude" and "units" as direct
# attributes, and others via "pint" pseudo-namespace.
if name == 'pint':
return MetaArrayAttributeContainer(self, wrap)
elif name == 'magnitude':
return wrap(self._magnitude)
elif name == 'units':
return self._units
# Now, allow inheriting from wrapped contents
return self._forward_array_getattr_to_content(name, wrap)
The scippx prototype also includes an unwrap
argument in this protocol, for which the need may arise in certain use cases. I personally have not grasped its utility, and so am refraining from demonstrating it here. However, it certainly merits discussion as we work towards an eventual standard if this style of side-chained & wrapped attribute protocol is endorsed by the community.
This protocol would remove the need for existing __getattr__
deferral mechanisms (like that present in Pint) and partially obviate the need for compatibility packages (like pint-xarray). For the latter, compatibility needs solely for the wrapped array (e.g., pint-specific things, in case of pint-xarray) would be obviated, however, integration of otherwise unsupported features in the wrapping array (e.g., unit-aware xarray API features not otherwise supported out-of-the-box) would still benefit from such a compatibility package. This would further imply naming conflicts (i.e., should the “pint” attribute on an xarray object point to the pint-xarray accessor or to the encapsulated pint namespace through the meta array protocol?), but those would likely be best worked out independently when encountered, rather than addressed in a general meta array community standard.
Also, once a protocol is standardized, the previously referenced “common library” could include utilities to ease implementation within meta array libraries.
TL;DR
It would be highly beneficial for the community to develop a set of standards and protocols for meta arrays to seamlessly handle repr
s as well as nested-array-specific operations from the outermost array object (i.e., the one with the exposed API surface) without falling back to a paradigm of deconstruct & reconstruct and/or requiring bespoke code for every possible nested array interaction.
Additional Discussion
Relation to Python Array API
Many prior discussions in this area have referenced the Python Array API. However, the issues referenced above in need of community standards are orthogonal to the Array API…these are specifically about meta arrays and the interactions between different levels of these arrays that wrap other arrays, rather than defining a shared API surface that could be targeted generically by any array consuming library. That isn’t to say that there aren’t mutual considerations (e.g., the previously mentioned non-meta-array Array API implementors being generically considered to be at the bottom of the DAG), however. But, regardless, I feel it best to move these meta array efforts forward within the Scientific Python community separate from the Python Array API standards.
Concluding Summary (or, even the TL;DRs were TL)
The Scientific Python community has seen some amazing work in the past several years around community standards for array providing libraries to streamline compatibility within array consuming libraries. However, there are libraries that do both (provide and consume arrays), which we can refer to as meta array libraries, and standardization work in this area (especially in concerns arising with using multiple meta arrays in combination) has languished. I propose we promptly move forward as community by establishing standards and protocols in two major areas:
- Achieving Consensus on Type Priority Between Array Types
- Designing Protocols/Standards for Exchange Between Array Layers
Please share your thoughts below!