Experiment: Array API + named dimensions + units

SimonHeybrock · June 22, 2024, 2:21pm

Not sure if this will be going anywhere, but I have been playing with some ideas bringing together:

Arbitrary array of values (implementing the Python array API standard),
Named dimensions, and
Physical units.

We can then use, e.g., dask.array with astropy.units, or numpy with pint, or any other combinations.

This is obviously similar to Xarray’s effort to extract xarray.Variable as NamedArray, but takes a different and more explicit approach to handling units (as well as not supporting attrs).

My current idea would be that units might be handled using something like a “Python units API standard”, since I am not (yet?) convinced that integrating units with either the array (by having the units library consume and implement the array API) or the NumPy dtype system is ideal, e.g., in terms of UX.

Any feedback welcome!

Cheers,
Simon

vnmabus · June 27, 2024, 1:43pm

Cool!

Two things come to my mind:

As far as I see, the dimension names are used to reorder the axis in order to do operations (as in xarray) and that is the main cause of discrepancies with the array API standard. Although that is of course a useful functionality, sometimes I would want to add names (or other kind of metadata) to dimensions or particular indexes without changing how operations and broadcasting work, and thus keeping compatibility with the Array API standard. Do you know a package that provides that?
This package tries to do two things at once (names and units). In my opinion, the most composable thing would be to have specialized array wrappers that do one thing well, accepting one API Array-compatible object and exposing the same API. Then the Array API standard could be extended so that it is possible to query parameters of these arrays independently on their position in the “stack” of wrappers. That way you could do, for example:
1. Pick a standard-compliant array, e.g. a CuPy array.
2. Wrap 1 in an object that adds masking.
3. Wrap 2 in an object that adds units.
4. Wrap 3 in an object that adds shape-dependent metadata (such as names).
and you should still be able to access the mask, units or metadata easily independently on the order in which the wrappers have been done. What do you think of this?

mdhaber · June 27, 2024, 1:56pm

Since masks came up, I thought I’d mention GitHub - mdhaber/marray: Masked versions of array API compatible arrays, which wraps array API compliant libraries to support a mask.

SimonHeybrock · June 28, 2024, 3:28am

Xarray has attrs attributes for their coords (coordinates in Xarray take the role of what is called indexes in Pandas) that let you do this. But (as far as I know) you cannot have attrs for a dimension, i.e., you have to have a coordinate for that dimension.

SimonHeybrock · June 28, 2024, 3:31am

vnmabus:

This package tries to do two things at once (names and units). In my opinion, the most composable thing would be to have specialized array wrappers that do one thing well, accepting one API Array-compatible object and exposing the same API. Then the Array API standard could be extended so that it is possible to query parameters of these arrays independently on their position in the “stack” of wrappers. That way you could do, for example:

Pick a standard-compliant array, e.g. a CuPy array.

Wrap 1 in an object that adds masking.

Wrap 2 in an object that adds units.

Wrap 3 in an object that adds shape-dependent metadata (such as names).

and you should still be able to access the mask, units or metadata easily independently on the order in which the wrappers have been done. What do you think of this?

I tried something like that last year, and came to the conclusion that it is not going to “work”. I am using quotation marks, since one can actually build something that does this (see my prototype at GitHub - scipp/scippx: Prototype duck-array layering, and thread here Multiple duck array layers: Composability, implementations, and usability concerns).

vnmabus · June 30, 2024, 7:37pm

May I ask what were the difficulties that you found with this approach? IMHO, this degree of composability is necessary to prevent a combinatory explosion of different types of arrays with different number of half-baked functionalities in the Scientific Python Ecosystem, instead of one well-implemented and tested array for each desired feature.

SimonHeybrock · July 1, 2024, 3:28am

Great question! So far none actually, but I believe this is mainly due to the early nature of the experiment where I only used core features present in all libraries.

As far as I can tell, in particular the different units implementations have made quite different choices, or support specific but required features that other units libraries do not. It is not clear to me yet if and how this could be supported. I hope defining a “Python units API standard” would go a long way (currently I have this is tiny incomplete compat layers in PyDims itself).

I have further concerns regarding linear-algebra types and operations, as I am not sure handling vectors/matrix axes using the standard named dims is a good choice. The logic of @ in the array API standard feels too implicit to me (relying on axis order and axis lengths). Another example is coordinate systems: If some stores spherical coords (r,\theta,\phi) in an array of inner length 3, @ should not work in the normal way. So my feeling is that linear algebra should be handled via dtypes, maybe… but that would deviate quite a bit from the array API standard, which would be problematic.

Finally, I think every science field has there own very specific needs and users will always need the ability to make things work in their own particular way. Would PyDims still be generic enough to allow for this? Or is the Python array API actually the lowest common denominator already?

vnmabus · July 1, 2024, 7:50am

About the vector/matrices, where can I see more on this? I am working (in my free time) to extend my library scikit-fda to provide a general type for “arrays of functions accepting and returning tensors” (which is a next step after "array of vectors/matrices/tensors, so it would be good if we make the same decisions here).

In particular, I am choosing to do a separate array “type”, instead of using a custom dtype, so that I can make it work with all libraries that follow the array API, and not just NumPy, but it should still work as if it was a normal array with a “function” dtype.

One goal is to also make it compatible with the array API standard. I think this should be possible at least for “scalars”, that is, functions accepting 0d-input and returning a scalar. However, it would be interesting to try to extend the functions, or at least most of them, to work generally with this “function” dtype. In doing so it would be great to know what other libraries are doing with “arrays of vectors” for comparison.