Advice and guidance about Array API for a units package

astrojuanlu · July 15, 2022, 7:06pm

Hi all,

A few months ago I started a proof of concept to develop a fast Python library for physical units (because we don’t have enough of them ) called fastunits. This is unrelated to the prototype @seberg presented today at SciPy 2022, which uses dtypes instead.

The current philosophy of fastunits is that Quantity wraps a value and a unit. This is how it should look like:

import numpy as np

from fastunits.??? import ??? as u  # TBC

qv1 = [1, 0, 0] << u.m
# Under the hood, this does
# qv1 = ArrayQuantity.from_list([1, 0, 0], u.m)

qv2 = [0, 1, 0] << u.cm
qv3 = np.random.randn(10_000) << u.m
print(qv1 + qv2)

As you can see, it’s heavily inspired by astropy.units, but I want to achieve at least a 10x improvement in performance by sacrificing some backwards compatibility and assumptions.

However, I have two questions on how to make this integrate nicely with the rest of the Scientific Python ecosystem:

Should I make *Quantity objects wrap any object conforming to the Python array API standard? This supposedly should allow folks to use fastunits not only with NumPy arrays but also any of the “implementors” (as described in array-api#1). However, after reading various NEPs and GitHub comments here and there, I’m not at all sure if the image below reflects the current consensus, an up-to-date list of implementors, etc.

Should I make *Quantity be an Array API implementor? My gut feeling was that anything that is not purely numerical arrays is beyond what the spec covers, and @seberg confirmed that intuition to me yesterday. However, there could be more clarity about what users expect in terms of convenience features. In other words: should numpy.add(q1, q2) return a fastunits.*Quantity? The alternative is that users need to wrap/unwrap quantities every time, or do things like

import fastunits.numpy_compat as funp

q = funp.add(q1, q2)

which is inconvenient.

As far as I understand, this should be done with NEP 18 __array_function__, but the draft of NEP 37 recollects contains several discouraging paragraphs, like (emphasis mine)

__array_function__ has significant implications for libraries that use it: […] users expect NumPy functions like np.concatenate to return NumPy arrays. […] Libraries like Dask and CuPy have looked at and accepted the backwards incompatibility impact of __array_function__; it would still have been better for them if that impact didn’t exist".

I feel like there are a lot of painful conversations behind these words, but as an outsider, I have zero context of what happened, and more importantly I’m not sure what is the blessed way to proceed.

Another reason I’m so hesitant is that I’ve seen other packages inherit np.ndarray for this, but I’m not sure at all it’s a good idea to do such a thing.

I tried to be as articulate and succinct as possible, looking forward to reading your opinions on this!

TomNicholas · July 15, 2022, 7:45pm

Great! I’m looking forward to looking into it. If you want help with making xarray wrapping work then ping me @TomNicholas

astrojuanlu · July 15, 2022, 9:36pm

Thanks @TomNicholas! Yes I’m aware of pint, but I devised a methodology that does not require sympy (like unyt) or parsing (like pint) by using a more algebraic approach that is also way simpler (details in the README) and I’m hoping to make it as fast as it gets.

endolith · July 17, 2022, 11:46am

The biggest problem with units support in Python is that there are too many solutions.

I made a comparison chart trying to decide which to use. I’ll add it here when I’m on my computer.

astrojuanlu · July 17, 2022, 3:52pm

Sure, there are lots of results for a “units” search on PyPI, but I don’t think there are that many libraries that are widely used, compatible with NumPy, and healthy, supported, and maintained as of July 2022:

The comparison by Trevor Bekolay (2013) tested a number of them, but the only ones that still had commits in 2022 are astropy.units and Pint.
The unyt paper (2018) compares unyt, astropy.units, and Pint.

Maybe 3 libraries are already too many (and it’s unclear if mine will become successful or if I’ll run out of steam quickly) but I think it’s natural that folks have slightly diverging use cases.

In any case, looking forward to seeing your chart @endolith!

astrojuanlu · July 17, 2022, 3:54pm

In any case, I’d like to bring back your attention to the fact that this is not a discussion of “should we have yet another units package?”, but rather, about how would someone create one from scratch and apply the most up-to-date interoperability standards.

endolith · July 17, 2022, 4:33pm

Found it:

Yeah I made it in 2016, and many are unmaintained, unfinished, etc.

rgommers · July 18, 2022, 5:22pm

It depends, I’d say you’re already taking on a big chunk of work, so perhaps design your wrapping of numpy such that it can be added to later? I have not seen too much interest in pytorch/cupy/jax + units, so I’m not sure if that extra generality will pay off from the start.

If you’re going to be writing functions that are in the Array API in your library (as your funp.add example indicates), then you may as well adhere to the function definitions in the array API standard rather than anything else. And at that point, “being an array API implementor” is a matter of adding the __array_namespace__ method.

That said, you will probably want to implement more than is in the standard. There’s no reason that’s not okay, you can add anything you need.

It’s not easy to figure out what your implementation will look like, so it’s hard to say what the “blessed way” is. Not that there even is such a thing perhaps; there’s multiple options with different trade-offs. Interoperability with NumPy — NumPy v2.0.dev0 Manual captures quite a few.

Inheriting from numpy.ndarray is possible, but a little fragile and perhaps the least appealing of all the options.

astrojuanlu · July 18, 2022, 5:50pm

Thanks a lot for your response @rgommers, that’s really helpful. As a final follow-up, what do you think of

In any case, I think I can follow a gradual approach:

For the data containers, prepare the code so that it uses get_namespace as proposed in NEP 47 (possibly aided by these hypothesis strategies), which in principle should make fastunits “backend-agnostic”.
For the operations with quantities, start by adding a fastunits.numpy_compat module with functions from the array API specification as I need them or as users request them (possibly aided by a future standalone vendored array protocol). Then if the project is successful, someone will open an issue saying “please make np.add(q1, q2) return a *Quantity”, and then I’ll have the “social proof” I need to go ahead with implementing __array_function__ .

rgommers · July 23, 2022, 2:41pm

I think yes, it should. Or alternatively, error out. But “array kind out == array kind in” is a good design rule imho, and losing the units is probably not intended by your user, so returning plain numpy arrays is the least attractive option.