Hello, we (Ralf Gommers and me) wrote a blog post about the support for multiple (non-NumPy) array types in SciPy, scikit-learn, scikit-image, and other similar libraries. Ralf has started the conversation about the overall approach and proposed design here: A proposed design for supporting multiple array types across SciPy, scikit-learn, scikit-image and beyond
In this thread we’d like to discuss the following:
- Should we enable or disable new dispatch mechanisms by default?
- How to control the new dispatch behavior?
Scikit-learn and SciPy convert any input data structure implementing a duck array to a NumPy array (using
numpy.asarray, which relies on
__array__ et al.) to ensure consistent behavior. We want to replace the call to
__array_namespace__().asarray to enable the new dispatch mechanism.
Should the new behavior be enabled by default? If it’s enabled by default by a library (and hence opt-out for users, or maybe not even giving users a choice), is it the library that implements the n-D array object that controls this enabling or is it individual array-consuming libraries?
I think array libraries can safely add the
__array_namespace__ method to their array objects, indicating the adoption of Array API, and autoregister backends for
uarray-based implementation. It’s just a signal for users and array-consuming libraries that the dispatching is supported and how to make use of such capability is up to the array consumers.
Enabled by default or not, I think we should add the ability to control the behavior using for example a global flag that turns the feature on (or off). This flag could be implemented per function or per module, but I think too granular control may be too cumbersome, that’s why the proposal is to have one global switch implemented per package. Then separately document which parts of the library the switch affects. For inputs that don’t implement the
__array_namespace__ method, we would preserve the current behavior - converting to a NumPy array.
Let’s say we name the global switch
DISPATCH_ENABLED in the
sklearn package. Then no matter what’s the default value for the switch is, the pattern to support multiple array libraries would be something like:
def some_sklearn_func(x): if sklearn.DISPATCH_ENABLED and hasattr(x, '__array_namespace__'): xp = x.__array_namespace__() else: xp = numpy x_array = xp.asarray(x) # ... return result
uarray-based dispatching would check the global switch
DISPATCH_ENABLED and presence of the
Having the dispatching enabled by default seems to be a good idea, but gives rise to backward compatibility concerns. If and when an individual array-consuming library sets its default value for the switch as
ENABLED, this will change the behavior for all users from coercing to a
numpy.ndarray to returning the same type of array as the input type. This is technically a break; in most projects, the impact may be small and acceptable to get the desired user experience in the long-term - but this is then up to the maintainers of each individual library.