Hello, we (Ralf Gommers and me) wrote a blog post about the support for multiple (non-NumPy) array types in SciPy, scikit-learn, scikit-image, and other similar libraries. Ralf has started the conversation about the overall approach and proposed design here: A proposed design for supporting multiple array types across SciPy, scikit-learn, scikit-image and beyond
In this thread we’d like to discuss the following:
- Should we enable or disable new dispatch mechanisms by default?
- How to control the new dispatch behavior?
Scikit-learn and SciPy convert any input data structure implementing a duck array to a NumPy array (using numpy.asarray
, which relies on __array__
et al.) to ensure consistent behavior. We want to replace the call to numpy.asarray
with __array_namespace__().asarray
to enable the new dispatch mechanism.
Should the new behavior be enabled by default? If it’s enabled by default by a library (and hence opt-out for users, or maybe not even giving users a choice), is it the library that implements the n-D array object that controls this enabling or is it individual array-consuming libraries?
I think array libraries can safely add the __array_namespace__
method to their array objects, indicating the adoption of Array API, and autoregister backends for uarray
-based implementation. It’s just a signal for users and array-consuming libraries that the dispatching is supported and how to make use of such capability is up to the array consumers.
Enabled by default or not, I think we should add the ability to control the behavior using for example a global flag that turns the feature on (or off). This flag could be implemented per function or per module, but I think too granular control may be too cumbersome, that’s why the proposal is to have one global switch implemented per package. Then separately document which parts of the library the switch affects. For inputs that don’t implement the __array_namespace__
method, we would preserve the current behavior - converting to a NumPy array.
Let’s say we name the global switch DISPATCH_ENABLED
in the sklearn
package. Then no matter what’s the default value for the switch is, the pattern to support multiple array libraries would be something like:
def some_sklearn_func(x):
if sklearn.DISPATCH_ENABLED and hasattr(x, '__array_namespace__'):
xp = x.__array_namespace__()
else:
xp = numpy
x_array = xp.asarray(x)
# ...
return result
Similarly, the uarray
-based dispatching would check the global switch
DISPATCH_ENABLED
and presence of the __array_namespace__
method.
Having the dispatching enabled by default seems to be a good idea, but gives rise to backward compatibility concerns. If and when an individual array-consuming library sets its default value for the switch as ENABLED
, this will change the behavior for all users from coercing to a numpy.ndarray
to returning the same type of array as the input type. This is technically a break; in most projects, the impact may be small and acceptable to get the desired user experience in the long-term - but this is then up to the maintainers of each individual library.