Default dispatching behavior for supporting multiple array types across SciPy, scikit-learn, scikit-image

Hello, we (Ralf Gommers and me) wrote a blog post about the support for multiple (non-NumPy) array types in SciPy, scikit-learn, scikit-image, and other similar libraries. Ralf has started the conversation about the overall approach and proposed design here: A proposed design for supporting multiple array types across SciPy, scikit-learn, scikit-image and beyond

In this thread we’d like to discuss the following:

  • Should we enable or disable new dispatch mechanisms by default?
  • How to control the new dispatch behavior?

Scikit-learn and SciPy convert any input data structure implementing a duck array to a NumPy array (using numpy.asarray, which relies on __array__ et al.) to ensure consistent behavior. We want to replace the call to numpy.asarray with __array_namespace__().asarray to enable the new dispatch mechanism.

Should the new behavior be enabled by default? If it’s enabled by default by a library (and hence opt-out for users, or maybe not even giving users a choice), is it the library that implements the n-D array object that controls this enabling or is it individual array-consuming libraries?

I think array libraries can safely add the __array_namespace__ method to their array objects, indicating the adoption of Array API, and autoregister backends for uarray-based implementation. It’s just a signal for users and array-consuming libraries that the dispatching is supported and how to make use of such capability is up to the array consumers.

Enabled by default or not, I think we should add the ability to control the behavior using for example a global flag that turns the feature on (or off). This flag could be implemented per function or per module, but I think too granular control may be too cumbersome, that’s why the proposal is to have one global switch implemented per package. Then separately document which parts of the library the switch affects. For inputs that don’t implement the __array_namespace__ method, we would preserve the current behavior - converting to a NumPy array.

Let’s say we name the global switch DISPATCH_ENABLED in the sklearn package. Then no matter what’s the default value for the switch is, the pattern to support multiple array libraries would be something like:

def some_sklearn_func(x):
    if sklearn.DISPATCH_ENABLED and hasattr(x, '__array_namespace__'):
        xp = x.__array_namespace__()
    else:
        xp = numpy
    x_array = xp.asarray(x)
    # ...
    return result

Similarly, the uarray-based dispatching would check the global switch
DISPATCH_ENABLED and presence of the __array_namespace__ method.

Having the dispatching enabled by default seems to be a good idea, but gives rise to backward compatibility concerns. If and when an individual array-consuming library sets its default value for the switch as ENABLED, this will change the behavior for all users from coercing to a numpy.ndarray to returning the same type of array as the input type. This is technically a break; in most projects, the impact may be small and acceptable to get the desired user experience in the long-term - but this is then up to the maintainers of each individual library.

6 Likes

Hi, first of all, thanks for putting this together and helping move this forward!

Preface to this: first time I am having a look at all these. So I might say something that does not make sense.

I think this should be configurable but enabled by default. I think that once we advertise for it, a lot of users will want to try this and it should be easy to ensure a wide adoption. And for lot of them all of this would be new and they would be willing to do more experimenting and adjustments for the advertised gains.

On coercing the type of the output to the type of the input. If a function returns a modified version of the input array, to me it’s natural to conserve the type. The issue is for cases where we return an array which is not directly linked to the input. Then yes I can see how this could be confusing (thinking about summary statistics for instance which could be small arrays). But there are going to be lot of different use cases here.

I see a few options

  • Do nothing and in the doc advertise the new behaviour. I think the overall impact should be contained. My preference, see below.
  • Ensure the current return type is preserved. With a decorator or something else depending on the case. And have a deprecation warning when another backend is used. But this would be annoying so should be configurable.
  • Add an env variable to control the return type. I don’t think we want that. It adds some extra complexity.

I also mind a bit less to potentially break things here as moving data around from one backend to another can be costly. Breaking a bit users code could help to pass on some messages. Like if you have an array on your GPU and send it to SciPy, yes now it will return something which is still on the GPU. But it’s for your own good as you can continue to do something with it on the GPU without going back to the CPU and again to the GPU. Some people would not realize this at all if everything is transparent.

1 Like