Default dispatching behavior for supporting multiple array types across SciPy, scikit-learn, scikit-image

Hello, we (Ralf Gommers and me) wrote a blog post about the support for multiple (non-NumPy) array types in SciPy, scikit-learn, scikit-image, and other similar libraries. Ralf has started the conversation about the overall approach and proposed design here: A proposed design for supporting multiple array types across SciPy, scikit-learn, scikit-image and beyond

In this thread we’d like to discuss the following:

  • Should we enable or disable new dispatch mechanisms by default?
  • How to control the new dispatch behavior?

Scikit-learn and SciPy convert any input data structure implementing a duck array to a NumPy array (using numpy.asarray, which relies on __array__ et al.) to ensure consistent behavior. We want to replace the call to numpy.asarray with __array_namespace__().asarray to enable the new dispatch mechanism.

Should the new behavior be enabled by default? If it’s enabled by default by a library (and hence opt-out for users, or maybe not even giving users a choice), is it the library that implements the n-D array object that controls this enabling or is it individual array-consuming libraries?

I think array libraries can safely add the __array_namespace__ method to their array objects, indicating the adoption of Array API, and autoregister backends for uarray-based implementation. It’s just a signal for users and array-consuming libraries that the dispatching is supported and how to make use of such capability is up to the array consumers.

Enabled by default or not, I think we should add the ability to control the behavior using for example a global flag that turns the feature on (or off). This flag could be implemented per function or per module, but I think too granular control may be too cumbersome, that’s why the proposal is to have one global switch implemented per package. Then separately document which parts of the library the switch affects. For inputs that don’t implement the __array_namespace__ method, we would preserve the current behavior - converting to a NumPy array.

Let’s say we name the global switch DISPATCH_ENABLED in the sklearn package. Then no matter what’s the default value for the switch is, the pattern to support multiple array libraries would be something like:

def some_sklearn_func(x):
    if sklearn.DISPATCH_ENABLED and hasattr(x, '__array_namespace__'):
        xp = x.__array_namespace__()
    else:
        xp = numpy
    x_array = xp.asarray(x)
    # ...
    return result

Similarly, the uarray-based dispatching would check the global switch
DISPATCH_ENABLED and presence of the __array_namespace__ method.

Having the dispatching enabled by default seems to be a good idea, but gives rise to backward compatibility concerns. If and when an individual array-consuming library sets its default value for the switch as ENABLED, this will change the behavior for all users from coercing to a numpy.ndarray to returning the same type of array as the input type. This is technically a break; in most projects, the impact may be small and acceptable to get the desired user experience in the long-term - but this is then up to the maintainers of each individual library.

6 Likes

Hi, first of all, thanks for putting this together and helping move this forward!

Preface to this: first time I am having a look at all these. So I might say something that does not make sense.

I think this should be configurable but enabled by default. I think that once we advertise for it, a lot of users will want to try this and it should be easy to ensure a wide adoption. And for lot of them all of this would be new and they would be willing to do more experimenting and adjustments for the advertised gains.

On coercing the type of the output to the type of the input. If a function returns a modified version of the input array, to me it’s natural to conserve the type. The issue is for cases where we return an array which is not directly linked to the input. Then yes I can see how this could be confusing (thinking about summary statistics for instance which could be small arrays). But there are going to be lot of different use cases here.

I see a few options

  • Do nothing and in the doc advertise the new behaviour. I think the overall impact should be contained. My preference, see below.
  • Ensure the current return type is preserved. With a decorator or something else depending on the case. And have a deprecation warning when another backend is used. But this would be annoying so should be configurable.
  • Add an env variable to control the return type. I don’t think we want that. It adds some extra complexity.

I also mind a bit less to potentially break things here as moving data around from one backend to another can be costly. Breaking a bit users code could help to pass on some messages. Like if you have an array on your GPU and send it to SciPy, yes now it will return something which is still on the GPU. But it’s for your own good as you can continue to do something with it on the GPU without going back to the CPU and again to the GPU. Some people would not realize this at all if everything is transparent.

1 Like

Hi Ivan. I have been reading all the documents (NEP 37, NEP 47, uarray docs, Array API standard, the relevant threads on this forum) related to this work. Thanks for helping me out with that.

Now, coming to the point of discussion. The first question is interesting and a little bit tricky I think. Let’s consider the first case, that is, enable by default. Will it be backwards incompatible? Depends on what happens right now. Say, from your example if some_sklearn_func(pytorch.Tensor) raises an error now but after enabling the default behaviour it starts to work, then I think that would be good for the end users. And if nobody is using non-NumPy arrays with some_sklearn_func kinds of functions at an extensive scale right now, then enabling dispatching by default won’t have much negative impact. Basically, a small collateral damage for the greater good? Now the second case, make it configurable. Yes that’s the most diplomatic approach I’d say. Enable by default but give a switch to turn it off. Now, this brings us to the next thing. Where to put that switch? A global one for the whole library or module might be risky IMO. The reason why I think so is because, say I import sklearn and then launch two threads. Now, since, sklearn.DISPATCH_ENABLED is writable, one thread might want to switch it to OFF and the other one might want to make it ON. And if what I am saying is possible then the behaviours might be undefined leading to inconsistencies. I am not sure if it’s a practical case to consider. Other thing to consider is what if I want dispatching enabled partially for some functions but not for the others. May be, creating switches local to functions would be good. I think that can be done via uarray (like adding an extra attribute as a switch to multimethod object and then using it to perform the dispatch)? Though that should be exposed to the end user and should be writable.

Please feel free to correct me if you find anything that I have written as wrong or absurd. Thanks.

I agree with going with a global configuration option. scikit-learn already has global configuration which we made threadlocal so different threads can modify the state without interfering with each other. Also we explicitly copy the configuration over when spawning processes with our joblib backends.

The default value for the configuration will depend on how the library feels about backward compatibility and I suspect libraries will make difference choices.

1 Like

I agree that enabling by default might provide a better user experience overall. However, as Thomas points out there might be backward compatibility concerns. See for example “Why __array_function__ hasn’t been enough” section in NEP 37.

Sure, it should be mostly avoided.

If such a switch is added to scikit-learn then as Thomas points out they have an infrastructure developed already for global configuration flags.

It will certainly be the case during the development stage as we can’t cover the whole library in one go. So the boundaries of support should be clearly documented.

2 Likes

For me the problem with this is mainly about how to transition users and not so much about the choice. The choice is important, but it is one that libraries can make individually and the dispatching system could try to make it easy or provide some unified API, but that is probably not even necessary.

It is also half orthogonal to the transitioning problem, a library may disable this by default, even while JAX is trying to transition its users to a world where it is enabled by default.

There are three approaches to transitioning I think:

  1. Nobody actually cares: JAX was worried about it, but in practice nobody complained e.g. about __array_function__ or __array_ufunc__ suddenly returning the correct type.
    Thus, sklearn will have an opt-in for a while, but it is really more of a “test phase” thing rather than a transitioning solution.
  2. The backend provider should transition: If JAX (or someone for JAX) registers an implementation, that implementation should be opt-in only for a transitioning period. During that period we will give a FutureWarning that the result type will be a JAX array in the future.
  3. The library transitions to a type-strict future: Whenever an array-like is passed in (which is not already a NumPy array) and converted using np.asarray() the library transitions to giving a warning or a TypeError. This ensures that if JAX adds a new backend, users either had a warning, or were warned that a change in result type is possible.
    In some sense, I think there may be agreement about it, but the question for this is how convenient it actually is for end-users who are used to everything “just working” by the fact that everthing can be coerced to a NumPy array.

(I do think that these apply just as much to __array_function__. We just never developed the solutions to this for it. __array_module__ as opposed to __array_function__ more clearly moved the transition burden to the library, rather then the array object provider. That made point 3. easier and is a bit better in general, but did not quite remove the “transitioning problem”.)

If such a switch is added to scikit-learn then as Thomas points out they have an infrastructure developed already for global configuration flags.

Awesome. Sounds good.