A proposed design for supporting multiple array types across SciPy, scikit-learn, scikit-image and beyond

Hi all, Ivan Yashchuk and I - with input from a number of maintainers of all libraries involved - wrote a blog post about a topic that is hopefully exciting to many of you: making support for multiple (non-NumPy) array types in SciPy, scikit-learn, scikit-image and other such libraries a reality. Here is the post: A vision for extensibility to GPU & distributed support for SciPy, scikit-learn, scikit-image and beyond | Quansight Labs

Because this topic spans across multiple libraries at the core of the PyData ecosystem, we figured that this forum is the best place to have a discussion about the big picture here. Topics specific to one project can be moved to that project’s mailing list or GitHub repo. We’d like to pose a couple of questions we think are important to discuss, and invite feedback on the overall vision and proposed design.

Questions we’d like to cover:

  1. Decision making on this proposal
  2. How to delineate what is supported functionality for “multiple array types work” for a given library
  3. A few key technical questions
  4. Testability & maintainability

Decision making on this proposal

Meta-question regarding decision making: what would you like to see in terms of further design docs, and how do we agree those are good enough? For example, we could write a single SPEC document after the initial discussion here settles, then give that a “go ahead”, merge PRs that make sense for a project, and revisit the overall design once we have working code covering SciPy, scikit-learn and scikit-image before finalizing the SPEC.

How to delineate what is supported functionality

Let’s start with a statement that is hopefully uncontroversial: it must be easy for users to understand what parts of a library support multiple array types. It shouldn’t be a patchwork of things that do and don’t work.

There are multiple ways to go about this. What I was thinking is:

  • Implement something for a logical group of functionality completely, before merging it into a library (e.g. “all of scipy.ndimage is now supported”)
  • For very large submodule it may be more granular, but it should still make sense. E.g., “decompositions in scipy.linalg are supported, matrix functions are not” (see scipy.linalg docs)
  • Documentation should have a uniform way of noting whether a function or submodule supports dispatching, for example via a custom Sphinx directive that can be put in the Notes section of a docstring. Have this be part of the numpydoc format.

A few key technical questions

As already mentioned in the blog post, the most tricky question is probably what the default behavior should be for the dispatching that’s happening, and how users opt into or opt out of that behavior. Ivan will start a separate thread on this forum to discuss that, because it’s a complex topic where there will very likely be different opinions.

Choice of dispatching methods, namely __array_namespace__ and uarray. The blog goes into a bit of detail already, but focused more on what should be achieved by the overall design, rather than detailed arguments for or against these methods. So let’s give a couple of reasons. For __array_namespace__:

  • Let’s start by linking to NEP 37, which gives a number of reasons for why __array_function__ hasn’t been enough. __array_namespace__ is basically the worked out version of __array_module__ from that NEP.
  • The array API standard see docs has been designed with this portability in mind, has a well-specified API that will work for GPU execution, JIT, compiling, etc., and has a reasonably-sized and clearly specified API surface. It has support in NumPy and CuPy in their upcoming releases, and PyTorch will follow (others are still thinking/discussing).
  • It is not tied to a single preferred array library. For example, if we have libraries that now are built on top of PyTorch, JAX or CuPy for example, then __array_namespace__ can be adopted by them (and that’s not the case for other methods in NumPy).

For uarray:

  • It is the only method that meets all the goals/wishes outlined in the blog. The other multiple dispatch methods cannot meet (a) provide a way to select a different backend for the same array type, and (b) support “array duck types”.
  • It is already used by scipy.fft, and over the summer a PR for scipy.ndimage (unmerged) was written, as well as the start of one for scipy.linalg.
  • We could use another method for some modules where we are confident that we don’t need the functionality in the first bullet (if that exists), but anyway it seems we don’t want two separate systems for doing the same job.
  • It has other functionality that we are likely to need - things like coercing scalar/sequence inputs to arrays, which has been implemented with use cases like these in mind. The libraries that dispatch only on types look quite simple in comparison, but are missing such features.
  • It was the joint preference of the group of people we did some initial brainstorming and planning with - including multiple maintainers: Gregory Lee (SciPy, scikit-image), Thomas Fan (scikit-learn), Peter Bell (SciPy), and myself (NumPy, SciPy).

We should mention some of the downsides for uarray as well, these come to mind:

  • more lines of code per API function covered
  • the docs are too terse and need improving

The opt-in vs. opt-out is not really a downside of uarray at this point. For context, it was discussed a lot ~2 years ago when comparing with __array_function__, but that’s the wrong way around. We should first decide what the desired semantics should be (in the other thread on this forum, as mentioned above) and then uarray can be made to fit those semantics.

Testability

It seems like a given that this code must be testable. For the array API standard, there’s a portable test suite: GitHub - data-apis/array-api-tests: Test suite for the PyData Array APIs standard. Already used by CuPy in CI, and NumPy testing is done on the array-api-tests repo. That test suite should be able to run against any conforming library. Given that the API is well-defined and relatively small, this is probably fine as is. When putting a dispatch layer in front of a set of functions though, that needs separate testing. Testing the dispatching itself is not difficult, for example it takes ~150 lines of code for scipy.fft. The more interesting question though is: is that enough? Will it cover real-world code where one, for example, feeds CuPy arrays to SciPy and scikit-learn code?

The tentative answer, based on experience and relative lack of issues for the __array_ufunc__, __array_function__ and scipy.fft machinery, is yes - at least for dispatching on the top-level function call. For something like this:

def some_sklearn_func(x):
    # A bunch of pure Python code
    # ...
    # Some array API usage (__array_namespace__)
    # ...
    # Maybe a call to some other code dispatching via uarray
    return result

Are we completely confident that we can refactor that code and not break anything? It looks like that needs some dedicated testing - because if somewhere in the middle new code is introduced and that’s not yet covered by CuPy, then existing user code like some_sklearn_func(x_cupy) will break. This was always a problem with __array_function__ et al., and in practice it hasn’t been much of an issue as far as I know. But it remains technically undertested, so perhaps more is needed here?

Maintainability

None of us have a crystal ball, so this is a tricky topic. We suspect that the code will be relatively low-overhead to maintain, and worth it for the benefits we get - but it’s hard to be sure. So this is more of an open question; based on experience with other dispatching methods, and on the code and PRs linked above, are there general or specific worries about longer-term maintainability?

One thing that may be worth discussing is API stability. Projects implementing a backend for a particular API effectively commit to following the choices of the project which has the API in it. Including backwards incompatible changes, and API extensions. This seems to be working well for CuPy and Dask already - but for other libraries that like making their own choices, there may be some trade-offs perhaps. Or they should get more involved in the design of the base library perhaps?

13 Likes

Ralf, thanks very much for writing this up. This post triggered a lot of conversation, and I think we’re all gathering our thoughts before responding. Just wanted you to know that it’s been seen and is being thought about!

1 Like

Thanks @stefanv.

I’ve gotten a few questions in private as well, about this topic and also recently about the NumPy SIMD work - both involving a meta topic around collaboration between companies and the community. So I wanted to follow up with a few thoughts around that.

Companies are often interested in code/features/packages that help them - they typically need a good business reason to spend resources or funds. For multiple array types and the kind of dispatch/backend system we’re talking about here, that is often hardware related - support new hardware, or make widely used code run faster on their hardware. This often ends up being done in isolation or as a new project, because it’s nontrivial and time-consuming to interact with community projects. A few examples (not picking on these projects specifically, they’re just the first examples that came to mind):

  • RAPIDS reimplements large parts of scikit-learn, scikit-image and SciPy in separate packages to support CUDA. An NVIDIA project to support NVIDIA hardware.
  • scikit-learn-intelex re-implements parts of scikit-learn - an Intel project optimized for Intel CPUs.
  • NEP 36 was written largely because Anaconda defaults contained modified numpy (and IIRC also scikit-learn) packages for optimized performance on x86_64.

The above projects were done without much interaction with the relevant community projects. And a similar argument can be made about the earlier reimplementation of NumPy-like libraries. Sometimes for good reason - if you have big plans and a few to a few dozens of engineers on a project (even 100+ engineers in a few cases) - it’s not going to work well to route everything through a project with a few handfuls of mostly-volunteer maintainers. But it doesn’t have to be that way - in many cases it will make more sense to collaborate.

Recently I’ve seen encouraging engagement from Intel, Huawei, IBM and Apple in NumPy - with performance optimizations for the specific hardware each company cared about. This is made possible by a structure like this:

  • Generic infrastructure
    • Backend 1 for hardware type 1
    • Backend 2 for hardware type 2
    • etc.

The NumPy universal SIMD framework proved to work nicely for that case, where a lot of code is general-purpose and fast enough, and in some cases specific (AVX512, Neon, VSX, etc.) extra optimizations can be added incrementally. In the case of SciPy/sklearn/skimage APIs, something similar should be possible. Where there is compiled code, it’s hardware-specific and therefore needs to be reimplemented. Where it’s Python, it can often be reused. And in the end that allows higher-level packages and end users to write code using only their familiar PyData project APIs, rather than:

if device == CPU:
   from scipy import submodule
   submodule.func(x)
elif device == CUDA:
    import cusubmodule
    cusubmodule.func(x)
# Maybe need to add ROCm, SYCL, etc. - and then deal with
# optimized CPU backends too. Too complex, so won't happen.
else:
    raise TypeError("this thing is not supported!")

(the above can also be written as if check_isnumpy(x): ... elif check_istorch(x); ... elif check_iscupy(x): ... etc.)

My hope is that if we get more “generic infrastructure + hardware-specific backends”, we will see (a) more contributions - both engineering time and direct funding - from companies to community projects, and (b) more higher-level projects that work out of the box with multiple array types. That seems preferable to that of an ecosystem where the largest tech companies continue to build their own tech stacks in isolation.

I hope the above is helpful. If it seems of interest and broader than the initial post in this thread, happy to split it off as a separate conversation.

1 Like

Cross-linking the scikit-image thread on this.