Hi all, Ivan Yashchuk and I - with input from a number of maintainers of all libraries involved - wrote a blog post about a topic that is hopefully exciting to many of you: making support for multiple (non-NumPy) array types in SciPy, scikit-learn, scikit-image and other such libraries a reality. Here is the post: A vision for extensibility to GPU & distributed support for SciPy, scikit-learn, scikit-image and beyond | Quansight Labs
Because this topic spans across multiple libraries at the core of the PyData ecosystem, we figured that this forum is the best place to have a discussion about the big picture here. Topics specific to one project can be moved to that project’s mailing list or GitHub repo. We’d like to pose a couple of questions we think are important to discuss, and invite feedback on the overall vision and proposed design.
Questions we’d like to cover:
- Decision making on this proposal
- How to delineate what is supported functionality for “multiple array types work” for a given library
- A few key technical questions
- Testability & maintainability
Meta-question regarding decision making: what would you like to see in terms of further design docs, and how do we agree those are good enough? For example, we could write a single SPEC document after the initial discussion here settles, then give that a “go ahead”, merge PRs that make sense for a project, and revisit the overall design once we have working code covering SciPy, scikit-learn and scikit-image before finalizing the SPEC.
Let’s start with a statement that is hopefully uncontroversial: it must be easy for users to understand what parts of a library support multiple array types. It shouldn’t be a patchwork of things that do and don’t work.
There are multiple ways to go about this. What I was thinking is:
- Implement something for a logical group of functionality completely, before merging it into a library (e.g. “all of
scipy.ndimageis now supported”)
- For very large submodule it may be more granular, but it should still make sense. E.g., “decompositions in
scipy.linalgare supported, matrix functions are not” (see
- Documentation should have a uniform way of noting whether a function or submodule supports dispatching, for example via a custom Sphinx directive that can be put in the Notes section of a docstring. Have this be part of the numpydoc format.
As already mentioned in the blog post, the most tricky question is probably what the default behavior should be for the dispatching that’s happening, and how users opt into or opt out of that behavior. Ivan will start a separate thread on this forum to discuss that, because it’s a complex topic where there will very likely be different opinions.
Choice of dispatching methods, namely
uarray. The blog goes into a bit of detail already, but focused more on what should be achieved by the overall design, rather than detailed arguments for or against these methods. So let’s give a couple of reasons. For
- Let’s start by linking to NEP 37, which gives a number of reasons for why
__array_function__hasn’t been enough.
__array_namespace__is basically the worked out version of
__array_module__from that NEP.
- The array API standard see docs has been designed with this portability in mind, has a well-specified API that will work for GPU execution, JIT, compiling, etc., and has a reasonably-sized and clearly specified API surface. It has support in NumPy and CuPy in their upcoming releases, and PyTorch will follow (others are still thinking/discussing).
- It is not tied to a single preferred array library. For example, if we have libraries that now are built on top of PyTorch, JAX or CuPy for example, then
__array_namespace__can be adopted by them (and that’s not the case for other methods in NumPy).
- It is the only method that meets all the goals/wishes outlined in the blog. The other multiple dispatch methods cannot meet (a) provide a way to select a different backend for the same array type, and (b) support “array duck types”.
- It is already used by
scipy.fft, and over the summer a PR for
scipy.ndimage(unmerged) was written, as well as the start of one for
- We could use another method for some modules where we are confident that we don’t need the functionality in the first bullet (if that exists), but anyway it seems we don’t want two separate systems for doing the same job.
- It has other functionality that we are likely to need - things like coercing scalar/sequence inputs to arrays, which has been implemented with use cases like these in mind. The libraries that dispatch only on types look quite simple in comparison, but are missing such features.
- It was the joint preference of the group of people we did some initial brainstorming and planning with - including multiple maintainers: Gregory Lee (SciPy, scikit-image), Thomas Fan (scikit-learn), Peter Bell (SciPy), and myself (NumPy, SciPy).
We should mention some of the downsides for
uarray as well, these come to mind:
- more lines of code per API function covered
- the docs are too terse and need improving
The opt-in vs. opt-out is not really a downside of
uarray at this point. For context, it was discussed a lot ~2 years ago when comparing with
__array_function__, but that’s the wrong way around. We should first decide what the desired semantics should be (in the other thread on this forum, as mentioned above) and then
uarray can be made to fit those semantics.
It seems like a given that this code must be testable. For the array API standard, there’s a portable test suite: GitHub - data-apis/array-api-tests: Test suite for the PyData Array APIs standard. Already used by CuPy in CI, and NumPy testing is done on the
array-api-tests repo. That test suite should be able to run against any conforming library. Given that the API is well-defined and relatively small, this is probably fine as is. When putting a dispatch layer in front of a set of functions though, that needs separate testing. Testing the dispatching itself is not difficult, for example it takes ~150 lines of code for
scipy.fft. The more interesting question though is: is that enough? Will it cover real-world code where one, for example, feeds CuPy arrays to SciPy and scikit-learn code?
The tentative answer, based on experience and relative lack of issues for the
scipy.fft machinery, is yes - at least for dispatching on the top-level function call. For something like this:
def some_sklearn_func(x): # A bunch of pure Python code # ... # Some array API usage (__array_namespace__) # ... # Maybe a call to some other code dispatching via uarray return result
Are we completely confident that we can refactor that code and not break anything? It looks like that needs some dedicated testing - because if somewhere in the middle new code is introduced and that’s not yet covered by CuPy, then existing user code like
some_sklearn_func(x_cupy) will break. This was always a problem with
__array_function__ et al., and in practice it hasn’t been much of an issue as far as I know. But it remains technically undertested, so perhaps more is needed here?
None of us have a crystal ball, so this is a tricky topic. We suspect that the code will be relatively low-overhead to maintain, and worth it for the benefits we get - but it’s hard to be sure. So this is more of an open question; based on experience with other dispatching methods, and on the code and PRs linked above, are there general or specific worries about longer-term maintainability?
One thing that may be worth discussing is API stability. Projects implementing a backend for a particular API effectively commit to following the choices of the project which has the API in it. Including backwards incompatible changes, and API extensions. This seems to be working well for CuPy and Dask already - but for other libraries that like making their own choices, there may be some trade-offs perhaps. Or they should get more involved in the design of the base library perhaps?