Hi all,
Support for array types - PyTorch tensors and CuPy and JAX arrays 0primarily, and others like Marray and Dask as well - has been growing over time, and all the infrastructure (test, docs, CI, runtime utilities, etc.) have been stable for at least half a year now, after a couple of years of hard work and evolution. Now that 1.18.x is branched, I think it is time to make this proposal: let’s make array types support public in the next release after 1.18, and let’s make that the 2.0 release.
In general, everything should be ready to make public, meaning to make the behavior that is now gated on the SCIPY_ARRAY_API environment variable the default behavior. The user-facing narrative docs should be moved from https://scipy.github.io/devdocs/dev/api-dev/array_api.html to the user guide and reference docs, but that’s probably the only thing beyond just keep extending support that still needs doing (and it makes sense to do that once we decide on making it public).
To get a sense of work done and coverage achieved:
- Over 400 PRs with the array types label: link
- There are coverage tables for CPU, GPU and JIT support at https://scipy.github.io/devdocs/dev/api-dev/array_api.html. I’ll note that those probably underestimate the level of support, because it treats all APIs/functions as equally contributing to % coverage. While if you look at trackers of large submodules like stats and special, you’ll see that the functions that are covered have many of the most heavily used functions with support, and the long tail of more niche functions left for later or even deliberately excluded.
- The original RFC with the high-level plan: scipy#18286
- Pinned tracker, linking out to per-submodule trackers: scip#18867
Making the functionality public will make it much easier to use (it’s now really for the early adopter, and hidden to the average user), and will also help downstream usage. E.g., scipy#21529 was opened by the scikit-learn team because they need to be able to enable SciPy’s private behavior currently (and cannot do so today). And the scikit-learn plan for making their own array types support public depends on ours: scikit-learn#33444.
Why do we need a 2.0 release? In a nutshell, for two reasons:
- Because there will be enough backwards compatibility impact to justify that. Not for NumPy arrays or Python scalars as input, but for things we don’t test structurally today as inputs: PyTorch tensors, Dask arrays, pandas dataframes, etc. We never guaranteed that that works, but it usually just did when the inputs are convertible to NumPy arrays.
- Because it’s such a massive feature and amount of work that it’s worth doing a major release. If this doesn’t qualify, I don’t know what would.
When talking to some people about this, there were both wishes that could fit in a major release (e.g., deprecate stats.mstats and sparse matrices) as well as voices that advocated for as little disruption as possible (don’t see it as a license to make random breaking changes, or larger-scale API changes/cleanups like NumPy 2.0 did). I think both of those make sense. As long as the number of things we have to communicate is small, and not hard-breaking unless it was already deprecated according to our regular backwards compatibility policy, I think we can satisfy both of those types of wishes.
I’ll also note that as of today, we no longer require a Fortran compiler - another huge change, after 25 years of Fortran compilation fun. That is also a 2.0-worthy achievement. And enabled full, unconditional ILP64 BLAS library support, another large one. I think there may be a few more noteworthy items, like batching support received a lot of work. Overall a 2.0 release makes a lot of sense I believe, we have a lot to be proud of and can do it without much disruption.
What do you all think?