Making array types support public & SciPy 2.0

Hi all,

Support for array types - PyTorch tensors and CuPy and JAX arrays 0primarily, and others like Marray and Dask as well - has been growing over time, and all the infrastructure (test, docs, CI, runtime utilities, etc.) have been stable for at least half a year now, after a couple of years of hard work and evolution. Now that 1.18.x is branched, I think it is time to make this proposal: let’s make array types support public in the next release after 1.18, and let’s make that the 2.0 release.

In general, everything should be ready to make public, meaning to make the behavior that is now gated on the SCIPY_ARRAY_API environment variable the default behavior. The user-facing narrative docs should be moved from https://scipy.github.io/devdocs/dev/api-dev/array_api.html to the user guide and reference docs, but that’s probably the only thing beyond just keep extending support that still needs doing (and it makes sense to do that once we decide on making it public).

To get a sense of work done and coverage achieved:

  • Over 400 PRs with the array types label: link
  • There are coverage tables for CPU, GPU and JIT support at https://scipy.github.io/devdocs/dev/api-dev/array_api.html. I’ll note that those probably underestimate the level of support, because it treats all APIs/functions as equally contributing to % coverage. While if you look at trackers of large submodules like stats and special, you’ll see that the functions that are covered have many of the most heavily used functions with support, and the long tail of more niche functions left for later or even deliberately excluded.
  • The original RFC with the high-level plan: scipy#18286
  • Pinned tracker, linking out to per-submodule trackers: scip#18867

Making the functionality public will make it much easier to use (it’s now really for the early adopter, and hidden to the average user), and will also help downstream usage. E.g., scipy#21529 was opened by the scikit-learn team because they need to be able to enable SciPy’s private behavior currently (and cannot do so today). And the scikit-learn plan for making their own array types support public depends on ours: scikit-learn#33444.


Why do we need a 2.0 release? In a nutshell, for two reasons:

  1. Because there will be enough backwards compatibility impact to justify that. Not for NumPy arrays or Python scalars as input, but for things we don’t test structurally today as inputs: PyTorch tensors, Dask arrays, pandas dataframes, etc. We never guaranteed that that works, but it usually just did when the inputs are convertible to NumPy arrays.
  2. Because it’s such a massive feature and amount of work that it’s worth doing a major release. If this doesn’t qualify, I don’t know what would.

When talking to some people about this, there were both wishes that could fit in a major release (e.g., deprecate stats.mstats and sparse matrices) as well as voices that advocated for as little disruption as possible (don’t see it as a license to make random breaking changes, or larger-scale API changes/cleanups like NumPy 2.0 did). I think both of those make sense. As long as the number of things we have to communicate is small, and not hard-breaking unless it was already deprecated according to our regular backwards compatibility policy, I think we can satisfy both of those types of wishes.

I’ll also note that as of today, we no longer require a Fortran compiler - another huge change, after 25 years of Fortran compilation fun. That is also a 2.0-worthy achievement. And enabled full, unconditional ILP64 BLAS library support, another large one. I think there may be a few more noteworthy items, like batching support received a lot of work. Overall a 2.0 release makes a lot of sense I believe, we have a lot to be proud of and can do it without much disruption.

What do you all think?

I can squint a little and see the mstats and sparse matrix deprecations as part of a piece of the Array API rationalization effort and not separate things thrown in just because the 2.0 ship is sailing. +1 for including them specifically, but the caution not to throw everything in opportunistically is well-taken.

Thanks for proposing this. I definitely think it’s the right time to start planning for 2.0.

However, I am slightly concerned that making the next release the 2.0 release would be moving too fast. Not because I don’t think array api isn’t ready but because it won’t give us, the developers, enough time to plan for and execute other things. There has been an accumulation of changes (see SciPy 2.0: ideas for changes & cleanups · scipy/scipy Wiki · GitHub for a non-exhaustive list) which we have punted because it was decided that it should wait until version 2.0.

Before committing to this decision I think we should collate what other changes had possibly been planned for 2.0, decide on their priority, and whether they could still be achieved under the schedule you are proposing.

I think the biggest question mark for me is how the stats distribution infrastructure work fits into the proposed timeline. scipy#15928 said:

The new infrastructure and distributions would be widely advertised, and documentation and existing code would transition to the new infrastructure. Existing infrastructure/distributions would be supported without deprecation warnings until two releases before SciPy 2.0, at which point they would be moved to a new package that is released once (without planned maintenance).

That was the proposal at the time, but I’m not sure it’s the best thing to do if we’re preparing for 2.0 and some of the rest of that proposal is still incomplete. One major thing that hasn’t materialized is fitting via zfit, for instance. So we could begin to de-emphasize the old infrastructure in 2.0 (e.g. use stats.Normal in examples rather than stats.norm), but probably not deprecate/remove yet.

Small correction: none of those were decided on, it was more a “not to be considered before 2.0”.

I think the first thing to agree on is the principles for the release, in particular this:

Most of the items on that wiki page seem like they’d clash with this. For a 2.0 release along the lines I described, we’re only removing some already-deprecated things (possibly with 1-2 exceptions) and do a few more impactful deprecations. A six month timeline is easily compatible with that. The most work will be the actual preparation for making the array types support public.

As for concrete items on that list:

  • Already deprecated things are the compat shims for private APIs and possibly scipy.misc (since all its namespace members are deprecated).
  • Removals of APIs that aren’t yet deprecated (fftpack, mstats, ltisys/wavelets/splines, signal.cwt, sparse matrices, cKDTree, changes of return types away from tuples): that would all be out of scope.
    • uarray is the one that maybe we could get away with, since there are only a few large users and that functionality is essentially being replaced in large part by the dispatching on array types we will now make public
    • I’d also say that most of this is low-prio and if we want to get rid of it, a deprecation could be done in any regular feature release
  • rng keyword behavior changes is the one item that seems relevant and we’ve been working towards for a while.

Do you agree with the principles I proposed, or are you advocating for a much more breaking major release @j-bowhay?

For sparse matrices in particular, that seems about right: start being noisy at all sites of use in the next release, and remove 2 minor releases later. @dschult am I right in thinking that your plan Tracking spmatrix deprecation progress and action items · Issue #24802 · scipy/scipy · GitHub would be basically unchanged, apart from the move to v2.0 and v2.2?

For sparse matrices in particular, that seems about right: start being noisy at all sites of use in the next release, and remove 2 minor releases later.

Yes, removal in v1.21 is the current plan for sparse matrices. So the plan could remain intact with a shift in release numbers to: “v2.0(warnings) and v2.2(removal)”. We could move quicker than that if desired, but that timeline maintains the usual warning times for feature removals.

If I squint through a cloud of hopefulness I can even imagine a slightly easier transition if 2.0 coincides with the sparse matrix deprecation warnings: people who allocate more time to update their code for a major release could see the warnings and remove sparse matrix even though not required. :slight_smile:

Indeed my original wording was imprecise, apologies. However, I do think we should explicitly make a decision about each of these.

I would like the option to do more, yes.

From the list here are my thoughts.

Uncontroversial (hopefully):

  • Removing all the compatibility shims for private APIs (see #14360 and others, e.g. #14919) - we have been warning about this since forever
  • Removing scipy.misc- this also has a warning for a while and is empty

Things I would be +1 on:

  • Removing scipy.stats.mstats (gh-22194) - a lot of these functions are low quality and do a disservice to our users. Also having a different api for a different array type seems at odds with the philosophy of array api.
  • Consider removing spmatrix now that we have sparray - seems desirable to happen in a major release although we would need to give users some notice.
  • Remove scipy.fftpack - I don’t see why we should carry two complete fft APIs for forever. However, it would be nice to give users some forewarning about this.
  • Orthography choice for parameters named lambda. - Seems straight forward and using the rename parameter decorator we have this doesn’t even have to be breaking.
  • Remove uarray - I would really like to see this removed as it makes the fft code really confusing.
  • ~~SPEC07 RNG behaviour changes - seems like we are most of the way there already?~~ sounds like we need more time
  • Enforce consistent policy for position-only and keyword-only parameters. - would make maintenance and new features easier

Things I would be +0 on:

  • Deduplicate kdtree apis - no need to carry two implementations
  • Remove legacy APIs - I’d be infavour of removing the lesser used ones but things like fsolve would probably be too disruptive.

Things I would be -0 on:

  • Decide on the fate of control-related parts of signal, in particular ltisys, either retire them or commit to an overhaul with possible breakage. - I don’t know enough about this.
  • Removing other used-but-questionable functionality, such as signal.cwt and other wavelet and spline functionality in signal - also don’t know enough about this.

Things I would be -1 on:

  • Possibly making changes to the return types for non-numpy-array input (i.e., not coercing with np.asarray by default) - seems very disruptive for what benefit?
  • Return objects with attributes rather than tuples to facilitate future additions. - Too disruptive given we have a workaround

Other:

  • ~API redesign of the statistical distribution.~ Remove old distribution infrastructure. - Matt seemed to indicate we aren’t ready to do this.

For these ones, could you give your opinion on deprecating in 2.0 and removing a few releases after? As Dan said, I think that sounds fine for spmatrix (although I appreciate that scipy<2.0 is less cognitive load than scipy<2.2). fftpack and uarray I probably feel the same. Likewise for SPEC 7, I think we just need more time to raise some noise, but I don’t see much of a benefit to removal in a major version. mstats too, I think deprecating in 2.0 to ‘test the waters’ before we fully commit to it will hopefully encourage users to pick up MArray (the kind that are keen to upgrade to 2.0, at least).

This one I think we should revisit to check opinions, but I’m not sure if there is desirable breakage on the table. @h-vetinari and @rgommers at least seemed to be roughly agreed back at API: keywords only arguments · Issue #14714 · scipy/scipy · GitHub that a project-wide deprecation is not justified by the benefits.

Likewise for SPEC 7, I think we just need more time to raise some noise

With SPEC7 changes the lead time is looooong. After the introduction of the rng keyword we wait until all supported releases use the new keyword. Only then do we deprecate the old keyword (seed, random_state), and then it’s another few releases until the old keyword is removed. We’re only just about getting to the point where there’s lots of places where we can start deprecating the old keyword. There may also be places where we haven’t changed over.

(Unrelated but this thread seems to use the wrong scipy tag as it does not show up under scipy - Scientific Python)

It wouldn’t be my first choice but I would be willing to accept it as a compromise.

It’s awesome to see this post. I was one of the ones who urged caution; to avoid using this as an opportunity to push through a laundry list of sweeping API-changes. My main concern was that we could end up loading a substantial amount of work on our plates and risk either significantly delaying the 2.0 release or rushing to complete the work and producing a poor quality release. I fear the later would leave a lasting bad impression and that it’s essential to stick the landing and ship something strong, even if it means limiting scope. My vote would be to stick to a minimal set of things that group into thematic clusters and which are either already somewhat mature, or which could be completed without much difficulty. I agree that the mstats and sparse matrix deprecations (with removal in 2.2) could reasonably fall into the array API thematic cluster and am +1 on including them. Beyond that, I’d prefer to stick to the things Ralf listed in his original post, and to pour energy into producing quality documentation and ensuring everything works as expected. Producing a strong, polished 2.0 release with minimal cause for complaint seems like the best thing for the health of the project.

If we can also include uarray in this list then I think I am happy with the proposal.

I can’t say I understand what all of the implications would be, but I’m on board with deprecating uarray. It seemed that Ralf thought we could possibly even get away with removing it completely without deprecating first.

Great, it seems like we’re on the same page then! For uarray, I agree indeed - it fits thematically, and it’s probably the only item on that list that is a little urgent because it is technicallly complex code and we’re currently having trouble upgrading it due to some hairy segfault. Let’s try to push for removal, and if we get pushback from downstream, we can deprecate and then remove a little later.

***

For the other items, I’d like to add that declaring an API legacy and leaving it there forever is perfectly okay - cleaning up for cleaning up’s sake is not a good goal. Example for fftpack: it is now a thin API shim over scipy.fft so incurs minimal cost, while if you do a code search on GitHub you’ll find O(100,000) hits: from scipy import fftpack results, from scipy.fftpack results (and there’ll be more in private code). Removing that shim would be a very poor tradeoff.

To digress slightly, do you have a plan for handling arrays that aren’t on the default device? So far I don’t think we have handled this properly, i.e. there are lots of places where arrays are created internally without explicitly passing a device. We also ideally we need a way of testing this that doesn’t require writing an explicit test for every function. Maybe we could monkey patch the xp test fixture so that, in the test functions, asarray creates on a non-default device?

| rgommers
June 5 |

  • | - |

Great, it seems like we’re on the same page then! For uarray, I agree indeed - it fits thematically, and it’s probably the only item on that list that is a little urgent because it is technicallly complex code and we’re currently having trouble upgrading it due to some hairy segfault. Let’s try to push for removal, and if we get pushback from downstream, we can deprecate and then remove a little later.

Agreed that having a tightly scoped set of changes is good, and if we scope it and keep it thematically coherent, it does not matter all that much if 2.0 comes as the next release (in ~6 months) or one after next (~12 months). And the case for the next release is strong I think, so +1 for the plan above. Especially given that it looks like we’re not necessarily planning that much breakage.

One thing concerns me a bit though. We used “SciPy 2.0” for a long time as a bit of a nebulous target: some point far in the future when we’ll be able to make breaking changes (or more breaking than acceptable in a usual minor release). Since this was always far far away, and there never was any timeline, it might have worked as a deterrent: a project is not even worth starting because it cannot conclude in a minor release and there is no timeline for when it can.
And now we switch from “not any time soon” to a rather tight timescale. So, we should I think ask the community at large: is there something potentially more breaking than usual, that you would want to work on if there’s a definite timeline for a breaking release?
Example projects include items from Jake’s list, lti systems in scipy.signal, consistent handling of complex-valued signals across scipy.signal, or other larger-scale improvements.

For any of this to work, I think we’d need

  • a scoped proposal for a project (an informal one of course, nothing PEP-level needed);
  • a champion who want to drive the work;
  • a time estimate it needs before a breaking release.

Then for each proposal we make a yes/no decision on this list, and work out the timeline for when these may land: add it to the 2.0 or make a plan for a 3.0 (if it is just after 2.0 even then why not in principle).
So, any breaking projects, are there?

Evgeni

If someone wants to try this, it might be easier to prototype on array-api-extra first. We have a non-default device fixture already array-api-extra/tests/conftest.py at 090c7a02a9f7d141975ddc64e5cd9f456cd97377 · data-apis/array-api-extra · GitHub but it must be used explicitly in a test at the minute.