Should scipy tests assume float64 as the default floating-point dtype?

Hi,

Historically, scipy assumed that the default floating-point type is a double-precision float64. Now that we’re moving towards the array API world, we face torch and jax which default to float32, unless explicitly asked.

Thus, there’s a question of how we test scipy on these new frameworks. Note that here I am only talking about the scipy test suite.
End users are free to configure their scipy-using workloads as they see fit.

At the moment, scipy tests are inconsistent: we configure JAX to use default to float64 but torch keeps defaulting to float32.

This leads to having to write tests along the lines of

xp_assert_close(..., atol=1e-7 if is_torch(xp) else 1e-15)

The ability to test things in the f32 world is apparently important to some of us, and there is a body of tests which were converted to use this kind of pattern.

The use of float64 for jax is needed AFAICS due to jax’s specifics. Not using fp64 for torch is, IMO, an oversight.

There’s a discussion in a recent PR: ENH: signal.windows: add array API support (take 2) by ev-br · Pull Request #21783 · scipy/scipy · GitHub

Proposal:

  • configure the scipy test suite for torch to default to float64
  • cook up a backend-agnostic way of querying the default fp dtype and use it in tests. Then, tests would use

atol=1e-7 if default_fp_type(xp) == xp.float32 else xp.float64

instead of if is_torch(xp). (this is using a made-up default_fp_type function, and of course we can spell it in a different way)

  • if desired, add a separate CI run with fp32 as the default.

In a recent community meeting, Tyler mentioned that there are even emerging frameworks that do not have float64 at all. If we want to test scipy with those, we could add a separate CI run for those, and explicitly skip tests of those things which hard require float64. That would be a separate effort, IMO.

Thoughts?

Evgeni

This one is the most clear cut: we decided from the start that libraries must support float64, or they don’t get tested. The effort and churn to make all float64 usage conditional would be huge, and didn’t seem worth doing. We can always revisit that decision in the future of course, but it looks like for now our time will be much better spent in maturing support for libraries that do have float64. And a library with only float32 will still work - it’s just not tested by our test suite.

I think it is indeed an oversight. At first I thought it was a regression, but after reviewing the initial PR in which we added the array API infra, I think it didn’t come up there because the tests with PyTorch passed with tolerances like 1e-15 even without changing the default dtype, because of a combination of cluster tests not being very sensitive to floating-point tolerances and some tests explicitly using float64 for test array creation.

So this seems to be accidental indeed.

xp_assert_close(..., atol=1e-7 if is_torch(xp) else 1e-15)

Writing tests like this makes apples to apples comparison very difficult. It’d be better to have tests that parametrize over xp only switch the array library and not the precision as well.

The ability to test things in the f32 world is apparently important to some of us

We have reasonable but far from complete test coverage for float32. Judiciously expanding that coverage seems perfectly fine (where it makes sense, let’s not double all tests), but it should switch on the actual dtype and not the array library.

Where we do test with float32, those tests are equally relevant for numpy arrays. So again a float32 test can then compare numpy vs. pytorch vs. cupy vs. jax, switching only the array library.

So I agree with your proposal @ev-br.

:+1: from me as well.

I am seeing that issues related to numerical stability are reported quite frequently, especially for the signal module (e.g., #22035, #21969 were reported within the last 2 weeks). It is often hard to decide, if such issues point to a sub-optimal numeric implementation or a inherent limitation of the algorithm itself (which might not be documented well enough).

I expect the frequency of such issues to increase with the more widespread use of different float types. Perhaps it would make sense to take this idea a step further: I.e., explicitly document that the implementations are designed with assumption of float64 being the default and that the numerical accuracy is not tested for other types, unless explicitly stated.

I am not sure how to best approach this though: Do it globally or state it for every module, every function?

The proposal seems good to me, but we should let @j-bowhay and @mdhaber chime in if they want to.

Without the third point, we would indeed lose test coverage. Thanks to check_dtype=True in the xp-assertions, we get a load of dtype checks “for free” where we haven’t written explicit checks for the dtypes. Combine this with the fact that we test with float32 on torch currently, we get to check that input.dtype == output.dtype since we cover both float dtypes.

A (better) alternative to the third point is to write proper dtype tests for every function. But that is probably more work!

I don’t think that conclusion follows. While using the default torch dtype initially in the cluster PR may have been an oversight, that’s independent from what we do with JAX. We are seemingly forced into float64 for JAX due to 🔪 JAX - The Sharp Bits 🔪 — JAX documentation.

A few points to add from my perspective:

  • it seems reasonable that many users will end up using the default floating dtype of their library (for better or worse) so we should at least make some effort to test with it. JAX seems to be an outlier here because of it’s sharp corners so I don’t think should influence the discussion.
  • we already have dtype dependant default tolerances scipy/scipy/_lib/_array_api.py at a56b88524003e97c010e67e96d9792482455b8b8 · scipy/scipy · GitHub
  • we have already converted a bunch of functions in stats, optimize, and differentiate which all respect the default floating dtype. In cases where we need greater precision we just explicitly set the dtype.

The crux of my argument is I don’t see why we need to make such a sweeping change for something that can be remedied as simply as xp.asarray(..., dtype=xp.float64) without needing to roll back a bunch of work that has already been done.

Jake, what work would be lost, though? I think usually it would mean that some of the trouble we went through was not necessary, but I would rather let that go to make it easier going forward.

I do think we need to replace the coverage, though. I have seen many cases in which dtype bugs would be exposed by one test but not others, so I am not sure that a single dedicated dtype test for each feature is sufficient. I would prefer to at least have a dedicated dtype CI job (all tests x all dtypes x 1 platform) and maybe also a dedicated dtype test (1 test x all dtypes x all platforms) if something more thorough is infeasible.

Sorry bad phrasing on my behalf but my point is I don’t want to see us flip the switch over to float64 without it being clear that we’ll maintain the same standard of testing.

2 Likes

If recent testing turned up float32-specific bugs regularly, then those exist also with numpy and also in code that wasn’t made array type-agnostic. So I’d take it as a good data point that we will benefit from having a generic CI job as Matt and Evgeni propose.

I like your idea of a dedicated CI job, since that is pretty cheap compared to more heavily parametrizing the whole test suite by default.

It seems nontrivial to do this in a smart way, without large-scale changes to the test code. Perhaps it’s possible though, if we can find a smart way of modifying some key functions for array creation and testing?

Hmm I just meant to agree with Evgeni’s suggestion w.r.t. the CI job; I don’t know how to implement it. : )

I didn’t realize until this thread that PyTorch’s default could be configured, so it sounds like changing just that would be easy enough for starters? Whenever writing or reviewing array API tests, I’ve tried to rely on the default dtype whenever possible, so for these sorts of tests, all we’d need to do is configure the PyTorch default.

For tests that generate random data with NumPy before converting with xp.asarray, I think the default dtype of NumPy random transfers over. So maybe what we do is adjust the behavior of xp.asarray in tests. For tests that actually need to rely on a particular dtype, we could use astype instead of the dtype parameter of asarray?

I think just almost ever array API test we’d want to run uses xp.asarray, so hacking into that would get us most of the way.

While it sounds clever, I really don’t like the idea of “hacking into” xp.asarray to produce non-standard behaviour in the test suite :sweat_smile: I’m not sure how else to do it, though.

If testing with just array-api-strict would be sufficient, a flag to set the default dtypes (Allow changing the default dtypes · Issue #38 · data-apis/array-api-strict · GitHub) would almost be enough, but it wouldn’t cover cases like coercing arrays from numpy.random.

If recent testing turned up float32-specific bugs regularly, then those exist also with numpy

FWIW, I’m not aware of this being a frequent issue. Maybe in recent additions, like differentiate or elementwise frameworks — I am not familiar with those, so maybe it’s different there.

In more “traditional” parts of scipy (ndimage, signal, interpolate), what I observed a lot in the last couple of months was that a test with a hardcoded tolerance, written with numpy only in mind, starts failing under pytorch. There are two main scenarios:

  • torch genuinely differing from numpy. Either some edge cases are handled differently or not at all, or generally an algorithm is a bit different. These cases are actually rare IME. And in these cases, bumping the tolerance is perfectly reasonable.
  • xp.asarray(list) giving an f32 torch tensor, while test tolerances assume f64 (atol < finfo(float32).eps, this kind of a thing). As mentioned above, a large part of those tests are covered by the recent dtype-dependent tolerances in xp_assert_close, but all tests which set tolerance explicitly are affected. And that’s a a lot of tests, meaning there’s a lot of churn.

So I’d argue this is not smoking pre-existing numerical problems, it’s rather that we significantly expand the testing area by allowing f32-default array providers via Array API.

Given that — again, IME and FWIW — converting tests already takes 80-90% of total conversion effort, this has a non-trivial effect on conversion/adoption velocity.

As far as specific strategies are concerned: I’d agree that hacking into asarray feels a bit much; Pytorch defaults we can change today; if array api introspection/configuration capabilities grow a way to query/set the default in a backend-agnostic way, these can propagate to array-api-strict like Lucas said.
Once that exists, we can configure a CI job with an env variable. So the total effort in scipy is a dozen LOC.
Whether this new CI job is required or we opt in tests submodule by submodule… I’d think the latter is preferable, for the same reasons: this whole f32 thing is an enhancement, and I’d prioritize array api coverage over enhancements.

Evgeni

That’s because the tests are old and they mostly test “good-weather” branches. We did not touch much of it for a long time. In particular, signal is in bruised shape overall; the LTI object and filter design related ones are using suboptimal ways. Also they are easy to overflow/underflow.

There were long discussions about backwards compatibility so I lost appetite to start fixing things in signal, because I don’t have a way to fix and maintain compatibility, though a large portion of it I already have it in harold that we also pushed a bit of it to python-control. Typically we tend to push these issues to SciPy 2.0 though I don’t know when that will come if ever.

Regardless, not every function would work in 32-bit precision properly (as good as f64 without tripping), unless you code another branch specifically for reduced precision (especially in signal, think of any nontrivial polynomial coefficient manipulation). That is to say, they would fail instead of giving less precise answers. So even writing f32 tests won’t make much of a difference as they would xfail. Hence I agree that if a framework does not have f64, tough luck.