Naming convention for generalized ufuncs in special

Indeed, and to add to the inconsistency, some functions have the “at” behaviour and some of the functions have the “complete” / “all” behaviour, and only the parabolic cylinder functions have been paired with the “_seq” suffix.

sph_harm as it is right now has the “at” behaviour. lpmn as it is right now has the “complete” / “all” behaviour. That in particular makes no sense as they are closely related functions, with sph_harm being mathematically defined in terms of lpmn.

I think things are getting entangled a bit. What you would like to do at the low level C++ is not important. I think you can go ahead with it anyways immediately. We can always put the right cables to right places. So I don’t think there is anything blocking that.

The discussion is about the SciPy public API functions with two names or not. We have already 250 functions and probably n many of them have this extensions possible. (Not claiming any understanding internally but these are most of the functions we translated from F77).

There are, (I don’t know how many yet), n of them and now we will have 250 + n functions in scipy.special (extra n for the _complete versions). Say we switch to new names and we are dealing with 500 + 2n functions. I am a bit uncomfortable with this plan if I’m honest.

Also I don’t quite agree with Ralf’s array API parts and in fact I would like to reraise what @seberg wrote. These are not well argued “good practices”. Shape changing is everywhere in SciPy. I mentioned two of them being eig and svd. All the remaining alternatives are worse than returning different shaped arrays regardless how much array API discourages it.

I would suggest we do the split and you can actually freely compose what the new version would look like and we can handle the remaining decisions on the concrete functions instead of contemplating what the future might hold.

Sorry @ilayn, when I said the whole catalog, I meant the whole catalog of relevant functions computed with a recurrence relation. It’s not even close to the full set of functions available in scipy.special. We’re working out the list now.

OK, I’ve tallied them up. This applies to more functions than I realized. I’ve counted 59 applicable functions in scipy.special (counting that clpmn and lpmn should not be separate functions). Of the functions below, separate seq versions currently exist for only 5 of them in the public API. Thanks for making me take this into consideration @ilayn, commiting to adding 59 new functions to the public API is not something to do lightly. Not that adding keyword arguments doesn’t come with its own set of troubles.

Integer ordered Bessel functions.

kn
yn
We don’t have integer versions of the Bessel function function of the first kind jv or the modified Bessel function of the first kind iv, but may want to add them. I think we should have better names for the Bessel functions. I don’t think we’d want to add a function named in .

Spherical Bessel functions

spherical_jn
spherical_yn
spherical_in
spherical_kn

Ricatti Bessel functions

riccati_jn
riccati_yn

Legendre functions

sph_harm
clpmn: this and lpmn below should be combined into a single function. clpmn is just the complex valued version of lpmn.
lpmn
lqmn
lpn
lqn

Ellipsoidal Harmonics

ellip_harm
ellip_harm_2
ellip_normal

Orthogonal Polynomials

assoc_laguerre
eval_legendre
eval_chebyt
eval_chebyu
eval_chebyc
eval_chebys
eval_jacobi
eval_laguerre
eval_genlaguerre
eval_hermite
eval_hermitenorm
eval_gegenbauer
eval_sh_legendre
eval_sh_chebyt
eval_sh_chebyu
eval_sh_jacobi

Parabolic Cylinder functions

pbdv
pbvv
pbdn

Mathieu and related functions

mathieu_a
mathieu_b
mathieu_even_coef
mathieu_odd_coef
mathieu_cem
mathieu_sem
mathieu_modcem1
mathieu_modcem2
mathieu_modsem1
mathieu_modsem2

Spheroidal wave functions

pro_ang1
pro_rad1
pro_rad2
obl_ang1
obl_rad1
obl_rad2
pro_cv
obl_cv
The ones below differ from the non _cv versions in that they take a precomputed characteristic value.
pro_ang1_cv
pro_rad1_cv
pro_rad2_cv
obl_ang1_cv
obl_rad1_cv
obl_rad2_cv

1 Like

OK this is great. We can start regrouping these functions or restructuring the API a bit better, before we take decisions then.

As an example, the last two groups (with and without _cv) have the same signature so we can lump them together (maybe?). And combine them under a more readable naming scheme. But again these are all suggestions, don’t mean to make it sound like strongly held opinions.

Or even van Buren’s lib rewritten into C++ (cf. ENH: integrate Arnie Van Buren's implementations of the spheroidal wave functions · Issue #14117 · scipy/scipy · GitHub) So that we have state of art implementations. But seems like a lot of code.

Also as you noticed, keyword solution is not fixing all the remaining issues either. So probably we should pay attention to how these things are actually used instead.

Yeah, the cv versions have an extra argument for the characteristic value and the characteristic value could and probably should be a keyword argument in unified functions with more readable names.

I was thinking just now that non-ufuncs in special could have a method which has the same signature as the function itself, and returns the ufunc which covers those inputs. This could work for dispatching to different array API backends too. So if someone wants the underlying ufunc, they can get it easily.

I forgot to respond to this. Yes, I plan to to rewrite this in C++ sometime this year. It will probably take a few weeks of focused work.

Neither @seberg nor @izaid wrote anything that I disagree with really, nor anything that confirms that they aren’t good practices. E.g. I wrote “just don’t use such keywords if you have a choice” (emphasis mine) and @seberg wrote “Not saying that I wouldn’t generally avoid such flags, but …”. It both comes down to preferring things that cleanly work with static typing etc., unless there are strong reasons not to.

Some things are worse than others, obviously. Here is how I’d order them from most to least impactful (in a negative way) for behavior based on the value of a keyword:

  • Returning a different number of output values
  • Returning an array with a different dtype
  • Returning an array with a different number of dimensions

Something like np.quantile(method="...") is fine if it doesn’t do any of the above but only changes how values are calculated.

Agreed of course, since one sometimes needs dynamic features of the Python language. As long as the intent is to steer towards semantics that are cleanly statically typed, and only use dynamic features if there’s a really clear net benefit, I think the outcomes will be good.

I meant this part. Who considers this bad practice seemed like an Array API argument. And that I have disagreement with.

If I misunderstood apologies in advance.

We are strictly talking about output array shapes. But going through your list;

  • Different output arguments is a cardinal sin so we can skip that.
  • Changing dtype is unavoidable if you are using a compiled library (typically Fortran or C working with only doubles). With BLAS/LAPACK you are out of luck except the for fFdD anyways.
  • Changing argument shapes is unavoidable if you are selecting something with a keyword or choosing a threshold. Return everything below 1e-5 will always return different shapes depending on the input.

The argument is whether you want to return something else with a keyword, e.g., the default value is stuff=this and changing it to stuff=that causes the return array shapes differ.

If we can’t have this much of flexibility but should add more functions because some obscure library needs to know at JIT-time, I don’t see how it is our problem. If we can have static typing as a bonus, sure let’s try to keep having as much as we can. But if we can’t, it is not “Pythonic” to steer course orthogonal to have it not for Numba not for anything. That’s my nano diatribe.

To emphasize the disagreement point, static typing should not be a major reason for the Public Python API signature decisions, otherwise, folks are working on the wrong language.

ISTM we all are on a pretty much the same page.
The argument is not about Array API or static typing IMO. It’s just the ease of predicting the output shapes/dtypes, for both human and compiler readers.

And of course, there are no hard rules, it’s all about what we consider best practices in general, and while the ultimate answer is ‘it depends’, let’s work out a few examples (opinionated conclusions below; I suspect we mostly agree on these though):

  • Changing dtype is unavoidable if you are using a compiled library (typically Fortran or C working with only doubles). With BLAS/LAPACK you are out of luck except the for fFdD anyways.
  • linalg.qr(int_array) -> float arrays is OK (just as you say, unavoidable unless we prohibit integer inputs)

  • special.comb(m, n, exact=True) -> python int while special.comb(m, n, exact=False) -> numpy int — The first one should have been a separate function or, better, live in a separate namespace

  • definitely not:

def weirdo(x, return_int): 
      ...
      return x.astype(int) if return_int else result
  • Changing argument shapes is unavoidable if you are selecting something with a keyword or choosing a threshold. Return everything below 1e-5 will always return different shapes depending on the input.
  • linalg.qr(a, economic=True/False): this is perfectly fine.

  • np.meshgrid(*xi, sparse=True/False): these two are better as separate functions.

  • np.meshgrid(*xi, indexing="xy"): this is perfectly fine.

  • definitely not:

def weirdo_2(x, threshold=1e-5):
     ...
     return result if threshold < 1e-5 else result.squeeze()
  • this stanza is common in some parts of scipy:
if np.isscalar(result):
     return result[()]
else:
     return result

While this is probably made to imitate ufuncs, to me it’s in between “rather not” and “definitely not” categories.

Note that none of the above is a critique of existing code in scipy or elsewhere; it’s just a personal opinion on what would be best going forward.

Cheers,

Evgeni

It doesn’t. If you see any such statement in the array API docs, please link to it. A “these are our design principles for the array API” is very different from what you are saying.

The “don’t do it” for SciPy judgement is mine, based on multiple sources: array API design principle, NumPy design principles (which are similar), learnings from compiler folks like Numba, learnings from static typing usage & support, etc.

You’re misunderstanding, that is not what changing dtypes means. Fortran/C actually constrain you in the right direction here, because they’re statically typed languages. E.g. integer dtypes in → float64 out is perfectly fine. The point is to avoid functions that sometimes give float64, sometimes something else, depending on values (rather than types/dtypes) of an input argument.

The point here is to avoid writing such functions unless you really must - which is quite rare. This example with 1e-5 is something that you don’t want to put in a public API. If it’s something like eigenvalues, it’s better to return them all and leave selecting below a threshold up to the user. There are some functions that need it, but they should be limited sparingly and when there are no better alternatives.

@ev-br’s examples are about right I think. Except a minor thing with the linalg.qr(a, economic=True/False) example - that’s not a thing, it’s `mode=‘economic’ instead of a boolean keyword. When we use strings like that, they’re basically our version of enums. That’s nicer than boolean keywords not only because it’s harder to use as a variable instead of a literal (which helps static typing & co), but also because it’s more extensible.

The initial discussion stems from not having different output shapes, of the same dtype, in the output because reasons of Jax and other libs can’t build a tree or something along those lines.

From the top post Data-dependent output shapes — Python array API standard 2022.12 documentation

That’s fine but that’s where we disagree. This is not something we can avoid in Python. I’m happy to be proven wrong without writing the same function n times which is exactly what we are doing when we write Cython or C extensions (with Tempita or manually). This is the fundamental benefit of Python. Anything else is pushing things in the wrong direction. If folks need static typing they should move to a statically typed lang or use C++ templates or Julia’s type system etc. Otherwise it won’t work regardless in which context you put it.

Great. Then we are saying the same thing. The detail I emphasize is having a different output dtype based on the input is still a varying output dtype. If you push D in, you get D out but we get f out when we push f in. So not sure if typing system can handle that just by looking at the signature without templating. And if they can why can’t they figure out the shape is something I did not understand. Even C does not have a problem if you type the output. It does not need to fix the output shape so not sure where this extra frozen shape argument helps.

That won’t fly for many linear algebra functions. The point is not to waste huge arrays to get a few eigenvectors/svds/reduced decompositions etc. You don’t know how many there will be over the threshold and you definitely don’t want a 400x400 array just to use 2 columns. That’s why C/Fortran wastes the initial memory beforehand because they don’t know and fill only the parts they need. We truncate it at the output. The LAPACK’s infamous lwork query annoyance is specifically because of this.

This is a well-defined problem. linalg.eigh in fact has two separate conditions to filter it out; based on index and based on value. Likewise, linalg.pinv has cutoff values to decide where to detect rank deficiency and so on. These examples can be extended to optimize and signal too. Having said all that, I agressively agree with your remark. I think having a solid API with no surprises is a goal we should all strive for. But what I also want to add that this is still Python we are working with.

Overall, I emphasize that I don’t mean at all that we should just let the API be the funkiest thing like matlab’s with nargin/nargout frenzy. But if we start taking inspiration from typed languages then the API of Python will suffer. Python is fundamentally an untyped language. Type hints are fine, consistency is fine. But pretending as if Python can be typed is not fine (in my opinion).

Moreover, if the typing libraries work with our existing functions that’s great let’s do more. But if it is perfectly fine for Python but some static typing thing does not work because of some detail, I do believe it is detrimental for the majority of users just to gain a bit more performance by a tiny minority.

Anyways, I guess we said the same thing already multiple times so let’s conclude this here. I’ll let @steppi and @izaid to decide. Since they did all the work, I should trust their judgement about it.

It can, something like this:

from typing Any, TypeVar

import numpy as np
import numpy.typing as npt


# This line is not great, but you need only a few of these in a code base.
_DType_co = TypeVar("_DType_co", covariant=True, bound=np.dtype[Any])


def func(x: npt.NDArray) -> npt.NDArray[np.float64]:
    """Input is an array with any dtype, output always has float64"""
    ...


def func2(x: npt.NDArray[_DType_co]) -> npt.NDArray[_DType_co]:
    """Input is an array with any dtype, output always has same dtype as input"""
    ...

The details of using TypeVar are not always easy to understand, but this kind of logic doesn’t require templating of source code.

Let’s just agree to disagree here. I’m personally only an occasional user of static typing and not too invested in the topic, but there is in general a large interest in static typing from both end users and open source contributors (e.g., the Typing section of https://discuss.python.org/ has a similar amount of traffic as both Packaging and Core Development).

+1