Requirements and discussion of a type dispatcher for the ecosystem

seberg · December 10, 2021, 4:27pm

In uarray implementations can return NotImplemented to signal that the next one should be tried.

The way uarray does priorities currently, is that you can select an explicit implementation (maybe you can also give a sorted list of possible choices – not sure). But basically, you can float the backend you want to prefer to the top.

In this thread, I wanted to dive more into the type-dispatching, which I think is a bit distinct from prioritizing one implementation over another (since these are for the same types).
Assigning (mutable) priorities is one way to do backend selection though. Returning NotImplemented (or enable/disable) is another. (I don’t like priorities for type-dispatching, because operator priorities in Python or NumPy have not been a huge success – Python 1 had them! – although, it may well be that we don’t need priorities for type-dispatching as opposed to backend selection anyway.)

czgdp1807 · December 10, 2021, 4:50pm

Yes. Though in my example, any backend tried first will work because both (mat_mul.impl1 and mat_mul.impl2) are able to process CuPy arrays (it’s just a matter of how they do it).

I see. That is great. Though it would be interesting to know how easy would it be to do that in run time. Like if it requires de-registering backends or just providing a new list and uarray will override its state of backends it is maintaining.

Yes.

Ah! I see. Please let me know if this is not the right place to discuss this further, I will move my questions to a new issue in uarray.

Thanks a lot for answering my questions and sharing your thoughts.

czgdp1807 · December 14, 2021, 2:19pm

It’s resolved. If we want a particular backend to be used for a block of code then we can use context manager with our preferred backend and put our code block inside it. Thanks to Peter Bell for resolving this.

grlee77 · December 14, 2021, 6:25pm

From the perspective of scipy.ndimage or scikit-image I would say yes, this level of overhead is acceptable. Even a factor of 10 larger than that is unlikely to be an issue outside of extremely rare scenarios (processing extremely small images in a loop maybe?). For example, all benchmarks given in the GitHub - grlee77/uskimage-demo: proof-of-concept uarray + scikit-image demo demo are > 100 ms for the scikit-image and all are still > 1 ms even for the purely GPU CuPy backend.

I agree that the current uarray approach is unambigous for libraries that only accept their own array types. However, there are some scenarios where different backends may take the same array types. Specific ones in the context of scikit-image are (some of these have a prototype in the demo at: GitHub - grlee77/uskimage-demo: proof-of-concept uarray + scikit-image demo):

1.) DIPlib-based backend for scikit-image functions. We could in principle make this only accept DIPlib array objects as the input, but I don’t think that is very user friendly. It would be nice to just register the backend and have much faster filtering and morphology operations for existing numpy-based scikit-image code.
2.) There are two dask backends in that repository. One strictly requires dask.array.Array inputs while the other coerces the input to/from numpy arrays. The latter can allow simple extension of multiprocessing support to scikit-image for users who do not wish to work with Dask directly.

Currently the default scikit-image backend would be tried last (as it is registered using try_last), but if both the dask_numpy and diplib backends were registered, I don’t think the order in which they are tried is currently gauranteed. Someone could create an OpenCV backend for 2D images as well, so there could potentially be multiple backends that can in principle accept the same array types.

For scikit-image, a small number of footprint (aka structuring element) generation functions in skimage.morphology (such as ball, octagon, etc.) are array-creation functions. These do not take any array-like inputs so cannot dispatch purely based on array type. It seems this may be a case that benefits from uarray’s with context to select the desired implementation.

That said, those are a small minority of functions. Many more functions do things like allocate an empty or zeros array internally, but those would seem to be compatible with the __array_namespace__ approach of the Array API.

seberg · December 14, 2021, 7:53pm

True, but overheads below 300ns (~2 lightweight Python function calls) seems very doable if the dispatcher knows that it can use the types up-front. Ralf mentioned that some people were annoyed at the slowness of __array_function__. This is much worse unless we find that library functions are in practice always a whole lot more heavy weight than NumPy functions.
My example was extreme, but note that in NumPy a length 2000 rfft also takes 10µs, which is a not quite super-small. (Images are rarely that light-weight, but what about scipy.stats certain sklearn use-cases, etc.?)

Yeah, I think there are two cases:

We have e.g. type-dispatched implementations like Dask, which also accept ndarray (because it supports a mix of both) and it would be nice if we are sure that the Dask backend will not be asked if all inputs are ndarray (the Dask backend could reject this at a speed penalty, knowing that an ndarray version exists)
We have two backends which both try to provide an alternative implementation for the same types. In this case we always need some prioritization (which is why I consider this backend-selection vs. typed-dispatching).

I still need to distinguish them clearly in my mind. But yeah, I think figuring out how we want to behave for backend-selection is also important (cared more about the first point here). We could also consider things like warning. I somewhat suspect that “backends” will normally require some explicit user choice (at the least from evilcorp import faster_skimage), so there might even be some options like giving a UserWarning when things get ambiguous.

Yes, and I still argue that we shouldn’t use the with statement here, for the reasons noted above and in NEP 37. Rather, skimage may need to either add like= to these, or the user may need to write something like plum’s skimage.morphology.ball.invoke(cupy.ndarray)(...).

IvanYashchuk · December 15, 2021, 6:39pm

singledispatch is nice and simple. I made a package that wraps SciPy fft, ndimage, linalg, special functions with singledispatch for experimentation. It can be tried in a Colab notebook.

peterbell10 · December 16, 2021, 6:53pm

This is a concern, although I expect the vast majority of programs will use at most one or two array libraries at a time. It’s clear we want libraries to support say JAX and PyTorch but I can’t think of a reason why a user would register both backends in the same program.

It might be interesting though to have the backends declare statically which types they support, which would allow you to quickly prune backends that definitely don’t support the arguments being dispatched without calling into __ua_convert__.

seberg · December 16, 2021, 11:02pm

Sure, and I assume you can speed up uarray a bit also. But right now, we seem to be at best at potential “typical case” overhead of 4 * N(backends) slower than what seems easy for a dispatcher that has all info available (i.e. a strict-type dispatcher similar to singledispatch):

from functools import singledispatch
import numbers

def plain(arg, arg2, arg3, arg4, arg5):
    return "plain", arg, arg2, arg3, arg4, arg5

@singledispatch
def disp(arg, arg2, arg3, arg4, arg5):
    return "base-impl", arg, arg2, arg3, arg4, arg5

# Super fast of course, ~73ns for me
%timeit plain(1, 2, 3, 4, 5)
# Much slower, the wrapping adds ~320ns even if there is only one choice :).
%timeit disp(1, 2, 3, 4, 5)

@disp.register(numbers.Integral)
def _(arg, arg2, arg3, arg4, arg5):
    return "base-impl", arg, arg2, arg3, arg4, arg5

@disp.register(numbers.Integral)
def _(arg, arg2, arg3, arg4, arg5):
    return "integral", arg, arg2, arg3, arg4, arg5

@disp.register(list)
def _(arg, arg2, arg3, arg4, arg5):
    return "base-impl", arg, arg2, arg3, arg4, arg5

# has to do more "work" now, but it is constant no matter which one we take
# and additional work is only 40ns and is a dict lookup, so basically constant
# at 360ns overhead.
%timeit disp(1, 2, 3, 4, 5)
%timeit disp("base", 2, 3, 4, 5)

And this is pure Python using *args, **kwargs. So I you could probably knock of more than 100ns by moving to C! Lets say the “relevant args” part adds ~100ns if it is done by a helper (as in __array_function__ and uarray, but it seems like this may be a convenient API in any case) and I guess you may need to add up to 20ns for every additional “relevant arg”.

That leaves us at <500ns overhead, with lots of potential to speed things up (some of which would require technical deep-dives though)! I would not be surprised if <200ns is possible in the vast majority of cases.

Yeah, you can speed things up with additional API. But if that requires backend buy-in so it is harder to add in some future rather than adding it now (you have to fix all backends, rather than one frontend). And once we have the API for type-based “pruning”, it seems to me that we probably do not need __ua_convert__ at all for type-dispatching!
(There may be some convenience about supporting auto-conversion or some type of “replacer-API”, but it seems a feature that is completely independent of dispatching itself – at least most/typical dispatching.)

I believe there are a lot of good ideas and things in the concrete uarray proposal, but I also think we need to start by identifying them and looking at each choice individually to see if there are better alternatives.

We really need better arguments or fixes for:

The use of the with statement in its very broad form. (I think it’s availability to force a type-specific backend is a no-go. And there needs to be a reason why it must exists in this form and that it is just a possible, but discouraged abuse?)
That a dispatcher and registration process that is primarily based on types is not preferable for type-dispatching.
- Performance is one aspect here, but I think convenience is almost as important.
- Yes, there will be limits especially if we come to backend-selection, but frankly, they feel solvable (if a bit more tricky than the current version obviously).
Listing of all available implementations for a single function (I do not think a list of backends that might implement it is good). I.e. I don’t understand why the “domains” need to be so broad? Potentially allow libraries to blocklist outdated/buggy backend versions? …?
… probably more, but I would hope those will be more of the “technical details” kind.

I will be frank: If you were asking for inclusion in NumPy, unfortunately I think I would have to veto any proposal that does not address these points much clearer. Right now, the alternatives look better to me when it comes to type-dispatching, and I do not understand why adding on backend-selection should be too hard.
uarray to me seems always designed based on __array_ufunc__ and __array_function__ and that means that the alternative of something like “multiple dispatching” was probably never thoroughly considered (happy to be proven wrong). Since then, a lot of things were fixed and solved (lets learn from those!), but I think we need to backtrack to especially those early design choices and really be sure they were right.

Taking a concrete proposal on without them addressed feels like potentially taking on technical debt that may just be too painful to repay later. We should have the resources right now to carefully explain and also reconsider the design choices rather than hoping that we can fix them later. If there were no clear alternatives, that may be an argument, but they exist, they are just not implemented.

peterbell10 · December 17, 2021, 12:49am

Let’s say I want to write a PyTorch backend that accepts all types that implement the __torch_function__ or __torch_dispatch__ protocols. In that case you can’t statically know which types you support. There isn’t even a type heirarchy so you need more than basic multiple dispatch. You might introduce an abstract base class instead, but at that point I don’t see how it’s any better than calling a named function.

with set_backend(...) is necessary to enforce an ordering to dispatch. If I want my application to use mkl_fft for NumPy arrays instead of the default scipy.fft backend, I need some way of making it have a higher priority. I don’t see how this constitutes an “abuse”.

Domains are useful for libraries like dask that wrap other array libraries. It means you can write one __ua_function__ and have it work for functions you’ve not even heard of before.

An interesting alternative design used in PyTorch land is a dispatch table per function and additional global fallback implementations. If the dispatch table doesn’t have a direct entry for a specific function, you execute the fallback instead. The main downside here is probably the O(n_functions x n_backends) memory overhead.

seberg · December 17, 2021, 2:45pm

Yes, I can: it is exactly those types that implement __torch_function__ and __torch_dispatch__. At least assuming they don’t get dynamically tagged on to the instances. So an ABC is be able to resolve this just fine and you could also do ABC.register() if need be to resolve certain cases.

What am I missing? Because this was actually my argument for describing type-dispatching based on the type hierarchy! The type hierarchy should be able to automatically sort a potential pytorch (or subclass-of-pytorch) specific backend behind the __torch_function__ one: The correct one will always get picked without anyone having to worry about registration order!
This is something that you lose completely, since you have a strictly linear search, that depends on the registration.

Right, this is what I try to call “backend-selection” because I feel it is distinct for type-dispatching (ensure cupy implementation for cupy input). In backend-selection we (usually?) have two implementations which match our inputs equally well (based on the type they work on), and we need a way to pick which one to use. And we probably need to be able to pick a specific one context-locally?

Unlike implementations for different types (type-dispatching), for these backends, we can assume that they behave the same except for possible small precision differences. I.e. you can’t break a library by forcing a backend, unless it is due to reduced precision or so (which the backend should be explicit about if it is plausible).

Switching to a backend matching equally well (are there other possibilities?) is probably needed in some form. I.e. enable/disable the use of fftw if the input is a NumPy array. But I still have severe reservations about allowing the with statement to choose the return type in a context-local manner. This is better done in a local manner using like=... or function.invoke(cupy.ndarray)(...). But I think allowing to use the with statement is the current design, and even used as an argument for the design.
This was probably the main reason behind the whole __array_namespace__ counter-proposal to uarray!

I honestly, don’t know yet how to do backend-selection using a with statement with a normal multiple-dispatcher. It seems the dispatcher needs to query some global state (for each potential backend?), or alternatively, the with statement needs to explicitly modify resolution orders for each function when active (if “domains” are the functions, the backend has its “own” domain containing all functions it overrides).
That may be OK, I doubt it is super slow (we probably just need to change the order in some linked-list for 100 functions) but it seems not lightweight either.
And I am not sure whether with pyfftw(): needs to be lightweight, or if we can say that any place where it would have to be lightweight has to use pyfftw.fft directly.
(I.e. trade off dispatching speed, which is very common, for slower backend switching speed.)

(There are other possibilities, some backends may want to always be tried first, but promise to fall-back to the previous one quickly, e.g. for small arrays. It would be interesting to know what backend providers would be most interested in providing for their users.)

True, but is this important enough to burden the API providing library and end-user with, if it means additional complexity or less features/optimizations for those? Can’t we find a solution to dynamically find all functions to register with, rather than trusting the backend to be quick about deciding whether it even implements that function?

Yes, a per-function dispatch table is exactly what I think we must consider thoroughly (in whichever form). What does torch need to the additional global one for, or is that the “duck-type” version that you try if all else fail (I.e. just like NumPy calls its own implementation if no argument provides __array_function__)?

I do not see much of a reason against it because of the memory footprint. Firstly, it does not seem large (compared to the size of each function, a tablel of functions should be small). Secondly, it only applies to such dask-like backends, which try to wrap more than they explicitly implement, which seems like a very specific and probably untypical use-case anyway. We won’t have dozens of such backends (and if we do we have more problems than the memory footprint).

peterbell10 · December 18, 2021, 4:46pm

So concretely, say I have one backend that implements the ABC TorchTensorLike and another that implements torch.Tensor. IIUC you want the second backend to have implicitly higher priority because torch.Tensor is a subclass of TorchTensorLike?

The idea is interesting but I don’t think it’s enough on its own. There will be ambiguous situations where multiple backends handle the same type but neither is more specialised than the other. In these cases you still need the user to be able to impose an explicit priority. At which point it isn’t clear to me whether that implicit priority is valuable or just distracts from the tools we already have to set explicit priorities.

I’m not particularly attached to supporting calls without dispatchable arguments. I think that’s completely orthogonal to the with statement though. Whichever way you slice it, dispatching requires there to be some state that tells you which backends to call and the end-user needs to be able to manipulate it. This is the same for single/multiple dispatch, it’s just a decorator syntax there instead of a context manager.

You would need to update the dispatch table on every function that backend implements.uarray has less overhead here because we only need to maintain dispatch tables per-domain. However, I expect most use cases will set their backends once at the start of the program and leave them alone from there. So the setup itself likely isn’t performance sensitive.

I think it’s impossible to dynamically find all functions that a backend might implement because more functions can always be imported at any time, including after the backend is registered.

Not duck-type, but specific types with generic implementations that rely on re-dispatching to other backends to implement the function.

A nice example is conjugate views. They are views of normal in-memory tensors that represent their values after conjugation. Some functions have special handling of conjugated views like torch.conj which returns a non-conjugated view of the underlying tensor. Most functions however, are implemented by a fallback that materializes a new tensor with the conjugated values and re-dispatches to the normal CPU or CUDA implementation.

So with a relatively small amount of code you have a new type that implements interesting and useful semantics.

It’s not just the memory itself but also the time overhead associated with looking up dispatch tables which may not be in cache, and as you mentioned initializing the dispatch tables when you register the backend. Another point that occurs to me is registration of multiple dispatch requires you to eagerly import the functions to get to their dispatch table, even if you never end up using them in practice. Whereas domain based registration just needs to know the name of the domain.
you need the dispatch tables for every dispatching function in every library and every backend that directly implements that function needs to make an entry in that table not just dask. In fact it’s the opposite; generic backends like dask don’t have to register per-function but concrete backends do.

seberg · December 18, 2021, 9:39pm

This is why I prefer to make a clear distinction between multiple backends for the same types (i.e. those providing better performance) and type-dispatching (implementations enabling a function to work with more inputs types, such as cupy, or torch).
The former clearly needs an additional mechanism on top of type-dispatching (e.g. priorities that may even have state)! But I think we are doing us a huge disservice by assuming that the need for backend-selection means that we cannot make use of what, to me, currently seems like the clearly superior approach for the type-dispatching part.

I feel that type-dispatching is probably our most important and pervasive use-case. A type-based dispatcher (such as singledispatch) seems to have very concrete advantages, since uarray cannot get the prioritization correct there even in the type-dispatching case (you have to carefully craft the backend order or the user has to manually choose it).
So the singledispatch approach does better: it only needs user-input for backend-selection/prioritization. Where uarray currently may need user-input (or a lot of care?) to find the best implementation for the type-dispatching case.

Great, can we even pin this down as a strict requirement to either forbid it or make sure it is document that implementations doing type-dispatching must be written in a way that end-users cannot reasonably misuse them in this way?
Additionally to not allowing calls without dispatchable arguments there is also one important behaviour restrictions that all implementations must adhere too: __ua_convert__ may only ever “match” if the backend can guarantee the output type will be correct for the inputs: e.g. a JAX backend must either not match on numpy-only inputs or guarantee that the output is still a numpy array.
Because of this, I think the force=True option is probably best removed for the time being, since its purpose seems to explicitly allow the opposite.

(This requirements to __ua_convert__ also applies to any type-dispatcher, but the uarray documentation makes it seem like the opposite is encouraged: coercing in __ua_convert__ – and thus returning a different type(?) – is documented as correct so long it is O(log(N))!)

I think this is an important thing to discuss and consider the options! Because, I feel there is a good chance we can solve this and then we can use type-dispatching on a per-function level for most of the work.

If we say that updating the function tables explicitly is too slow (and it might not be). We may be able to solve it e.g. with a prioritization scheme?

For uarray we currently get the typical lookup of:

Walk through all domains and all its backends (some of which may not implement the function)
“touch” the backend to check it has a matching implementation (through __ua_convert__)

Backend-selection modifies the order in which we check, but I think we can say that we may need to “touch” half the backends in typical average use-case. (“touching” is currently very slow in uarray.)

NB: Technical confusion about nested with-statement in different domains

What happens if I do:

with backend_in_fft_domain():
     with backend_in_scipy_domain():
          scipy.fft.fft()  # backend gets to go first?

What happens if I swap the with statements? Do I always get the version defined by the inner-most one?

E.g. lets say we find a way to assign “priorities” that can be used for backend-selection within a with statement (warning: not sure exactly how to assign the priority in practice, especially in cases like the tricky “scoped” one above). Priorities (if used) would have to be context-aware and would probably be stored in a ContextVar (I assume this would be one for each backend).

So what would happen is:

We use type-dispatching to narrow things down to the backends that may apply (this is cachable, and fast as we do not need to “touch” the backends).
For the implementations with mutable priorities (some backends), we have to inspect their priority:
- “touch” each one, by fetching its priority ContextVar
We sort them based on their priority (i.e. ContextVar result) (very fast)

The limiting part here is that we need to “touch” all applicable backends (hopefully, in practice always only 1 or 2!). However, “touching” may be as cheap as loading a single contextvar!

If you can do the “type-dispatching” part in uarray without ever touching the backends (i.e. using a cache), you might have the upper-hand (although I am not sure it would matter in practice). But without that this should be much faster. (Of course we need to figure out a scheme to “assign” that priority when entering the with statement.)

An improvement or alternative would be if we can find a way to have a “version”: A fast way to tell a function that it may need to update its dispatch order (invalidate its type-dispatch cache). If we have that, it should not really matter if reordering the backends is a bit of a slow process (but it may still be a bit tricky to figure out a good way to actually do that reordering).

So we can say this is the “Dask” use-case where you have an implementation that you know will always work for all functions (and there are no type-safety concerns, because whether or not materialization occurs is generally considered an implementation detail)?

In the case of conjugate views, it seems that the fallback is just a “default implementation” that applies to a large set of inputs? An option could also be to have an AnyTorchSubclass version (which materializes), and a pytorch.Tensor version (which does not need to materialize). The conjugate-view version uses the materializing one (it does not match the other one!), but could register a ConjugateView one if it exists.

I agree with the use-case, and that it may be helpful to have more. But I still tend to think it should be put low on the priority list for now and we should explore how “minimal” we can go (and start as minimal as possible!).

Yes, to restate and make sure we agree. If we ignore dask, we have two possibilities (using -> for the table and :: for lookup that should be simpler):

A per-backend table, mapping function -> implementation. Giving us the lookup chain function::domains::backend[function] -> implementation
A per-function table mapping backend -> implementation. Giving us the lookup chain function[types?] -> implementation (the implementation knows its backend if necessary).

Which to me seems pretty much identical storage wise (there always is a table somewhere). The Dask backend is the only exception, because it may not need a function -> implementation table at all, but would still need to put a DaskArray -> implementation_that_always_works mapping into the per-function table.

I don’t really believe that this is a meaningful nor am I sure that it really is an advantage [reasons skipped]. But maybe we should ignore this, because it may not matter in practice…
What I think matters much more is having the ability to create a per-function cache that can do most of the work. That is probably such a gigantic win that it could make everything else almost irrelevant. I know how a cache can work on types, but I currently do not know how it can reasonably work with the uarray setup.

seberg · December 20, 2021, 6:07pm

Btw. I realized that assigning priorities seems pretty easy. The reason is that we can create 64-bit counter that is incremented each time a backends is set to the new “highest” priority. That is, each time the user uses with backend() we just update its priority to highest += 1; backend_priority = highest and when we exit the context we reset the priority of the backend to the old value.

If we want to use the highest_priority to indicate whether any priority change may have happened, we would, intuitively, increment it in __exit__ as well (even though the highest actually used priority may now be lower).

64bit would be more than enough that it would be impossible to overflow during program runtime (even 32bit should be sufficient in practice, but no need to gamble ;)).

EDIT: Hmmm, although I am a bit unsure what happens if you have multiple threads. I guess that may be fine, but if we want to cache the priorities, we also need to cache it thread-local, or so? Not sure how that plays out, does uarray have a pattern for thread-safety already that would cover this as well?

seberg · December 22, 2021, 6:14pm

Code says more than a thousand words and I got stuck on thinking about it yesterday. So here is a pretty full-featured prototype for what I was suggesting: GitHub - seberg/precodita: Rough prototype for a single-type but multiple-parameter type-dispatcher and user backend-selecter

The interesting stuff to look at will be to start with example.py which shows off how things work and what is supported.

To recap, two modes/features:

strict type-dispatching. The better matching implementation is preferred. If two implementations have the same dispatch-type (promotables are currently ignored here), their priority is compared. Right now the newer defined backend (not registration) has priority.
Enabling and prioritization: Some backends are opt-in. They are ignored by default, but enabled in a context. Within the context, their priority is set to higher than all others (only a nested with statement can get a higher priority)

Very briefly, like __array_function__ (or uarray), I do use the “relevant argument extractor” approach, limited only to a single kind of arguments.

Unsolved stuff includes things like pickling (not even sure what to expect here). I somewhat expect that thread/asyncio-safety is close to working right, but honestly, probably not quite right yet.

peterbell10 · January 18, 2022, 1:20am

seberg:

What happens if I do:
with backend_in_fft_domain():
     with backend_in_scipy_domain():
          scipy.fft.fft()  # backend gets to go first?
What happens if I swap the with statements? Do I always get the version defined by the inner-most one?

The more specific domain has higher priority, so regardless of registration order backend_in_fft_domain gets called first.

You seem to have missed the point. It’s trivial to figure out the type dispatch, but it’s highly non-trivial to register it against individual functions. You can’t know the full set ahead of time!

seberg:

peterbell10:

you need the dispatch tables for every dispatching function in every library and every backend that directly implements that function needs to make an entry in that table not just dask. In fact it’s the opposite; generic backends like dask don’t have to register per-function but concrete backends do.

Yes, to restate and make sure we agree. If we ignore dask, we have two possibilities (using -> for the table and :: for lookup that should be simpler):

A per-backend table, mapping function -> implementation . Giving us the lookup chain function::domains::backend[function] -> implementation

A per-function table mapping backend -> implementation . Giving us the lookup chain function[types?] -> implementation (the implementation knows its backend if necessary).

Which to me seems pretty much identical storage wise (there always is a table somewhere). The Dask backend is the only exception, because it may not need a function -> implementation table at all, but would still need to put a DaskArray -> implementation_that_always_works mapping into the per-function table.

Yes, either way you need to do some form of lookup to get the function in the backend. But of course a per-backend mapping already exists; it’s the module itself. Just do getattr(module, func.name).

The important part is that the registration of a function within a backend is usually static, whereas the registration of backends is dynamic. In precodita you point out that pickling is an issue. Well if you want multiprocessing then you have to pickle all of that dynamic state to send it to the other process. For domain-based registration this is fairly minimal, but for function-based registration you need to send over every individual function’s dispatch tables. Or somehow “replay” the sequence of calls that setup those dispatch tables.

peterbell10 · January 18, 2022, 2:35am

I think it would be useful to map some of the design space explicitly. I think most of these are fairly independent choices so we don’t have to mix them together in discussion.

Backend registration

__array_function__ essentially registers a backend against the type which isn’t extensible over multiple libraries so doesn’t work here. Instead the choices are against the function or against an abstract “domain”.

Registering against functions is in some sense a simpler choice but:

Doesn’t support generic backends like dask
Multiprocessing requires serializing a lot of extra state
Functions have to be imported to be registered (or some indirection or deferred registration is required like LazyDispatchable)

Registration against domains fixes these problems but at the cost of having a list of backends that may or may not implement the function which need to be tested at runtime. Some caching may help here.

Explicit backend priorities

In order to support higher performance backends for specific functions and types, the user needs to be able to explicitly prioritize one backend higher than another. e.g. by calling the most recently registered first.

Implicit backend priorities

In uarray there are no implicit priorities, if two backends are given the same explicit priority (e.g. both are added with register_backend) then there is no guaruntee which one with be tried first. Whereas precodita uses ABC type hierarchies to implicitly prioritize backends more specialized to a specific type.

Type dispatching

uarray uses a single function called with all relevant arguments (__ua_convert__) but precodita uses one function call per argument (and that function happens to be called __subclasscheck__). There are of course two more options here: one call with the types, or one call per argument with the arguments.

IMO one call is superior to one per argument, and calling with types is easier to cache so should be preferred if we are purely considering type dispatch.

Argument annotation

precodita's extractor returns a list of arrays which is an implicit form of annotation. uarray.Dispatchable on the other hand explicitly annotates arguments with two pieces of metadata: “coercible” and “type”. “coercible” really means “not an out argument” which is important for dask-like generic backends. “type” may seem irrelevant for array libraries but actually it’s being used in the ndimage backend prototype; where there are arguments that might not actually be an array, because out=np.float32 is how you specify the dtype (unfortunately).

Another point of reference is the pytorch dispatcher where each function has a schema with type info and “alias info” which tells you which arguments may be modified during the function call. This is again mainly used for generic “fallback implementations”.

Argument normalization

Most dispatchers simply pass through the arguments from the function call directly. uarray on the other hand normalizes arguments by removing arguments that match the default argument (by identity like is to reduce perf overhead). This has some small runtime overhead, but means you can add new arguments with default values to existing functions and the backends will still work as long as you don’t pass a non-default value.

Argument coercion

NumPy functions can be called with any array-like inputs and they will be coerced into np.array. uarray has support to opt-in to this behaviour for specific backends, however I don’t think it’s relevant here because __array_namespace__ wouldn’t allow this.

seberg · January 18, 2022, 7:08pm

I had a another long reply typed out, outlining where and why I disagree with many of the points. But I think it is more useful to try and continue with your good start of dissecting things and trying to structure it more along those lines.
First of all, though, precodita is a proof of concept, so it obviously only shows potential alternatives and is not a complete solution like uarray tries to be. Importantly, though it shows of multiple orthogonal design alternatives! Something that I also started to lose track off…

Type dispatching

Sub-points here, as you nicely distinguish, are:

Implicit priorities.
Explicit priorities (what I call backend-selection, since I do not want explicit priorities to apply to the type dispatching part.)

precodita tries to show that we can use a type-hierarchy to do this. It further gives a way to combine that with explicit priorities. I see no argument against this alternative, yet. I think precodita gets both of the above points better:

uarray has no proposed solution for “implicit priorities”, but you do list the problem it solves as a possible user-story (duck-type backends).
(explicit priorities) precodita uses type hierarchy > with-statement as priority, while uarray uses domain > with-statement, AFAIK. That makes the following obviously right in precodita, when it is unclear to me that it can be for uarray:
```
with pyfftw:  # "domain" is `scipy.fft`
    with mkl:  # "domain" is `scipy`
        scipy.fft.fft(...)  # uses mkl over pyfftw
```
Probably orthogonal: I also use with backend: rather than with ns.set_backend(backend), which seems like a nicer API to me.

You say that precodita may perform less well. I disagree, although in the end I doubt it matters. precodita does exactly the same checks as __ua_convert__ does. But I do it for the user, which allows me cut corners (and stay within C).
Yes, isinstance(obj, ABC) as a formalism may add a bit of overhead in some cases, but ABCs do cache the result (__subclasshook__ is called only once – and we could even add another cache).
My guess is that my approach is faster in practice, but I am not sure and it probably depends on usage. Also factors such as knowing when backends definitely do not implement the function may have a bigger influence (uarray does this differently, but it seems orthogonal to the matching itself – this is part of the “implementation discovery”!).

In my opinionated summary, precodita shows of a few design possibilities with regard to type-dispatching:

We can do the matching of __ua_convert__ for the backend author. (Basically, implement __ua_convert__ for them – which to me seems nice.)
We can have “implicit priorities” if we rely on the type-hierarchy for matching.
We can have caching if we rely only types for matching. (which is a huge bonus IMO)

Further, but probably not overlapping:

We can improve on the with statement semantics with respect to “explicit priorities” (I am not sure if that requires the type-hierarchy or not, but it seems good to combine it with that.)
We can enforce that with statements will have no “type dispatching” semantics. (uarray probably could do this, so long __ua_convert__ is correctly implemented by the backends – it is not clearly specified yet, though.)

Global priority/backend state (part of “backend registration” but lets split it out)

The global “priority” state seems identical: both have a “list of backends” and “priorities” in some form. Both are ultra-lightweight in this regard, they just store the priority a bit differently.
The backends themselves are super lightweight, how they provide/fetch their implementations is interesting though.

Implementation Discovery (other chunk of “backend registration”)

uarray discovers implementations by:

Having a scope in the “domains” as a quick filter/match whether a backend may have an impl.
Asking __ua_function__ to provide/call the implementation.

precodita (completely distinct from the type dispatching!), tries to formulate this only using decorators. Note that precodita could just as well ask each backend to have a get_implementation() or call_implementation() function doing the exact same thing as __ua_function__! So we can have type-hierarchy/type-based dispatching to replace __ua_convert__ for matching and retain __ua_function__ unmodified!

In precodita, I tried to make the point that a strict API based on decorators may be nicer. I agree it will add complexity in the dispatcher, but I think it reduces API complexity for the end-user and also may open up very useful introspection abilities. (Yes, overall we probably get quite a bit of more complexity, but that complexity will be opaque and only the maintainers of the dispatching system should have to worry about it.)

Note that as I point out, we can develop this idea further. For example, use:

@ backend.add_implementation(module, qualname)
def myimpl():
    pass

So that we have control over the fact that no unnecessary imports happen (for example cupyx.scipy.fft imports scipy.fft and builds its own _implementations dictionary.)

I do agree that precodita may need to grow a bit of additional API e.g. a way to say: “If a module named sklearn.morphology registeres dispatchables, I need to inject my stuff there!”.
If you squint at that, you will see that this is actually exactly the same as __ua_domain__ = "numpy.skimage.morphology", which tells uarray that if skimage.morphology is used/imported, it needs to ask the backend about its implementations!

There may be slight differences about when that lookup happens. Although, I expect anything is possible here if we have a clear preference. On the up-side the more we take charge of dealing with this, the more we can ensure that e.g. an mkl backend doesn’t “import the world” . And the more optimization tricks or introspection API we may be able to pull off?

Argument annotation/normalization

I think uarray does some great things there. I will probably want to review some in detail, but this is fully orthogonal and IMO secondary to the other things.

Argument coercion

I am OK with argument coercion only if it is purely a helper to backends, or if explicitly and locally requested by a user, such as:

scipy.fft.fft.invoke(cupy.ndarray)(numpy_array)

There are some interesting thoughts possible, such as fully supporting a frozen state/namespace to allow:

scipy.fft.fft.invoke(cupy.ndarray, priority_state=state)(numpy_array)
# or even:
ns = get_frozen_state_namespace(type=cupy.ndarray)  # or type=None
ns.scipy.fft.fft(numpy_array)

But I think these are additions that are again orthogonal to the dispatching approach.

peterbell10 · February 15, 2022, 1:18pm

seberg:

(explicit priorities) precodita uses type hierarchy > with-statement as priority, while uarray uses domain > with-statement , AFAIK. That makes the following obviously right in precodita , when it is unclear to me that it can be for uarray :
with pyfftw:  # "domain" is `scipy.fft`
    with mkl:  # "domain" is `scipy`
        scipy.fft.fft(...)  # uses mkl over pyfftw

The sub-domain always takes precedence in dispatching, so uarray would call pyfftw in your example. Note that a similar issue applies to precodita:

with pyfftw:  # supports type(x) == np.ndarray:
   with mkl:  # supports array-like types
        f(...) # Does f use `mkl` or `pyfftw`?

The root of the issue is having two competing methods of prioritizing backends.

This requires backend providers to import the backend library in order to sub-class the context manager. Whereas uarray only needs the generic API to interact with uarray directly. So e.g. CuPy can implement scipy.fft without any new dependencies.

Rather than end-user, do you mean backend-provider? If so then again the not wanting to force a new dependency rules out decorators. The backend-provider UX doesn’t change the principle though which is domain vs. function registration; both of which could be done via protocol or decorators.

seberg:

Argument coercion

I am OK with argument coercion only if it is purely a helper to backends, or if explicitly and locally requested by a user, such as:
scipy.fft.fft.invoke(cupy.ndarray)(numpy_array)

Something that has to be mentioned is array-like arguments. Many libraries already have a backward-compatibility requirement to support lists and other array-likes. With precodita you would need to install two backends, one for np.ndarray and a separate one for array-likes so that the np.ndarray backend doesn’t get demoted during dispatch for not being specific enough.

I strongly object to f.invoke(type)(...) because it means that scipy.signal.fftconvolve has to decide whether to use type-dispatching style or invoking style. Or from user code, it makes no sense to call scipy.fft.fft.invoke(cupy.ndarray) when they could just call cupy.fft.fft directly.

seberg · February 15, 2022, 5:39pm

I had a long answer that tried a more constructive language, but frankly… I am out. So this will not be very friendly…

Your arguments against my suggested alternative, honestly, feel like arguments but not reasons:

with backend: does not override the type-hierarchy?
- Yes, that is correct, but it is an API choice that I made, it is cleary not a requirement of the approach I could do the opposite.
- Also: Yes, this is also surprising, but I could imagine meta-backends (or backends with multiple for types) to handle it. So can you honestly argue that it is clearly worse than the “domain” surprise?
“You need to import the backend, not needing an import should be the goal”:
- Yes, it is nice to have. But I don’t see that it has to be a goal.
- uarray needs to import scipy.fft.set_backend, and scipy has to import uarray so the dependency is there also.
- No matter what you have a try: import backend; or try: import scipy, you need some “import” unless you don’t care about registering the backend.
- One argument I had was that providing decorators/machinery will help avoiding the scipy.fft import. Unlike the backend system, those can be very heavy and I think there could be a real advantage in making it easy to avoid those imports.
  So the potential advantage of providing helpful API to backend authors is still on the table.

Right now the discussion feels pretty futile. IMO, uarray is clearly not ready without modifications (probably big ones) or at least without seriously vetting what the best API and implementation choices are. I have given possible alternatives. Are we discussing whether those alternatives may be better, or are we just trying to quickly check them off to come back to uarray already being the best possible choice?

To do a silly citation from “The Zen of Python”:

Now is better than never.
Although never is often better than right now.

At the moment it is my clear position that the second line applies. “never” is better than adopting uarray right now (unmodified), because uarray has simply not proven that it was carefully enough vetted both implementation and API wise.

We should aim to find the nicest API for library authors and end-users and the best implementation (and especially future-proof and probably thus limited)…
To be frank. uarray feels designed by taking the __array_function__ approach and extending it until it is feature-complete. Unfortunately, I don’t see that there was ever a pause to ask whether that isn’t over-extending the approach beyond its limits.
Yeah, I can see that it is frustrating to have someone come along and say: “Please shelve uarray for now completely because to me it looks like it is overextending the __array_function__ approach”.
But I can’t change my opinion here. I think uarray must be shelved for now, until the design space and possibilities have been considered carefully. And sorry, but to me they clearly have not been – otherwise my arguments/observation would not have been new to anyone on this thread.
And, frankly, NEP 31 was shot down mid-air very long ago. That should have been a hint that at least some things around the with statement need to be changed…

uarray is great as something to learn from, to figure out what we need, want, and often probably steal concepts and ideas verbatim. But thinking that we can just adopt it now without ever carefully considering the alternatives seems a fallacy.
You need to convince me/us that there was careful thought about design alternatives. At least for me, you are making very little progress on that front. And yes, that is bound to be a bit more involved but remember that the two main initial arguments that I read were:

a type-based (multiple dispatcher) cannot do array-likes
plum, etc. cannot do backend selection.

Where the first is wrong, and I have proven that the second is a limit of concrete implementations and not of type-based dispatchers as such?

seberg · February 15, 2022, 8:39pm

To be honest – from my point of view – until now most arguments that were explicitly given in favor of the uarray design over alternatives seem to have turned out not thought through.
The initial post also notes the design choices/axes from NEP 37… But glances over the fact that at least one of those continues explicitly arguing against uarray's current design?!

Right now, I fail to see any willingness to take a step back, which was what I had hoped to achieve when starting the thread…
Frankly, I don’t even see a willingness to do something about any of the possible issues I pointed to. Is there a TODO list somewhere for things like considering to do, maybe starting with:

update __ua_convert__ documentation to at least change it to note that it is not OK to coerce other types even if it is fast;
tell people that they must only match based on types to allow adding type based caching?

And, unrelated to this thread, any discussion at all if an end-user API that has all of (for each “domain”/“module”):

module.register_backend()
module.set_backend()
module.skip_backend() – still around even though I had previously convinced Hameer that it has bad scoping?!
module.set_global_backend()

can’t be made more slender at least for now where we are proposing to put this into a lot of modules?