I wanted to bring this post to your attention. It discusses a plan for allowing scikit-image (and other packages) to support more than just NumPy arrays.
I created a repo that has a demo of dispatching to different backends with a uarray-based approach
Benchmark results and some discussion is given the README there. I mostly autogenerated the uarray multimethods on the scikit-image side using a script provided in the repository above, but it does create an additional layer of indirection that can make things a little harder for new contributors to understand. The Dask backend there is a more user-friendly way of using apply_parallel
where a user would just have to register a dask_backend
and never have to call apply_parallel
(or the equivalent dask.array.map_blocks
) directly.
We should make some corresponding demos of possibilities with the array_namespace approach that will become possible with NumPy 1.22. That approach is relatively simple to understand, but doesnât help us with all of our functions that currently use Cython code.
Thanks, Greg, it is great to have a concrete piece of code to reason about. This may be worth mentioning on the main thread as well.
Cool, it is very useful to have something to try out and experiment!
Just in case it helps anyone: I tried uskimage-demo
(at least the CPU backends). I ended up using your uarray-v3
branch though, and adding an empty uskimage-demo/uskimage_demo/morphology/__init__.py
.
One thing that would interest me to play around with the implicit dispatching (multiple-dispatching) side of it: If the input is already a dask array, maybe we want to return a dask array, and automatically choose the dask backend?
As an example (sorry using blobs
and footprints
from the demo):
skimage.morphology.binary_erosion(
da.asarray(blobs, chunks=1000), footprint)
with skimage.set_backend(dask_numpy_backend, only=True):
skimage.morphology.binary_erosion(
da.asarray(blobs, chunks=1000), da.asarray(footprint, chunks=100))
The first takes much longer than the second one. So even though all inputs are dask arrays, we donât actually use the dask backend automatically. (Maybe you should not here!)
But: For the more general case, I would be very interested in having an example that allows e.g. shows how the user might pass a cupy
/dask
array, and get out a cupy
/dask
array even though skimage itself doesnât even have that backend?
This case of automatically dispatching to Dask or CuPy backends without needing to use a context manager is being discussed in Auto-register fft backend to avoid context manager ¡ Issue #14266 ¡ scipy/scipy ¡ GitHub. It involves having the library set the global backend with try_last=True
so that any other registered backends would get tried first. Then if we had scikit-image call register_backend
on those by default the Dask one would have been used in the first case you propose as well. I am not sure we should do that on the library side, though, or leave it up to users. As long as you have registered the backend you can avoid the explicit use of the context manager which is convenient.
A separate shortcoming of the Dask backend as implemented in uskimage_demo
is that it is currently assuming the Dask array chunks are NumPy arrays on the host. Given that Dask arrays can also have CuPy arrays as chunks, we should handle that case as well. I will see about making an update for that scenario.
Another design difference between the skimage uarray branch and SciPyâs uarray use is that the SciPy case currently has set_backend
, register_backend
, etc. in each subpackage (i.e. there scipy.fft.set_backend
and a proposed scipy.ndimage.set_backend
). This makes some sense in that FFT-specific backends like pyFFTW
or mkl_fft
are FFT-specific, linalg backends are likely to be linear algebra specific, etc.
In the sckit-image case, though, I currently just have a single top-level skimage.set_backend
that is used to set the backend for all subpackages. I assumed this is likely more convient as there are probably few cases where one wants different backends for say filters vs. morphology in the same usage scenario, but am interested if there is disagreement on that point?
fwiw and from a high level (havenât had time to delve into the details of the proposals), I very much would like things to work implicitly with multiple dispatch by default, at least in the simple cases. Itâs great that np.mean(my_dask_array)
Just Works now, and it would be equally great if skimage.filters.gaussian
also Just Worked in the same way.
Sorry about this, I found I had forgotten to commit a couple of the files in the more recent uarray
branch! (I was running an in-place install so things were working okay when I had tested it prior to posting). I just tested that it now works for a clean install. There is still a separate issue about getting a more recent release of uarray uploaded to PyPI so pip can meet that requirement (we specify uarray>=0.8.2, but PyPI only has 0.6.0 at the moment).
@stefanv I read your comment Towards a Faster Python project for scientific Python - #2 by stefanv and was wondering whether there are any updates on this front from sk-imageâs side? While we arenât quite taking on the âwhole-sale dispatchingâ in SciPy, we are seeing success utilising the array API standard and âdispatchingâ to other backends, with both under quite active development.
Yes, we are currently most interested in following the approach used in NetworkX, and will be working on refactoring that into a package called spatch
for further experimentation. Hopefully, there will be some discussions at EuroSciPy as well.
You can follow along at Spatch requirements for reuse in other libraries ¡ Issue #1 ¡ scientific-python/spatch ¡ GitHub