Add kernel-based interpolation routines?

hankwang · August 18, 2024, 1:45pm

I have implemented an interpolation algorithm for data that is sampled on an uniform grid. There is already scipy.interpolate.RegularGridInterpolator, but it tends to be very slow for two reasons

for every evaluation, it has to search the nonuniform sampling grid.
for cubic interpolation, it looks like every evaluation needs to touch a substantial portion of the input array.

My alternative uses convolutions with suitable kernels. This is nothing new; all image and video processing software does this when you request a bicubic resampling. It’s for example in OpenCV image transforms, but that API is not very practical for me.

I’d like to contribute my code to scipy. Would this, in principle, be welcome? (I’m brand new here; I just created an account to post this.)

Notes

I use numba.njit for the performance-critical parts. I suppose that I will need to rewrite those in C?
I wrote my code as a part of my job. I’ll need to align with my employer.
I’d have to do quite a bit more work on the code to make it suitable for general-purpose use and inclusion into scipy.
A bicubic kernel is not equivalent to a 2D cubic spline interpolation, but it’s close enough for most applications that just require a continuously differentiable interpolation function.
I’d like to support a few common kernels (bicubic variants; Lanczos) and allow to user to supply a custom kernel.

Benchmarks

Test on a laptop i7 CPU; 60x60 input samples; 1000 random evaluations; scipy 1.14. Time per evaluation:

My 2D linear interpolation: 9 ns
RegularGridInterpolator, method='linear': 130 ns
2D kernel interpolation on 4x4 (e.g. bicubic) or 6x6 kernel: 130-150 ns
RegularGridInterpolator, method='cubic': 85 μs (!!)

ev-br · August 18, 2024, 3:33pm

Hi Han-Kwang,

This looks very interesting, and potentially in scope!
Several questions though:

what is the mathematical content of your methods, is it interpolation (as in, for input data {x_j, y_j}, f(x_j) == y_j) or some smoothing or something else? What are the smoothness guarantees for the interpolator, is it C^1 or C^2 or something else?
you mention convolutions with suitable kernels — could you compare your methods to RBFInterpolator then?

I have implemented an interpolation algorithm for data that is sampled on an uniform grid. There is already scipy.interpolate.RegularGridInterpolator, but it tends to be very slow for two reasons

for every evaluation, it has to search the nonuniform sampling grid.

for cubic interpolation, it looks like every evaluation needs to touch a substantial portion of the input array.

Re: uniform grid: there’s a discussion at https://github.com/scipy/scipy/issues/10180, and the standing recommendation is to delegate to ndimage.map_coordinates (https://scipy.github.io/devdocs/tutorial/interpolate/ND_regular_grid.html#uniformly-spaced-data)

Would be great if you could compare, too.

For a “substantial portion of the input array”, well, a degree-k spline interpolator in d dimensions touches exactly (k+1)**d input points.
The construction involves all input points though because of C^2 smoothness.

If you only require C^1 smoothness, you can use e.g. “pchip” method of RGI — it’s currently uses a slow recursive implementation, and a large speedup can be expected if this is replaced by a construction of an NdPPoly instance from d pchip prescriptions, similar to how RGI constructs an NdBSpline for method=“cubic”. (this would be a very welcome pull request BTW :-)).

Notes- I use numba.njit for the performance-critical parts. I suppose that I will need to rewrite those in C?

Yes. Cython, C++ or C are acceptable, so is pythran; numba is not a scipy dependency though.

I wrote my code as a part of my job. I’ll need to align with my employer.

Absolutely! Please clarify the IP rights before contributing code.

Benchmarks

Test on a laptop i7 CPU; 60x60 input samples; 1000 random evaluations; scipy 1.14. Time per evaluation:

My 2D linear interpolation: 9 ns

RegularGridInterpolator, method='linear': 130 ns

2D kernel interpolation on 4x4 (e.g. bicubic) or 6x6 kernel: 130-150 ns

RegularGridInterpolator, method='cubic': 85 μs (!!)

It’s very likely there’s low-hanging fruit in optimizing RGI performance, and addressing those would be very welcome indeed (am saying it even though it’s of course orthogonal to your original proposal).

Cheers,

Evgeni

hankwang · August 18, 2024, 6:45pm

what is the mathematical content of your methods, is it interpolation (as in, for input data {x_j, y_j}, f(x_j) == y_j)

If you use a bicubic or Lanczos kernel, then yes. But kernels don’t need this property (I actually intend to use it with different kernels as well, but those are not suitable for general-purpose scipy).

What are the smoothness guarantees for the interpolator, is it C^1 or C^2 or something else?

C1 for bicubic and Lanczos. You could make a C2-smooth kernel with a bi-quartic but I suspect that that won’t be of practical use. This is what Lanczos kernels look like: File:Lanczos-kernels.svg - Wikimedia Commons (which I just uploaded).

There is an implementation detail; I approximate the kernel piecewise linearly, which is good enough for my application, but as a part of scipy, I’d setup a quadratic spline, which will have some impact on the performance.

you mention convolutions with suitable kernels — could you compare your methods to RBFInterpolator then?

Not really; RBFI is for unstructured data samples unless I’m missing something. For the 1D case and a length-4 kernel, data the input sample array, and kfunc(x) a kernel function defined on the interval [-2, 2], an evaluation at value x is something like:

i = int(x)
h = x - i
result = sum([data[i+j] * kfunc(h - j) for j in range(-1, 3)])

[ndimage.map_coordinates] Would be great if you could compare, too.
(…) If you only require C^1 smoothness, you can use e.g. “pchip” method of RGI

Oh I see… I should have known about that before. There was no pointer to ndimage from the API docs of 1.14.

Benchmark (NxN input grid; mean of M evaluations), laptop i7 CPU:

     N, M     --> 60,1000  120,1000 60,100  240,1000
Algorithm         
My linear:        9 ns     9 ns     40 ns   9 ns
RGI linear:       110 ns   117 ns   600 ns  134 ns
My 4x4 kernel:    133 ns   138 ns   180 ns  140 ns
My 6x6 kernel:    150 ns   155 ns   200 ns  163 ns
map_coords (k=1): 39 ns    39 ns    92 ns   39 ns
map_coords (k=2): 107 ns   227 ns   600 ns  710 ns
map_coords (k=3): 130 ns   245 ns   600 ns  730 ns
RGI cubic:        85 μs    88 μs    80 μs   110 μs
RGI pchip:        135 μs   135 μs   141 μs  145 μs

Pchip is not fast either. I’m surprised about the speed of map_coords(k=3), although the initialization overhead seems to scale badly with the size of the input array.

Cython, C++ or C are acceptable, so is pythran

I’ll have to study those options.

a degree-k spline interpolator in d dimensions touches exactly (k+1)**d input points.The construction involves all input points though because of C^2 smoothness. (…)
It’s very likely there’s low-hanging fruit in optimizing RGI performance, and addressing those would be very welcome indeed

I examined the RGI implementation. I can’t fully oversee the algorithm, but it seems to have negligible overhead on __init__. All the work happens on evaluation calls. The timings above show that, too. Given how complicated the RGI code is, I would not be comfortable in touching that code.

ev-br · August 19, 2024, 7:16am

Sorry, it’s still not clear to me what is the mathematical object you are proposing to add. Could you write out a formula please?

Your benchmarks are intriguing! I suspect there are several things at play:

scipy’s interpolators, both from ndimage and interpolate, are general nD, while yours is 2D-specific, correct?
your implementation detail of using linear approximations.

So a guess is that once you generalize, the perf gap is going to shrink considerably (this is a pure speculation though).

One other thing is that RGI does not assume equal spacing of input data— and this is a feature. It’s not unexpected that there’s a perf price to pay.
The docs I linked upthread briefly discuss a CartesianGridInterpolator recipe. So I wonder if the recipe can be improved or extended — this would be very welcome for sure.
Also if you see how to make the connection to ndimage routines more prominent in the docs, please do send a PR!

hankwang · August 19, 2024, 10:30am

My current implementation is 2D, but for scipy, I’d implement it as N-D; under the hood probably with separate implementations for 1D, 2D, and N-D.

The extra overhead related to nonequidistant sampling in RGI should be vanishingly small compared to the 85 μs evaluation time. A 2x bigger grid is just 1 extra step in a bisection search.

As for the math of kernel based interpolation: the input sample array can be interpreted as representing a function that is nonzero only at discrete points (Dirac delta functions). This function is convolved with a kernel that is continuous and nonzero only in a small domain. For a 4x4 kernel (such as a bicubic kernel), a single evaluation is the weighted sum of 16 input samples.

The kernel could look like a Hann window for smoothing or like a windowed sinc function for interpolation. Higher-dimensional kernels would be the product of 1D kernels along the Cartesian axes, e.g. sinc(x)*sinc(y)*sinc(z).

My linear interpolation is a separate case.

stefanv · August 19, 2024, 5:42pm

We do something similar with the interpolation functions in scikit-image, except that, because we do not generally deal with regularly spaced interpolation, we have to evaluate a slightly different kernel at each position:

github.com

scikit-image/scikit-image/blob/main/skimage/_shared/interpolation.pxd#L72


      
              -------
              value : np_float
                  Interpolated value.
          
              """
          
              out[0] = <np_real_numeric_out>get_pixel2d(
                  image, rows, cols, round(r), round(c), mode, cval)
          
          
          cdef inline void bilinear_interpolation(
                  np_real_numeric* image, Py_ssize_t rows, Py_ssize_t cols,
                  np_floats r, np_floats c, char mode, np_real_numeric cval,
                  np_real_numeric_out* out) noexcept nogil:
              """Bilinear interpolation at a given position in the image.
          
              Parameters
              ----------
              image : numeric array
                  Input image.
              rows, cols : int

For regularly spaced interpolation, a kernel should be fast. One could perhaps start with numpy.convolve or scipy.signal.convolve to see how far you can get without custom Cython / Pythran code. Using numba in this comparison, as is done in the benchmarks above, is likely to confuse matters, since that is not what is used inside of SciPy itself.

It’d be good have some code to look at, to see how the kernels are constructed, and what the API is.

hankwang · August 19, 2024, 7:17pm

I think you misunderstand; only the input data (true values) has regular spacing. The interpolator evaluates at arbitary points that are generally not on a grid. Discrete convolutions won’t help unless you want to resample data to a shifted grid.

It’d be good have some code to look at, to see how the kernels are constructed, and what the API is.

I can’t show you my fast implementation until I have obtained permission from my employer, but here is a very basic algorithm to show how a function sampled at integer arguments is interpolated.

def kernel(x, a):
    return np.sinc(x) * np.sinc(x/a) if abs(x) < a else 0.0

def truefunc(fi, fj):
    return np.cos(2*np.pi*0.15*fj) * np.cos(2*np.pi*0.20*(fi-0.5*fj))

n = 12
samples = truefunc(np.arange(n)[:, None], np.arange(n))

def interp(samples, fi:float, fj: float):
    """Interpolate samples[fi, fj] at non-integer indices."""
    a = 3  # half kernel size (typically 2 or 3)
    i, j = int(fi), int(fj)
    hi, hj = fi-i, fj-j
    out = 0.0
    for ki in range(1-a, 1+a):
        for kj in range(1-a, 1+a):
            out += kernel(hi-ki, a) * kernel(hj-kj, a) * samples[i+ki, j+kj]
    return out
      
fig, ax = plt.subplots()
fis = np.linspace(2.5, n-3.5)
for fj in [3.2, 4.7]:
    fjs = np.full_like(fis, fj)
    z_true = truefunc(fis, fjs)
    z_interp = [interp(samples, fi, fj) for fi in fis]
    ax.plot(fis, z_true)
    ax.plot(fis, z_interp, 'x')

The kernel function should be normalized (which it isn’t here) and to avoid all those expensive sinc evaluations, the kernel itself should get a spline approximation.

The API might look like this (analogous with scipy.ndimage.map_coordinates):

class InterpKernel:    
    def from_func(cls, kernel_func: Callable, a: int):
        ...        
    def from_array(cls, kernel_vals: np.ndarray, a: int):
        ...
    
def map_coordinates_kernel(
        input, coordinates, 
        kernel: Union[str, InterpKernel] = 'ncubic', 
        mode='constant', cval=0.0
        ):
    """
    kernel can be one of 'ncubic', 'lanczos2', 'lanczos3', or an
    InterpKernel instance.
    """

For what it’s worth, here is a new benchmark run, now separating the setup time and the evaluation time per point. Input array shape (N, N); compare runtime of 20k, 2k, or 20 evaluations to estimate the fixed setup/call overhead and the evaluation time.

	         N--> 60           120          240
Algorithm                                       
My linear:        5μs,5ns      4μs,5ns      5μs,5ns
RGI linear:       93μs,70ns    78μs,76ns    78μs,88ns
My 4x4 kernel:    10μs,128ns   13μs,137ns   12μs,140ns
My 6x6 kernel:    10μs,144ns   9μs,154ns    11μs,166ns
map_coords (k=1): 8μs,32ns     9μs,35ns     9μs,34ns
map_coords (k=2): 61μs,55ns    200μs,60ns   700μs,54ns
map_coords (k=3): 61μs,75ns    190μs,73ns   700μs,74ns
RGI cubic:        200μs,80μs   400μs,91μs   1.0ms,110μs
RGI pchip:        380μs,136μs  900μs,153μs  2.5ms,152μs

It was quite difficult to setup the speed test and get reproducible numbers, but I think these timings are good to +/-20%.

stefanv · August 19, 2024, 7:40pm

Thanks for clarifying what you are proposing.

Given that we’re not using numba in SciPy, the kernels would need to be pre-defined—which is essentially what is happening in the scikit-image code I linked to. Once you implement the kernels in Cython or Pythran, we can take a look at how the speed compares. We should probably include skimage in there for completeness.

hankwang · August 19, 2024, 7:48pm

Once you implement the kernels in Cython or Pythran, we can take a look at how the speed compares.

I’m not familiar with either Cython or Pythran. I am with C/C++, but I never interfaced with Python. What would you recommend?

hankwang · August 21, 2024, 7:48am

For the record, I’m now trying to get familiar enough with Cython to implement the interpolator.

hankwang · September 14, 2024, 4:29pm

Update: I’m still working on this. It’s not ready yet for sharing with the community, but it looks I have the basics working. It’s faster (and more accurate) than the numba implementation.

I found it rather difficult to make Cython emit efficient code, but interfacing with numpy arrays from C++ didn’t look very appealing either…

ilayn · September 14, 2024, 5:08pm

Oh there is endless fun there welcome aboard, just wait until you want to do some LAPACK calls

If you push it to your own fork, please tag me there if you have issues with Cython parts.

Add kernel-based interpolation routines?

Notes

Benchmarks

Notes- I use numba.njit for the performance-critical parts. I suppose that I will need to rewrite those in C?

Benchmarks

Notes- I use `numba.njit` for the performance-critical parts. I suppose that I will need to rewrite those in C?