Use of runtime SIMD dispatch in SciPy

ilayn · April 2, 2026, 1:20pm

This has been briefly touched upon in Does scipy have a standard way of using simd? but I’d like to put more details in it from a maintenance angle.

Currently SciPy does not provide any optimizations other than -O3 optimization level. While this is often quite sufficient for our algorithms which are not trivial to divide into mundane but optimizable tasks, we do have some NumPy-like C/C++ functions that are often used and eat considerable time budget.

Two of these are, figuring out the structure of an array, (diagonal, triangular, etc.) and symmetricity/hermitianness check. These are sitting in the front of most scipy.linalg functions and dispatch to different algorithms depending on whether triangular or diagonal input is given. However, these checks add up mainly for n = [30, 300] sizes (where it matters), compared to the actual cost of the dispatched functionality.

Note that early-exit is a fundamental property for performance hence existing ufuncs or array comparison solutions cannot be used.Hence, SIMD plays nicely when enabled for these workloads.

But, turning on AVX2 or other features typically do not immediately lend itself to SIMD optimized code by the compiler and some manual intervention is needed. There are many reasons for this but essentially the compiler can’t reason about when or how to break out of certain loops.

On the other hand, there are very heavy dependencies such as Google Highway or SimSIMD projects that try to abstract away the SIMD code specifics. While this might be justified for projects like NumPy, for SciPy it is not that much of a good match for once, we don’t have enough code to be SIMD’d and also we don’t really maintain much of the compiled code. So if it is contained in a few places, maintenance cost is justified in my opinion.

To that end, I prepared a Pull Request that demonstrates this runtime switching SIMD in https://github.com/scipy/scipy/pull/24907 and the performance benefits.

Currently it does too many things at once and I’m going to separate it very soon but you can just look at the _linalg_simd_kernels.c file for the interface. In a nutshell;

Upon the library load time, it runs __builtin_cpu_supports('AVX2') to check whether current machine supports AVX2 features and depending on the results either falls back to scalar implementation or switches to AVX2 implementation via __attribute__((constructor)) mechanism.

But that’s my opinion and we would welcome any pushbacks and/or other contributions to the discussion.

We already had a bit of discussion in ( Ensure we use the correct level of SIMD instructions in the Meson build · Issue #17111 · scipy/scipy · GitHub ) about questions like “What about ARM?”, “Why not just bump SSE version?“. So more technical details are there in case interested.

ev-br · April 2, 2026, 2:05pm

The _linalg_simd_kernels.c source file from the PR looks not entirely terrible: the intent is clear, and it’s unlikely that we’ll want to modify it often. What I’m not sure about is how to troubleshoot it if something goes wrong. OTOH I’m happy to trust you that you know how, and that you’ll be taking point troubleshooting it during the initial roll-out :-). After which it’s probably not going to need much maintenance anyway.
As a small DX feature, it’d be nice to have a build switch, in a form of a meson option or something, to bypass the load-time dispatch selection and use a scalar version regardless of the CPU architecture. If only as a trouble-shooting implement.
Other than that, I don’t see much problem, and 2x - 4x speedups certainly look worth reaping.

My two cents,

Evgeni

ilayn · April 2, 2026, 9:50pm

We can always turn it off and the regular branch is built. I do have the same concern so I did not bake anything in. The functionality is basically the same with the regular function but more SIMD jargon sprinkled on top with a switch that is currently tied to architectural #ifdef but you can always set it to 0 unconditionally.

Regarding the maintainability, I will comment the file properly that you would know exactly what is going on because the only annoying part of SIMD are these cryptic nonsense names. The nice thing about SIMD commands is that they do exactly what they say, so much easier than a programming language.

Other than that, I’m pretty sure you can get up to speed very quickly (should you wish to). I used to think this was complicated like Assembly but after writing it some time now, I don’t think so. The difficult part is figuring out which command to use and also figuring out what functions are available. Reading the written code is quite OK.

The main switch is at the top. If you override has AVX2 flag, all code becomes invisible and only scalar version is compiled. But we can rename it something else if it is confusing.

rgommers · May 4, 2026, 8:28pm

After the split of the initial PR into two, ENH:linalg.bandwidth: Implement AVX2 SIMD enabled branches for GCC/Clang + X86_64 by ilayn · Pull Request #24994 · scipy/scipy · GitHub looks easier to digest if folks want to have another look at this topic.

Regarding the maintainability/performance tradeoff: this currently looks like 4x more lines of code for 4x more performance. That seems reasonable if the pattern stays limited to this function, or a few more such places. I think we should count on adding NEON for arm64 platforms, because that’s roughly equally important nowadays. That won’t will probably be another 300 lines of code or so, so it’d be 6x more code. Still okay probably. There are no other candidate platforms now that would warrant adding dedicated support - possibly RVV for RISC-V or SIMD128 for WASM, but those are a couple of years away from being high-value enough.

There is the Windows question though - it’s excluded now, but I think it can be added for clang-cl, mingw-w64, and also Intel compilers. Just not MSVC, which we don’t support today anyway. I think we should also decide that in case we do start supporting MSVC again (needs dropping both Fortran and Pythran, and maybe something else), we won’t add yet more code to compile AVX2 kernels with a different dispatch mechanism - at that point the maintainability is almost certainly impacted too much (and if we wanted this, we should use Google Highway instead).

What I’d like to establish is that a decision to merge this PR does not mean we can then go add hand-written SIMD kernels in other places. Because if this pattern would really grow, there’s a more serious impact to maintainability of the code base as a whole. So I’d suggest that, if the decision here is to go ahead for linalg.bandwidth, that doesn’t mean it’s fine for other functionality without a new explicit discussion.