Use of runtime SIMD dispatch in SciPy

This has been briefly touched upon in Does scipy have a standard way of using simd? but I’d like to put more details in it from a maintenance angle.

Currently SciPy does not provide any optimizations other than -O3 optimization level. While this is often quite sufficient for our algorithms which are not trivial to divide into mundane but optimizable tasks, we do have some NumPy-like C/C++ functions that are often used and eat considerable time budget.

Two of these are, figuring out the structure of an array, (diagonal, triangular, etc.) and symmetricity/hermitianness check. These are sitting in the front of most scipy.linalg functions and dispatch to different algorithms depending on whether triangular or diagonal input is given. However, these checks add up mainly for n = [30, 300] sizes (where it matters), compared to the actual cost of the dispatched functionality.

Note that early-exit is a fundamental property for performance hence existing ufuncs or array comparison solutions cannot be used.Hence, SIMD plays nicely when enabled for these workloads.

But, turning on AVX2 or other features typically do not immediately lend itself to SIMD optimized code by the compiler and some manual intervention is needed. There are many reasons for this but essentially the compiler can’t reason about when or how to break out of certain loops.

On the other hand, there are very heavy dependencies such as Google Highway or SimSIMD projects that try to abstract away the SIMD code specifics. While this might be justified for projects like NumPy, for SciPy it is not that much of a good match for once, we don’t have enough code to be SIMD’d and also we don’t really maintain much of the compiled code. So if it is contained in a few places, maintenance cost is justified in my opinion.

To that end, I prepared a Pull Request that demonstrates this runtime switching SIMD in https://github.com/scipy/scipy/pull/24907 and the performance benefits.

Currently it does too many things at once and I’m going to separate it very soon but you can just look at the _linalg_simd_kernels.c file for the interface. In a nutshell;

Upon the library load time, it runs __builtin_cpu_supports('AVX2') to check whether current machine supports AVX2 features and depending on the results either falls back to scalar implementation or switches to AVX2 implementation via __attribute__((constructor)) mechanism.

But that’s my opinion and we would welcome any pushbacks and/or other contributions to the discussion.

We already had a bit of discussion in ( Ensure we use the correct level of SIMD instructions in the Meson build · Issue #17111 · scipy/scipy · GitHub ) about questions like “What about ARM?”, “Why not just bump SSE version?“. So more technical details are there in case interested.

The _linalg_simd_kernels.c source file from the PR looks not entirely terrible: the intent is clear, and it’s unlikely that we’ll want to modify it often. What I’m not sure about is how to troubleshoot it if something goes wrong. OTOH I’m happy to trust you that you know how, and that you’ll be taking point troubleshooting it during the initial roll-out :-). After which it’s probably not going to need much maintenance anyway.
As a small DX feature, it’d be nice to have a build switch, in a form of a meson option or something, to bypass the load-time dispatch selection and use a scalar version regardless of the CPU architecture. If only as a trouble-shooting implement.
Other than that, I don’t see much problem, and 2x - 4x speedups certainly look worth reaping.

My two cents,

Evgeni

We can always turn it off and the regular branch is built. I do have the same concern so I did not bake anything in. The functionality is basically the same with the regular function but more SIMD jargon sprinkled on top with a switch that is currently tied to architectural #ifdef but you can always set it to 0 unconditionally.

Regarding the maintainability, I will comment the file properly that you would know exactly what is going on because the only annoying part of SIMD are these cryptic nonsense names. The nice thing about SIMD commands is that they do exactly what they say, so much easier than a programming language.

Other than that, I’m pretty sure you can get up to speed very quickly (should you wish to). I used to think this was complicated like Assembly but after writing it some time now, I don’t think so. The difficult part is figuring out which command to use and also figuring out what functions are available. Reading the written code is quite OK.

The main switch is at the top. If you override has AVX2 flag, all code becomes invisible and only scalar version is compiled. But we can rename it something else if it is confusing.