This has been briefly touched upon in Does scipy have a standard way of using simd? but I’d like to put more details in it from a maintenance angle.
Currently SciPy does not provide any optimizations other than -O3 optimization level. While this is often quite sufficient for our algorithms which are not trivial to divide into mundane but optimizable tasks, we do have some NumPy-like C/C++ functions that are often used and eat considerable time budget.
Two of these are, figuring out the structure of an array, (diagonal, triangular, etc.) and symmetricity/hermitianness check. These are sitting in the front of most scipy.linalg functions and dispatch to different algorithms depending on whether triangular or diagonal input is given. However, these checks add up mainly for n = [30, 300] sizes (where it matters), compared to the actual cost of the dispatched functionality.
Note that early-exit is a fundamental property for performance hence existing ufuncs or array comparison solutions cannot be used.Hence, SIMD plays nicely when enabled for these workloads.
But, turning on AVX2 or other features typically do not immediately lend itself to SIMD optimized code by the compiler and some manual intervention is needed. There are many reasons for this but essentially the compiler can’t reason about when or how to break out of certain loops.
On the other hand, there are very heavy dependencies such as Google Highway or SimSIMD projects that try to abstract away the SIMD code specifics. While this might be justified for projects like NumPy, for SciPy it is not that much of a good match for once, we don’t have enough code to be SIMD’d and also we don’t really maintain much of the compiled code. So if it is contained in a few places, maintenance cost is justified in my opinion.
To that end, I prepared a Pull Request that demonstrates this runtime switching SIMD in https://github.com/scipy/scipy/pull/24907 and the performance benefits.
Currently it does too many things at once and I’m going to separate it very soon but you can just look at the _linalg_simd_kernels.c file for the interface. In a nutshell;
Upon the library load time, it runs
__builtin_cpu_supports('AVX2')to check whether current machine supports AVX2 features and depending on the results either falls back to scalar implementation or switches to AVX2 implementation via__attribute__((constructor))mechanism.
But that’s my opinion and we would welcome any pushbacks and/or other contributions to the discussion.
We already had a bit of discussion in ( Ensure we use the correct level of SIMD instructions in the Meson build · Issue #17111 · scipy/scipy · GitHub ) about questions like “What about ARM?”, “Why not just bump SSE version?“. So more technical details are there in case interested.