BLAS and LAPACK functions with threading in SciPy

Hi Developers!

  1. This query is regarding “running BLAS and LAPACK functions in parallel manner using threads for SciPy”.

  2. I have built OpenBLAS dynamic library for windows with number of threads=12. Now, I have built SciPy with the OpenBLAS that I had built and linked it using delvewheel for runtime usage.

  3. Initially I have built SciPy with OpenBLAS without any number of threads(NUM_THREADS=0) and the performance of this SciPy wheel were poor in functions of SciPy where it requires BLAS and LAPACK modules from OpenBLAS when compared prebuilt SciPy Wheel available on PyPi(I could see from benchmark results).

  4. The same kind of observation is being seen even I uses OpenBLAS that is being built with threads=12, Is there anything I am missing in the configuration? Does any specific thing I have to configure to make sure SciPy is using threaded OpenBLAS?

Thanks!!!

Here is what SciPy uses for its wheels: openblas-libs/tools/build_openblas.sh at main · MacPython/openblas-libs · GitHub

Are you sure it’s threading related? It should be easy to check whether your build uses 1 or 12 threads when performing an operation on a large enough 2-D array. The other most common option is CPU architecture selection. The SciPy wheels are built for 5 different x86-64 CPU architectures. The default is to build for only 1 - the architecture of the build machine. If you built it in CI and run it locally on a different CPU, that may carry a large performance penalty.