NumPy C function implementations

Hi Developers

As many of the contributors from SciPy are part of NumPy, I have some doubts on NumPy C level function implementation:

  1. Currently we trying to optimise certain function calls from NumPy to yield better performance.
  2. I could see dot and sqrt functions are contributing more time for execution, We used profilers such cprofiler and line_profiler for getting these insights.
  3. Since these actual implementation of this modules were written in C and were not able to traced by the python profilers.
  4. Where would I find the actual implementation of these C function calls? It would be better to point to exact implementation of these functions and is there any tools to trace these c calls from python function calls?
    thanks

Hi @Andressflintoff ,

I think a reasonable starting point would be to look at the documentation for, and output of numpy.show_config(), numpy.show_runtime() and scipy.show_config(). Matrix multiplication (dot and friends) gets farmed out to linear algebra library (BLAS/LAPACK). Also worth noting that depending on what you have installed, the BLAS/LAPACK libraries may actually be implemented in FORTRAN and not C. In the ancient past (before conda, manylinux wheels, etc), one had to compile numpy on your own to link against fast implementations of these libraries, and simply installing them from PyPI (via pip or easy_install) lead to having very slow matrix multiplication code being linked in.

My knowledge of the internals in question is quite dated, so hopefully someone with more current knowledge will chime in, but at least you have a starting point.

1 Like

By default NumPy is linked against OpenBLAS, which provides very fast implementations of many matrix operations.

If you are doing fairly simple ufucs, you may also want to look at numexpr. Numba may help, or doing custom kernels in pythran.

But, I’d be surprised if you could get significant speedup over existing implementations coming from BLAS (an easier route would be to use a faster BLAS, such as MKL).

2 Likes

Does sqrt functions are from Universal C Runtime library(UCRT)?

Not necessarily. For many functions (and this as well, I think), NumPy now has custom implementations to make better use of CPU features (SIMD: single instruction multiple data).

For many individual calls, NumPy should perform well, with the caveat that NumPy doesn’t parallelize (but linear algebra will typically, due to BLAS use).
NumPy doesn’t do combine operations (“fusing” or “fission”), so if you use other tools or packages, they may perform better due to that.

(Of course, I can’t say, for any specific use-case overheads may be more important, or there may be unnecessary casting, etc., etc.)

I haven’t use it much, but e.g. py-spy tries to make it convenient to profile into the C-level: GitHub - benfred/py-spy: Sampling profiler for Python programs
(Which will help you see what gets called, but you will need to compile NumPy locally to have debug symbols probably.)