The Faster Python project is working on making CPython 5× faster. This is great, but it doesn’t really help scientific Python as much as it does other use cases, given that scientific Python is mostly compiled code, bypassing CPython altogether.
I believe it is possible to achieve the same for the scientific Python ecosystem, making code 2-5× more efficient. In particular, I’m talking about single-threaded code with the same APIs running faster. I’m not talking about parallelism, nor am I suggesting massive re-architectures like the difference between Pandas’ eager execution and Polars’ lazy execution.
Some examples
Just as a starting point, the massive amount of volunteer work that goes into these projects is amazing. But given limited time, some things don’t happen, and I have observed significant potential for speedups. What follows is a random grab bag of examples.
- Some scikit-image/SciPy image processing APIs are much slower than OpenCV equivalents. I’ve encountered APIs where there’s a 10× opportunity for speedup.
- I’ve also encountered a core scikit-image API that could be 4-8× faster depending on input types, with 1/20th the memory usage.
- A Numba implementation of (a specific invocation subset of) NumPy’s argmax can be 10× faster.
- The default compiler optimization setting for Python extensions on Linux is -O2. This means no auto-vectorization happens. Even with -O3, extra work is needed to get auto-vectorization to happen with Cython, and with C I suspect it’d be even more difficult. The simple example I implemented got a 2× speedup: Speeding up Cython with SIMD
- Documentation of many performance topics like the above is lacking.
- On Linux, the default architecture target on x86-64 for Python is the instruction set as of 2003. 20 years of additional CPU instructions, including SIMD, are basically ignored. RedHat Enterprise 9 and possibly latest Ubuntu are now targeting x86-64-v2, which is the status quo as of 2008. I’m pretty sure it’s possible to do extension-level dispatch, rather than the more normal function-level dispatch, which would be one way to take advantage newer CPUs while still having backwards compatibility, at the cost of larger package downloads.
- NumPy does have dispatched function SIMD usage so it can take advantage of newer CPUs in some contexts, but my impression is that most other major libraries do not.
- Rust has some tricks that allow for enabling bounds checking while optimizing away its performance impacts; I’ve managed to achieve the same with Numba. I’ve failed to do so on Cython (might be gcc vs clang since Rust and Numba are LLVM like clang, but it might also be the C code it generates).
- Many libraries do not have automatic benchmarking on PRs, which means performance regressions won’t be caught.
- Measuring performance can be difficult in some contexts; I ended up writing a profiler for Numba, for example (GitHub - pythonspeed/profila: A profiler for Numba) since none existed.
These are all just examples of things I’ve encountered, and again, not a complaint, I have my own share of personal open source projects where many of the same issues apply.
In short, with the right resources, I believe a large number of performance improvements are achievable, by manually optimizing code, by upgrading core defaults, potentially by creating new shared infrastructure designed to enable performance, and by improving documentation and community knowledge.
How would this be structured?
The CPython Faster Python project is, I believe, a team of engineers funded by Microsoft. Structurally they’re working on a single software code base that needs to be optimized.
In contrast, what I’m describing above involves speedups across a broad range of projects, and isn’t just about writing software. So structurally it might be better implemented some other way.
As I understand it this community is intended for this sort of cross-project collaboration, so I’d love to hear people’s thoughts on the desirability of such a project, and how it might be implemented.