Towards a Faster Python project for scientific Python

itamarst · July 17, 2024, 1:01pm

The Faster Python project is working on making CPython 5× faster. This is great, but it doesn’t really help scientific Python as much as it does other use cases, given that scientific Python is mostly compiled code, bypassing CPython altogether.

I believe it is possible to achieve the same for the scientific Python ecosystem, making code 2-5× more efficient. In particular, I’m talking about single-threaded code with the same APIs running faster. I’m not talking about parallelism, nor am I suggesting massive re-architectures like the difference between Pandas’ eager execution and Polars’ lazy execution.

Some examples

Just as a starting point, the massive amount of volunteer work that goes into these projects is amazing. But given limited time, some things don’t happen, and I have observed significant potential for speedups. What follows is a random grab bag of examples.

Some scikit-image/SciPy image processing APIs are much slower than OpenCV equivalents. I’ve encountered APIs where there’s a 10× opportunity for speedup.
I’ve also encountered a core scikit-image API that could be 4-8× faster depending on input types, with 1/20th the memory usage.
A Numba implementation of (a specific invocation subset of) NumPy’s argmax can be 10× faster.
The default compiler optimization setting for Python extensions on Linux is -O2. This means no auto-vectorization happens. Even with -O3, extra work is needed to get auto-vectorization to happen with Cython, and with C I suspect it’d be even more difficult. The simple example I implemented got a 2× speedup: Speeding up Cython with SIMD
Documentation of many performance topics like the above is lacking.
On Linux, the default architecture target on x86-64 for Python is the instruction set as of 2003. 20 years of additional CPU instructions, including SIMD, are basically ignored. RedHat Enterprise 9 and possibly latest Ubuntu are now targeting x86-64-v2, which is the status quo as of 2008. I’m pretty sure it’s possible to do extension-level dispatch, rather than the more normal function-level dispatch, which would be one way to take advantage newer CPUs while still having backwards compatibility, at the cost of larger package downloads.
NumPy does have dispatched function SIMD usage so it can take advantage of newer CPUs in some contexts, but my impression is that most other major libraries do not.
Rust has some tricks that allow for enabling bounds checking while optimizing away its performance impacts; I’ve managed to achieve the same with Numba. I’ve failed to do so on Cython (might be gcc vs clang since Rust and Numba are LLVM like clang, but it might also be the C code it generates).
Many libraries do not have automatic benchmarking on PRs, which means performance regressions won’t be caught.
Measuring performance can be difficult in some contexts; I ended up writing a profiler for Numba, for example (GitHub - pythonspeed/profila: A profiler for Numba) since none existed.

These are all just examples of things I’ve encountered, and again, not a complaint, I have my own share of personal open source projects where many of the same issues apply.

In short, with the right resources, I believe a large number of performance improvements are achievable, by manually optimizing code, by upgrading core defaults, potentially by creating new shared infrastructure designed to enable performance, and by improving documentation and community knowledge.

How would this be structured?

The CPython Faster Python project is, I believe, a team of engineers funded by Microsoft. Structurally they’re working on a single software code base that needs to be optimized.

In contrast, what I’m describing above involves speedups across a broad range of projects, and isn’t just about writing software. So structurally it might be better implemented some other way.

As I understand it this community is intended for this sort of cross-project collaboration, so I’d love to hear people’s thoughts on the desirability of such a project, and how it might be implemented.

stefanv · July 17, 2024, 7:15pm

Putting on my skimage hat, we’d certainly love to be more performant. The question is always: at what cost (to the mostly volunteer development team).

We tend to rely on more foundational libraries in the ecosystem (NumPy, Cython, Pythran) to implement optimizations. We are also quite interested in whole-sale dispatching of some sort, so that GPU-optimized implementations of skimage can be called using the skimage API.

The more generic enchancements you propose, like building for more modern chipsets, or improving NumPy, all sound feasible and worth pursuing.

itamarst · July 18, 2024, 12:02am

Definitely understand the limits of what any group of volunteers can do, which is why this seems like a place where trying to get outside resources might be helpful.

ev-br · July 18, 2024, 6:25am

For scikit-image specifically, I wonder if it can make use of scipy.ndimage delegating to cupy under the hood, as in ENH: ndimage: add array API standard support by ev-br · Pull Request #21150 · scipy/scipy · GitHub , and what the pain points are.

rgommers · July 18, 2024, 9:46am

Thanks for bringing this up @itamarst! Agreed that there are lots of opportunities for improvement.

Desirability: high. I’d say that it’s one of the things users care about the most.

How it might be implemented: that’s the key question. It requires expertise that is hard to find, so currently things move slowly. A central project/team/repo which would focus on knowledge sharing, documenting relevant activities already happening and how to plug in with those, and identifying new opportunities and what they could bring would be a great start. Then there’s the question of finding funding - which in itself is a lot of work. I think the main thing to focus on there is larger structural items (e.g., making SIMD usage much easier), rather than individual algorithms.

Correct - it’s too hard currently, infrastructure like runtime dispatching and/or better packaging support are lacking.

This is no longer true. It’s the setuptools default only. Many projects for which performance really matters have moved away now to either Meson or CMake. And if one is serious about using compiled code, that’s the first thing to do.

The fused types trick there is unfortunately not great for widely used libraries, because it will inflate binary size by a lot - and the impact of Cython generating too large binaries is already way too high.

That is also a problem for SIMD usage in general, to a lesser extent, which is why runtime dispatch levels need to be carefully chosen.

True. This is a matter of CI maintenance overhead, which until recently was very high. I’m hoping https://codspeed.io/ will be helpful here - first experiences with it are quite positive, results seem accurate/stable with hosted CI.

ev-br · July 18, 2024, 10:10am

A first step could be to set up a repository to collect user stories, pain points and (however rough) ideas; also to collect individual incremental wins within various projects/libraries across the ecosystem.
Such a rough list could be then turned into a centralized resource to scope and turn bits and pieces into specific plans and potentially fundable proposals.

itamarst · July 18, 2024, 12:31pm

I would like to push back on the notion that GPUs are The Solution to performance, or the related notion that optimizations to specific algorithms aren’t worth doing.

Speed has two distinct benefits: faster results, and reduced costs.

Parallelism—whether via multiple cores or via GPU usage—only gives you faster results, it does not reduce costs except in very limited situations.
Optimization—making code more efficient—gives you both faster results and reduced costs.

If the only answer to “how do we go faster” is parallelism, essentially you’re pushing the cost of faster results on to users. For companies that can mean much larger bills; for researchers, that means using up even more limited budgets, or doing fewer experiments. And this is extra bad for organizations working with smaller budgets in lower-income countries, who will have harder time accessing expensive hardware or cloud services. On a broader level, it also means higher carbon emissions.

On the other hand, optimizing specific algorithms has massive leverage in the context of open source projects. If you make core, commonly used algorithms in e.g. scikit-image faster (and you can, from median filters to contrast adjustment there’s huge opportunities to speed things up) this massively benefits many people and many organizations. And it does so without them having to speed extra money on GPUs.

I spent a few hours optimizing e.g. a 2D uint8 local median threshold and got a 15× speedup over scikit-image (which in practice may mean SciPy, I didn’t check the underlying implementation). That’s a 93% reduction in compute costs and some approximation of that in electricity usage.

And this isn’t hard, it’s 50 lines of code implementing an algorithm that I later discovered has been known since 1979 (https://www.uio.no/studier/emner/matnat/ifi/INF2310/v12/undervisningsmateriale/artikler/Huang-etal-median.pdf) half of which I was able to deduce from first principles after learning the first half in 5 minutes of googling.

And to be clear:

There’s nothing bad about parallelism or GPUs, it is very much worth enabling, and it can be extremely valuable. It just shouldn’t be the only focus (and even for those who do benefit from GPUs, optimization is also helpful).
I realize scikit-image/SciPy/etc maintainers don’t have time for this right now, which is why it doesn’t happen. I would like to figure out ways to enable this, by figuring out if there’s any way to get outside resources and have the necessary project infrastructure.

itamarst · July 18, 2024, 12:40pm

Codspeed is very useful, I used it to great effect on an open source pure-Python project and it both helped demonstrate speedups and prevented performance regressions.

Unfortunately I don’t think it’s a good tool for scientific computing. In particular, it works by measuring CPU instruction counts. Because of instruction-level parallelism in CPUs, you can often tweak low-level compiled code in very minor ways, and get your code to run e.g. 2× faster by enabling the CPU to run more instructions in parallel (on the same core, this isn’t about threading for those not familiar).

In this situation, a user’s experience is that the code is 2× faster, but codspeed will say the code is exactly the same speed because the instruction count hasn’t changed. Or it might even think the code is slower if you had to use more instructions to get that parallelism.

For pure Python code, changing your code won’t really impact instruction-level parallelism, SIMD usage, branch prediction and so on (at least until the JIT is being used in anger) so codspeed’s metric is fairly accurate. For compiled code, it’s going to be very misleading in many cases.

ev-br · July 18, 2024, 12:56pm

At this early stage, I’d suggest to brainstorm: collect pain points in the form of user stories and postpone debating the best ways of addressing each of them case by case. I’m sure there will be room for all of parallelization on CPU, using GPUs, algorithmic optimizations, purely software optimizations, you name it.

mdboom · July 18, 2024, 1:06pm

I like this idea a lot. The Faster CPython team has had pretty good success with the Faster CPython ideas repo, which we just use as an issue tracker to discuss specific issues. The idea is that once things get into “solutions”, we move over to the CPython tracker (though I wouldn’t say that rule is always perfectly followed). But it takes some of the burden of formality in the main bug tracker away to have more brain-storming-like sessions – and it works really well asynchronously.

So, just a suggestion that maybe an issue tracker to coordinate this space might be helpful. I also invite anyone to post to (or link from) the Faster CPython ideas repo for anything where changes to CPython itself could be helpful. There have been a few things like that in there already (for example), but it would definitely help to have more.

rgommers · July 18, 2024, 1:36pm

Oh don’t get me wrong, I completely agree with you that optimizing specific algorithms is worth doing. “this PR makes code run faster without changing behavior or API” is pretty much always welcome.

What I had in mind was that it’s much less in need of activity here, for some cross-project coordination effort. There’s less of a maintenance/expertise bottleneck there, and pretty much every project will be happy with pull requests that add a performance optimization (plus a relevant benchmark in most cases). There’s established ways of reviewing those, and that’s often not even that much effort (certainly less than new functionality on average). And there’s even more funding for that type of performance optimization work - it has been included in lots of CZI EOSS grants by scientific Python projects for example.

This contrasts with new better practices and shared code infrastructure for things like SIMD accelerations or build optimizations, and with larger improvements in NumPy/Cython. Those are harder to move forward due to technical complexity and maintainer bottlenecks, and are harder to fund.

itamarst · July 18, 2024, 4:14pm

Ah, thank you for the clarification, that totally makes sense.

So it sounds like a good next step might be, as @ev-br suggested, creating a repo to track ideas?

ilayn · July 19, 2024, 7:25am

I am active on SciPy not much about algorithmic contributions anymore due to time constraints but mostly cleaning up the historical warts by converting old code etc. I think your post is very nicely articulated and makes a good case that summarizes the essentials.

I also share your sentiment about pushing things to GPU however with SIMD you are also implicitly providing the same argument that is “parallelism” but in the CPU obviously much more efficient this time. But more cores can also optimize the total net energy usage (not power!) by load balancing and without unnecessarily boosting the less number of cores. That’s a different domain of expertise that I can’t comment further than this. However that line of argument based on power etc. can always be easily manipulated by made up numbers for both sides. It is very difficult to make a concrete case. But anyways, my point is that more cores = more waste statement is not always true.

I would like to take it slightly back in the chain that current scientific software is essentially stuck at interfacing with BLAS and LAPACK calls. I have the opinion that this is a bit of learned desperation and we just gave up.

But when your BLAS call gets faster almost everything else gets faster with almost no effort. I don’t mean to say that there are individual corners in the algorithms in general that would benefit from significant speed ups, but BLAS/LAPACK improvements boost the baseline performance up of all underlying things in all scientific stack of all languages - no exception.

However, it is virtually left to very little number of exceptionally generous people or alternatively proprietary MKL and Apple teams. I find this baffling but that’s me. Probably there are others that I do not know. But first we need to separate BLAS from LAPACK. I think BLAS is in relatively better shape via OpenBLAS or BLIS and so on but more Rust initiatives are happening. On the other hand LAPACK is not.

We must (full emphasis on must) port LAPACK out of F77 for fast computing as scientific community (of all languages). LAPACK codebase is treated as the scripture, and for a good reason because the folks who wrote the algorithms have immense insight and expertise. But that does not mean they coded the most optimal code and also F77 does not lend itself to anything modern hardware offers. You get only what gfortran offers for the loop optimization. While the algorithms are rock solid, they have no way other than writing for loops as if there is no tomorrow. I know this because I spent reading similar code for the good chunk of last year due to META: FORTRAN Code inventory · Issue #18566 · scipy/scipy · GitHub. LAPACK routines are often older than SciPy’s *PACK libraries.

I am confident that if there is critical mass, folks will come out of the woodwork and start this initiative. But I don’t know how to kick it off, just yet.

Long story short, while we can really make great progress by individual algorithm improvements, there is an even greater progress potential sitting silently next to all of us in the corner of our eyes.

betatim · July 19, 2024, 9:00am

I think creating a repository akin to the “Fast CPython ideas” one would be great.

It is unclear to me which of the different options to make code run faster is best. Probably an “ensemble” approach is the right thing to do: push on all the fronts that we have.

Like Ralf said, no one will say no to a speed up that comes without changed behaviour and without making the code significantly more difficult to maintain. The fact that maintainers aren’t inundated with such PRs suggests to me that there isn’t that much low hanging fruit left. As a result of us having played this game for a while already I suspect. It doesn’t mean we should stop doing it though.

I think making use of specialised hardware like a GPU makes sense. For operations that they are good at, they can be significantly more energy efficient than a CPU.

We will also see more “accelerators” (of different kinds) in every day computers. I use “accelerator” as a placeholder for “things that aren’t your standard CPU core from 10 years ago”. For example, modern day MacBooks contain different kinds of cores already (performance, efficiency, GPU, neural) and it seems like a good idea to be able to take advantage of them. This can be done via using a BLAS that knows how to use them or using the MPS device in PyTorch or some other way.

Because we have only started scratching the surface of this there is hopefully lots of low hanging fruit left. But it also means learning new tools and ideas, which takes time. And probably there will be false starts as well.

lagru · July 19, 2024, 12:09pm

Without wanting to derail this discussion, is this optimization public or would you be willing to publish? I’d be happy to explore upstreaming this in an issue on scikit-image.

itamarst · July 19, 2024, 12:30pm

My personal experience is that there are still low-hanging fruit, and also less low-hanging fruit but still places that could be vastly faster.

Part of it I think is that Python culture is that you get speed from compiled code, so if it’s already compiled many people assume there’s nothing more to do. The other part is that it’s often easier to have local version of the function, or to e.g. switch to OpenCV for a couple of function to get the 10× speedup.

itamarst · July 19, 2024, 12:47pm

@lagru or @stefanv or someone else with access, could you create a new repo under the GitHub org, e.g. scientific-python/ideas-for-speed or something? Thank you!

jarrodmillman · July 19, 2024, 5:01pm

itamarst · July 20, 2024, 1:52am

Thank you! I know @ilayn at least had an idea that should be added there? And I’ll start dropping writeups in as I have time.

ilayn · July 20, 2024, 8:19am

It is more of a grandiose plan for now I don’t quite know where the weakest point of this wall is.

But quick wins also exist; say, certain bool valued functions in NumPy early quitting as soon as a search condition satisfied and not running all the way to the end of the array. These operations are way too common in all codebases. I think that is more suitable for the case you are making in your write-up. They stop wasting resources the moment they get the answer and save, truly, lots of flops.

I’ve put more details in APIs for discovering array structure as pertains to linear algebra operations · data-apis/array-api · Discussion #783 · GitHub