Best Practices for Optimizing NumPy Performance in Scientific Python Projects!

danielljossse · December 23, 2024, 9:27am

Hi everyone,

I have been working on a computational project that involves heavy use of NumPy for numerical operations. While it’s incredibly efficient I am curious about ways to further optimize performance, especially when working with large datasets.

What are some best practices for leveraging NumPy’s built-in functions to minimize looping in Python: ??
Are there specific strategies for optimizing memory usage when dealing with large arrays: ??
Does using tools like Numba or Cython significantly enhance performance, and how do they integrate with NumPy: ??
How do you approach debugging performance bottlenecks in scientific Python projects: ??

I would love to hear from those who’ve tackled similar challenges in their workflows. Any tips, resources or examples would be greatly appreciated !! I have also read this thread https://discuss.scientific-python.org/t/towards-a-faster-python-project-for-scientific-python-datasphere but still need some more help.

Looking forward to your insights.

With Regards,
Daniel Jose

ev-br · December 23, 2024, 11:38am

It’s not very helpful to start looking into details in the abstract, until you have a working prototype. There are too many techniques for too many use cases and too many specific problems, and bogging down into details prematurely is a great time sink.

From experience optimizing python and other scientific workflows, I’d recommend roughly this:

Have a prototype, however rough. Don’t worry about performance just yet. If something is blatantly obvious (like, exponential vs linear complexity), sure use the knowledge; otherwise, don’t bother yet.
Make sure it works as expected. Collect a set of validation examples: asymptotics, limiting cases, known values, expected results—something you check with that your prototype is not entirely incorrect. Depending on details, you may want to roughly separate this into two buckets: small, quick to run examples where you check against accurate results, and longer, heavier runs which need to be looked at by a human eye. (For example from some of my past workflows: am building a quantum MC simulation; checking against an exactly solvable small system is the first one; checking that the error scales roughly as 1/\sqrt{number_of_steps} is the second one).
Make this collection into something semi-automated if you can. Don’t get bogged down with fine details of acceptance testing vs unit testing vs whatnot — if you can reasonably make your acceptance suite run with a single command, great; if it’s a collection of scripts you run manually, also OK. You’ll refine the framework as you go.
Having constructed this set of examples, you have a rough idea of what’s bad in your prototype. Now time to turn a rough idea into data----start profiling. This is key. You need data.
If your workflow involves disk or network or large memory or databases — do it spend time in IO or number crunching? If the latter, does it fit into memory or you start swapping?
At this stage things start depending on details, but the general idea is — identify a bottleneck, work on it and ignore the rest.
At any rate, you need to profile. If your application is in python, just use the standard library cProfile module as a starting point; Once you know where the bottleneck is at the function level, sometimes it’s useful to throw in line_profiler which will point you to specific lines of code. Is there a part which dominates the profile? Great, eliminate the bottleneck — maybe it’s better numpy vectorization; if that doesn’t work, maybe you’ll need a compiled extension; if you’re running out of memory, you’ll need to think about parallelizing — but the main point stands: only consider a bottleneck, as shown by the profiler.
Once you’ve eliminated the bottleneck (again, profiler will tell you), rerun the acceptance tests. Once your rewrite is correct: is the current state acceptable? If yes, you’re done. Just stop optimizing. If not, goto 6.

I know what I’m saying is kind of vague. It’s because specific details of what to optimize and how to optimize are very very very problem specific, and there’s no point dwelling on solutions to non-problems or somebody-else-problems.
Once you’re down to a specific bottleneck, we might be able to offer more focused suggestions.

To summarize: have a prototype, have acceptance tests, use a profiler to identify bottlenecks, and iterate until the result is acceptable.
Oh and, do use some form of version control to keep track of iterations.

HTH,

Evgeni

mdhaber · December 23, 2024, 4:18pm

What are some best practices for leveraging NumPy’s built-in functions to minimize looping in Python: ??

I’d suggest the following: insist that embarrasingly parallel operations can be written without loops.

There are exceptions in both directions, and you may find cases where the hoops you need to jump through to write code without loops are not worth it (either due to performance or code complexity). But I think a good starting point is to assume that vectorization is possible and work until you have a solution to test test rather than jumping to the conclusion that it’s not worth the trouble.

One non-trivial example that comes to mind is "Efficiently calculate angle between three points over triplets of rows in a numpy array".

Some things to learn about that help find vectorized solutions:

broadcasting
indexing and take_along_axis / put_along_axis
ufunc methods (e.g. accumulate)
einsum
triu_indices / tril_indices

dhirschfeld · December 29, 2024, 7:01am

Avoiding allocations can be important when considering performance. One technique to use there is to pre-allocate the memory and pass it in to a function with the out parameter.

Back in the day is used to also be important to consider the memory layout of your (multi-dimensional) array as routines often assumed C order (IIRC). These days functions default to K order which tries to adapt to the memory layout of the inputs.

Where non-contiguous memory is unavoidable, there may still be some performance benefit to copying the data to ensure it is contiguous before operating on it.

dhirschfeld · December 29, 2024, 7:04am

There are probably some things to be gleaned from these old references:

Jake VanderPlas - Performance Python: Seven Strategies for Optimizing Your Numerical Code - Speaker Deck
Principles of Performance — principles-of-performance documentation