Can I volunteer to optimize scikit-image?

Hi all,

I’m Saurabh, based in San Francisco, with experience building performance tools at Nvidia and ML models at Meta.

For the past two years, I’ve been working on Codeflash—an automated performance optimizer for Python code. It generates and verifies high-quality code optimizations, ensuring both performance improvements and correctness.

I’d love to help make skimage faster by running Codeflash on the code base and submitting the best optimizations for your review. We’ve had great success with image processing code, as shown by our merged PRs in libraries like Albumentations, Kornia, and Roboflow. In fact, Codeflash is now optimizing all new code for Roboflow and Albumentations in their CI.

I’d be happy to volunteer my time to optimize skimage, including the benchmarks in the benchmarks directory. Please let me know if you have any specific areas you’d like me to focus on.

Can I attempt to optimize scikit-image? Please let me know what I should keep in mind to make this endeavor successful. I would also love to ask any questions here if I run into any trouble.

Hi @misrasaurabh1, welcome to the forum, and thank you for your proposal.

We are always interested in improving scikit-image, including making performance improvements. However, we have so far steered clear of any AI tools, because of the dubious licensing (and attribution) of AI generated code.

My inclination would be to say: if you use AI to identify categories of improvements we can make (i.e., use it as a linter of sorts), that’s fine. But the code to improve on those categories, we would want to write ourselves.

If it feels like the temptation would be to submit the AI-generated patches, once you’ve run the tool, then I would encourage you rather not to do it in the first place.

Best regards,
Stéfan

2 Likes

Hi @stefanv and scikit-image team.

I looked into all the code that the benchmarks called and looked into optimizing them. This way I prioritized the code where performance is important. I did this for both networkx and scikit-image, and I found a lot of solid optimizations in my first try.

I opened this Pull Request for you review, it delivers a solid 48% speedup for calculating moments over dimension 2 and 3 matrices. The changes look good to me.

I understand that you might not be open to AI generated optimizations, let me know how you might want to incorporate the changes since they look quite good to me. I am willing to adapt the changes to your needs. Once I figure out your requirements towards merging code, I can open a lot more optimizations that I’ve created for both networkx and scikit-image, which should speed them up substantially. I want to ensure I meet your standards and requirements, so just me know. About licensing and attribution, the code belongs to you and we don’t claim anything on it. I am happy to sign a contributor license agreement if that would help.

I am excited about the optimizations we’ve found, and I hope with a successful collaboration we can speed up these important python libraries significantly!

I don’t think that addresses the concerns about licensing.

The concern is that your model’s output is a derived work of the materials it was trained on. (If your model was trained on solely public domain material, please correct me.) This is material you presumably don’t hold copyright on, and therefore can’t grant us the rights. If we accepted this, we would therefore need to comply with the most restrictive license in your dataset.

(Note that I am not stating that I agree with this position; just that it is an open legal question, and reasonable for scikit-image maintainers to worry about.)

2 Likes

It’s a tricky issue, because there may be some cases where the LLM-generated code does not have licensing implications. I think your PR is likely such a case, since it mainly avoids repeated array lookups by assigning to temporaries.

That said:

  • It is very difficult to know where to draw the line, and we don’t want to take the risk of polluting our license.
  • It requires bandwidth from the team to review these PRs, and from looking at the first one I think it (a) makes the code harder to read and (b) provides relatively modest improvement.

I appreciate you bringing the discussion here first, and for being open about the contribution being AI generated. This is en evolving conversations, one that we also had at the Scientific Python Developer Summit, where it was led by @matthew-brett, who may have more to add here. We may have a better and more nuanced approach in the future, but for now I think best to steer clear.

Thank you for your thoughtful response and I see the reasons behind your concern.

I would want to find a solution here since the gains are just too large to ignore. When I just looked at the code that your predefined benchmarks references, I found 25 optimizations for scikit-image that should speed up your benchmarks. Also for networkx I found 20 optimizations wrt the benchmarks.

These speed up important things like -
Networkx -

For scikit-image -

The PRs above need some work to make it merge ready (we are still improving codeflash so that it creates merge ready code). But the whole scientific python community might benefit a lot from all of these optimizations. Since the runs above are just a first attempt at a small portion of the codebase (maybe 10%), codeflash can probably generate 100s more optimizations across the two projects, by generating the benchmarks and regression tests internally.

I asked an expert IP Lawyer and its true that there is a grey area around LLM generated code and OSS licenses. There are a few technical solutions to this.

  1. The model we are using includes a code referencing feature (see, e.g., this feature), then we could turn that on when generating the optimizations. These features are designed to detect when a generated suggestion in the output matches/overlaps with a particular item from the model’s training set, and where that original item has open source licensing terms, the feature provides notice/attribution/licensing information alongside that output suggestion. If we turn this on, and if it provides any attribution info, I would just include that info in the contribution. You can reject the ones that have a license mismatch.
  2. I could use a scanning tool (e.g., something like Black Duck, or a free/OSS alternative) to scan your generated code to see if it has any matches/similarities against existing third party OSS codebases.

None of these options are “foolproof” though.

One idea if you are wholly against any line of code that was AI generated is that you can take inspiration from the above generated optimizations and re-implement them as you feel fit. I can run code scanning tools to be extra careful. This “re-implementing ideas from memory” is a safe way to deal with copyright issues, and is an established practice. If you are interested in this approach, let me know and I can ask the lawyers to find the safe way to do this. I can also recruit volunteers or hire contractors to help with this optimization effort.

I am passionate about making python performant, which is the reason I have been building codeflash. My goal is to make every python code performant, and making the core libraries performant is how I can contribute the most to the whole Python ecosystem.

Hi,

It may be worth having a look at the write-up of our discussion on open-source, licensing and AI here : sp-ai-post/notes.md at main · matthew-brett/sp-ai-post · GitHub

As you’ll see there - the primary problem is not whether we’d be held legally responsible - but whether we are properly respecting the copyright of the authors that (unwittingly) contributed to the training code.

We there also discuss the problem of inspecting AI-generated code before writing one’s own code - which we also considered to be too high a risk of license leak.

What you could do though, is look at the optimizations yourself, understand then, and them explain them to someone else, for example in an issue, so that they could implement it without having seen the AI-generated code.

Cheers,

Matthew

3 Likes

It looks like it won’t be possible to merge these codeflash optimizations into networkx or scikit-image. If you change your ai policy or if you prioritize optimizations in the future, do take a look at the Pull Requests linked above, I will keep them open as reference solutions for future. You can also run codeflash --all to optimize the rest of the library or any other scientific python library.