Hi,
Given the rise of various generative AI tools, it would be nice if we could come up
with a definitive policy on dealing with genAI assisted contributions to SciPy.
The subject has been discussed multiple times over multiple venues, so it’d be great
if we could codify our position in some form and place it somewhere in the docs
(Where exactly is TBD; SciPy - Frequently Asked Questions could be one such place; one other option
is the github PR template which has some guidelines already).
For two recent discussions see, e.g.
- a long discussion in the NumPy list [1]
- a recent use case (which, in fact, sparked this email) [2]
[1] Mailman 3 Policy on AI-generated code - NumPy-Discussion - python.org
[2] MAINT: stats: remove `mvn` fortran calls from `multivariate_normal.cdf` by ev-br · Pull Request #22298 · scipy/scipy · GitHub
Broadly, it seems that the concerns can be roughly classified into two buckets:
- Do we accept genAI assisted contributions at all, and if we do, what are the boundaries
of what’s acceptable. - Copyright and license concerns.
Let’s consider these two in order.
- Do we accept genAI contributions?
It’s futile to attempt to block them wholesale.
People are going to use AI assistants anyway, along with various IDEs, autocompleters etc
So it seems reasonable to treat these assistants on par with other tools: as long as
it does not matter if a contributor uses VS Code or emacs or vi, it also does not matter
what engine powers their autocompleter. The discussion in [1] has a concise wording,
“Only submit code that you actually undestand”, and I’d just go with just this.
One thing we probably do not want to see is a fully automated submission, which is
both generated and submitted by a bot, with no human intervention or supervision at all.
So we could/should stress that, too.
- Copyright violations / incompatible licenses
Here I lack expertise and have more questions than answers. Here’s a possibly naive
attempt at asking hopefully relevant questions:
- Are we concerned that the code generated by some tool is under some incompatible
license from just the fact of being generated by the tool?
In this case, using the tool is similar to copying code from, say, Numerical Recipes.
Then we should either blacklist known bad tools (do we know of any?) or ask
contributors to declare the tools they used (so that we can remove offending code
at a future date if needed).
- Are we concerned that the training set of a tool included license-incompatible code?
And if it did, what does it mean for us, are we then transitively non-compliant?
If this is a concern, and we at some point discover that some code is license-incompatible,
the action does not depend on how it got into the codebase, genAI or not:
we discover it, we remove it.
And then the only question IMO is whether it’s helpful to us to ask contributors
list the AI assistants they used.
To summarize,
- we should not block genAI assisted contributions;
- we should block fully automated contributions, and ask submitters to only submit
code they understand; - TBD if we want to ask contributors to list what genAI assistants they used, if any.
- Let’s add explicit (short!) guidance to the docs, TBD where precisely.
Thoughts?