Let me reply with some questions, followed by the promised unpacking.
- Do we care about copyright? That is - is it acceptable to us, as open-source authors, that using an LLM can easily generate code to which copyright should apply, so that, if we do nothing, in due course, copyright will become very difficult to enforce, and will be practically void?
- Let’s say we do care about copyright - what cost are we prepared to pay, in order to defend ourselves from voiding copyright? Clearly replies here will vary from high cost to low cost. I feel strongly about copyright, and am prepared to pay a high cost. Others may feel differently.
- If one is engaged in that trade-off, copyright against code quality and volume, then one has to ask what benefit we expect from AI, or conversely, what benefit will we lose by constraining the use of AI. That’s a difficult question, with many facets. Could the code have been written in a similar time without AI, other than for fairly trivial and mechanical processing that is unlikely to violate copyright? (The evidence so far suggests yes). Will contributors using AI benefit in the same way from feedback and training, our previous model of onboarding contributors? Is it in fact true that the committer of AI code understands the code and its context to the same degree as a contributor not using AI? If we are flexible and welcoming of vibe-coding and autocomplete - what kind of programmers will we encourage? And what will be the value / cost ratio of merging those contributions? And supporting those contributors? And so on. These are the kinds of questions that Oscar is alluding to in his post that set off the current discussions.
At the top of your email:
Basically, I honestly don’t understand 1) how can a contributor check for potential copyright violations, and 2) how can an OSS project check the contributor’s check?
I’ll try and cover the first point below. For the second, as we’ve discussed before, we have to ask whether it is worth having a policy that may be effective, but is difficult to enforce. For example, it may be that contributors do change their behavior, in their desire to adapt to our norms, but that it would be hard to detect whether they have done this. I would argue that having the norm is worthwhile, even so. In general, it seems to me that we have a strong set of communities, and strong norms, which are, in fact, seldom broken by serious contributors.
Anyway, back to the personal copyright check. Let’s take the example you gave:
a contributor asks an LLM to translate a BSD-licensed Matlab code to Python, and submits it to SciPy
In that case, I would look carefully at the Matlab code, and at the Python code, and confirm that it was a faithful port, with no substantial code in it, that did not come directly from the Matlab code. I would state that in my PR to confirm I had done the work. If I did find substantial code that didn’t come from the original, I’d do a Github search for that code, and I’d do a Google search for potential code that might have been used. I’d then check the results to see if they matched the AI output, and report the outcome on the PR.
- a contributor uses an autocomplete functionality of their IDE to write a patch;
I’m not quite sure what you mean here - if you mean a simple one or two-line fix, then I think the contributor would be reasonable in asserting something like “code is trivial, copyright unlikely to apply”. If the code is not trivial, then I’d do the Github / Google search as above, and report the results.
In asking the contributor to report their findings on potential copyright violation, we make it easier for maintainers to confirm. I suspect we’ll gradually build expertise in detecting copyright violation.
I suppose the underlying question is the obvious one - is it time to throw up our hands and accept copyright has become moot in the age of coding LLMs? I would argue no - copyright is too important, and LLM coders are not good enough, to make us do that.