A policy on generative AI assisted contributions

Please do unpack more! Basically, I honestly don’t understand 1) how can a contributor check for potential copyright violations, and 2) how can an OSS project check the contributor’s check?

Consider a well-meaning, diligent and competent contributor, and consider three situations:

  • a contributor asks an LLM to translate a BSD-licensed Matlab code to Python, and submits it to SciPy (this was the original case which triggered the OP— Matt H did just this for some Alan Genz code);
  • a contributor uses an autocomplete functionality of their IDE to write a patch;
  • a contributor gives an LLM the link to an issue on Github and asks it to write a fix;

Suppose, further, that in all cases a contributor checks the generated code, modifies it as necessary, and can in full confidence check the box “I verified all code, I understand what it does, and I believe it fixes the issue/adds an enhancement asked”. Suppose the resulting patch is good, is of high quality, and is otherwise mergeable.

Questions:

  1. What actions does a contributor need to take in order to check for copyright violations?
  • If the problem is that the LLM training set could have contained license-incompatible code (which is the problem discussed upthread IIUC), then the answer seems to be “there is no way, it’s a black box”.
  1. If the contributor says basically “I pinky swear it’s compatible”, how does this help the OSS project?
  • again, if the problem is with the LLM training set, no actions of a contributor shield the OSS project from being in violation of the copyright?

In other words, I honestly fail to see how moving the burden of a copyright check on a contributor helps to either weed out low-quality submissions, or help with LLM-assisted submissions which are of otherwise high quality.

1 Like