Thank you @jni
My replying to the listserv from gmail was not a good idea–some encoding issue made the reply a bit hard to read, and didn’t do much to help my tone. I’ll copy here and hope that this version is slightly less painful to read.
re: @matthew-brett ’s comments about copyright, and his write-up with @_pi (see also this talk from @_pi), I want to share a couple of posts from Matthew Butterick, and a related lawsuit:
https://matthewbutterick.com/chron/this-copilot-is-stupid-and-wants-to-kill-me.html
https://githubcopilotinvestigation.com/
https://githubcopilotlitigation.com/
I’m guessing some people will be familiar with these, but if you’re not, I hope you’ll read and consider the arguments.
Perhaps most relevant (from the “investigation” post):
> Just this week, Texas A&M proÂfesÂsor Tim Davis gave numerÂous examÂples of large chunks of his code being copied verÂbaÂtim by CopiÂlot, includÂing when he prompted CopiÂlot with the comÂment /* sparse matrix transÂpose in the style of Tim Davis */.
So clearly coding LLMs memorize, but as currently engineered, most do not provide citations or links to licenses.
(They could, but AFAIK they don’t. Butterick points this out also.) Previous work suggests it’s easy to extract code, but hard to extract authorship (e.g., https://arxiv.org/pdf/2012.07805, page 10), and that larger models tend to memorize more (https://dl.acm.org/doi/pdf/10.1145/3597503.3639074). Likewise “most LLMs fail to provide accurate license information, particularly for code under copyleft licenses” (https://arxiv.org/abs/2408.02487v1),
The example of the prompt “matrix code in the style of Tim Davis” shows that it’s not as simple as “more examples in training data = more memorization”. For scientific software, the number of examples in the training data will always be <<< the number of examples of, say, CRUD apps. My guess is that, with a very specific prompt, if you try to generate scientific code, you will be much more likely to violate an OS license. (One could test this, of course. Probably with an approach like this: https://arxiv.org/abs/2601.02671)
I don’t want to be a random person getting sanctimonious on the mailing list.
But I really value the ethics of open source software, the amazing contributions of all the numpy developers, and of scientist-coders more broadly.
I get the appeal of coding LLMs. And I agree with Butterick that, as currently designed, they break the ethical compact of open source.
I would hate to see numpy and the ecosystem around it move in that direction.