Criteria for including new functionality into scikit-image

I’d like to get some input and clarification from the scikit-image community and @skimage-core on whether there is a consensus on how to decide if a contribution with new functionality should be accepted into our code-base. I haven’t found a clear statement (yet) in our resources on this but here is what I more or less picked up implicitly:

  • New contributions must be compatible with our license. They can’t be patented and the copyright holder must give consent to include it in our code base (implicit if this is the contributor itself?).
  • New functionality should have references in the scientific literature. Ideally, these sources are well cited.
  • New functionality should be well used in image processing, e.g. in other image processing frameworks or there should be a clear demand.

The latter two criteria can be sometimes hard to judge, especially with more specialized stuff (e.g. this pull request is raising this question for me: Add division of self overlapping polygons #6462). Has there ever been a case were these two lead to us denying a high-quality contribution?

I may be completely off here or just to dumb to find the answer in our resources but I thought it wouldn’t hurt to clarify. Happy to take notes and put them somewhere in our contributor or maintainer guide.

Hey Lars,
thanks for pointing that out. These are some reasons that led me to advocate for “a PR always solves an issue” before, mainly because we could discuss what PRs are feasible for us to maintain in the long haul.
This is specially useful right now, since we don’t have that many hands around to take care of the library, and we have plans and open PRs that will require a lot of work — also, it’s sad to ask a contributor to drop a PR in which they worked on a lot already.

IMHO we should discuss every new contribution before it’s coming into the package, so we don’t have doubts. Nevertheless, I agree with the points you suggest.

  • New contributions must be compatible with our license. They can’t be patented and the copyright holder must give consent to include it in our code base (implicit if this is the contributor itself?).

We should take extra care with this one, concerning data. Code is alright; BSD-compatible and we’re good to go. Data is tricky, though; I think all data in the library should be in public domain, and I had a conversation with @stefanv a while ago about that — long story short, CC0 has a breach that can be used for patent trolls. Fedora is considering CC0 as licença non grata: Fedora to disallow CC0-licensed code [LWN.net], [Fedora-legal-list] Change in classification of CC0 [LWN.net]. We should think about that.

Do you mean strictly in public domain and not just CC0?

Reminds me that I was supposed to document how to submit new data in more detail (under https://scikit-image.org/docs/stable/contribute.html#development-process)…

We would need a red banner for this! First line in the README…

Yeah. The aftermath of the talk Stéfan and I had was that CC0 and public domain are different things — hence, the Fedora case above :slightly_smiling_face:

1 Like

Thanks for bringing this up, @lagru.

This criterion is objective:

  • referenced in the literature: yes or no?
  • well cited: how many citations and how ‘diverse’ these citations (not just from the same authors/teams)?

I guess the tricky part is ‘how well’ is ‘well cited’ but, honestly, I think that a handful of us can determine this in a collegial manner: I don’t expect this to be controversial…

The third (implicit :wink:) criterion though,

is much more difficult to assess. But it should be somewhat correlated with the former criterion, and our assessment could include adding non-academic publications.

Thanks for raising this important topic, Lars!

2 and 3 are, indeed, often difficult to judge and I think there is no simple answer to that.
We have actually had a precedent where a not-so-famous algorithm was added to the library. At the time of my PR there was not even a preprint available, only an article draft somewhere on a unversity website. But the algorithm was very principled and showing, based on my experience, very good results in a wide range of practical cases (I did quite a comprehensive testing within the work projects). The preprint was added by the author later, if I am not mistaken, following the inquiry coming from us. Based on its low citation count, the algorithm almost got removed, but the final decision was to keep it thanks to its practical value. I have later seen our implementation successfully used in many repos across the web.

Considering the above, I believe we should keep the subjective element (our expert estimation of the potential value/impact of the algorithm) as a part of the equation regardless of the academic metrics.

2 Likes

You are arguing that the corresponding issue for each PR can be used to discuss these criteria and other stuff that is not relevant to the concrete implementation? I’m not opposed to the idea but this would certainly take a lot of discipline. +1 for the banner in the pull request template that @mkcor advocates. A there other advantages perhaps already reiterated inside a discussion somewhere? I’d like to know if this has been already discussed. Otherwise we should open another topic to give our community time to discuss and review.


Concerning the criteria: There are many positive cases in our library where functions were added that don’t contain literature references, e.g. remove_small_holes, local_maxima, … I’m assuming that today we would add many of these functions again.

Thank you @soupault for the fitting example. I agree with you and @mkcor that the latter two criteria have been and probably should remain guidelines. So instead maybe we should focus on how we reach the decision whether to include it. I feel like the following rule from our governance SKIP 1 would be a good fit (or already applies) here:

Code changes and major documentation changes require agreement by two core developers and no disagreement or requested changes by a core developer on the issue or pull-request page (lazy consensus).

A corresponding issue to each PR like @alexdesiqueira suggested would be where this process takes place.


Off topic: I kinda wish our decision making rules would at least include a recommendation on how much time should be given until a lazy consensus is assumed. I’m asking myself that question a lot lately because a lot of maintainers are currently less active. Most of the time I’m trying to give it at least one week…

1 Like

FYI we have some guidelines for SciPy in our docs that might interest you SciPy Core Developer Guide — SciPy v1.10.0.dev0+1697.74cbd2d Manual and Ways to Contribute — SciPy v1.10.0.dev0+1697.74cbd2d Manual

Another important point is StackOverflow which is not BSD/MIT compatible.

2 Likes

This may be worth reviewing as well:

2 Likes

A rule of thumb is at least 3 years since publication, 200+ citations, and wide use and usefulness.

Thanks for sharing, @jarrodmillman!

1 Like

Thanks everyone for the feedback! I startet the linked PR to document what I understand to be the general consensus here:

1 Like

For Fedora it seems that CC0 is problematic only for code, not for content (“We plan to classify CC0 as allowed-content only, so that CC0 would no longer be allowed for code”). Could data be considered content? :thinking:

Yes. I understand they are using code as a subset of content in that sentence, and more than that, all content — including both code and data — would fall in that primary concern: somebody could be sued.

Re: licensing of data, I think that CC0 is perfectly acceptable since it translates to “put this into the public domain to the extent possible by law”—Alex, I’m curious why we would have thought differently? Note that Fedora still allows CC0 for data.

Re: minimal inclusion criteria, the “rule of thumb” is exactly that—a rough guideline. Sometimes, a paper is published that makes a tweak to a well cited paper, e.g., and in that case it should be included. Sometimes, contributions do not have matching papers, but are used widely. Unfortunately, it is a judgement call, so we cannot put any strict rules in place.

Re: decision making guidelines: we should update our governance to include decision timing. I think it is reasonable to simply move ahead with non-controversial decisions that all team members present are in agreement on, but to give two or three weeks for feedback on potentially hairy ones, and make sure that relevant parties (people who have worked on a specific piece of code, e.g., or who have expressed interest in a topic before) are pinged.

Alex, I’m curious why we would have thought differently?

@stefanv I said we should think about the implications of using CC0, not that we think differently about that:

(…) I think all data in the library should be in public domain, and I had a conversation with @stefanv a while ago about that — long story short, CC0 has a breach that can be used for patent trolls. Fedora is considering CC0 as licença non grata (…) We should think about that.

We having a conversation about something doesn’t mean we have a consensus or decision on anything :smile:

Note that Fedora still allows CC0 for data.

Being more explicit about that, you’re referring to this:

“We plan to classify CC0 as allowed-content only, so that CC0 would no longer be allowed for code,”

is that right? I think it puts more questions than anything. Not sure if Fedora distributes “data” in their packages, and how much of these data is CC0. Maybe I was wrong when assuming this:

I understand they are using code as a subset of content in that sentence, and more than that, all content — including both code and data — would fall in that primary concern: somebody could be sued.

If that’s the case, sorry @mkcor! Nevertheless, I am still reading content there as everything, byte-wise…
Maybe I’m digressing too much with that. Sorry everyone!

Yes, it seems very clear what is happening there: CC0 does not have a patent clause, therefore should not be used for code. Data is fine. I’m not sure why they’re OK with the BSD, since it also does not have a patent clause.

I see no relevance of the Fedora case to us using CC0 datasets.

Ok :relieved:
Thank you both for following up!