Terminology for parameters controlling parallel computation

lagru · March 4, 2024, 6:30pm

Over at scikit-image we have been trying to come up with a consistent name for parameters that control parallel computation, be it number of threads or processes.

The discussion hasn’t been settled yet in part because there doesn’t seem to be a “standard” convention in the ecosystem to follow: e.g., workers is used by SciPy while scikit-learn uses n_jobs.

As this seems somewhat similar in concept to SPEC 7 — Seeding pseudo-random number generation I’d be curious what people think of making an ecosystem wide recommendation. In my mind the scope would deal more with naming conventions rather than a full blown “parallel computing API”.

stefanv · March 4, 2024, 7:00pm

I’d love to see some coordination around this API. Perhaps @lesteve has thoughts?

drammock · March 5, 2024, 10:48pm

FWIW, MNE-Python follows sklearn in using n_jobs

lesteve · March 6, 2024, 6:40am

n_jobs may not be the best parameter name but changing it in scikit-learn, joblib, and a number of other projects that adopted the same name, like MNE-Python, imbalanced-learn, xgboost etc … seems like a long-term endeavour and the migration will cause some disruption.

I think workers is fine, I would have a slight preference for n_workers or num_workers but following Scipy naming is probably better. @thomasjpfan seems to agree, see his talk at Scipy 2023 Can There Be Too Much Parallelism? and Parallelism in Numerical Python Libraries.

cc @ogrisel in case he has some comments.

stefanv · March 6, 2024, 5:42pm

Indeed, @thomasjpfan’s terrific blog post inspired our thinking on this topic. Ultimately, though, we’d rather opt for a workable option that the community is willing to adopt, rather than for the “optimal” naming.

scikit-image has to make a decision relatively soon, so we’re in a similar boat as we were in with SPEC 7: Seeding pseudo-random number generation.

The num or n prefix has also come to make sense to me, personally, to distinguish it from passing around worker objects.

Schefflera-Arboricola · April 4, 2024, 4:12pm

Hi, I’ve been working on nx-parallel(parallel backend for NetworkX) and we had a little discussion on what the default value of n_jobs(or workers) should be.

I’m yet to add a config manager. So right now, a user cannot modify any of the parallel configs but I think we decided to go by the joblib’s conventions(i.e. n_jobs) because nx-parallel is heavily dependent on joblib for all kinds of parallelization stuff, so it would be fair to expect a user to know joblib(if they would want to play around with the parallel configurations in nx-parallel). Also, allowing users to set configs would hopefully allow them to use other parallel backends(like threading, multiprocessing, dask, ray etc.) through joblib, so then, having an ecosystem-wide recommendation would be even more helpful. Also having a SPECs for a whole parallel computing API would be great! And, I’d really like to know how I can contribute to this.

Thank you