Dear community,
As mentioned in my previous post, I would like to respond to EuroSciPy 2024’s Call for Proposals. My presentation would clearly fall under Data Science > Reproducibility (while touching on various issues found in other tracks: HPC, scientific applications, machine learning, and education).
In a new tutorial for the scikit-image gallery of examples, we try and reproduce (a very small part of) segmentation results from published open research in mouse embryonic development, which brings into play advanced 3D bioimaging techniques (note that the PR is still under review). We started this project last year with then-Outreachy interns @ana42742 and @agdev.
I’m currently ‘scaling’ this work by running the image data analysis on an entire frame (as opposed to a small 3D slice). For this I’m using the Heap community cluster and I’m so grateful to the Recurse Center! There are challenges every step along the way, from getting the data to setting up the environment, while following the research-style documentation, etc.
There is substantial literature on how to make research reproducible [1, 2, 3] so (hopefully) more and more researchers are empowered to share the artifacts which, given a published study, are required for reproducing it. I would like to report on my experience of using these artifacts: I must say that it’s been really joyful sometimes and really painful other times.
Obviously, my approach is not to finger-point but to praise the authors who go the extra mile of making their research open and reproducible. But how feasible is it, knowing that dependency versions are dropped over time (e.g., SPEC 0) and new standards take over? We can’t expect research code to be maintained once the grant which funded it is over, right?
Basically, many (interesting) questions have arisen for me while working on this. For example, the data are available in KLB format, which apparently isn’t much in use anymore; nowadays, a similar study would probably publish their data in the Zarr format, and this would make it easier/possible to load the data into various analysis tools. But is it fair to ask that the data be available in Zarr, when the research was published in 2018?
I’d be curious to know of other attempts at reproducing published open data. I welcome any kind of feedback you might have. Thank you for reading.
/cc @lagru @jarrodmillman @jni @stefanv @soupault
PS: @ana42742 would you be interested in co-presenting?
[1] http://www.practicereproducibleresearch.org
[2] Reproducibility Guide
[3] Guide for Reproducible Research — The Turing Way