Talk proposal for EuroSciPy 2024

Dear community,

As mentioned in my previous post, I would like to respond to EuroSciPy 2024’s Call for Proposals. My presentation would clearly fall under Data Science > Reproducibility (while touching on various issues found in other tracks: HPC, scientific applications, machine learning, and education).

In a new tutorial for the scikit-image gallery of examples, we try and reproduce (a very small part of) segmentation results from published open research in mouse embryonic development, which brings into play advanced 3D bioimaging techniques (note that the PR is still under review). We started this project last year with then-Outreachy interns @ana42742 and @agdev.

I’m currently ‘scaling’ this work by running the image data analysis on an entire frame (as opposed to a small 3D slice). For this I’m using the Heap community cluster and I’m so grateful to the Recurse Center! There are challenges every step along the way, from getting the data to setting up the environment, while following the research-style documentation, etc.

There is substantial literature on how to make research reproducible [1, 2, 3] so (hopefully) more and more researchers are empowered to share the artifacts which, given a published study, are required for reproducing it. I would like to report on my experience of using these artifacts: I must say that it’s been really joyful sometimes and really painful other times.

Obviously, my approach is not to finger-point but to praise the authors who go the extra mile of making their research open and reproducible. But how feasible is it, knowing that dependency versions are dropped over time (e.g., SPEC 0) and new standards take over? We can’t expect research code to be maintained once the grant which funded it is over, right?

Basically, many (interesting) questions have arisen for me while working on this. For example, the data are available in KLB format, which apparently isn’t much in use anymore; nowadays, a similar study would probably publish their data in the Zarr format, and this would make it easier/possible to load the data into various analysis tools. But is it fair to ask that the data be available in Zarr, when the research was published in 2018?

I’d be curious to know of other attempts at reproducing published open data. I welcome any kind of feedback you might have. Thank you for reading.

/cc @lagru @jarrodmillman @jni @stefanv @soupault

PS: @ana42742 would you be interested in co-presenting?

[1] http://www.practicereproducibleresearch.org
[2] Reproducibility Guide
[3] Guide for Reproducible Research — The Turing Way

3 Likes

@mkcor I think this will make for a very interesting presentation, especially because of the personal experience you describe. I like the open questions too, and would love to hear what the audience have to say about them.

2 Likes

Absolutely, that’s what I have in mind! Thank you very much for your encouraging feedback, Stéfan.

1 Like

Dear all,

I have drafted my proposal here. I would be very grateful if you could have a look before I submit it, i.e., in ~24 hours (sorry for the super short + weekend notice!).

For lack of time, I haven’t mentioned my (current) experiment on a cluster. I may be going into too much detail, although I’m far from reaching the 50,000-character upper limit! Although I’m not formally revealing any personal information, the latter can be inferred when visiting the links I shared as references… How serious do you think this is?

Thank you!
Marianne

That’s a great write-up; I would attend :slight_smile: I’ve left some inline comments.

1 Like

@stefanv thank you for the corrections and valuable comments (great review)! Still in the process of addressing them…

Thumbs for such a great write up. Having contributed to this briefly, I can completely relate with the feeling of pain and euphoria that accompany the attempt of this work. I am confident this would make for a good Talk. Well done @mkcor @ana42742

1 Like