Hello everyone,
I’m currently in the early stages of designing a scientific workflow to automate the processing and visualization of experimental data coming from lab instruments (primarily CSVs and HDF5 files). My primary objective is to create a robust, modular, and reproducible pipeline using tools from the scientific Python ecosystem. However, I’m trying to narrow down the most effective approach and would appreciate guidance from those more experienced.
Right now, I’m working with pandas
for initial data manipulation and matplotlib
/seaborn
for visualization, but I’m unsure if these are the best tools in terms of scalability and long-term flexibility. I’ve also explored xarray
, which seems promising for handling multi-dimensional datasets, especially those with labeled axes, but I’m not yet confident in how it fits into my pipeline.
Additionally, I’d like to incorporate some level of workflow orchestration and possibly integrate with dask
for parallelization if the dataset size grows. My final goal is to have a script or notebook that can be adapted to different datasets without too much manual rewriting.
Has anyone here built a similar system or can share insights on how to structure such a workflow using the Scientific Python stack? Any advice on library choices, best practices for reproducibility (e.g., conda
, poetry
, or pipenv
for environments), and general architecture would be incredibly helpful.
Thanks in advance!