Analyzing the Research Workflow in Python Research Scripts
Context
While every Data Science research script has an input and an output at some point, there are multiple steps in between that are used to transform the provided data. Given a big enough sample, it should be possible to derive a workflow that most scientists adhere to and compare it with suggested workflows described in the literature [1].
This workflow can then be used for further analyses, such as the number of function calls for each stage in the workflow. To facilitate that, we need to create a mapping of popular data science functions to those identified stages.
A short example on how function calls are annotated to a stage in the workflow can be found here:
sample <- read.csv("sample.csv", sep = ";") #import
plot(sample$var1 ~ sample$var2, pch = 20, col = "grey58", ylim = c(0, 1), xlim = c(0, 1)) #visualize
abline(lm(sample$var1 ~ sample$var2)) #visualize
Research Problem
In this work, we want to explore if there is a common workflow across disciplines (such as Chemistry[2], Biology, Social Sciences[3], etc.) that high-grade papers adhere to. By exploring outstanding journals and conferences in that field, we want to collect samples of the way they structure their scripts. The derived workflow is then compared to literature on proposed workflows to check if there is any overlap. Furthermore, we want to create a mapping for the functions used in the process to their respective stage in the derived workflow.
Tasks
- Identify a set of conferences/journals as a basis for a literature review.
- Collect recent publications that do data science in python from this set.
- Derive a multi-stage workflow and compare it to literature on data science workflows.
- Identify popular libraries that are used and map their functions to the stages of your workflow.
Related Work and Further Reading
[1] Huber, F. (2025). Hands-on Introduction to Data Science with Python. v0.23, 2025, Zenodo. https://doi.org/10.5281/zenodo.10074474
[2] Davila-Santiago, E.; Shi, C.; Mahadwar, G.; Medeghini, B.; Insinga, L.; Hutchinson, R.; Good, S.; Jones, G. D. Machine learning applications for chemical fingerprinting and environmental source tracking using non-target chemical data. Environ. Sci. Technol. 2022, 56 (7), 4080–4090. DOI: 10.1021/acs.est.1c06655.
[3] Di Sotto S, Viviani M. Health Misinformation Detection in the Social Web: An Overview and a Data Science Approach. International Journal of Environmental Research and Public Health. 2022; 19(4):2173. https://doi.org/10.3390/ijerph19042173
Contact
Ruben Dunkel