RepeatFS: a file system providing reproducibility through provenance and automation.

Academic Article

Abstract

  • MOTIVATION: Reproducibility is of central importance to the scientific process. The difficulty of consistently replicating and verifying experimental results is magnified in the era of big data, in which bioinformatics analysis often involves complex multi-application pipelines operating on terabytes of data. These processes result in thousands of possible permutations of data preparation steps, software versions and command-line arguments. Existing reproducibility frameworks are cumbersome and involve redesigning computational methods. To address these issues, we developed RepeatFS, a file system that records, replicates and verifies informatics workflows with no alteration to the original methods. RepeatFS also provides several other features to help promote analytical transparency and reproducibility, including provenance visualization and task automation. RESULTS: We used RepeatFS to successfully visualize and replicate a variety of bioinformatics tasks consisting of over a million operations with no alteration to the original methods. RepeatFS correctly identified all software inconsistencies that resulted in replication differences. AVAILABILITYAND IMPLEMENTATION: RepeatFS is implemented in Python 3. Its source code and documentation are available at https://github.com/ToniWestbrook/repeatfs. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
  • Authors

  • Westbrook, Anthony
  • Varki, Elizabeth
  • Thomas, W Kelley
  • Status

    Publication Date

  • June 9, 2021
  • Published In

  • Bioinformatics  Journal
  • Keywords

  • Automation
  • Computational Biology
  • Reproducibility of Results
  • Software
  • Workflow
  • Digital Object Identifier (doi)

    Start Page

  • 1292
  • End Page

  • 1296
  • Volume

  • 37
  • Issue

  • 9