Modern scientific workflows can be very complex, involving many data sources, software components, and partial results. At the same time, many scientific workflows are not automated and incur significant manual effort or depend on brittle, one-time, scripts. As a result, scientists and data professionals have issues with managing experiments, collaboration, and reproducibility.
Data Workspaces (DWS) is an open source framework for managing scientific data and automating experiment workflows. Data Workspaces maintains the state of a science project, including data sets, intermediate data, results, and software. It supports reproducibility through snapshotting and lineage tracking and collaboration through a push/pull model inspired by version control systems for code.
The goal is to provide the reproducibility and collaboration benefits with minimal changes to your current projects and processes.
CapabilitiesData Workspaces lets you:
- Track and version all the different resources for your science project from one place.
- Automatically track the full history of your experimental results and generate relevant reports summarizing the results.
- Reproduce any prior experiment, including the source data, code, and configuration parameters used.
- Go back to a prior experiment as a “branching-off” point to explore additional permutations.
- Collaborate with others on the same project, sharing data, code, and results.
- Easily reproduce your environment on a new machine to parallelize work.
- Publish your environment for others to download and explore.
To get a sense of how DWS can benefit a project, we will use an example data analysis of historical temperature data from ocean buoys. This data has been captured for almost 50 years and is available online in text file format from various government and research organizations (e.g. the National Buoy Data Center). A Data Workspace for the analysis of this data uses the following resources:
- Source Data: The original data files collected from the National Buoy Data Center
- Code: Software to pre-process data and to run analytics
- Intermediate Data: Space to store intermediate results of the analyses
- Results: The results of the analyses
Under Data Workspaces, the workflow for this project would involve the steps shown below. We first create an initial (empty) workspace, and add the four kinds of resources to the workspace. Note that resources such as source data and code may reside in local file systems or in their own local or remote repositories. For example, data can reside in a database, an NFS server, or an Amazon S3 bucket and code can reside in its own version control repositories. Once the resources are added to the workspace, they are transparently managed by DWS. Thus, the scientist can run one or more experimental workflows on the data, and can take snapshots to precisely track the data, code, and parameters used to obtain a certain result. Later, the state of the system (or individual resources) can be restored to the settings of a previous snapshot. DWS can generate reports of the snapshots and experimental results.
Further, a workspace can be published in a repository and shared with collaborators. DWS consolidates the different resources required to replicate the workspace on a different machine, and allows collaborators to share workspaces, experiments, and snapshots. The workspace, including dependencies, can be replicated and executed on a local machine as well as on a hosted service.
As the project progresses, various reports can be generated to show:
- The history of snapshots taken of the workspace (across all the collaborators) and the key metrics for each snapshot
- Detailed parameters and results of a given snapshot
- Detailed lineage data for the current state of the workspace or any past snapshot
For example, here is a snapshot history report from the Buoy Data Analysis project:
The report shows some previously snapshots of experiments. The tags are readily accessible names which can be used to get back to a specific state later. Each snapshot reports some metrics from the experiments. The “NaN”s indicate a failed state that was saved for later scrutiny.
Each snapshot keeps track of the state of the resources (as hashes), and also lineage information showing which resources were used in obtaining a result. Here is a visualization from the lineage report for snapshot ‘buoy-46026-final” of the same project:
The nodes in this graph represent specific file paths within resources (e.g. data:/46026) and the edges represent step executions which read-from and write-to resources. The sequences of letters and numbers following the resource names (e.g. 27f9344) are the hashes of the resource content that were captured during snapshots.
Data Workspaces as been released as open source software and is available for use today. The project has a small user community centered around the Max Planck Institute for Software Systems and is looking for new users. We are happy to answer questions and discuss how Data Workspaces can be incorporated into your workflow.