Zoom*UserViews: Objectives

Scientific experiments are becoming increasingly large and complex, with a commensurate increase in the amount and complexity of data generated. Data, both intermediate and final results, is derived by chaining and nesting together multiple database searches and analytical tools. In many cases, the means by which the data are produced is not known, making the data difficult to interpret and the experiment impossible to reproduce. Provenance in scientific workflows is thus of paramount importance.

Objectives

Zooming in on provenance through user views

We present a formal model of provenance for scientific workflows which is general (i.e. can be used with existing workflow systems, such as Kepler, myGrid and Chimera) and sufficiently expressive to answer the provenance queries encountered in a number of case studies. Interestingly, our model not only takes into account the chained and complex structure of scientific workflows, but allows users to see the workflow at different levels of abstraction by means of user views. User views can be used to vary the level of detail presented in response to provenance queries. Based on this model, we have developed a prototype, ZOOM*UserViews. We used this prototype in the first Provenance Challenge: We discussed the design and implementation of ZOOM in the context of the queries posed by the challenge, and showed how user views affect the level of granularity at which provenance information can be seen and reasoned about.

Constructing relevant user views

In this project, our goal is to help scientists construct user views so that reformatting tasks within the workflow are hidden and tasks in which they are interested – “relevant” tasks -- are seen. Provenance information can then be examined centered around relevant tasks. We have developed a notion of what a good user view is with respect to a given set of relevant tasks within a workflow, and an algorithm for generating a good user view.

This is also included in the Zoom*UserViews prototype.

Designing a workflow generator

Evaluating techniques involving scientific workflows is challenging since it requires realistic workflow specifications and runs on which to base the experiments. However, as with database schemas, scientific workflow specifications are often confidential and shared only with a small group of collaborators. Furthermore, there are no incentives to motivate scientists to share their specifications outside of publications (text) which describe the scientific result and loosely describe the means by which results were obtained. It is consequently rare to find a publicly available, well defined scientific workflow. We have therefore collected and analyzed roughly 30 workflow specifications to extract common patterns – such as looping or sequential execution – and use these patterns to develop a synthetic generator which allows the user to generate arbitrarily complex scientific workflow specifications.

Computing workflow difference

Since a scientific workflow is an in-silico experiment, there are typically two phases to its use: an initial phase in which the workflow specification evolves, and a second phase in which the specification is stable but many runs are made using different inputs and parameters. In both these phases, it is important to understand the difference between two runs – i.e. what parts of the executions were different, as well as the difference in parameters –to understand why results differed between the runs. We are developing a notion of edit operations between runs, and algorithms to find the smallest edit script between two runs. 'Go to to know more about this project!'

The worfklow generator is available from the Tools menu of the prototype.

People

Current team

Previous members

  • Thunyarat (Bam) Amornpetchkul

Sponsors

This work supported by the National Science Foundation* under Grant No. 0612177.

*Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation.