Provenance for Reproducible Data Science

YouTube

Description

In science, results that are not reproducible by peer scientists are valueless and of no significance. Good practices for reproducible science are to publish used codes under Open Source licenses, perform code reviews, save the computational environments with containers (e.g., Docker), use open data formats, use a data management system, and record the provenance of all actions.

The provenance of data provides detailed information about the origin of that data. That includes information about ownership and both actions and modifications performed on the data. With provenance information, data will be traceable and users can be confident in quality of the data. To specify and store provenance information, W3C has standardized the provenance model PROV. Using PROV and associated implementations, users can record provenance of data analytics processes. The provenance information are directed acyclic graphs that can be analysed to get insight into the data analytics processes.

The talk covers

Introduction to provenance and PROV
Modelling provenance for data processing
Python APIs for provenance recording
Provenance recording for Jupyter notebooks
Storing provenance in graph databases
Analysis of provenance information

PyVideo

Provenance for Reproducible Data Science

Description

Details