Data Provenance for SQL

We explore new ways to derive the provenance (or lineage) of data items that flow through programs or queries. Once this provenance information has been derived, we know

  1. exactly which input items led the program (or query) to emit which output items (Why and Where Provenance), as well as

  2. which program parts were involved in the computation of each single item (How Provenance).

Our exploration started with the analysis and instrumentation of Python programs used in Scientific Data Processing (in the context of the ScienceCampus Tübingen). We now tweak and transfer the resulting techniques such that they apply to the derivation of data provenance for relational queries, SQL in particular. There is the potential to derive very fine-grained provenance information for substantially larger SQL dialects than were considered up to now.