PRESENTER: Hongrae Lee
TIME: Thu Nov 20, 2pm
LOCATION: room 304
TITLE: Efficient Provenance Storage
AUTHORS: Adriane P. Chapman and H. V. Jagadish and Prakash Ramanan
ABSTRACT:
As the world is increasingly networked and digitized, the data we store
has more and more frequently been chopped, baked, diced and stewed. In
consequence, there is an increasing need to store and manage provenance
for each data item stored in a database, describing exactly where it
came from, and what manipulations have been applied to it. Storage of
the complete provenance of each data item can become prohibitively
expensive. In this paper, we identify important properties of provenance
that can be used to considerably reduce the amount of storage required.
We identify three different techniques: a family of factorization
processes and two methods based on inheritance, to decrease the amount
of storage required for provenance. We have used the techniques
described in this work to significantly reduce the provenance storage
costs associated with constructing MiMI [22], a warehouse of data
regarding protein interactions, as well as two provenance stores, Karma
[31] and PReServ [20], produced through workflow execution. In these
real provenance sets, we were able to reduce the size of the provenance
by up to a factor of 20. Additionally, we show that this reduced store
can be queried efficiently and further that incremental changes can be
made inexpensively.