Provenance Examples

Provenance Files Structure

All generated provenance files are uniquely named, the provenance file’s unique identifier is the same as the relevant data file’s. For example, the data file Structured-File-Lens-f07bb63d-05a0-4a15-99d1-9a478263bdc2.nq is followed by provenance Structured-File-Lens-provenance-f07bb63d-05a0-4a15-99d1-9a478263bdc2.nq. There are also provenance files with unique identifies which do not match its data files. In the case of our example, it is Structured-File-Lens-provenance-803d180d-7ae7-4ce9-8ec4-ef8de44a800d.nq. These provenance files are generated by the provenance engine for upper-level activities such as main-activity. They might be used to join all datasets generated within the same execution launched by a single trigger. For a further breakdown, see the example SPARQL queries below.

Example SPARQL Queries

Load trial data in Stardog

The following examples all use a Stardog Database to execute these SPARQL queries, however, any triplestore of your choice can be used. To launch a trial version of Stardog, see their getting started page.

To load the demo data into Stardog run following command. In the following example, the project is downloaded into /data/Stardog/datalens-examples directory in the local filesystem, The command mounts output files into docker container’s /data directory.

docker run \ -d \ -v /data/Stardog/datalens-examples/sflens-test-files/output-files/names-simple:/data \ -p 5820:5820 \ -e STARDOG_SERVER_JAVA_ARGS="-Xmx3g -Xms3g -XX:MaxDirectMemorySize=1g" \ stardog/stardog:latest

Using Stardog Studio software you can create a database and run the following command to load the example files:

load <file:///data/Structured-File-Lens-f07bb63d-05a0-4a15-99d1-9a478263bdc2.nq> ; load <file:///data/prov/Structured-File-Lens-provenance-803d180d-7ae7-4ce9-8ec4-ef8de44a800d.nq> ; load <file:///data/prov/Structured-File-Lens-provenance-f07bb63d-05a0-4a15-99d1-9a478263bdc2.nq>

The whole dataset contains 39 triples in 2 graphs: http://www.data-lens.co.uk/803d180d-7ae7-4ce9-8ec4-ef8de44a800d and http://www.data-lens.co.uk/ontology#ProvenanceGraph. The dataset is only grouped in the first graph, to confirm this you would use the following SPARQL query:

prefix dlo: <http://www.data-lens.co.uk/ontology#> SELECT ?s ?p ?o ?g where { { ?s ?p ?o } UNION { GRAPH ?g { ?s ?p ?o } filter(?g != dlo:ProvenanceGraph) } }

With the output being:

s

p

o

g

s

p

o

g

http://example.com/10001

http://www.w3.org/1999/02/22-rdf-syntax-ns#type

http://example.com/Employee

http://www.data-lens.co.uk/803d180d-7ae7-4ce9-8ec4-ef8de44a800d

http://example.com/10002

http://www.w3.org/1999/02/22-rdf-syntax-ns#type

http://example.com/Employee

http://www.data-lens.co.uk/803d180d-7ae7-4ce9-8ec4-ef8de44a800d

http://example.com/10001

http://example.com/hasName

Alice

http://www.data-lens.co.uk/803d180d-7ae7-4ce9-8ec4-ef8de44a800d

http://example.com/10002

http://example.com/hasName

Bob

http://www.data-lens.co.uk/803d180d-7ae7-4ce9-8ec4-ef8de44a800d

 

Get provenance metadata

To obtain the metadata it is enough to bring together relations starting from the quad:

For example:

Which obtains the following result:

generatedInProcess

start

end

label

inputFile

agent

applicationName

friendlyName

version

generatedInProcess

start

end

label

inputFile

agent

applicationName

friendlyName

version

http://www.data-lens.co.uk/d07405d4-b388-4b35-b0e7-563185ecb5a7

2020-04-09T11:16:18.837+01:00

2020-04-09T11:16:19.507+01:00

main-execution

file:///var/local/input-file.csv

http://www.data-lens.co.uk/740926a9-503f-43d6-993d-55ad0377b1e1

Structured-File-Lens

Structured-File-Lens

1.3.2.0

 

Get data based on provenance metadata

Find data generated on 9 April 2020

The searching should start from filtering all processes ending that day using the SPARQL query:

In the next step we should find all entities and graphs generated by those processes by adding the lines:

As the outcome we have two graphs:

Finally, we can update the query to display data within those graphs, this is achieved if we put the above query into a subquery. Using a CONSTRUCT query, the output could then be presented in the form of a graph rather than a simple comma-delimited file. The final query therefore is:

Get data augmented with provenance

Find data generated on 9 April 2020 - Produce the augmented dataset with the process start time and the application version using RDF reification approach (Ref. Rdf Reification).

We can reuse the SPARQL query designed in the previous example. A reification is a different approach than we used for generation of the provenance. Instead of grouping data in named graphs and describing the graphs, we describe single triples using statement. For example, for the triple:

We add extra statement triples (including a metadata statement):

The final query then becomes:

Please note that in lines 36 and 37, we create an IRI of the statement using an MD5 hash of a binded subject, predicate and object. The statement is optionally linked with the old provenance entity graph in line number 13. And in lines 14-15 we have added the metadata properties.

The final augmented data then looks like: