Provenance Examples
Provenance Files Structure
All generated provenance files are uniquely named, the provenance file’s unique identifier is the same as the relevant data file’s. For example, the data file Structured-File-Lens-f07bb63d-05a0-4a15-99d1-9a478263bdc2.nq
is followed by provenance Structured-File-Lens-provenance-f07bb63d-05a0-4a15-99d1-9a478263bdc2.nq
. There are also provenance files with unique identifies which do not match its data files. In the case of our example, it is Structured-File-Lens-provenance-803d180d-7ae7-4ce9-8ec4-ef8de44a800d.nq
. These provenance files are generated by the provenance engine for upper-level activities such as main-activity. They might be used to join all datasets generated within the same execution launched by a single trigger. For a further breakdown, see the example SPARQL queries below.
Example SPARQL Queries
Load trial data in Stardog
The following examples all use a Stardog Database to execute these SPARQL queries, however, any triplestore of your choice can be used. To launch a trial version of Stardog, see their getting started page.
To load the demo data into Stardog run following command. In the following example, the project is downloaded into /data/Stardog/datalens-examples
directory in the local filesystem, The command mounts output files into docker container’s /data
directory.
docker run \
-d \
-v /data/Stardog/datalens-examples/sflens-test-files/output-files/names-simple:/data \
-p 5820:5820 \
-e STARDOG_SERVER_JAVA_ARGS="-Xmx3g -Xms3g -XX:MaxDirectMemorySize=1g" \
stardog/stardog:latest
Using Stardog Studio software you can create a database and run the following command to load the example files:
load <file:///data/Structured-File-Lens-f07bb63d-05a0-4a15-99d1-9a478263bdc2.nq> ;
load <file:///data/prov/Structured-File-Lens-provenance-803d180d-7ae7-4ce9-8ec4-ef8de44a800d.nq> ;
load <file:///data/prov/Structured-File-Lens-provenance-f07bb63d-05a0-4a15-99d1-9a478263bdc2.nq>
The whole dataset contains 39 triples in 2 graphs: http://www.data-lens.co.uk/803d180d-7ae7-4ce9-8ec4-ef8de44a800d
and http://www.data-lens.co.uk/ontology#ProvenanceGraph
. The dataset is only grouped in the first graph, to confirm this you would use the following SPARQL query:
prefix dlo: <http://www.data-lens.co.uk/ontology#>
SELECT ?s ?p ?o ?g where {
{
?s ?p ?o
} UNION {
GRAPH ?g {
?s ?p ?o
}
filter(?g != dlo:ProvenanceGraph)
}
}
With the output being:
s | p | o | g |
---|---|---|---|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Â
Get provenance metadata
To obtain the metadata it is enough to bring together relations starting from the quad:
For example:
Which obtains the following result:
generatedInProcess | start | end | label | inputFile | agent | applicationName | friendlyName | version |
---|---|---|---|---|---|---|---|---|
| 2020-04-09T11:16:18.837+01:00 | 2020-04-09T11:16:19.507+01:00 | main-execution | file:///var/local/input-file.csv |
| Structured-File-Lens | Structured-File-Lens | 1.3.2.0 |
Â
Get data based on provenance metadata
Find data generated on 9 April 2020
The searching should start from filtering all processes ending that day using the SPARQL query:
In the next step we should find all entities and graphs generated by those processes by adding the lines:
As the outcome we have two graphs:
Finally, we can update the query to display data within those graphs, this is achieved if we put the above query into a subquery. Using a CONSTRUCT
query, the output could then be presented in the form of a graph rather than a simple comma-delimited file. The final query therefore is:
Get data augmented with provenance
Find data generated on 9 April 2020 - Produce the augmented dataset with the process start time and the application version using RDF reification approach (Ref. Rdf Reification).
We can reuse the SPARQL query designed in the previous example. A reification is a different approach than we used for generation of the provenance. Instead of grouping data in named graphs and describing the graphs, we describe single triples using statement. For example, for the triple:
We add extra statement triples (including a metadata statement):
The final query then becomes:
Please note that in lines 36 and 37, we create an IRI of the statement using an MD5 hash of a binded subject, predicate and object. The statement is optionally linked with the old provenance entity graph in line number 13. And in lines 14-15 we have added the metadata properties.
The final augmented data then looks like:
Â