Manually Create a Mapping File

Intro

This guide will outline all the elements required within a valid mapping file for use in a Lens, as well as explaining how to create one of your own from scratch. The mapping files are written using the RML language, RML is defined as a superset of the W3C-recommended mapping language, R2RML, that maps data in relational databases to RDF. The RDF Mapping Language (RML) is a generic scalable mapping language defined to express rules that map data in heterogeneous structures and serializations to the RDF data model. RML mappings are themselves RDF graphs and written down in Turtle syntax, therefore we will be saving all our mapping files with the .ttl file extension. All sample files in this document can be found and downloaded here, start up your own Lens to test them for yourself.

 


Table of Contents

 

 


Overview and Example

The mapping refers to the Logical Source to retrieve data from an input source. This consists of:

  1. A reference to the input source

  2. The Reference Formulation to specify how to refer to the data

  3. The iterator that specifies how to iterate over the data

Each logical source is mapped to RDF using a triples map. The triples map is a rule that maps each data element to a number of RDF triples. The rule has two main parts:

  1. subject map that generates the subject of all RDF triples that will be generated from a data element. The subjects often are IRIs that are generated from the primary key or ID portions of the data.

  2. Multiple predicate-object maps that in turn consist of predicate maps and object maps.

Triples are produced by combining the subject map with a predicate map and object map, and applying these three to each element in the source data. For example, the complete rule for generating a set of triples might be:

  • Subjects: A template http://example.com/employee/{id} is used to generate subject IRIs from the id column or element.

  • Predicates: The constant vocabulary IRI prefix ex:name is used.

  • Objects: The value of the name element or column is used to produce an RDF literal.

By default, all RDF triples of the output dataset are in the default graph for Data Lens - this is represented by the Data Lens prefix followed by the unique identifier for that execution of the lens, for example, <http://www.data-lens.co.uk/803d180d-7ae7-4ce9-8ec4-ef8de44a800d>. This UUID can be used in conjunction with the Provenance to ascertain specific meta-data about the execution of the Lens. In addition to the default graph, a specific Named Graph can be specified in the mapping whereby an additional triple is generated for each element.

An overview of R2RML

In this document, there will be a number of examples. To clearly discern which examples refer to what, they will use the following colour codes:

Full RML mapping files will be blue
Mapping RML snippets will be purple
Example input data will be yellow

 


Mapping File

This is an example of a mapping file which we will break down in more detail in the sections below. Its aim is to take the Id column as a Subject, and the Name column as an Object with ex:hasName as a Predicate.

Input File

Using this very basic CSV file as an example input file, we can see how it might be transformed into RDF using the Structured File Lens.

id

name

id

name

10001

Alice

10002

Bob

 


Output File

Using the mapping file, and passing in the input source file, the following output NQuads file will be generated. The NQuads are represented as ?Subject ?Predicate ?Object ?NamedGraph, as defined through the mapping and as populated with the source data.

Subject

Predicate

Object

Graph

Subject

Predicate

Object

Graph

<http://example.com/10001>

<http://example.com/hasName>

"Alice"

<http://www.data-lens.co.uk/803d180d-7ae7-4ce9-8ec4-ef8de44a800d>


 

Namespaces - Prefix and Base

Specifying a namespace prefix allows you to bind an IRI to an abbreviation for later use. As seen in the example, this is done in the following way:

The most common prefixes and ones that will be used within this document are:

Prefix

IRI

Description

Prefix

IRI

Description

rml

http://semweb.mmlab.be/ns/rml#

RML ontology

rr

http://www.w3.org/ns/r2rml#

The R2RML ontology, which is extended by RML

ql

http://semweb.mmlab.be/ns/ql#

The Query Language vocabulary, which is used together with RML

rdf

http://www.w3.org/1999/02/22-rdf-syntax-ns#

The RDF Concepts Vocabulary

rdfs

http://www.w3.org/2000/01/rdf-schema#

The RDF schema

xsd

http://www.w3.org/2001/XMLSchema#

The XML Schema Definition namespace

schema

http://schema.org/

The schema.org vocabulary

dbo

http://dbpedia.org/ontology/

The DBpedia ontology

ex

http://example.com/

An example prefix used for our RML rules

 

A Base IRI is used in resolving relative IRIs produced by the RML mapping. According to the R2RML spec, the base IRI must be a valid IRI - it should not contain a question mark (“?”) or hash (“#”) characters and should end in a slash (“/”) character.

 


 

TriplesMap

A Triples Map defines rules to generate zero or more RDF triples sharing the same subject. A Triples Map consists of a Logical Source, a Subject Map and zero or more Predicate-Object Maps. The example mapping file snippets in this section will all correspond to the following input files:

CSV / SQL / XLSX / ODS

JSON

XML

CSV / SQL / XLSX / ODS

JSON

XML

 

 

The exact same structure will be used in a SQL Table to represent examples for the SQL Lens, using the table name employees

Note how the id value has been described in this XML file as an attribute instead of an element.

 

LogicalSource - Structured Files

As previously mentioned, a Logical Source consists of:

  1. A reference to the input source

    1. As defined by - rml:source "inputSourceFile.csv"

    2. When using the Lenses, the input source file will be renamed before processing to what was specified in the logical source parameter when triggering the process, which in this case is inputSourceFile.csv. So ensure the TriplesMaps that correlate to the file you want to process have the same matching logical sources.

  2. The Reference Formulation clarifies which data format is parsed and how the references to the extracts of data are defined

    1. As defined by - rml:referenceFormulation ql:CSV

    2. The following reference formulations are predefined but not limited: ql:CSVql:JSONPathql:XPath, and ql:SaXPath, for use in the Structured File Lens, and just ql:JSONPath for use in the RESTful Lens.

  3. The iterator that specifies how to iterate over the data

    1. RML needs to process data that do not have an explicit iteration pattern as SQL Tables and CSV have, this is defined with the iterator property - rml:iterator

    2. JSON iterator defined using JSONPath - rml:iterator "$.[*].employees"

    3. XML iterator defined using XPath - rml:iterator "/employees"

 

CSV

JSON

XML

CSV

JSON

XML

The Logical Source of a mapping file designed for CSV input files does not have an iterator. This is due to CSVs having an explicit iteration pattern, and therefore it does not need to be defined.

As seen, this iterator uses JSONPath to specify how to reference the data source. This simple path gets all of the elements in the base array and returns everything in the employees object to then be later iterated over in the subject maps and predicate object maps.

When parsing XML within the Structured File Lens, we have devised two different ways to generate your output data. One approach utilises a SAX parser, this allows for incredibly high speeds of XML processing. And the second approach is a DOM parser, which while slower, does allow for complex XPath iterators to be used.

Our recommendation is to use the SAX parser where possible, this is done by declaring the reference formulation as above. However, if you are using a complex XPath expressions in your subject or object maps that utilises parent elements (../), then please use the DOM parser by specifying rml:referenceFormulation ql:XPath.

 

LogicalSource - SQL Database

The logical source when defining a mapping file for use against a SQL DB in the SQL Lens differs slightly from structured files. There are currently two main ways to retrieve data from your databases - the first allows you to retrieve all data from a table, and the second enables queries to be defined for explicit data retrieval. First, you must specify two new prefix namespaces in your mapping:

 

Select All

Custom Query

Database Source Info

Select All

Custom Query

Database Source Info

This first example demonstrates retrieving all the data from that specified database table (employees).

The rml:source is referencing a database source specifying the details and credentials of the database to target as seen in the third column. This would usually be stored at the end of the mapping file.

The rr:sqlVersion is an identifier for a SQL version, this will always be rr:SQL2008.

The rr:tableName is the name of the table to retrieve all the data from.

This second example demonstrates the ability to define a query within your mapping file. This is done using rml:query with the query surrounded in triple double-quotes """ [query] """. When using your own query, you must then specify the reference formula to be CSV as that is what the returned results will be parsed as. This is done with rml:referenceFormulation ql:CSV.

Also seen in this example is the use of a LIMIT and OFFSET. As described in the SQL Lens Documentation, there is the ability to automatically limit your queries to improve performance of requests. For this to work, the query in the mapping file needs to follow the above structure, whereby where the numbers in the query will be placed, the text {{{queryLimit}}} and {{{queryOffset}}} need to be written with the use of three sets of curly brackets.

And finally, the source, this manages the storing of the details and credentials of your database. This excerpt usually exists at the end of your mapping file, see full examples below for more details.

The d2rq:jdbcDSN specifies the URL of your database, and the d2rq:jdbcDriver specifies the JDBC driver for your DB type. See here for the correct configuration of your URL and driver.

The d2rq:username and d2rq:password allow you to specify your security authentication credentials.

All these properties are required when defining your database. Additional options can also be given by following the same structure, for example by adding:

 

SubjectMap

The Subject Map consists of the URI pattern that defines how each triple's subject is generated, and optionally you can also define its type.

CSV / JSON / SQL

XML

CSV / JSON / SQL

XML

rr:template allows you to define the IRI of the subject by concatenating plain text with data elements from the source data. In the above example, we concatenate http://example.com/ with the id value of the element. More than one element can also be used, for example: rr:template "http://example.com/{firstname}_{surname}".

While optional, rr:class allows you to define its type, whereby in this case we are giving it type http://example.com/Employee. This property must be an IRI. Mappings where the class IRI is not constant, but instead needs to be computed based on the contents of the input source, can be achieved by defining a predicate-object map with predicate set to rdf:type and a non-constant object map, as seen in the following section.

Defining the subject map for XML documents is the same with the other file types, except as noted previously with our XML example input file, the id value is represented as an attribute in the XML. Using standard XPath, we can access this using the @ symbol as seen above.

 

PredicateObjectMap

A triples map specifies the rules for translating, each row of a database, each record of a CSV data source, each element of an XML data source, or each object of a JSON data source, into zero or more RDF triples. The triples are generated using Predicate Object Maps, this consists of Predicate and Object Map(s). A Predicate Map specifies how the triple's predicate is generated and an Object Map specifies how the triple's object(s) are generated. 

Within our Predicate Object Map, we must declare our predicate and our object. First, we assign out Predicate to be a constant-valued term map, this means that it is will ignore the logical iterator specified by the query and will always generate the same RDF term. This is represented using the constant shortcut property rr::predicate followed by a mandatory IRI. As seen in the following example we are using our namespace prefix ex which results in the constant http://example.com/hasName.

Secondly, our object is defined through the use of an rr:objectMap. This allows us to specify the object as a template as well as the ability to set a number of properties. A template-valued term map is a term map that is represented by a resource that has exactly one rr:template property, where its value must be a valid string template. A string template is a format string that can be used to build strings from multiple components. It can reference logical references by enclosing them in curly braces (“{” and “}”). As seen in the following example, we are simply referencing the firstname element/column whilst prepending the string http://example.employees.com/. When not specifying a term type, the template must equate to a valid IRI. Note: this predicate object map example is the same across different data source types due to the same XPath, JSONPath, CSV and SQL reference).

This simply results in the predicate being equal to <http://example.com/hasName> and the object, by referencing the firstname element of a record in the data, equaling, for example, <http://example.employees.com/Alice>.

An alternative to rr:template, is rml:reference. This simply refers directly to a column in a database, a record in a CSV data source, an element in an XML data source, or an object in a JSON data source. A reference must be a valid identifier, considering the reference formulation (rml:referenceFormulation) specified, this can be an absolute path, or a path relative to the iterator specified at the logical source. For example rr:template "{firstname}" simply becomes rml:reference "firstname". As such, you are not able to specify any string template outside of the reference.

 

Additional Properties

The next example shows the use solely the reference within the rr:template, however the following syntax rules apply to valid string templates:

  • Pairs of unescaped curly braces must enclose valid references according to the specified reference formulation.

  • Curly braces that do not enclose references must be escaped by a backslash character (“\”). This also applies to curly braces within references.

  • Backslash characters (“\”) must be escaped by preceding them with another backslash character, yielding “\\”. This also applies to backslashes within references.

  • There should be at least one pair of unescaped curly braces.

  • If a template contains multiple pairs of unescaped curly braces, then any pair should be separated from the next one by a safe separator. This is any character or string that does not occur anywhere in any of the data values of either reference; or in the IRI-safe versions of the data values, if the term type is rr:IRI.

In addition to this, as mentioned previously, from within the declaration of the object map, we can specify a number of optional properties, this includes:

  • rr:termType: The term type determines the kind of generated RDF term. Its value must be an IRI and must be one of the following options: rr:IRIrr:BlankNode, or rr:Literal. This will default to rr:IRI, however, if you wish to make use of data type or language tag, use rr:Literal.

  • rr:dataType: The data type determines the data type of the generated RDF term. Its value must be an IRI and can only be specified when it is a datatypeable term map, this is a term map with a term type of rr:Literal that does not have a specified language tag. We recommend the use of the XML Schema datatypes, by using the xsd namespace prefix.

  • rr:language: A specified language tag causes generated literals to be language-tagged plain literals, and as such this may only be specified on term maps of rr:Literal with no data type. The value must be a valid language tag.

Data Type

Language

Data Type

Language

The resulting object would equal "Alice" of type string.

The resulting object would equal "Alice"@en-gb.

 

Named Graphs

Each triple generated from an RML mapping is placed into one or more graphs of the output dataset. As previously mentioned, by default, all RML mappings are put into a Data Lens graph - this is represented by the Data Lens prefix followed by the unique identifier for that execution of the Lens, for example, <http://www.data-lens.co.uk/803d180d-7ae7-4ce9-8ec4-ef8de44a800d>. This UUID can be used in conjunction with the Provenance to ascertain specific meta-data about the execution of the Lens.

In addition to this default graph, specific IRI-named Named Graphs can be specified in the mapping. Any subject map or predicate-object map may have one or more associated graph maps. These are specified by using the constant shortcut property rr:graph.

 

As outlined, this example predicate object map would result in two generated quads, one in the Data Lens named graph, and one in the user-specified http://example.com/EmployeesGraph named graph.

 


 

Examples

This section will provide you with full examples of mapping files, along with the input data and expected output triples. Copying these files and using them directly with your Lens will result in similar output data.

CSV

This basic CSV example is designed to take the id column as the subject, and set it to type http://example.com/Employee. Then from each record, take a combination of the firstname and the lastname to be a literal string object for the http://example.com/name predicate. And finally, the occupation set to be an “en-gb” literal object.

XLSX / ODS (Spreadsheet)

An example for spreadsheet data would be exactly the same as above for the CSV data, however with slight change to the logicalSource, this would be as follows:

JSON

This basic JSON example follows a similar structure to the CSV example, whereby the id column is taken to be the subject. The ex:name object has not been assigned a termType, dataType, or language, so there for it must be an IRI. The structure of this JSON is a little more complicated than the CSV, the rml:iterator and the ex:occupation object both show how JSONPath is used in the mappings. By following the concatenation of the two, $.[*].employees.[*].role.occupation, we reach the occupation value for each employee.

XML

This basic XML example also follows similar structures to the CSV and JSON example, taking id as subject, and the firstname and lastname as the ex:name predicate object. However, the subject value is an attribute of the xml, and the object contains a user defined named graph, ex:EmployeesGraph, note the additional NQuads generated in the output as a result. Similarly to JSON, the XPath concatenation of the iterator and the reference in the template, /employees/employee/role/occupation, reaches the occupation value for each employee. And finally, as no parent nodes at being used in any of the XPath queries, we are using ql:SaXPath as the rml:referenceFormulation.

SQL

This basic SQL example is most similar to the CSV example as the two share a similar data structure. We are just using the id, firstname and lastname to create our subject predicate objects, but this example’s main purpose is for targeting a database and its tables. The source DB is referenced in the rml:source as <#DB_source>, which can then seen at the end of the file. Here the table name is specified, this will then return all records and values from this table.

id

firstname

lastname

occupation

10001

Alice

Johnson

Tech

10002

Bob

Smith

Sales

SQL with Custom Query

This SQL example builds on the previous one, where here we also include a custom SQL query. Note that we are using a MySQL DB, so the syntax may differ. Setting up the rml:source and rr:sqlVersion is the same as before, however now we are dropping the rr:tableName in place of rml:query and rml:referenceFormulation ql:CSV. As well as the id, firstname and lastname subject predicate objects, we are also taking the data from the newly formulated namelength column. This is done in the exact same way as before, by setting a predicate, in this case ex:namelength, then defining the object, a reference to the value as a literal int.

id

firstname

lastname

occupation

10001

Alice

Johnson

Tech

10002

Bob

Smith

Sales

 


 

Functions

The Lenses of Data Lens allow for functions to be included within the mappings. These functions allow for raw data to be transformed or filtered on its way to being translated to RDF. We support a wide array of functions, including the majority of the GREL string functions. For a full list of the supported functions included in the latest releases of the Structured File Lens, SQL Lens, and RESTful Lens, see here.

Including a function in your mapping file is fairly simple as outlined below. When a function is used, a triple will not be generated if the subject or object is provided will a NULL value. Strings passed into functions may be sourced from an rml:reference (The entire value of a field in the source document), an rr:template (A string that includes values of fields in the source document, as well as manually specified strings), as well as outputs from other Functions (functions may therefore be nested).

 

Prefixes

In order to make use of functions, the first thing you must include is the following prefixes to the beginning of your mapping files:

 

Structure of a Function

To describe the structure of a function, we will use a simple example. The purpose of this function aims to take an input string, and convert it to all uppercase.

The function described above is inserted into a mapping file from within a rr:subjectMap or rr:objectMap. For example

Therefore, as you may have noticed, this allows you to create embedded functions, or functions within functions, simply by replacing an object map within an input argument with another function. This is demonstrated in the full example below.

 

Full Functions Example

This example shows a full mapping file containing functions within functions, along with an example input source CSV file and its expected RDF output.

 

Mapping

Firstly, the namespace prefixes are declared, these include the additional fnml, fno, and grel namespaces required for functions, as well as idlab-fn for additional functions.

Next, the logical source and subject map are declared as normal for a CSV input source file. However, the first predicate object map, is where we see our first function within a function. The aim of this output is to return true when the value is equal to bob, case insensitive. Let’s break this down:

  1. The innermost nested function is executed first, in this example it exists from lines 36 to 45

    1. This function performs a toUpperCase on the input, as seen in the previous example

    2. Taking a reference from the firstname value in the source data

  2. The second most nested function is executed next, using the output from the inner function as an input, this spans from lines 28 to 52

    1. This function checks if the input value is equal to the constant and returns true, or false otherwise

    2. This takes the transformed input data from the first function as input

    3. The second parameter of the function is constant

  3. The predicate object map is declared like any other

    1. The predicate is defined as http://example.com/isBob

    2. The object map uses fnml:functionValue instead of rr:template or rml:reference, and as seen, you are still able to declare additional properties: rr:termType, rr:datatype, and rr:language.

    3. Additionally, a rr:graph can also be specified

Where the first predicate object map demonstrated an example in the transformation of data, the second predicate object map shows filtering. The objective of this function is to nullify and values represented by the declared constant value. When null is returned from a function, an RDF triple is not generated.

  1. This function takes the same shape as previous functions

    1. It aims to remove values equal to the constant

    2. Using a reference to the occupation column as input

    3. And the constant having a value of “None”

 

 

Output

As seen in the output, all ex:isBob objects are output with a datatype of boolean, with records 10002 and 1004 being equal to true. Also notice how no NQuad has been generated for the occupation of 10004, this is due to “None” being filtered away.

 

 

Adding a Function to a Subject Map Example

 

Adding your own custom Function

Additional functions can be made in a relatively short period of time, should your data require transformation or filtering in a way not currently possible. Contact us for further support.