Manually Create a Mapping File
Intro
This guide will outline all the elements required within a valid mapping file for use in a Lens, as well as explaining how to create one of your own from scratch. The mapping files are written using the RML language, RML is defined as a superset of the W3C-recommended mapping language, R2RML, that maps data in relational databases to RDF. The RDF Mapping Language (RML) is a generic scalable mapping language defined to express rules that map data in heterogeneous structures and serializations to the RDF data model. RML mappings are themselves RDF graphs and written down in Turtle syntax, therefore we will be saving all our mapping files with the .ttl
file extension. All sample files in this document can be found and downloaded here, start up your own Lens to test them for yourself.
Table of Contents
Overview and Example
The mapping refers to the Logical Source to retrieve data from an input source. This consists of:
A reference to the input source
The Reference Formulation to specify how to refer to the data
The iterator that specifies how to iterate over the data
Each logical source is mapped to RDF using a triples map. The triples map is a rule that maps each data element to a number of RDF triples. The rule has two main parts:
A subject map that generates the subject of all RDF triples that will be generated from a data element. The subjects often are IRIs that are generated from the primary key or ID portions of the data.
Multiple predicate-object maps that in turn consist of predicate maps and object maps.
Triples are produced by combining the subject map with a predicate map and object map, and applying these three to each element in the source data. For example, the complete rule for generating a set of triples might be:
Subjects: A template
http://example.com/employee/{id}
is used to generate subject IRIs from theid
column or element.Predicates: The constant vocabulary IRI prefix
ex:name
is used.Objects: The value of the
name
element or column is used to produce an RDF literal.
By default, all RDF triples of the output dataset are in the default graph for Data Lens - this is represented by the Data Lens prefix followed by the unique identifier for that execution of the lens, for example, <http://www.data-lens.co.uk/803d180d-7ae7-4ce9-8ec4-ef8de44a800d>
. This UUID can be used in conjunction with the Provenance to ascertain specific meta-data about the execution of the Lens. In addition to the default graph, a specific Named Graph can be specified in the mapping whereby an additional triple is generated for each element.
An overview of R2RML
In this document, there will be a number of examples. To clearly discern which examples refer to what, they will use the following colour codes:
Full RML mapping files will be blue | Mapping RML snippets will be purple | Example input data will be yellow |
Mapping File
This is an example of a mapping file which we will break down in more detail in the sections below. Its aim is to take the Id
column as a Subject, and the Name
column as an Object with ex:hasName
as a Predicate.
Input File
Using this very basic CSV file as an example input file, we can see how it might be transformed into RDF using the Structured File Lens.
id | name |
---|---|
10001 | Alice |
10002 | Bob |
Output File
Using the mapping file, and passing in the input source file, the following output NQuads file will be generated. The NQuads are represented as ?Subject ?Predicate ?Object ?NamedGraph
, as defined through the mapping and as populated with the source data.
Subject | Predicate | Object | Graph |
---|---|---|---|
|
|
|
|
Namespaces - Prefix and Base
Specifying a namespace prefix allows you to bind an IRI to an abbreviation for later use. As seen in the example, this is done in the following way:
The most common prefixes and ones that will be used within this document are:
Prefix | IRI | Description |
---|---|---|
|
| RML ontology |
|
| The R2RML ontology, which is extended by RML |
|
| The Query Language vocabulary, which is used together with RML |
|
| The RDF Concepts Vocabulary |
|
| The RDF schema |
|
| The XML Schema Definition namespace |
|
| The schema.org vocabulary |
|
| The DBpedia ontology |
|
| An example prefix used for our RML rules |
A Base IRI is used in resolving relative IRIs produced by the RML mapping. According to the R2RML spec, the base IRI must be a valid IRI - it should not contain a question mark (“?”) or hash (“#”) characters and should end in a slash (“/”) character.
TriplesMap
A Triples Map defines rules to generate zero or more RDF triples sharing the same subject. A Triples Map consists of a Logical Source, a Subject Map and zero or more Predicate-Object Maps. The example mapping file snippets in this section will all correspond to the following input files:
CSV / SQL / XLSX / ODS | JSON | XML |
---|---|---|
|
| |
The exact same structure will be used in a SQL Table to represent examples for the SQL Lens, using the table name | Note how the |
LogicalSource - Structured Files
As previously mentioned, a Logical Source consists of:
A reference to the input source
As defined by -
rml:source "inputSourceFile.csv"
When using the Lenses, the input source file will be renamed before processing to what was specified in the logical source parameter when triggering the process, which in this case is
inputSourceFile.csv
. So ensure the TriplesMaps that correlate to the file you want to process have the same matching logical sources.
The Reference Formulation clarifies which data format is parsed and how the references to the extracts of data are defined
As defined by -
rml:referenceFormulation ql:CSV
The following reference formulations are predefined but not limited:
ql:CSV
,ql:JSONPath
,ql:XPath
, andql:SaXPath
, for use in the Structured File Lens, and justql:JSONPath
for use in the RESTful Lens.
The iterator that specifies how to iterate over the data
CSV | JSON | XML |
---|---|---|
The Logical Source of a mapping file designed for CSV input files does not have an iterator. This is due to CSVs having an explicit iteration pattern, and therefore it does not need to be defined. | As seen, this iterator uses JSONPath to specify how to reference the data source. This simple path gets all of the elements in the base array and returns everything in the | When parsing XML within the Structured File Lens, we have devised two different ways to generate your output data. One approach utilises a SAX parser, this allows for incredibly high speeds of XML processing. And the second approach is a DOM parser, which while slower, does allow for complex XPath iterators to be used. Our recommendation is to use the SAX parser where possible, this is done by declaring the reference formulation as above. However, if you are using a complex XPath expressions in your subject or object maps that utilises parent elements |
LogicalSource - SQL Database
The logical source when defining a mapping file for use against a SQL DB in the SQL Lens differs slightly from structured files. There are currently two main ways to retrieve data from your databases - the first allows you to retrieve all data from a table, and the second enables queries to be defined for explicit data retrieval. First, you must specify two new prefix namespaces in your mapping:
Select All | Custom Query | Database Source Info |
---|---|---|
This first example demonstrates retrieving all the data from that specified database table The The The | This second example demonstrates the ability to define a query within your mapping file. This is done using Also seen in this example is the use of a | And finally, the source, this manages the storing of the details and credentials of your database. This excerpt usually exists at the end of your mapping file, see full examples below for more details. The The All these properties are required when defining your database. Additional options can also be given by following the same structure, for example by adding: |
SubjectMap
The Subject Map consists of the URI pattern that defines how each triple's subject is generated, and optionally you can also define its type.
CSV / JSON / SQL | XML |
---|---|
While optional, | Defining the subject map for XML documents is the same with the other file types, except as noted previously with our XML example input file, the |
PredicateObjectMap
A triples map specifies the rules for translating, each row of a database, each record of a CSV data source, each element of an XML data source, or each object of a JSON data source, into zero or more RDF triples. The triples are generated using Predicate Object Maps, this consists of Predicate and Object Map(s). A Predicate Map specifies how the triple's predicate is generated and an Object Map specifies how the triple's object(s) are generated.
Within our Predicate Object Map, we must declare our predicate and our object. First, we assign out Predicate to be a constant-valued term map, this means that it is will ignore the logical iterator specified by the query and will always generate the same RDF term. This is represented using the constant shortcut property rr::predicate
followed by a mandatory IRI. As seen in the following example we are using our namespace prefix ex
which results in the constant http://example.com/hasName
.
Secondly, our object is defined through the use of an rr:objectMap
. This allows us to specify the object as a template as well as the ability to set a number of properties. A template-valued term map is a term map that is represented by a resource that has exactly one rr:template
property, where its value must be a valid string template. A string template is a format string that can be used to build strings from multiple components. It can reference logical references by enclosing them in curly braces (“{
” and “}
”). As seen in the following example, we are simply referencing the firstname
element/column whilst prepending the string http://example.employees.com/
. When not specifying a term type, the template must equate to a valid IRI. Note: this predicate object map example is the same across different data source types due to the same XPath, JSONPath, CSV and SQL reference).
This simply results in the predicate being equal to <http://example.com/hasName>
and the object, by referencing the firstname
element of a record in the data, equaling, for example, <http://example.employees.com/Alice>
.
An alternative to rr:template
, is rml:reference
. This simply refers directly to a column in a database, a record in a CSV data source, an element in an XML data source, or an object in a JSON data source. A reference must be a valid identifier, considering the reference formulation (rml:referenceFormulation
) specified, this can be an absolute path, or a path relative to the iterator specified at the logical source. For example rr:template "{firstname}"
simply becomes rml:reference "firstname"
. As such, you are not able to specify any string template outside of the reference.
Additional Properties
The next example shows the use solely the reference within the rr:template
, however the following syntax rules apply to valid string templates:
Pairs of unescaped curly braces must enclose valid references according to the specified reference formulation.
Curly braces that do not enclose references must be escaped by a backslash character (“
\
”). This also applies to curly braces within references.Backslash characters (“
\
”) must be escaped by preceding them with another backslash character, yielding “\\
”. This also applies to backslashes within references.There should be at least one pair of unescaped curly braces.
If a template contains multiple pairs of unescaped curly braces, then any pair should be separated from the next one by a safe separator. This is any character or string that does not occur anywhere in any of the data values of either reference; or in the IRI-safe versions of the data values, if the term type is
rr:IRI
.
In addition to this, as mentioned previously, from within the declaration of the object map, we can specify a number of optional properties, this includes:
rr:termType
: The term type determines the kind of generated RDF term. Its value must be an IRI and must be one of the following options:rr:IRI
,rr:BlankNode
, orrr:Literal
. This will default torr:IRI
, however, if you wish to make use of data type or language tag, userr:Literal
.rr:dataType
: The data type determines the data type of the generated RDF term. Its value must be an IRI and can only be specified when it is a datatypeable term map, this is a term map with a term type ofrr:Literal
that does not have a specified language tag. We recommend the use of the XML Schema datatypes, by using thexsd
namespace prefix.rr:language
: A specified language tag causes generated literals to be language-tagged plain literals, and as such this may only be specified on term maps ofrr:Literal
with no data type. The value must be a valid language tag.
Data Type | Language |
---|---|
The resulting object would equal | The resulting object would equal |
Named Graphs
Each triple generated from an RML mapping is placed into one or more graphs of the output dataset. As previously mentioned, by default, all RML mappings are put into a Data Lens graph - this is represented by the Data Lens prefix followed by the unique identifier for that execution of the Lens, for example, <http://www.data-lens.co.uk/803d180d-7ae7-4ce9-8ec4-ef8de44a800d>
. This UUID can be used in conjunction with the Provenance to ascertain specific meta-data about the execution of the Lens.
In addition to this default graph, specific IRI-named Named Graphs can be specified in the mapping. Any subject map or predicate-object map may have one or more associated graph maps. These are specified by using the constant shortcut property rr:graph
.
As outlined, this example predicate object map would result in two generated quads, one in the Data Lens named graph, and one in the user-specified http://example.com/EmployeesGraph
named graph.
Examples
This section will provide you with full examples of mapping files, along with the input data and expected output triples. Copying these files and using them directly with your Lens will result in similar output data.
CSV
This basic CSV example is designed to take the id
column as the subject, and set it to type http://example.com/Employee
. Then from each record, take a combination of the firstname
and the lastname
to be a literal string object for the http://example.com/name
predicate. And finally, the occupation
set to be an “en-gb” literal object.
XLSX / ODS (Spreadsheet)
An example for spreadsheet data would be exactly the same as above for the CSV data, however with slight change to the logicalSource
, this would be as follows:
JSON
This basic JSON example follows a similar structure to the CSV example, whereby the id
column is taken to be the subject. The ex:name
object has not been assigned a termType, dataType, or language, so there for it must be an IRI. The structure of this JSON is a little more complicated than the CSV, the rml:iterator
and the ex:occupation
object both show how JSONPath is used in the mappings. By following the concatenation of the two, $.[*].employees.[*].role.occupation
, we reach the occupation value for each employee.
XML
This basic XML example also follows similar structures to the CSV and JSON example, taking id
as subject, and the firstname
and lastname
as the ex:name
predicate object. However, the subject value is an attribute of the xml, and the object contains a user defined named graph, ex:EmployeesGraph
, note the additional NQuads generated in the output as a result. Similarly to JSON, the XPath concatenation of the iterator and the reference in the template, /employees/employee/role/occupation
, reaches the occupation value for each employee. And finally, as no parent nodes at being used in any of the XPath queries, we are using ql:SaXPath
as the rml:referenceFormulation
.
SQL
This basic SQL example is most similar to the CSV example as the two share a similar data structure. We are just using the id
, firstname
and lastname
to create our subject predicate objects, but this example’s main purpose is for targeting a database and its tables. The source DB is referenced in the rml:source
as <#DB_source>
, which can then seen at the end of the file. Here the table name is specified, this will then return all records and values from this table.
id | firstname | lastname | occupation | |
10001 | Alice | Johnson | Tech | |
10002 | Bob | Smith | Sales | |
SQL with Custom Query
This SQL example builds on the previous one, where here we also include a custom SQL query. Note that we are using a MySQL DB, so the syntax may differ. Setting up the rml:source
and rr:sqlVersion
is the same as before, however now we are dropping the rr:tableName
in place of rml:query
and rml:referenceFormulation ql:CSV
. As well as the id
, firstname
and lastname
subject predicate objects, we are also taking the data from the newly formulated namelength
column. This is done in the exact same way as before, by setting a predicate, in this case ex:namelength
, then defining the object, a reference to the value as a literal int.
id | firstname | lastname | occupation | |
10001 | Alice | Johnson | Tech | |
10002 | Bob | Smith | Sales | |
Functions
The Lenses of Data Lens allow for functions to be included within the mappings. These functions allow for raw data to be transformed or filtered on its way to being translated to RDF. We support a wide array of functions, including the majority of the GREL string functions. For a full list of the supported functions included in the latest releases of the Structured File Lens, SQL Lens, and RESTful Lens, see here.
Including a function in your mapping file is fairly simple as outlined below. When a function is used, a triple will not be generated if the subject or object is provided will a NULL value. Strings passed into functions may be sourced from an rml:reference
(The entire value of a field in the source document), an rr:template
(A string that includes values of fields in the source document, as well as manually specified strings), as well as outputs from other Functions (functions may therefore be nested).
Prefixes
In order to make use of functions, the first thing you must include is the following prefixes to the beginning of your mapping files:
Structure of a Function
To describe the structure of a function, we will use a simple example. The purpose of this function aims to take an input string, and convert it to all uppercase.
The function described above is inserted into a mapping file from within a rr:subjectMap
or rr:objectMap
. For example
Therefore, as you may have noticed, this allows you to create embedded functions, or functions within functions, simply by replacing an object map within an input argument with another function. This is demonstrated in the full example below.
Full Functions Example
This example shows a full mapping file containing functions within functions, along with an example input source CSV file and its expected RDF output.
Mapping Firstly, the namespace prefixes are declared, these include the additional Next, the logical source and subject map are declared as normal for a CSV input source file. However, the first predicate object map, is where we see our first function within a function. The aim of this output is to return true when the value is equal to bob, case insensitive. Let’s break this down:
Where the first predicate object map demonstrated an example in the transformation of data, the second predicate object map shows filtering. The objective of this function is to nullify and values represented by the declared constant value. When null is returned from a function, an RDF triple is not generated.
Output As seen in the output, all | |
|
Adding a Function to a Subject Map Example
Adding your own custom Function
Additional functions can be made in a relatively short period of time, should your data require transformation or filtering in a way not currently possible. Contact us for further support.