User Guide - RESTful Lens v1.1

Intro

This is the full User Guide for the RESTful Lens, it contains an in-depth set of instructions to fully set up, configure, and run the Lens so you can start ingesting data as part of an end-to-end system. For a guide to get the Lens up and running in the quickest and simplest possible way, see the Quick Start Guide. Once deployed, you can utilise any of our ready-made sample input, mapping, and expected output files to test your Lens. For a list of what has changed since the last release, visit the User Release Notes.

 


Table of Contents

 

 


Creating the Mapping File

The first step in configuring the RESTful Lens is to create a mapping file. The mapping file is what creates the links between your source data and your target model (ontology). This can be created using our online Data Lens Mapping Tool utilising an intuitive web-based UI. Log in here to get started, and select the option for RESTful Lens. The RESTful Lens is capable of ingesting RESTful endpoints, creation of mapping files differ slightly between file types so ensure to select the correct options for your use case. Alternatively, the Mapping Tool can be deployed to your own infrastructure, this enables additional functionality such as the ability to update mapping files on a running Lens. To do this, follow these instructions.

However, if you wish you create your RML mapping files manually, there is a detailed step by step guide on creating one from scratch.

 


Configuring the Lens

All Lenses supplied by Data Lens are configurable through the use of Environment Variables. How to declare these environment variables will differ slightly depending on how you choose to run the Lens, so please see Running the Lens for more info. For a breakdown of every configuration option in the RESTful Lens, see the full list here.

Mandatory Configuration

For the Lens to operate the following configuration options are required.

  • License - LICENSE

    • This is the license key required to operate the lens, request your new unique license key here.

  • Configuration JSON file URL - JSON_API_CONFIG_URL or JSON_REST_CONFIG_URL

    • Depending on the mode in which the Lens is set to, either one of these two variables is required to be set. The JSON configuration URL refers to the external JSON resources (a static file or a RESTful service) which produces JSON content matching either the JSON:API standard or plain JSON. The default mode of the Lens is json-api and therefore the JSON_API_CONFIG_URL variable is required.

  • Mapping Directory URL - MAPPINGS_DIR_URL

    • This is the directory where your mapping file(s) is located. As with all directories, this can be either local or on a remote S3 bucket. Mapping files for the RESTful Lens can be created using our Mapping Config Web App and can be pushed directly to a running Lens.

  • Output Directory URL - OUTPUT_DIR_URL

    • This is the directory where all generated RDF files are saved to. This also supports local and remote URLs.

  • Provenance Output Directory URL - PROV_OUTPUT_DIR_URL

    • Out of the box, the RESTful Lens supports Provenance and it is generated by default. Once generated, the Provenance is saved to separate output files to the transformed source data. This option specifies the directory where provenance RDF files are saved to, which also supports local and remote URLs.

    • If you do not wish to generate Provenance, you can turn it off by setting the RECORD_PROVO variable to false. In this case, the PROV_OUTPUT_DIR_URL option is no longer required. For more information on Provenance configuration, see below.

Lens Mode and JSON Configuration File

There are two types of mode in which the RESTful Lens can operate in, and therefore two types of JSON configuration files that can be used with the RESTful Lens; this depends on whether the Lens is being used to work with a standard RESTful Endpoint, or with one that conforms to the JSON:API specification. The endpoint mode is set by specifying either json-api or rest-api under the ENDPOINT_MODE config option, by default the Lens is set to JSON:API mode. Specifying a JSON config file is a mandatory requirement, and is done by providing a value for the JSON_API_CONFIG_URL or JSON_REST_CONFIG_URL configurable variables respectively.

JSON Rest Configuration

{ "url": "https://test-rest-bucket.s3.eu-west-2.amazonaws.com/imdb-titles-100.json", "method": "GET", "headers": [ {"Content-Type": "application/json"} ], "params": [ {"foo1": "bar1"}, {"foo2": "bar2"} ] }

The JSON Rest config file must include all the data required to make a standard restful call to the endpoint. The URL and method in the config are standard JSON key-value pairs. Headers and params, if required, are both arrays which contain a single object for each required key-value pair.

JSON:API Configuration

{ "startURL" : "http://example.com/articles", "includeFields": [ "type", "title", "author.type", "author.id", "author.firstName", "author.lastName", "author.twitter", "comments.type", "comments.id", "comments.body" ] }

The JSON:API config file needs to include the start URL of the JSON:API service and the list of fields that you want to take from the service to convert into RDF. The Lens will go through the JSON:API service to get the relevant fields, it will start its search on the first page referenced and if it cannot locate the required field on the current page it will use the self-reference for the relevant lower-level object to try to find the required fields. For example, if we take the author value, the author.type and author.id fields may be included on the initial http://example.com/articles page, however the author.firstName, author.lastName and author.twitter fields may not be.

For a specific article, if the author.type is “Person” and author.id is “9”, the Lens will look for an object with those values in type and id fields and follow the self-link to the relevant page for the author http://example.com/people/9. From here it will then look for the missing additional fields author.firstName, author.lastName and author.twitter. If the author object itself contained objects such as author.book, with fields author.book.type, author.book.id, author.book.price, author.book.title, and these additional objects were specified in the JSON config, then the JSON:API will follow the same process to use the id and type of the book object to go to the relevant self-link for the book and search for the remaining fields there. The Lens is able to follow this process for any depth of an object. This can be seen in our example 1 within the examples repository, where the config file is called multipage-config.json.

AWS Configuration

If you wish to use cloud services such as Amazon Web Services you need to specify an AWS Access Key and Secret Key, and AWS Region, through AWS_ACCESS_KEY, AWS_SECRET_KEY, and S3_REGION respectively. By providing your AWS credentials, this will give you permission for accessing, downloading, and uploading remote files to S3 Buckets. The S3 Region option specifies the region of where in AWS your files and services reside. Please note that all services must be in the same region, including if you choose to run the Lens in an EC2 instance or with the use of Lambdas.

Kafka Configuration

One of the many ways to interface with the Lens is through the use of Apache Kafka. With the RESTful Lens, a Kafka Message can be used for managing the output of data to and from the Lens. To properly setup your Kafka Cluster, see the instructions here. Once complete, use the following Kafka configuration variables to connect the cluster with your Lens. If you do not wish to use Kafka, please set the variable LENS_RUN_STANDALONE to true.

The Kafka Broker is what tells the Lens where to look for your Kafka Cluster, so set this property as follows: <kafka-ip>:<kafka-port>. The recommended port is 9092.

All other Kafka configuration variables can be found here, all of which have default values that can be overridden.

Provenance Configuration

As previously mentioned, Provenance is generated by default, this can be turned off by setting the RECORD_PROVO variable to false, otherwise PROV_OUTPUT_DIR_URL is required. If you wish to store this Provenance remotely in an S3 Bucket, then you are required to specify your region, access key, and secret key, through PROV_S3_REGION, PROV_AWS_ACCESS_KEY, and PROV_AWS_SECRET_KEY respectively.

If you wish to manage the Provenance output files through Kafka, then you can choose to use the same brokers and topic names as with the previously specified data files, or an entirely different cluster. All Provenance configuration can be found here.

Logging Configuration

Logging in the RESTful Lens works the same way as with all other Lens, and like with most functionality is configurable through the use of environment variables; this list override-able options and their descriptions can be found here. When running the Lens from the command line using the instructions below, the Lens will automatically log to your terminal instance. In addition to this, the archives of logs will be saved within the docker container at /var/log/datalens/archive/current/ and /var/log/datalens/json/archive/ for text and JSON logs respectively, where the current logs can be found at /var/log/datalens/text/current/ and /var/log/datalens/json/current/. By default, a maximum of 7 log files will be archived for each file type, however this can be overridden. If running a Lens on cloud in an AWS environment, then connect to your instance via SSH or PuTTY, and the previously outlined logging locations apply.

Optional Configuration

There is also a further selection of optional configurations for given situations, see here for the full list.

Directories in Lenses

The Lenses are designed to support files and directories from an array of sources. This includes both local URLs and remote URLs including cloud-based technologies such as AWS S3. The location should be expressed as a URL string (Ref. RFC-3986).

  • To use a local URL for directories and files, both the format of file:///var/local/data-lens-output/ and /var/local/data-lens-output/ are supported.

  • To use a remote http(s) URL for files, https://example.com/input-file.json is supported.

  • To use a remote AWS S3 URL for directories and files, s3://example/folder/ is supported where the format is s3://<bucket-name>/<directory>/<file-name>. If you are using an S3 bucket for any directory, you must specify an AWS access key and secret key.

Accessing the configuration of a running Lens

Once a Lens has started and is operational, you can request to view the current config by calling one of the Lens' built-in APIs, this is explained in more detail below. Please note, that in order to change any config variable on a running Lens, it must be shut down and restarted.

 


 

Running the Lens

All of our Lenses are designed and built to be versatile, allowing them to be set up and ran on a number of environments, including in cloud or on-premise. This is achieved through the use of Docker Containers.

Local Docker Image

To run the Lens locally, first please ensure you have Docker installed. Then simply by running a command with the following structure, docker will start the container and run the Lens from your downloaded image. If you would prefer to try the Lens with ready-made example files, we have created a repository containing a selection from which you are able to clone to your local file system. In the following docker run example, we have used the files from example 1 pulled to the local directory /opt/Projects/DataLens/.

For UNIX based machines (macOS and Linux):

docker run \ -e MAPPINGS_DIR_URL=file:///data/example1/mapping/ \ -e JSON_API_CONFIG_URL=file:///data/example1/input/multipage-config.json \ -e OUTPUT_DIR_URL=file:///data/example1/output \ -e RECORD_PROVO=false \ -e LENS_RUN_STANDALONE=true \ -e LICENSE \ -v /opt/Projects/DataLens/datalens-examples/restful-lens/example1:/data/example1 \ -p 8080:8080 \ lens-restful:latest

For Windows

The above examples demonstrate how to override configuration options using environment variables in your Lens. Line 7 shows the use of passing in an environment variable saved to the machine, whereas lines 2-6 show simply a string value being passed it. Given the Lens is ran on port 8080, line 9 exposes and binds that port of the host machine so that the APIs can be triggered. The -v flag seen on line 8 mounts the working directory into the container; when the host directory of a bind-mounted volume doesn’t exist, Docker will automatically create this directory on the host for you. And finally, line 10 is the name and version of the Docker image you wish to run.

For more information of running Docker Images, see the official Docs.

Docker on AWS

The deployment approach we recommend at Data Lens is to use Amazon Web Services, this is to both store source and RDF data, as well as to host and run your Lenses and Writer.

The aim is to deploy the Lens and other services using AWS by setting up the following architecture:

For more information on the Architecture and Deployment of an Enterprise System, see our guide.

 


 

Ingesting Data

The RESTful Lens supports a number of ways in which to trigger the Lens to commence the ingestion of your data files.

RESTful API Endpoint

First, the easiest way to start the data ingestion into the RESTful Lens is to use the built-in APIs. Using the process endpoint will trigger the Lens, for example, using a GET request: <lens-ip>:<lens-port>/process, for example, http://127.0.0.1:8080/process. The completed process is confirmed with a success report.

Cron Job

In addition to the RESTful service, there is also a built-in Quartz Time Scheduler. This uses a user-configurable Cron Expression to set up a time-based job scheduler which will schedule the Lens to ingest your specified data from your database(s) periodically at fixed times, dates, or intervals.

For example, the cron expression */50 * * ? * * * translates to triggering the Lens every 50 seconds starting at :00 seconds after the minute. A more detailed explanation can be found on the Quartz Scheduler website.

 


 

Output Data

The data files created and output from the Lens are the same regardless on how it was triggered or ingested, however the way in which this information is communicated back to you is slightly different for each method.

Endpoint

Once the ingestion and transformation has successfully been processed after being triggered via the Process endpoint, the response returned from the Lens is in the form of a JSON. Within the JSON response is the outputFileLocations element; this element contains a list of all the URLs of generated RDF files. Often this would be a single file, however multiple files will be generated and listed when the ingested JSON document contains references to other documents (Ref. JSON:API Pagination).

Sample output:

Cron Job

If the ingestion has been triggered via the job scheduler, your confirmation of success will come in the form of log messages, these can be found dependant on your configuration.

Kafka

If you have a Kafka Cluster set up and running, then the successfully generated RDF file URL(s) will be pushed to you Kafka Queue. It will be pushed to the Topic specified in the KAFKA_TOPIC_NAME_SUCCESS config option, which defaults to “success_queue”. This will happen with both methods of triggering the Lens. One of the many advantages of using this approach is that now this transformed data can be ingested using our Lens Writer which will publish the RDF to a Semantic Knowledge Graph (or selection of Property Graphs) of your choice!

Dead Letter Queue

If something goes wrong during the operation of the Lens, the system will publish a message to the Dead Letter Queue Kafka topic (defaults to “dead_letter_queue”) explaining what went wrong along with meta-data about that ingestion, allowing for the problem to be diagnosed and later re-ingested. If enabled, the provenance generated for the current ingestion will also be included as JSON-LD. This message will be in the form of a JSON with the following structure:

Data type

The RESTful Lens supports data transformation into two different types: NQuads and JSON-LD. By default, the resulting RDF is represented in the form of NQuads, however by overriding the configuration option OUTPUT_FILE_FORMAT you can change it simply by setting this as json-ld.

 


 

Provenance Data

Within the RESTful Lens, time-series data is supported as standard, every time a Lens ingests some data we add provenance information. This means that you have a full record of data over time, allowing you to see what the state if the data was at any moment. The model we use to record Provenance information is the w3c standard PROV-O model.

Provenance files are uploaded to the location specified in the PROV_OUTPUT_DIR_URL, then this file location is pushed to the Kafka Topic declared in PROV_KAFKA_TOPIC_NAME_SUCCESS. The provenance activities in the RESTful Lens are main-execution, kafkaActivity, and lens-iteration.

For more information on how the provenance is laid out, as well as how to query it from your Triple Store, see the Provenance Guide.

 


 

REST API Endpoints

In addition to the Process Endpoint designed for ingesting data into the Lens, there is a selection of built-in exposed endpoints for you to call.

API

HTTP Request

URL Template

Description

API

HTTP Request

URL Template

Description

Process

GET

/process

Manually triggers RESTful Lens ingestion process

Config

GET

/config

Displays all Lens configuration as JSON

GET

/config?paths=<config-options>

Displays all Lens configuration specified in the comma-separated list

License

GET

/license

Displays license information

RML

GET

/rml

Displays the current RML mapping file, this is displayed as Turtle RDF serialisation

PUT

/rml

Deploys a new mapping file into Lens specified in the request body

 

Config

The config endpoint is a GET request that allows you to view the configuration settings of a running lens. By sending GET http://<lens-ip>:<lens-port>/config (for example http://127.0.0.1:8080/config), you will receive the entire configuration represented as a JSON, as seen in this small snippet below. All confidential values (such as AWS credentials) are replaced with the fixed string “REDACTED“.

Alternatively, you can specify exactly what config options you wish to return by providing a comma-separated list of variables under the paths parameter. For example, the request of GET http://<lens-ip>:<lens-port>/config?paths=lens.config.outputDirUrl,logging.loggers would return the following.

License

The license endpoint is a GET request that allows you to view information about your license key that is in use on a running lens. By sending GET http://<lens-ip>:<lens-port>/license (for example: http://127.0.0.1:8080/license), you will receive a JSON response containing the following values.

Process

As previously outlined in the Ingesting Data via Endpoint section, using the process endpoint is one way of triggering the Lens to ingest your source data. When an execution of the Lens fails after being triggered in this way, the response will be a status 400 Bad Request as follows.

RML

The RML endpoint is all about the mapping file that you created using the Mapping Config Web App. It consists of a GET and a PUT endpoint, allowing you to get the current master mapping file currently in use on the Lens, and well as replacing the master mapping file with a new one.

By sending GET http://<lens-ip>:<lens-port>/rml you will receive a response containing the contents of the mapping file written in RDF/Turtle. And by sending PUT http://<lens-ip>:<lens-port>/rml with a turtle mapping file in the body of the request, it will upload it to the file location specified in the MAPPINGS_DIR_URL and MASTER_MAPPING_FILE options in the configuration and replace the existing file. The mapping file should be in RDF/Turtle format and the declared HTTP Content-Type should be text/turtle. The successful upload is then indicated by an empty response with HTTP status OK (Ref. RFC-7231) and will be functional immediately.