User Guide - SQL Lens v2.0
Intro
This is the full User Guide for the SQL Lens, it contains an in-depth set of instructions to fully set up, configure, and run the Lens so you can start ingesting data as part of an end-to-end system. For a guide to get the Lens up and running in the quickest and simplest possible way, see the Quick Start Guide. Once deployed, you can utilise any of our ready-made sample input, mapping, and expected output files to test your Lens. For a list of what has changed since the last release, visit the User Release Notes.
Table of Contents
- 1 Intro
- 2 Table of Contents
- 3 Creating a Mapping File
- 4 Configuring the Lens
- 4.1 Configuration Manipulation
- 4.1.1 Accessing the Config
- 4.1.2 Editing config
- 4.1.3 Backup and Restore Config
- 4.2 Configuration Categories
- 4.2.1 Mandatory Configuration (Local Deployment)
- 4.2.2 Lens Directories Configuration
- 4.2.2.1 Directories in Lenses
- 4.2.3 SQL Configuration
- 4.2.4 AWS Configuration
- 4.2.5 Kafka Configuration
- 4.2.6 Provenance Configuration
- 4.2.7 Logging Configuration
- 4.2.8 Optional Configuration
- 4.1 Configuration Manipulation
- 5 Running the Lens
- 6 Ingesting Data / Triggering the Lens
- 6.1 RESTful API Endpoint
- 6.2 Cron Job
- 6.3 Kafka
- 7 Output Data
- 7.1 Endpoint
- 7.2 Cron Job
- 7.3 Kafka
- 7.3.1 Dead Letter Queue
- 8 Provenance Data
- 9 REST API Endpoints
- 9.1 Config
- 9.1.1 GET /config
- 9.2 Update Config
- 9.2.1 PUT /updateConfig
- 9.3 License
- 9.3.1 GET /license
- 9.4 Process
- 9.4.1 GET /process
- 9.5 RML
- 9.6 Custom Functions
- 9.1 Config
Creating a Mapping File
The first step in configuring the SQL Lens is to create a mapping file. The mapping file is what creates the links between your source database and your target model (ontology). To assist with the creation of your RML mapping files, please see our detailed step by step guide on creating one from scratch. Within this mapping file is also where you specify your SQL Database credentials, along with any SQL queries you may want to execute in order to retrieve your data.
Configuring the Lens
Each of the Lenses have a wide array of user configuration, all of which can be set and altered both before the startup of the Lens and during the running of a Lens. The former is done through the use of environment variables in your Docker container or ECS Task Definition, and latter is done through the use of an exposed endpoints, as seen below. For a breakdown of every configuration option in the SQL Lens, see the full list here.
Configuration Manipulation
Accessing the Config
Once a Lens has started and is operational, you can request to view the current configuration by calling the /config
endpoint. This is expanded upon below, including the ability to specify specific config properties.
Editing config
As explained below, the configuration on a running Lens can be edited through the /updateConfig
endpoint.
Backup and Restore Config
A useful feature in the Lens, is the ability to backup and restore your configuration. This is particularly beneficial when you’ve made multiple changes to the config on a running Lens, and want to be able to restore this without rerunning any update config commands. To backup your config, simply call the /uploadConfigBackup
endpoint, and all changes you’ve made to the config will be uploaded to the storage location specified in your CONFIG_BACKUP
env var.
To restore your configuration, this must be done on the startup of a Lens, therefore, by setting the CONFIG_BACKUP
config option as an environment variable in your startup script / task definition. This should be a remote directory such as S3, when being run as a task on ECS. If run locally, then ensure the container is mounted to a local volume, otherwise all container storage will be deleted when it is stopped.
Configuration Categories
Mandatory Configuration (Local Deployment)
License -
LICENSE
This is the license key required to operate the Lens when being run on a local machine outside of AWS Marketplace, request your new unique license key here.
Lens Directories Configuration
Lens Directory -
LENS_DIRECTORY
This is the directory where all Lens files are stored (assuming individual file dir config haven’t been edited). On Lens startup, if this has been declared, it will create folders at the specified location for mapping, output, yaml-mapping, provenance output, and config backup.
By default, this option is set to a local directory within the docker container (
file:///var/local/
) so isn't mandatory. As with all directories in the Lens, this can be either local or on a remote S3 bucket - we recommend using S3 when running the Lens on AWS (for example -s3://example-bucket/sqllens/
)
Mapping Directory URL -
MAPPINGS_DIR_URL
This is the directory where your mapping file(s) is located. All mapping files within this directory are downloaded and added to the store for processing.
Output directory URL -
OUTPUT_DIR_URL
This is the directory where all generated RDF files are saved to. This also supports local and remote URLs.
Provenance Output Directory URL -
PROV_OUTPUT_DIR_URL
Out of the box, the SQL Lens supports Provenance and it is generated by default. Once generated, the Provenance is saved to separate output files to the transformed source data. This option specifies the directory where provenance RDF files are saved to, which also supports local and remote URLs.
If you do not wish to generate Provenance, you can turn it off by setting the
RECORD_PROVO
variable to false. In this case, thePROV_OUTPUT_DIR_URL
option is no longer required. For more information on Provenance configuration, see below.
Config Backup -
CONFIG_BACKUP
The Lens supports functionality to backup your configuration in the scenario where you wish to reboot your Lens. Upon calling the upload config endpoint, your configuration settings will be backed up to the URL directory specified here. It must be a remote directory such as S3 to support rebooting of the Lens.
Directories in Lenses
The Lenses are designed to support files and directories from an array of sources. This includes both local URLs and remote URLs including cloud-based technologies such as AWS S3. The location should be expressed as a URL string (Ref. RFC-3986).
To use a local URL for directories and files, both the format of
file:///var/local/data-lens-output/
and/var/local/data-lens-output/
are supported.To use a remote http(s) URL for files,
https://example.com/input-file.csv
is supported.To use a remote AWS S3 URL for directories and files,
s3://example/folder/
is supported where the format iss3://<bucket-name>/<directory>/<file-name>
. If you are using an S3 bucket for any directory, and not running the Lens via the Marketplace, then you must specify an AWS access key and secret key.
SQL Configuration
The configuration concerned with connecting to your SQL Database credentials is taken care of in the mapping file, this also includes any SQL queries you may want to execute in order to retrieve your data. To aid with the performance of the SQL Lens, additional configuration values have been devised to split up the processing of larger data set.
SQL Limit -
SQL_LIMIT
The SQL Limit, when set to a value less than the total number of records in your database, forces the Lens to execute an iterative processing operation. The max number of records that is processed in any one iteration is determined by this configuration option. The Lens will batch process the records from the query and output multiple RDF files until the end of the database is reached.
This value must be an integer greater than zero. It defaults to zero, meaning that iterative queries are switched off.
SQL Offset -
SQL_OFFSET
The SQL Offset provides the ability to offset the start index of the iterative processing. This defaults to zero.
Concurrent Threads -
CONCURRENT_THREADS
When the Lens is executing an iterative query, these iterations will be run in parallel. This option allows you to specify the number of threads that will be used for concurrent iterative executions. If you specify a number of threads greater than what is available, the maximum number will be used.
Continue on Error -
CONTINUE_ON_ERROR
If there is an error during an iteration, this determines whether the execution will continue or halt. This defaults to false.
AWS Configuration
When running the Lens in ECS, these settings are not required as all credentials are taken directly from the EC2 instance running the Lens. If you wish to use AWS cloud services while running the Lens on-prem, you need to specify an AWS Access Key and Secret Key, and AWS Region. By providing your AWS credentials, this will give you permission for accessing, downloading, and uploading remote files to S3 Buckets. The S3 Region option specifies the region of where in AWS your files and services reside. To do this, the Lenses utilise the AWS Default Credential Provider Chain, allowing for a number of methods to be used. The simplest is by setting the environment variables for AWS_ACCESS_KEY_ID
, AWS_SECRET_ACCESS_KEY
, and AWS_REGION
.
Kafka Configuration
One of the many ways to interface with the Lens is through the use of Apache Kafka. With the SQL Lens, a Kafka Message Queue can be used for managing the output of data from the Lens, and to trigger a Lens transformation. To properly set up your Kafka Cluster, see the instructions here. Once complete, use the following Kafka configuration variables to connect the cluster with your Lens. If you wish to use Kafka, you must switch it on by setting the variable LENS_RUN_STANDALONE
to false.
The Kafka Broker is what tells the Lens where to look for your Kafka Cluster, so set this property as follows: <kafka-broker>:<kafka-port>
. The recommended port is 9092
.
All other Kafka configuration variables can be found here, all of which have default values that can be overridden.
Provenance Configuration
As previously mentioned, Provenance is generated by default, this can be turned off by setting the RECORD_PROVO
variable to false, otherwise the prov output files will be stored at the dir specified by PROV_OUTPUT_DIR_URL
. If you wish to store this Provenance remotely in an S3 Bucket, then you are required to specify your region, access key, and secret key, as explained previously in the AWS Configuration section.
If you wish to manage the Provenance output files through Kafka, then you can choose to use the same brokers and topic names as with the previously specified data files, or an entirely different cluster. All Provenance configuration can be found here.
Logging Configuration
When running the Lens locally from the command line using the instructions below, the Lens will automatically log to your terminal instance. In addition to this, the archives of logs will be saved within the docker container at /var/log/datalens/archive/current/
and /var/log/datalens/json/archive/
for text and JSON logs respectively, where the current logs can be found at /var/log/datalens/text/current/
and /var/log/datalens/json/current/
. By default, a maximum of 7 log files will be archived for each file type, however this can be overridden. If running a Lens on cloud in an AWS environment, then connect to your instance via SSH or PuTTY, and the previously outlined logging locations apply. Alternatively, configuring CloudWatch Logs is the easiest way to view your Len’s live logging.
By default, the Lens logs at INFO
level, this can be changed by overriding the LOG_LEVEL_DATALENS
option, however can only be done on Lens startup and will require a reboot if not.
Optional Configuration
There is also a further selection of optional configurations for given situations, see here for the full list.
Running the Lens
All of our Lenses are designed and built to be versatile, allowing them to be set up and run on a number of environments, including in cloud or on-premise. This is achieved through the use of Docker Containers. In addition to this, we now have full support for the Amazon Web Services Marketplace, where you can directly subscribe to and run your Lens from.
Local Docker Image
To run the Lens' Docker image locally, please first ensure you have Docker installed. Once installed, execute a docker run command with the following structure, and Docker will start the container and run the Lens from your downloaded image.
For UNIX based machines (macOS and Linux):
docker run \
-e LICENSE=<<<REQUEST LICENSE>>> \
-e LENS_DIRECTORY=file:///data/sqllens/ \
-e RECORD_PROVO=false \
-p 8080:8080 \
-v /User/DataLens/sqlens/:/data/sqllens/ \
lens-sql:latest
For Windows:
docker run ^
-e LICENSE=<<<REQUEST LICENSE>>> ^
-e LENS_DIRECTORY=file:///data/sqllens/ ^
-e RECORD_PROVO=false ^
-p 8080:8080 ^
-v //c//User/DataLens/sqlens/:/data/sqllens/ ^
lens-sql:latest
The above examples demonstrate how to override configuration options using environment variables in your Lens. Given the Lens is run on port 8080, line 5 exposes and binds that port of the host machine so that the APIs can be triggered. The -v
flag seen on line 6 mounts the working directory into the container; when the host directory of a bind-mounted volume doesn’t exist, Docker will automatically create this directory on the host for you. And finally, line 7 is the name and version of the Docker image you wish to run. For more information of running Docker Images, see the official Docs.
SQL Lens via AWS Marketplace
To run the SQL Lens on AWS, we have full support for the AWS Marketplace. First subscribe to the SQL Lens, then use the CloudFormation template we have created to deploy a one click solution, starting up an ECS Cluster with all the required permissions and networking, with the Lens running within as a task. See here for more information about how the template works and what is being initialised.
For more information on the Architecture and Deployment of an Enterprise System, see our guide.
Alternatively, you can manually start the Lens by creating a Task Definition to be run within an ECS or EKS cluster, and using the Lens’s Image ID, exposing the port 8080, and ensuring there is a Task Role with at least the AmazonS3FullAccess
and AWSMarketplaceMeteringRegisterUsage
included.
Ingesting Data / Triggering the Lens
While the mapping file previously created handles the querying of your target databases, triggering the Lens to start the ingestion of data can be done in a number of ways.
RESTful API Endpoint
First, the easiest way to trigger the database ingestion in the SQL Lens is to use the built-in APIs. Using the process
endpoint will trigger the Lens, for example, using a GET request: <lens-ip>:<lens-port>/process
, for example, http://127.0.0.1:8080/process. The completed process is confirmed with a success report.
Cron Job
In addition to the RESTful service, there is also a built-in Quartz Time Scheduler. This uses a user-configurable Cron Expression to set up a time-based job scheduler which will schedule the Lens to ingest your specified data from your database(s) periodically at fixed times, dates, or intervals.
For example, 0 */30 * ? * * *
translates to triggering the Lens every 30 minutes starting at :00 or :30 minutes after the hour. A more detailed explanation can be found on the Quartz Scheduler website.
By default, cron executions are turned off. To turn it on and set a cron expression, use the /updateCron
endpoint as explained below.
Kafka
The Lens can also be triggered by sending a message to a Kafka queue. To utilise this you must ensure LENS_RUN_STANDALONE
is set to false
and the Kafka configuration is configured correctly for you brokers. Once set up, you can simply push an empty message to the topic specified by the KAFKA_TOPIC_NAME_SOURCE
.
Output Data
The Lenses can create output data for both Semantic Knowledge Graphs and Property Graphs. The data produced for Knowledge Graphs is RDF, whereby for Property Graphs it is in the form of two CSV files, one for nodes and one for edges.
When creating RDF data for Semantic Knowledge Graphs, the Lens supports a wide number of data types: NQuads, NTriples, JSON-LD, Turtle, Trig, and Trix. By default, the resulting RDF is represented in the form of NQuads or NTriples if provenance is off, however this can be changed by setting the configuration option OUTPUT_FILE_FORMAT
to either nquads
, ntriples
, jsonld
, turtle
, trig
, or trix
.
To create CSV output data for a Property Graph, you must turn on the Property Graph mode by setting PROPERTY_GRAPH_MODE
to true
and then selecting your graph provider by setting PG_GRAPH
to either neptune
, tigergraph
, neo4j
, or default
.
The RDF or CSV data files created and output from the Lens are the same and are not dependant on how it was triggered. The way in which this information is communicated back to you varies slightly for each method.
Endpoint
After being trigger via the Process endpoint, once ingestion and transformation of data has successfully been completed, the response returned from the Lens is in the form of a JSON object. This JSON response contains a list of all the URLs of generated RDF files. Multiple files are only generated when specifying an SQL Limit and Offset in your mapping query, or multiples of two files are created when using Property Graph mode.
Sample Knowledge Graph response:
{
"successfulIterations": 3,
"outputFileLocations": [
"file:///data/sqllens/output/SQL-Lens-31235bfd-a6fb-43aa-bb51-e3d41c481983.nq",
"file:///data/sqllens/output/SQL-Lens-5b27a2a4-c971-4fd4-b128-d131bf5c3981.nq",
"file:///data/sqllens/output/SQL-Lens-5ecd6e81-65ff-4dbc-aed8-d6d320b7d04a.nq"
],
"processingTime": 20
}
Sample Property Graph response:
Cron Job
If the ingestion has been triggered via the job scheduler, your confirmation of success will come in the form of log messages, these can be found dependant on your configuration.
Kafka
If you have a Kafka Cluster set up and running, then the successfully generated RDF file URL(s) will be pushed to you Kafka Queue. It will be pushed to the Topic specified in the KAFKA_TOPIC_NAME_SUCCESS
config option, which defaults to “success_queue”. This will happen with both methods of triggering the Lens. One of the many advantages of using this approach is that now this transformed data can be ingested using our Lens Writer which will publish the RDF to a Semantic Knowledge Graph or CSV to a Property Graph of your choice!
Dead Letter Queue
If something goes wrong during the operation of the Lens, the system will publish a message to the Dead Letter Queue Kafka topic (defaults to “dead_letter_queue”) explaining what went wrong along with meta-data about that ingestion, allowing for the problem to be diagnosed and later re-ingested. If enabled, the provenance generated for the current ingestion will also be included as JSON-LD. This message will be in the form of a JSON object with the following structure:
Provenance Data
Within the SQL Lens, time-series data is supported as standard, every time a Lens ingests some data we add provenance information. This means that you have a full record of data over time, allowing you to see what the state if the data was at any moment. The model we use to record Provenance information is the w3c standard PROV-O model.
Provenance files are uploaded to the location specified in the PROV_OUTPUT_DIR_URL
, then this file location is pushed to the Kafka Topic declared in PROV_KAFKA_TOPIC_NAME_SUCCESS
. The provenance activities in the SQL Lens are main-execution
, kafkaActivity
, queryProcessing
, tableProcessing
, and singleQueryExecution
.
For more information on how the provenance is laid out, as well as how to query it from your Triple Store, see the Provenance Guide.
REST API Endpoints
In addition to the Process Endpoint designed for triggering the ingestion of data into the Lens, there is a selection of built-in exposed endpoints for you to call.
API | HTTP Request | URL Template | Description |
---|---|---|---|
Process | GET |
| Manually triggers SQL Lens process. |
Config | GET |
| Displays configuration as JSON string |
GET |
| Displays all Lens configuration specified in the comma-separated list | |
Update Config | PUT |
| Update configuration options on a running Lens |
Upload Config Backup | PUT |
| Uploads the current configuration to the specified config backup location so that it can be restored at a later date. |
Test DB Connection | GET |
| Tests whether the database credentials you have provided successfully connect to your SQL DB. Please note that the password must be Base64 encoded. |
Query Database | GET |
| Runs a SQL query against the specified database. Returns a CSV if returnCSV is set to true, otherwise a json object is returned. Please note that the password must be Base64 encoded. In addition, you can provide JDBC parameters into your connection. This is done by passing any number of |
License | GET |
| Displays license information |
RML | GET |
| Displays the current RML mapping file, this is displayed in Turtle RDF serialisation |
PUT |
| Deploys a new mapping file into Lens specified in the request body | |
Functions | GET |
| Allows for the deployment of new custom functions to the Lens. It will download the files set in the |
Update Cron | PUT |
| Updates the cron expression used for scheduling. If called without a cron value, the Lens with turn off the scheduler. |
Restart Kafka | GET |
| Turns Kafka on or off depending on its current state. |
Config
GET /config
The config endpoint is a GET request that allows you to view the configuration settings of a running lens. By sending GET http://<lens-ip>:<lens-port>/config
(for example http://127.0.0.1:8080/config
), you will receive the entire configuration represented as a JSON object, as seen in this small snippet below. All confidential values (such as credentials) are are hidden. This endpoint is also useful as a means of Health Checking the Lens.
Alternatively, you can specify exactly what config options you wish to return by providing a comma-separated list of variables under the paths
parameter. For example, the request of GET http://<lens-ip>:<lens-port>/config?paths=lens.config.outputDirUrl,logging.loggers
would return the following.
Update Config
PUT /updateConfig
The configuration on running Lens can now be edited without having to restart. This is done through the update config endpoint. For example, by running the following /updateConfig?configEntry=friendlyName&configValue=GraphBuilder
we have changed the friendly name of the Lens to GraphBuilder. To see a list of the configuration entry names, consult the SQL Lens Configurable Options.
License
GET /license
The license endpoint is a GET request that allows you to view information about your license key that is in use on a running lens. By sending GET http://<lens-ip>:<lens-port>/license
(for example: http://127.0.0.1:8080/license
), you will receive a JSON object response containing the following values.
Process
GET /process
As previously outlined in the Ingesting Data via Endpoint section, using the process endpoint is one way of triggering the Lens to start ingestion your database data. When an execution of the Lens fails after being triggered in this way, the response will be a status 400 Bad Request
and contain a response message similar to that sent to the dead letter queue as outlined above.
RML
The RML endpoint is all about the mapping file that you would have already created. It consists of a GET and a PUT endpoint, allowing you to get the current master mapping file currently in use on the Lens, and well as replacing the master mapping file with a new one.
GET /rml
By sending GET http://<lens-ip>:<lens-port>/rml
you will receive a response containing the contents of the mapping file written in RDF/Turtle.
PUT /rml
By sending PUT http://<lens-ip>:<lens-port>/rml
with a turtle mapping file in the body of the request, it will upload it to the file location specified in the MAPPINGS_DIR_URL
and MASTER_MAPPING_FILE
options in the configuration and replace the existing file. The mapping file should be in RDF/Turtle format and the declared HTTP Content-Type
should be text/turtle
. The successful upload is then indicated by an empty response with HTTP status OK
(Ref. RFC-7231) and will be functional immediately.
Custom Functions
If when designing your mapping file for your Lens, you require a function to be executed that cannot perform your required operation simply by using the built-in functions, it is possible to create and use your own. This can be done by setting your CUSTOM_FUNCTION_JAR_URL
and CUSTOM_FUNCTION_TTL_URL
config options to point at your jar and ttl files, and calling this endpoint to download and set these files. For further instructions on how to correctly carry out this process, please see our guide.