User Guide - Lens Writer v2.0
Intro
This is the full User Guide for the Lens Writer, it contains an in-depth set of instructions to fully set up, configure, and run the Writer so you can start writing data as part of an end-to-end system. For a guide to get the Writer up and running in the quickest and simplest possible way, see the Quick Start Guide. Once deployed, you can utilise any of our ready-made sample output NQuads files to test your Writer. For a list of what has changed since the last release, visit the User Release Notes.
Table of Contents
- 1 Intro
- 2 Table of Contents
- 3 Configuring the Writer
- 4 Running the Writer
- 5 Ingesting Source Data (RDF and CSV)
- 5.1 Directories in the Writer
- 5.2 Endpoint
- 5.3 Kafka
- 5.3.1 Dead Letter Queue
- 5.4 Ingesting Mode
- 6 Types of Graph
- 6.1 Semantic Knowledge Graphs
- 6.1.1 Stardog
- 6.1.2 GraphDB
- 6.1.3 Blazegraph
- 6.1.4 AllegroGraph
- 6.1.5 RDFox
- 6.1.6 Amazon Neptune
- 6.1.7 And more…
- 6.2 Property Graphs
- 6.2.1 Neo4j
- 6.2.2 Amazon Neptune (Cypher or Gremlin)
- 6.2.3 Gremlin Tinkerpop Server
- 6.1 Semantic Knowledge Graphs
- 7 Provenance Data
- 8 REST API Endpoints
- 8.1 Config
- 8.2 Update Config
- 8.2.1 PUT /updateConfig
- 8.3 License
- 8.4 Process
Configuring the Writer
As with the Lenses supplied by Data Lens, the Lens Writer has a wide array of user configuration, all of which can be set and altered both before the startup of the Lens and during the running of a Lens. The former is done through the use of environment variables in your Docker container or ECS Task Definition, and latter is done through the use of an exposed endpoints, as seen below. For a breakdown of every configuration option in the Lens Writer, see the full list here.
Configuration Manipulation
Accessing the Config
Once the Writer has started and is operational, you can request to view the current configuration by calling the /config
endpoint. This is expanded upon below, including the ability to specify specific config properties.
Editing config
As explained below, the configuration on a running Writer can be edited through the /updateConfig
endpoint.
Backup and Restore Config
A useful feature in the Writer, is the ability to backup and restore your configuration. This is particularly beneficial when you’ve made multiple changes to the config on a running Writer, and want to be able to restore this without rerunning any update config commands. To backup your config, simply call the /uploadConfigBackup
endpoint, and all changes you’ve made to the config will be uploaded to the storage location specified in your CONFIG_BACKUP
env var.
To restore your configuration, this must be done on the startup of a Lens, therefore, by setting the CONFIG_BACKUP
config option as an environment variable in your startup script / task definition. This must however be a remote directory such as S3, as anything local will be deleted if a task or container is stopped.
Configuration Categories
Mandatory Configuration (Local Deployment)
License -
LICENSE
This is the license key required to operate the Writer when being run on a local machine outside of AWS Marketplace, request your new unique license key here.
Graph Database Configuration
Graph Database Endpoint -
GRAPH_DATABASE_ENDPOINT
This is the endpoint for your Graph Database you wish to upload your RDF to and therefore required for the Lens Writer to work.
Graph Database Type -
GRAPH_DATABASE_TYPE
This is your Graph Database type, some graphs will support the default
sparql
type (e.g. AllegroGraph), however certain graphs require specific type declaration, these includegraphdb
,stardog
,blazegraph
,neptune-sparql
, andrdfox
.If you are using a Property Graph, you can set a specific Graph provider, including
neo4j
,neptune-cypher
,neptune-gremlin
, or the traversal languagecypher
orgremlin
.Please see the Types of Graph section for more info.
Graph Database Username and Password -
GRAPH_DATABASE_USERNAME
andGRAPH_DATABASE_PASSWORD
This is the username and password of your Graph Database. You can leave these fields blank if your Graph does not require any authentication.
AWS Configuration
When running the Writer in ECS, these settings are not required as all credentials are taken directly from the EC2 instance running the Lens. If you wish to use AWS cloud services while running the Lens on-prem, you need to specify an AWS Access Key and Secret Key, and AWS Region. By providing your AWS credentials, this will give you permission for accessing, downloading, and uploading remote files to S3 Buckets. The S3 Region option specifies the region of where in AWS your files and services reside. To do this, the Lenses utilise the AWS Default Credential Provider Chain, allowing for a number of methods to be used. The simplest is by setting the environment variables for AWS_ACCESS_KEY_ID
, AWS_SECRET_ACCESS_KEY
, and AWS_REGION
. Please note that all services must be in the same region, including if you choose to run the Writer in an EC2 instance.
Kafka Configuration
One of the many ways to interface with the Writer is through the use of Apache Kafka. With the Lens Writer, a Kafka Message Queue can be used for managing the input of RDF data into the Writer. To properly set up your Kafka Cluster, see the instructions here. Once complete, use the following Kafka configuration variables to connect the cluster with your Writer. If you do not wish to use Kafka, please set the variable LENS_RUN_STANDALONE
to true.
The Kafka Broker is what tells the Writer where to look for your Kafka Cluster, so set this property as follows: <kafka-ip>:<kafka-port>
. The recommended port is 9092
.
All other Kafka configuration variables can be found here, all of which have default values that can be overridden.
Provenance Configuration
Currently, the Lens Writer does not generate its own provenance meta-data and so the RECORD_PROVO
configuration option is set to false. However, any provenance previously generated is separate to this option, and will still be ingested into your Knowledge Graph. If you are using Kafka, ensure that your Kafka source topic is correctly configured if your Lens provenance is pushed to a separate queue from your generated output data.
Logging Configuration
Logging in the Lens Writer works the same way as the Lens, and like with most functionality is configurable through the use of environment variables; this list override-able options and their descriptions can be found here. When running the Lens Writer locally from the command line using the instructions below, the Writer will automatically log to your terminal instance. In addition to this, the archives of logs will be saved within the docker container at /var/log/datalens/archive/current/
and /var/log/datalens/json/archive/
for text and JSON logs respectively, where the current logs can be found at /var/log/datalens/text/current/
and /var/log/datalens/json/current/
. By default, a maximum of 7 log files will be archived for each file type, however this can be overridden. If running a Writer on cloud in an AWS environment, then connect to your instance via SSH or PuTTY, and the previously outlined logging locations apply.
By default, the Writer logs at INFO
level, this can be changed by overriding the LOG_LEVEL_DATALENS
option, however can only be done on Lens startup and will require a reboot if not.
Optional Configuration
There is also a further selection of optional configurations for given situations, see here for the full list.
Running the Writer
The Writer and all of our Lenses are designed and built to be versatile, allowing them to be set up and ran on a number of environments, including in cloud or on-premise. This is achieved through the use of Docker Containers. In addition to this, we now have full support for the Amazon Web Services Marketplace, where you can directly subscribe to and run your Writer from.
Local Docker Image
To run the Writer locally, first please ensure you have Docker installed. Then simply by running a command with the following structure, docker will start the container and run the Writer from your downloaded image.
For UNIX based machines (macOS and Linux):
docker run \
--env LICENSE=$LICENSE \
--env GRAPH_DATABASE_ENDPOINT=https://graphdb.example.com:443/repositories/test \
--env GRAPH_DATABASE_TYPE=graphdb \
--env GRAPH_DATABASE_USERNAME=test \
--env GRAPH_DATABASE_PASSWORD=test \
-p 8080:8080 \
-v /var/local/:/var/local/ \
lens-writer-api:latest
For Windows
docker run ^
--env LICENSE=%LICENSE% ^
--env GRAPH_DATABASE_ENDPOINT=https://graphdb.example.com:443/repositories/test ^
--env GRAPH_DATABASE_TYPE=graphdb ^
--env GRAPH_DATABASE_USERNAME=test ^
--env GRAPH_DATABASE_PASSWORD=test ^
-p 8080:8080 ^
-v /data/:/data/ ^
lens-writer-api:latest
The above examples demonstrate how to override configuration options using environment variables in your Lens Writer. Line 2 shows the use of passing in an environment variable saved to the machine, whereas lines 3-6 simply show a string value being passed to it. Given the Writer is ran on port 8080, line 7 exposes and binds that port of the host machine so that the APIs can be triggered. The -v
flag seen on line 8 mounts the working directory into the container; when the host directory of a bind-mounted volume doesn’t exist, Docker will automatically create this directory on the host for you. And finally, line 9 is the name and version of the Docker image you wish to run.
For more information of running Docker Images, see the official Docs.
Lens Writer Via AWS Marketplace
To run the Structured File Lens on AWS, we have full support for the AWS Marketplace. First subscribe to the Lens Writer, then use the CloudFormation template we have created to deploy a one click solution, starting up an ECS Cluster with all the required permissions and networking, with the Lens running within as a task. See here for more information about how the template works and what is being initialised.
For more information on the Architecture and Deployment of an Enterprise System, see our guide.
Alternatively, you can manually start the Lens by creating a Task Definition to be run within an ECS or EKS cluster, and using the Lens’s Image ID, exposing the port 8080, and ensuring there is a Task Role with at least the AmazonS3FullAccess
and AWSMarketplaceMeteringRegisterUsage
included.
Ingesting Source Data (RDF and CSV)
The Lens Writer is designed to ingest RDF NQuads (.nq) data files for Semantic Graphs, and CSV Nodes and Edges data files for Property Graphs. This ingestion can be done in a number of ways.
Directories in the Writer
When ingesting files, the Len Writer is designed to support files from an array of sources. This includes both local URLs and remote URLs including cloud-based technologies such as AWS S3. These locations should always be expressed as a URL string (Ref. RFC-3986).
To use a local URL for directories and files, both the format of
file:///var/local/data-lens-output/
and/var/local/data-lens-output/
are supported.To use a remote http(s) URL for files,
https://example.com/input-rdf-file.nq
is supported.To use a remote AWS S3 URL for directories and files,
s3://example/folder/
is supported where the format iss3://<bucket-name>/<directory>/<file-name>
. If you are using an S3 bucket for any directory, you ensure your Writer has the necessary credentials / permissions to access S3.
Also included in the Writer, is the ability to delete your source data files after they have been ingested into your Graph Database. This is done by setting the DELETE_SOURCE
config value to true
. Enabling this means that your S3 Bucket or local file store, will not continuously fill up with RDF / CSV data generated from your Lenses.
Endpoint
First, the easiest way to ingest an RDF file into the Lens Writer is to use the built-in APIs. Using the process
GET endpoint, you can specify the URL of a data file to ingest, and in return, you will be provided with the success status of the operation.
The structure and parameters for the GET
request is as follows: http://<writer-ip>:<writer-port>/process?inputRdfURL=<input-rdf-file-url>
, for example, http://127.0.0.1:8080/process?inputRdfURL=/var/local/input/input-rdf-data.nq
. Once an input file has successfully been processed after being ingested via the Process endpoint, the response returned from the Writer is in the form of a JSON. Within the JSON response are two elements containing both the input data URL and the URL of the target Graph Database, for example:
{
"input": "/var/local/input/input-rdf-data.nq",
"graphDatabaseEndpoint": "https://graphdb.example.com:443/repositories/test",
"graphDatabaseProvider": "graphdb",
"databaseType": "SEMANTIC_GRAPH"
}
Now by logging in to your Graph Database and making the necessary queries, you will be able to see the newly inserted source data.
Kafka
The second, and the more versatile and scalable ingestion method, is to use a message queue such as Apache Kafka. To set up a Kafka Cluster, follow the instructions here, but in short, to ingest source files into the Lens Writer you require a Producer. The topic name for which this Producer subscribes to must be the same name that you specified in the KAFKA_TOPIC_NAME_SOURCE
config option (defaults to “success_queue”). Please ensure that this is the same as the success queue topic name in the Lenses you wish to ingest transformed data from. Once set up, if manually pushing data to Kafka, each message sent from the Producer must consist solely of URL of the file, for example, > s3://examplebucket/folder/input-rdf-data.nq
.
Dead Letter Queue
If something goes wrong during either the operation of the Writer, that system will publish a message to the Dead Letter Queue Kafka topic (defaults to “dead_letter_queue”) explaining what went wrong along with meta-data about that ingestion, allowing for the problem to be diagnosed and later re-ingested. This message will be in the form of a JSON with the following structure:
Ingesting Mode
By default the ingested data are solid dataset and loaded fully into the final graph store. Different datasets are considered as being independent each other. In this mode the new dataset adds new values to already existing subject
and predicate
. That default behaviour is controlled by a parameter INGESTION_MODE
(see here). For example:
Existing data:
New data:
Final data
In the update mode (value update
of the parameter) the ingested data are used to replace already existing one. This mode is fully supported by the RDF standard and is built-in into all Semantic Knowledge Graphs. Since the property graphs such as Neo4J usually requires a separate add-ons to persist the RDF graphs this feature must be supported by the extension. Currently we implemented the support into neosemantics plug-in for Neo4j versions 3.5.x. The customised version of the plug-in can be downloaded from here.
The update mode is demonstrated on the example below.
Existing data:
New data:
Final data
Please note the above only applies to Semantic Graphs, in Property Graphs the Writer defaults to an upsert pattern, updating the properties for a given node or edge.
Types of Graph
The Lens Writer has support for the writing of RDF data to all Semantic Knowledge Graphs, as well as a wide selection of Property Graphs (with more being continuously added).
Semantic Knowledge Graphs
As previously stated, Data Lens and its Writer support writing to all Semantic Knowledge Graph databases. This includes, but is not exclusive to Stardog, GraphDB, BlazeGraph, AllegroGraph, MarkLogic, RDFox, and Amazon Neptune.
Stardog
Stardog has an Enterprise Knowledge Graph platform as well as a feature-rich IDE for querying and visualising your data. To import data into your Stardog graph you must set your GRAPH_DATABASE_TYPE
to stardog
and your GRAPH_DATABASE_ENDPOINT
to the structure of https://stardog.example.com:443/test
.
GraphDB
GraphDB is an enterprise-ready Semantic Graph Database, compliant with W3C Standards. To import data into your GraphDB graph you must set your GRAPH_DATABASE_TYPE
to graphdb
and your GRAPH_DATABASE_ENDPOINT
to the structure of https://graphdb.example.com/repositories/test
.
Blazegraph
Blazegraph DB is an ultra-high-performance graph database supporting Blueprints and RDF/SPARQL APIs. To import data into your Blazegraph you must set your GRAPH_DATABASE_TYPE
to blazegraph
and your GRAPH_DATABASE_ENDPOINT
to the structure of https://blazegraph.example.com/blazegraph/namespace/test
.
AllegroGraph
AllegroGraph is a modern, high-performance, persistent graph database with efficient memory utilisation. To import data into your AllegroGraph you must leave your GRAPH_DATABASE_TYPE
to the default sparql
and your GRAPH_DATABASE_ENDPOINT
to the structure of http://allegrograph.example.com:10035/repositories/test
.
RDFox
RDFox, is the first market-ready high-performance knowledge graph designed from ground up with semantic reasoning in mind. To import data into your RDFox you must set your GRAPH_DATABASE_TYPE
to rdfox
and your GRAPH_DATABASE_ENDPOINT
to the structure of http://rdfox.example.com:8090/datastores/test/sparql
Amazon Neptune
Amazon Neptune is a fast, reliable, fully-managed graph database service that makes it easy to build and run applications that work with highly connected datasets. To import data into your Neptune database you must set your GRAPH_DATABASE_TYPE
to neptune
and your GRAPH_DATABASE_ENDPOINT
to the structure of https://example.xyz123.us-east-1.neptune.amazonaws.com:8182
. This endpoint will be provided to you by AWS when setting up your Neptune instance. Please ensure, that this resides in the same region as the rest of your AWS services.
And more…
With support for all Knowledge graphs, your Knowledge Graph of choice may note have been listed. If so, it is recommended to leave the GRAPH_DATABASE_TYPE
to the default sparql
and set your GRAPH_DATABASE_ENDPOINT
to what has been specified in your graph’s user docs. For help and assistance, contact us.
Property Graphs
A Property Graph is a directed, vertex-labelled, edge-labelled multigraph with self-edges, where edges have their own identity. In the Property Graph paradigm, the term node is used to denote a vertex, and relationship to denote an edge. Semantic Graphs and Knowledge graphs differ in their implementation and technologies, therefore importing into this type differs slightly. Fortunately, the Writer makes this as simple as with Semantic Graph, requiring the same process of setting your Graph Database as before. Instead of RDF, the source files for ingesting data into your Property Graph are CSV Nodes and Edges files, these can be produced in the Lens by configuring it to be in Property Graph Mode.
Neo4j
Neo4j is a Property Graph database management system with native graph storage and processing. To import data into your Neo4j Graph, set your GRAPH_DATABASE_TYPE
to neo4j
and your GRAPH_DATABASE_ENDPOINT
to the structure of bolt://example.com:7687
, with optionally a table name appended to the URL after a forward slash, e.g. bolt://example.com:7687/test
. Once imported, using the Cypher query language, you can query your data using Neo4j’s products and tools.
Amazon Neptune (Cypher or Gremlin)
As well as SPARQL, Amazon Neptune also support Open Cypher and Gremlin. To import your data in these format, please set your GRAPH_DATABASE_TYPE
to neptune-cypher
and neptune-gremlin
respectively. And your GRAPH_DATABASE_ENDPOINT
as before.
Gremlin Tinkerpop Server
Apache TinkerPop is a graph computing framework for both graph databases (OLTP) and graph analytic systems (OLAP). To import data into your TinkerPop Graph, set your GRAPH_DATABASE_TYPE
to gremlin
and your GRAPH_DATABASE_ENDPOINT
to the structure of http://example.com:8182
Provenance Data
Within the Lenses, time-series data is supported as standard, every time a Lens ingests some data we add provenance information. This means that you have a full record of data over time allowing you to see what the state if the data was at any moment. The model we use to record Provenance information is the w3c standard PROV-O model. Currently, the Lens Writer does not generate its own provenance meta-data, however, any provenance previously generated will still be ingested into your Knowledge Graph. If you are using Kafka, ensure that your Kafka source topic is correctly configured if your Lens provenance is pushed to a separate queue from your generated output data. Having provenance pushed to a separate Kafka Topic allows for a different Lens Writer to be set up enabling you to push provenance to a separate Knowledge Graph to your generate RDF from source data.
For more information on how the provenance is laid out, as well as how to query in from you Knowledge Graph, see the Provenance Guide.
REST API Endpoints
In addition to the Process Endpoint designed for ingesting data into the Writer, there is a selection of built-in exposed endpoints for you to call.
API | HTTP Request | URL Template | Description |
---|---|---|---|
Process | GET |
| Tells the Writer to ingest the RDF file located at the specified URL location |
Config | GET |
| Displays all Writer configuration as JSON |
GET |
| Displays all Writer configuration specified in param as a comma-separated list | |
Check Connection | GET |
| Tests whether the Writer can establish a connection with the specific Knowledge Graph credentials |
Update Config | PUT |
| Update configuration options on a running Writer |
Upload Config Backup | PUT |
| Uploads the current configuration to the specified config backup location so that it can be restored at a later date |
License | GET |
| Displays license information |
Restart Kafka | GET |
| Turns the Writer’s Kafka connection on or off depending on its current state. |
Config
The config endpoint is a GET request that allows you to view the configuration settings of a running Writer. By sending GET http://<writer-ip>:<writer-port>/config
(for example http://127.0.0.1:8080/config
), you will receive the entire configuration represented as a JSON, as seen in this small snippet below.
Alternatively, you can specify exactly what config options you wish to return by providing a comma-separated list of variables under the paths
parameter. For example, the request of GET http://<writer-ip>:<writer-port>/config?paths=lens.config.tripleStore.endpoint,logging.loggers
would return the following.
Update Config
PUT /updateConfig
The configuration on running Writer can now be edited without having to restart. This is done through the update config endpoint. For example, by running the following /updateConfig?configEntry=friendlyName&configValue=GraphBuilder
we have changed the friendly name of the Writer to GraphBuilder. To see a list of the configuration entry names, consult the Lens Writer Configurable Options.
License
The license endpoint is a GET request that allows you to view information about your license key that is in use on a running Lens or Writer. By sending GET http://<writer-ip>:<writer-port>/license
(for example: http://127.0.0.1:8080/license
), you will receive a JSON response containing the following values.
Process
As previously outlined in the Ingesting Data via Endpoint section, using the process endpoint is one way of triggering the Lens to ingest your source data. When an execution of the Writer fails after being triggered in this way, the response will be a status 400 Bad Request
and contain a response message similar to that sent to the dead letter queue as outlined above.