User Guide - Lens Writer v2.0

Intro

This is the full User Guide for the Lens Writer, it contains an in-depth set of instructions to fully set up, configure, and run the Writer so you can start writing data as part of an end-to-end system. For a guide to get the Writer up and running in the quickest and simplest possible way, see the Quick Start Guide. Once deployed, you can utilise any of our ready-made sample output NQuads files to test your Writer. For a list of what has changed since the last release, visit the User Release Notes.

 


Table of Contents

 


 

Configuring the Writer

As with the Lenses supplied by Data Lens, the Lens Writer has a wide array of user configuration, all of which can be set and altered both before the startup of the Lens and during the running of a Lens. The former is done through the use of environment variables in your Docker container or ECS Task Definition, and latter is done through the use of an exposed endpoints, as seen below. For a breakdown of every configuration option in the Lens Writer, see the full list here.

Configuration Manipulation

Accessing the Config

Once the Writer has started and is operational, you can request to view the current configuration by calling the /config endpoint. This is expanded upon below, including the ability to specify specific config properties.

Editing config

As explained below, the configuration on a running Writer can be edited through the /updateConfig endpoint.

Backup and Restore Config

A useful feature in the Writer, is the ability to backup and restore your configuration. This is particularly beneficial when you’ve made multiple changes to the config on a running Writer, and want to be able to restore this without rerunning any update config commands. To backup your config, simply call the /uploadConfigBackup endpoint, and all changes you’ve made to the config will be uploaded to the storage location specified in your CONFIG_BACKUP env var.

To restore your configuration, this must be done on the startup of a Lens, therefore, by setting the CONFIG_BACKUP config option as an environment variable in your startup script / task definition. This must however be a remote directory such as S3, as anything local will be deleted if a task or container is stopped.

Configuration Categories

Mandatory Configuration (Local Deployment)

  • License - LICENSE

    • This is the license key required to operate the Writer when being run on a local machine outside of AWS Marketplace, request your new unique license key here.

Graph Database Configuration

  • Graph Database Endpoint - GRAPH_DATABASE_ENDPOINT

    • This is the endpoint for your Graph Database you wish to upload your RDF to and therefore required for the Lens Writer to work.

  • Graph Database Type - GRAPH_DATABASE_TYPE

    • This is your Graph Database type, some graphs will support the default sparql type (e.g. AllegroGraph), however certain graphs require specific type declaration, these include graphdb, stardog, blazegraph, neptune-sparql, and rdfox.

    • If you are using a Property Graph, you can set a specific Graph provider, including neo4j, neptune-cypher, neptune-gremlin, or the traversal language cypher or gremlin.

    • Please see the Types of Graph section for more info.

  • Graph Database Username and Password - GRAPH_DATABASE_USERNAME and GRAPH_DATABASE_PASSWORD

    • This is the username and password of your Graph Database. You can leave these fields blank if your Graph does not require any authentication.

AWS Configuration

When running the Writer in ECS, these settings are not required as all credentials are taken directly from the EC2 instance running the Lens. If you wish to use AWS cloud services while running the Lens on-prem, you need to specify an AWS Access Key and Secret Key, and AWS Region. By providing your AWS credentials, this will give you permission for accessing, downloading, and uploading remote files to S3 Buckets. The S3 Region option specifies the region of where in AWS your files and services reside. To do this, the Lenses utilise the AWS Default Credential Provider Chain, allowing for a number of methods to be used. The simplest is by setting the environment variables for AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY, and AWS_REGION. Please note that all services must be in the same region, including if you choose to run the Writer in an EC2 instance.

Kafka Configuration

One of the many ways to interface with the Writer is through the use of Apache Kafka. With the Lens Writer, a Kafka Message Queue can be used for managing the input of RDF data into the Writer. To properly set up your Kafka Cluster, see the instructions here. Once complete, use the following Kafka configuration variables to connect the cluster with your Writer. If you do not wish to use Kafka, please set the variable LENS_RUN_STANDALONE to true.

The Kafka Broker is what tells the Writer where to look for your Kafka Cluster, so set this property as follows: <kafka-ip>:<kafka-port>. The recommended port is 9092.

All other Kafka configuration variables can be found here, all of which have default values that can be overridden.

Provenance Configuration

Currently, the Lens Writer does not generate its own provenance meta-data and so the RECORD_PROVO configuration option is set to false. However, any provenance previously generated is separate to this option, and will still be ingested into your Knowledge Graph. If you are using Kafka, ensure that your Kafka source topic is correctly configured if your Lens provenance is pushed to a separate queue from your generated output data.

Logging Configuration

Logging in the Lens Writer works the same way as the Lens, and like with most functionality is configurable through the use of environment variables; this list override-able options and their descriptions can be found here. When running the Lens Writer locally from the command line using the instructions below, the Writer will automatically log to your terminal instance. In addition to this, the archives of logs will be saved within the docker container at /var/log/datalens/archive/current/ and /var/log/datalens/json/archive/ for text and JSON logs respectively, where the current logs can be found at /var/log/datalens/text/current/ and /var/log/datalens/json/current/. By default, a maximum of 7 log files will be archived for each file type, however this can be overridden. If running a Writer on cloud in an AWS environment, then connect to your instance via SSH or PuTTY, and the previously outlined logging locations apply.

By default, the Writer logs at INFO level, this can be changed by overriding the LOG_LEVEL_DATALENS option, however can only be done on Lens startup and will require a reboot if not.

Optional Configuration

There is also a further selection of optional configurations for given situations, see here for the full list.

 


 

Running the Writer

The Writer and all of our Lenses are designed and built to be versatile, allowing them to be set up and ran on a number of environments, including in cloud or on-premise. This is achieved through the use of Docker Containers. In addition to this, we now have full support for the Amazon Web Services Marketplace, where you can directly subscribe to and run your Writer from.

Local Docker Image

To run the Writer locally, first please ensure you have Docker installed. Then simply by running a command with the following structure, docker will start the container and run the Writer from your downloaded image.

For UNIX based machines (macOS and Linux):

docker run \ --env LICENSE=$LICENSE \ --env GRAPH_DATABASE_ENDPOINT=https://graphdb.example.com:443/repositories/test \ --env GRAPH_DATABASE_TYPE=graphdb \ --env GRAPH_DATABASE_USERNAME=test \ --env GRAPH_DATABASE_PASSWORD=test \ -p 8080:8080 \ -v /var/local/:/var/local/ \ lens-writer-api:latest

For Windows

docker run ^ --env LICENSE=%LICENSE% ^ --env GRAPH_DATABASE_ENDPOINT=https://graphdb.example.com:443/repositories/test ^ --env GRAPH_DATABASE_TYPE=graphdb ^ --env GRAPH_DATABASE_USERNAME=test ^ --env GRAPH_DATABASE_PASSWORD=test ^ -p 8080:8080 ^ -v /data/:/data/ ^ lens-writer-api:latest

The above examples demonstrate how to override configuration options using environment variables in your Lens Writer. Line 2 shows the use of passing in an environment variable saved to the machine, whereas lines 3-6 simply show a string value being passed to it. Given the Writer is ran on port 8080, line 7 exposes and binds that port of the host machine so that the APIs can be triggered. The -v flag seen on line 8 mounts the working directory into the container; when the host directory of a bind-mounted volume doesn’t exist, Docker will automatically create this directory on the host for you. And finally, line 9 is the name and version of the Docker image you wish to run.

For more information of running Docker Images, see the official Docs.

Lens Writer Via AWS Marketplace

To run the Structured File Lens on AWS, we have full support for the AWS Marketplace. First subscribe to the Lens Writer, then use the CloudFormation template we have created to deploy a one click solution, starting up an ECS Cluster with all the required permissions and networking, with the Lens running within as a task. See here for more information about how the template works and what is being initialised.

For more information on the Architecture and Deployment of an Enterprise System, see our guide.

Alternatively, you can manually start the Lens by creating a Task Definition to be run within an ECS or EKS cluster, and using the Lens’s Image ID, exposing the port 8080, and ensuring there is a Task Role with at least the AmazonS3FullAccess and AWSMarketplaceMeteringRegisterUsage included.

 


 

Ingesting Source Data (RDF and CSV)

The Lens Writer is designed to ingest RDF NQuads (.nq) data files for Semantic Graphs, and CSV Nodes and Edges data files for Property Graphs. This ingestion can be done in a number of ways.

Directories in the Writer

When ingesting files, the Len Writer is designed to support files from an array of sources. This includes both local URLs and remote URLs including cloud-based technologies such as AWS S3. These locations should always be expressed as a URL string (Ref. RFC-3986).

  • To use a local URL for directories and files, both the format of file:///var/local/data-lens-output/ and /var/local/data-lens-output/ are supported.

  • To use a remote http(s) URL for files, https://example.com/input-rdf-file.nq is supported.

  • To use a remote AWS S3 URL for directories and files, s3://example/folder/ is supported where the format is s3://<bucket-name>/<directory>/<file-name>. If you are using an S3 bucket for any directory, you ensure your Writer has the necessary credentials / permissions to access S3.

Also included in the Writer, is the ability to delete your source data files after they have been ingested into your Graph Database. This is done by setting the DELETE_SOURCE config value to true. Enabling this means that your S3 Bucket or local file store, will not continuously fill up with RDF / CSV data generated from your Lenses.

Endpoint

First, the easiest way to ingest an RDF file into the Lens Writer is to use the built-in APIs. Using the process GET endpoint, you can specify the URL of a data file to ingest, and in return, you will be provided with the success status of the operation.

The structure and parameters for the GET request is as follows: http://<writer-ip>:<writer-port>/process?inputRdfURL=<input-rdf-file-url>, for example, http://127.0.0.1:8080/process?inputRdfURL=/var/local/input/input-rdf-data.nq. Once an input file has successfully been processed after being ingested via the Process endpoint, the response returned from the Writer is in the form of a JSON. Within the JSON response are two elements containing both the input data URL and the URL of the target Graph Database, for example:

{ "input": "/var/local/input/input-rdf-data.nq", "graphDatabaseEndpoint": "https://graphdb.example.com:443/repositories/test", "graphDatabaseProvider": "graphdb", "databaseType": "SEMANTIC_GRAPH" }

Now by logging in to your Graph Database and making the necessary queries, you will be able to see the newly inserted source data.

Kafka

The second, and the more versatile and scalable ingestion method, is to use a message queue such as Apache Kafka. To set up a Kafka Cluster, follow the instructions here, but in short, to ingest source files into the Lens Writer you require a Producer. The topic name for which this Producer subscribes to must be the same name that you specified in the KAFKA_TOPIC_NAME_SOURCE config option (defaults to “success_queue”). Please ensure that this is the same as the success queue topic name in the Lenses you wish to ingest transformed data from. Once set up, if manually pushing data to Kafka, each message sent from the Producer must consist solely of URL of the file, for example, > s3://examplebucket/folder/input-rdf-data.nq.

Dead Letter Queue

If something goes wrong during either the operation of the Writer, that system will publish a message to the Dead Letter Queue Kafka topic (defaults to “dead_letter_queue”) explaining what went wrong along with meta-data about that ingestion, allowing for the problem to be diagnosed and later re-ingested. This message will be in the form of a JSON with the following structure:

 

Ingesting Mode

By default the ingested data are solid dataset and loaded fully into the final graph store. Different datasets are considered as being independent each other. In this mode the new dataset adds new values to already existing subject and predicate. That default behaviour is controlled by a parameter INGESTION_MODE (see here). For example:

Existing data:

New data:

Final data

 

In the update mode (value update of the parameter) the ingested data are used to replace already existing one. This mode is fully supported by the RDF standard and is built-in into all Semantic Knowledge Graphs. Since the property graphs such as Neo4J usually requires a separate add-ons to persist the RDF graphs this feature must be supported by the extension. Currently we implemented the support into neosemantics plug-in for Neo4j versions 3.5.x. The customised version of the plug-in can be downloaded from here.

The update mode is demonstrated on the example below.

Existing data:

New data:

Final data

 

Please note the above only applies to Semantic Graphs, in Property Graphs the Writer defaults to an upsert pattern, updating the properties for a given node or edge. 

 


 

Types of Graph

The Lens Writer has support for the writing of RDF data to all Semantic Knowledge Graphs, as well as a wide selection of Property Graphs (with more being continuously added).

Semantic Knowledge Graphs

As previously stated, Data Lens and its Writer support writing to all Semantic Knowledge Graph databases. This includes, but is not exclusive to Stardog, GraphDB, BlazeGraph, AllegroGraph, MarkLogic, RDFox, and Amazon Neptune.

Stardog

Stardog has an Enterprise Knowledge Graph platform as well as a feature-rich IDE for querying and visualising your data. To import data into your Stardog graph you must set your GRAPH_DATABASE_TYPE to stardog and your GRAPH_DATABASE_ENDPOINT to the structure of https://stardog.example.com:443/test.

GraphDB

GraphDB is an enterprise-ready Semantic Graph Database, compliant with W3C Standards. To import data into your GraphDB graph you must set your GRAPH_DATABASE_TYPE to graphdb and your GRAPH_DATABASE_ENDPOINT to the structure of https://graphdb.example.com/repositories/test.

Blazegraph

Blazegraph DB is an ultra-high-performance graph database supporting Blueprints and RDF/SPARQL APIs. To import data into your Blazegraph you must set your GRAPH_DATABASE_TYPE to blazegraph and your GRAPH_DATABASE_ENDPOINT to the structure of https://blazegraph.example.com/blazegraph/namespace/test.

AllegroGraph

AllegroGraph is a modern, high-performance, persistent graph database with efficient memory utilisation. To import data into your AllegroGraph you must leave your GRAPH_DATABASE_TYPE to the default sparql and your GRAPH_DATABASE_ENDPOINT to the structure of http://allegrograph.example.com:10035/repositories/test.

RDFox

RDFox, is the first market-ready high-performance knowledge graph designed from ground up with semantic reasoning in mind. To import data into your RDFox you must set your GRAPH_DATABASE_TYPE to rdfox and your GRAPH_DATABASE_ENDPOINT to the structure of http://rdfox.example.com:8090/datastores/test/sparql

Amazon Neptune

Amazon Neptune is a fast, reliable, fully-managed graph database service that makes it easy to build and run applications that work with highly connected datasets. To import data into your Neptune database you must set your GRAPH_DATABASE_TYPE to neptune and your GRAPH_DATABASE_ENDPOINT to the structure of https://example.xyz123.us-east-1.neptune.amazonaws.com:8182. This endpoint will be provided to you by AWS when setting up your Neptune instance. Please ensure, that this resides in the same region as the rest of your AWS services.

And more…

With support for all Knowledge graphs, your Knowledge Graph of choice may note have been listed. If so, it is recommended to leave the GRAPH_DATABASE_TYPE to the default sparql and set your GRAPH_DATABASE_ENDPOINT to what has been specified in your graph’s user docs. For help and assistance, contact us.

 

Property Graphs

A Property Graph is a directed, vertex-labelled, edge-labelled multigraph with self-edges, where edges have their own identity. In the Property Graph paradigm, the term node is used to denote a vertex, and relationship to denote an edge. Semantic Graphs and Knowledge graphs differ in their implementation and technologies, therefore importing into this type differs slightly. Fortunately, the Writer makes this as simple as with Semantic Graph, requiring the same process of setting your Graph Database as before. Instead of RDF, the source files for ingesting data into your Property Graph are CSV Nodes and Edges files, these can be produced in the Lens by configuring it to be in Property Graph Mode.

Neo4j

Neo4j is a Property Graph database management system with native graph storage and processing. To import data into your Neo4j Graph, set your GRAPH_DATABASE_TYPE to neo4j and your GRAPH_DATABASE_ENDPOINT to the structure of bolt://example.com:7687, with optionally a table name appended to the URL after a forward slash, e.g. bolt://example.com:7687/test. Once imported, using the Cypher query language, you can query your data using Neo4j’s products and tools.

Amazon Neptune (Cypher or Gremlin)

As well as SPARQL, Amazon Neptune also support Open Cypher and Gremlin. To import your data in these format, please set your GRAPH_DATABASE_TYPE to neptune-cypher and neptune-gremlin respectively. And your GRAPH_DATABASE_ENDPOINT as before.

Gremlin Tinkerpop Server

Apache TinkerPop is a graph computing framework for both graph databases (OLTP) and graph analytic systems (OLAP). To import data into your TinkerPop Graph, set your GRAPH_DATABASE_TYPE to gremlin and your GRAPH_DATABASE_ENDPOINT to the structure of http://example.com:8182

 


 

Provenance Data

Within the Lenses, time-series data is supported as standard, every time a Lens ingests some data we add provenance information. This means that you have a full record of data over time allowing you to see what the state if the data was at any moment. The model we use to record Provenance information is the w3c standard PROV-O model. Currently, the Lens Writer does not generate its own provenance meta-data, however, any provenance previously generated will still be ingested into your Knowledge Graph. If you are using Kafka, ensure that your Kafka source topic is correctly configured if your Lens provenance is pushed to a separate queue from your generated output data. Having provenance pushed to a separate Kafka Topic allows for a different Lens Writer to be set up enabling you to push provenance to a separate Knowledge Graph to your generate RDF from source data.

For more information on how the provenance is laid out, as well as how to query in from you Knowledge Graph, see the Provenance Guide.

 


 

REST API Endpoints

In addition to the Process Endpoint designed for ingesting data into the Writer, there is a selection of built-in exposed endpoints for you to call.

API

HTTP Request

URL Template

Description

API

HTTP Request

URL Template

Description

Process

GET

/process?inputRdfURL=<input-rdf-file-url>

Tells the Writer to ingest the RDF file located at the specified URL location

Config

GET

/config

Displays all Writer configuration as JSON

GET

/config?paths=<config-options>

Displays all Writer configuration specified in param as a comma-separated list

Check Connection

GET

/checkConnection

Tests whether the Writer can establish a connection with the specific Knowledge Graph credentials

Update Config

PUT

/updateConfig?configEntry=<config-entry>&configValue=<config-value>

Update configuration options on a running Writer

Upload Config Backup

PUT

/uploadConfigBackup

Uploads the current configuration to the specified config backup location so that it can be restored at a later date

License

GET

/license

Displays license information

Restart Kafka

GET

/restartKafka

Turns the Writer’s Kafka connection on or off depending on its current state.

 

Config

The config endpoint is a GET request that allows you to view the configuration settings of a running Writer. By sending GET http://<writer-ip>:<writer-port>/config (for example http://127.0.0.1:8080/config), you will receive the entire configuration represented as a JSON, as seen in this small snippet below.

Alternatively, you can specify exactly what config options you wish to return by providing a comma-separated list of variables under the paths parameter. For example, the request of GET http://<writer-ip>:<writer-port>/config?paths=lens.config.tripleStore.endpoint,logging.loggers would return the following.

Update Config

PUT /updateConfig

The configuration on running Writer can now be edited without having to restart. This is done through the update config endpoint. For example, by running the following /updateConfig?configEntry=friendlyName&configValue=GraphBuilder we have changed the friendly name of the Writer to GraphBuilder. To see a list of the configuration entry names, consult the Lens Writer Configurable Options.

License

The license endpoint is a GET request that allows you to view information about your license key that is in use on a running Lens or Writer. By sending GET http://<writer-ip>:<writer-port>/license (for example: http://127.0.0.1:8080/license), you will receive a JSON response containing the following values.

Process

As previously outlined in the Ingesting Data via Endpoint section, using the process endpoint is one way of triggering the Lens to ingest your source data. When an execution of the Writer fails after being triggered in this way, the response will be a status 400 Bad Request and contain a response message similar to that sent to the dead letter queue as outlined above.