User Guide - Lens Writer v1.4

Intro

This is the full User Guide for the Lens Writer, it contains an in-depth set of instructions to fully set up, configure, and run the Writer so you can start writing data as part of an end-to-end system. For a guide to get the Writer up and running in the quickest and simplest possible way, see the Quick Start Guide. Once deployed, you can utilise any of our ready-made sample output NQaud files to test your Writer. For a list of what has changed since the last release, visit the User Release Notes.

 


Table of Contents

 


 

Configuring the Writer

As with the Lenses supplied by Data Lens, the Lens Writer is also configurable through the use of Environment Variables. How to declare these environment variables will differ slightly depending on how you choose to run the Writer, so please see Running the Writer for more info. For a breakdown of every configuration option in the Lens Writer, see the full list here.

Mandatory Configuration

For the Lens to operate the following configuration options are required.

  • License - LICENSE

    • This is the license key required to operate the lens, request your new unique license key here.

  • Triple Store Endpoint - TRIPLESTORE_ENDPOINT

    • This is the endpoint for your Triple Store you wish to upload your RDF to and therefore required for the Lens Writer to work.

  • Triple Store Type - TRIPLESTORE_TYPE

    • This is the type of your Triple Store, some graphs will support the default sparql type (e.g. AllegroGraph), however certain graphs require specific type declaration, these include graphdb, stardog, blazegraph, neptune, and neo4j. Please see the Types of Graph section for more info.

  • Triple Store Username and Password - TRIPLESTORE_USERNAME and TRIPLESTORE_PASSWORD

    • This is the username and password of your Triple Store. You can leave these fields blank if your Triple Store does not require any authentication.

AWS Configuration

If you wish to use cloud services such as Amazon Web Services you need to specify an AWS Access Key and Secret Key, and AWS Region, through AWS_ACCESS_KEY, AWS_SECRET_KEY, and S3_REGION respectively. By providing your AWS credentials, this will give you permission for accessing, downloading, and uploading remote files to S3 Buckets. The S3 Region option specifies the region of where in AWS your files and services reside. Please note that all services must be in the same region, including if you choose to run the Writer in an EC2 instance.

Kafka Configuration

One of the many ways to interface with the Writer is through the use of Apache Kafka. With the Lens Writer, a Kafka Message Queue can be used for managing the input of RDF data into the Writer. To properly set up your Kafka Cluster, see the instructions here. Once complete, use the following Kafka configuration variables to connect the cluster with your Writer. If you do not wish to use Kafka, please set the variable LENS_RUN_STANDALONE to true.

The Kafka Broker is what tells the Writer where to look for your Kafka Cluster, so set this property as follows: <kafka-ip>:<kafka-port>. The recommended port is 9092.

All other Kafka configuration variables can be found here, all of which have default values that can be overridden.

Provenance Configuration

Currently, the Lens Writer does not generate its own provenance meta-data and so the RECORD_PROVO configuration option is set to false. However, any provenance previously generated is separate to this option, and will still be ingested into your triplestore. If you are using Kafka, ensure that your Kafka source topic is correctly configured if your Lens provenance is pushed to a separate queue from your generated output data.

Logging Configuration

Logging in the Lens Writer works the same way as the Lens, and like with most functionality is configurable through the use of environment variables; this list override-able options and their descriptions can be found here. When running the Lens Writer locally from the command line using the instructions below, the Writer will automatically log to your terminal instance. In addition to this, the archives of logs will be saved within the docker container at /var/log/datalens/archive/current/ and /var/log/datalens/json/archive/ for text and JSON logs respectively, where the current logs can be found at /var/log/datalens/text/current/ and /var/log/datalens/json/current/. By default, a maximum of 7 log files will be archived for each file type, however this can be overridden. If running a Writer on cloud in an AWS environment, then connect to your instance via SSH or PuTTY, and the previously outlined logging locations apply.

Neo4j Configuration

Neo4j is a Property Graph database management system with native graph storage and processing. As Neo4j is not a Semantic Knowledge Graph, storing RDF data in Neo4j in a lossless manner may require additional configuration options to be set. The defaults have been set, as seen here, for the most likely scenario, and also to allow imported RDF to be subsequently exported without losing a single triple in the process. As with all config, this can be overridden to suit your needs. For more information on how your data is represented in Neo4j see below.

Optional Configuration

There is also a further selection of optional configurations for given situations, see here for the full list.

Accessing the configuration of a running Writer

Once a Writer has started and is operational, you can request to view the current config by calling one of the Writer’s built-in APIs, this is explained in more detail below. Please note, that in order to change any config variable on a running Writer, it must be shut down and restarted.

 


 

Running the Writer

The Writer and all of our Lenses are designed and built to be versatile, allowing them to be set up and ran on a number of environments, including in cloud or on-premise. This is achieved through the use of Docker Containers. In addition to this, we now have full support for the Amazon Web Services Marketplace, where you can directly subscribe to and run your Writer from.

Local Docker Image

To run the Writer locally, first please ensure you have Docker installed. Then simply by running a command with the following structure, docker will start the container and run the Writer from your downloaded image.

For UNIX based machines (macOS and Linux):

docker run \ --env LICENSE=$LICENSE \ --env TRIPLESTORE_ENDPOINT=https://graphdb.example.com:443/repositories/test \ --env TRIPLESTORE_TYPE=graphdb \ --env TRIPLESTORE_USERNAME=test \ --env TRIPLESTORE_PASSWORD=test \ --env LENS_RUN_STANDALONE=true \ -p 8080:8080 \ -v /var/local/:/var/local/ \ lens-writer-api:latest

For Windows

docker run ^ --env LICENSE=%LICENSE% ^ --env TRIPLESTORE_ENDPOINT=https://graphdb.example.com:443/repositories/test ^ --env TRIPLESTORE_TYPE=graphdb ^ --env TRIPLESTORE_USERNAME=test ^ --env TRIPLESTORE_PASSWORD=test ^ --env LENS_RUN_STANDALONE=true ^ -p 8080:8080 ^ -v /data/:/data/ ^ lens-writer-api:latest

The above examples demonstrate how to override configuration options using environment variables in your Lens Writer. Line 2 shows the use of passing in an environment variable saved to the machine, whereas lines 3-7 simply show a string value being passed to it. Given the Writer is ran on port 8080, line 8 exposes and binds that port of the host machine so that the APIs can be triggered. The -v flag seen on line 9 mounts the working directory into the container; when the host directory of a bind-mounted volume doesn’t exist, Docker will automatically create this directory on the host for you. And finally, line 10 is the name and version of the Docker image you wish to run.

For more information of running Docker Images, see the official Docs.

Docker on AWS

The deployment approach we recommend at Data Lens is to use Amazon Web Services, this is both to store your source and RDF data, and to host and run your Lenses and Writer.

The aim is to deploy the Lens and other services using AWS by setting up the following architecture:

For more information on the Architecture and Deployment of an Enterprise System, see our guide.

AWS Marketplace

We now have full support for the Amazon Web Services Marketplace, where you can directly subscribe to a Writer. Then, using our CloudFormation Templates, you can deploy a one-click solution to run your Writer. See here for further details and instructions to get you started.

 


 

Ingesting RDF Data

The Lens Writer is designed to ingest RDF data, most commonly in the form of NQuads (.nq) files, and this can be done in a number of ways.

Directories in the Writer

When ingesting files, the Len Writer is designed to support files from an array of sources. This includes both local URLs and remote URLs including cloud-based technologies such as AWS S3. These locations should always be expressed as a URL string (Ref. RFC-3986).

  • To use a local URL for directories and files, both the format of file:///var/local/data-lens-output/ and /var/local/data-lens-output/ are supported.

  • To use a remote http(s) URL for files, https://example.com/input-rdf-file.nq is supported.

  • To use a remote AWS S3 URL for directories and files, s3://example/folder/ is supported where the format is s3://<bucket-name>/<directory>/<file-name>. If you are using an S3 bucket for any directory, you must specify an AWS access key and secret key.

Also included in the Writer, is the ability to delete your source NQuad input files after they have been ingested into your Triple Store. This is done by setting the DELETE_SOURCE config value to true. Enabling this means that your S3 Bucket or local file store, will not continuously fill up with RDF NQuad data generated from your Lenses.

Endpoint

First, the easiest way to ingest an RDF file into the Lens Writer is to use the built-in APIs. Using the process GET endpoint, you can specify the URL of an RDF file to ingest, and in return, you will be provided with the success status of the operation.

The structure and parameters for the GET request is as follows: http://<writer-ip>:<writer-port>/process?inputRdfURL=<input-rdf-file-url>, for example, http://127.0.0.1:8080/process?inputRdfURL=/var/local/input/input-rdf-data.nq. Once an input RDF file has successfully been processed after being ingested via the Process endpoint, the response returned from the Writer is in the form of a JSON. Within the JSON response are two elements containing both the input data URL and the URL of the target Triple Store, for example:

{     "input": "/var/local/input/input-rdf-data.nq", "tripleStoreEndpoint": "https://graphdb.example.com:443/repositories/test" }

Now by logging in to your Triple Store and making the necessary queries, you will be able to see the newly inserted RDF data.

Kafka

The second, and the more versatile and scalable ingestion method, is to use a message queue such as Apache Kafka. To set up a Kafka Cluster, follow the instructions here, but in short, to ingest RDF files into the Lens Writer you require a Producer. The topic name for which this Producer subscribes to must be the same name that you specified in the KAFKA_TOPIC_NAME_SUCCESS config option (defaults to “success_queue”). Please ensure that this is the same as the success queue topic name in the Lenses you wish to ingest transformed data from. Once set up, if manually pushing data to Kafka, each message sent from the Producer must consist solely of URL of the file, for example, > s3://examplebucket/folder/input-rdf-data.nq.

Dead Letter Queue

If something goes wrong during either the operation of the Writer, that system will publish a message to the Dead Letter Queue Kafka topic (defaults to “dead_letter_queue”) explaining what went wrong along with meta-data about that ingestion, allowing for the problem to be diagnosed and later re-ingested. This message will be in the form of a JSON with the following structure:

 

Ingesting Mode

By default the ingested data are solid dataset and loaded fully into the final graph store. Different datasets are considered as being independent each other. In this mode the new dataset adds new values to already existing subject and predicate. That default behaviour is controlled by a parameter INGESTION_MODE (see here). For example:

Existing data:

New data:

Final data

 

In the update mode (value update of the parameter) the ingested data are used to replace already existing one. This mode is fully supported by the RDF standard and is built-in into all Semantic Knowledge Graphs. Since the property graphs such as Neo4J usually requires a separate add-ons to persist the RDF graphs this feature must be supported by the extension. Currently we implemented the support into neosemantics plug-in for Neo4j versions 3.5.x. The customised version of the plug-in can be downloaded from here.

The update mode is demonstrated on the example below.

Existing data:

New data:

Final data

 


 

Types of Graph

While DataLens is built with semantic data in mind, we support the writing of your data to all Semantic Knowledge Graphs, as well as a selection of Property Graphs (with more coming soon).

Semantic Knowledge Graphs

As previously stated, Data Lens and its Writer support writing to all Semantic Knowledge Graph databases. This includes, but is not exclusive to Stardog, GraphDB, BlazeGraph, AllegroGraph, MarkLogic, and Amazon Neptune.

Stardog

Stardog has an Enterprise Knowledge Graph platform as well as a feature-rich IDE for querying and visualising your data. To import data into your Stardog graph you must set your TRIPLESTORE_TYPE to stardog and your TRIPLESTORE_ENDPOINT to the structure of https://stardog.example.com:443/test.

GraphDB

GraphDB is an enterprise-ready Semantic Graph Database, compliant with W3C Standards. To import data into your GraphDB graph you must set your TRIPLESTORE_TYPE to graphdb and your TRIPLESTORE_ENDPOINT to the structure of https://graphdb.example.com/repositories/test.

Blazegraph

Blazegraph DB is an ultra-high-performance graph database supporting Blueprints and RDF/SPARQL APIs. To import data into your Blazegraph you must set your TRIPLESTORE_TYPE to blazegraph and your TRIPLESTORE_ENDPOINT to the structure of https://blazegraph.example.com/blazegraph/namespace/test.

AllegroGraph

AllegroGraph is a modern, high-performance, persistent graph database with efficient memory utilisation. To import data into your AllegroGraph you must leave your TRIPLESTORE_TYPE to the default sparql and your TRIPLESTORE_ENDPOINT to the structure of http://allegrograph.example.com:10035/repositories/test.

Amazon Neptune

Amazon Neptune is a fast, reliable, fully-managed graph database service that makes it easy to build and run applications that work with highly connected datasets. To import data into your Neptune database you must set your TRIPLESTORE_TYPE to neptune and your TRIPLESTORE_ENDPOINT to the structure of https://example.xyz123.us-east-1.neptune.amazonaws.com:8182. This endpoint will be provided to you by AWS when setting up your Neptune instance. Please ensure, that this resides in the same region as the rest of your AWS services.

And more…

With support for all Knowledge graphs, your Knowledge Graph of choice may note have been listed. If so, it is recommended to leave the TRIPLESTORE_TYPE to the default sparql and set your TRIPLESTORE_ENDPOINT to what has been specified in your graph’s user docs. For help and assistance, contact us.

 

Property Graphs

A Property Graph is a directed, vertex-labelled, edge-labelled multigraph with self-edges, where edges have their own identity. In the Property Graph paradigm, the term node is used to denote a vertex, and relationship to denote an edge. Semantic Graphs and Knowledge graphs differ in their implementation and technologies, therefore importing into this type of graphs require an additional layer of transformation. However, for you, it is as simple as setting your triplestore as before. Currently, we support Neo4j, and coming soon will be support for more, including but not exclusive to OrientDB and TigerGraph.

Neo4j

Neo4j is a Property Graph database management system with native graph storage and processing. To import data into your Neo4j graph you must install a dedicated Neo4j plug-in from neosemantics and set your TRIPLESTORE_TYPE to neo4j and your TRIPLESTORE_ENDPOINT to the structure of bolt://example.com:7687. The plug-in binaries and documentation can be found at the Neo4j GitHub web page. Please note the plug-in configuration depends on Neo4j version.

As previously stated, Neo4j is not a Semantic Knowledge Graph, therefore, storing RDF data in Neo4j requires the preliminary data transformation supported by the plugin and controlled by additional transformation parameters, of which have been exposed to be configuration options. The defaults have been set, as seen here, allowing for imported RDF to be subsequently exported without losing a single triple in the process. As with all config, this can be overridden to suit your needs. Once imported, using the Cypher query language, you can query your data using Neo4j’s products and tools.

 


 

Provenance Data

Within the Lenses, time-series data is supported as standard, every time a Lens ingests some data we add provenance information. This means that you have a full record of data over time allowing you to see what the state if the data was at any moment. The model we use to record Provenance information is the w3c standard PROV-O model. Currently, the Lens Writer does not generate its own provenance meta-data, however, any provenance previously generated will still be ingested into your triplestore. If you are using Kafka, ensure that your Kafka source topic is correctly configured if your Lens provenance is pushed to a separate queue from your generated output data. Having provenance pushed to a separate Kafka Topic allows for a different Lens Writer to be set up enabling you to push provenance to a separate triplestore to your generate RDF from source data.

For more information on how the provenance is laid out, as well as how to query in from you Triple Store, see the Provenance Guide.

 


 

REST API Endpoints

In addition to the Process Endpoint designed for ingesting data into the Writer, there is a selection of built-in exposed endpoints for you to call.

API

HTTP Request

URL Template

Description

API

HTTP Request

URL Template

Description

Process

GET

/process?inputRdfURL=<input-rdf-file-url>

Tells the Writer to ingest the RDF file located at the specified URL location

Config

GET

/config

Displays all Writer configuration as JSON

GET

/config?paths=<config-options>

Displays all Writer configuration specified in the comma-separated list

License

GET

/license

Displays license information

 

Config

The config endpoint is a GET request that allows you to view the configuration settings of a running Writer. By sending GET http://<writer-ip>:<writer-port>/config (for example http://127.0.0.1:8080/config), you will receive the entire configuration represented as a JSON, as seen in this small snippet below.

Alternatively, you can specify exactly what config options you wish to return by providing a comma-separated list of variables under the paths parameter. For example, the request of GET http://<writer-ip>:<writer-port>/config?paths=lens.config.tripleStore.endpoint,logging.loggers would return the following.

License

The license endpoint is a GET request that allows you to view information about your license key that is in use on a running Lens or Writer. By sending GET http://<writer-ip>:<writer-port>/license (for example: http://127.0.0.1:8080/license), you will receive a JSON response containing the following values.

Process

As previously outlined in the Ingesting Data via Endpoint section, using the process endpoint is one way of triggering the Lens to ingest your source data. When an execution of the Writer fails after being triggered in this way, the response will be a status 400 Bad Request and contain a response message similar to that sent to the dead letter queue as outlined above.