Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

This is the full User Guide for the Lens Writer, it contains an in-depth set of instructions to fully set up, configure, and run the Writer so you can start writing data as part of an end-to-end system. For a guide to get the Writer up and running in the quickest and simplest possible way, see the Quick Start Guide. Once deployed, you can utilise any of our ready-made sample output NQaud files to test your Writer. For a list of what has changed since the last release, visit the User Release Notes.

...

Table of Contents

Table of Contents

...

As with the Lenses supplied by Data Lens, the Lens Writer is also configurable through the use of Environment Variables. How to declare these environment variables will differ slightly depending on how you choose to run the Writer, so please see Running the Lens Writer for more info. For a breakdown of every configuration option in the Lens Writer, see the full list here.

Mandatory Configuration

For the Lens to operate the following configuration options are required.

  • License - LICENSE

    • This is the license key required to operate the lens, request your new unique license key here.

  • Triple Store Endpoint - TRIPLESTORE_ENDPOINT

    • This is the endpoint for your Triple Store you wish to upload your RDF to and therefore required for the Lens Writer to work.

  • Triple Store Type - TRIPLESTORE_TYPE

    • This is the type of your Triple Store, some graphs will support the default sparql type (e.g. AllegroGraph), however certain graphs require specific type declaration, these include graphdb, stardog, blazegraph, neptune, and neo4j. Please see the Types of Graph section for more info.

  • Triple Store Username and Password - TRIPLESTORE_USERNAME and TRIPLESTORE_PASSWORD

    • This is the username and password of your Triple Store. You can leave these fields blank if your Triple Store does not require any authentication.

...

One of the many ways to interface with the Writer is through the use of Apache Kafka. With the Lens Writer, a Kafka Message Queue can be used for managing the input of RDF data into the Writer. To properly set up your Kafka Cluster, see the instructions here. Once complete, use the following Kafka configuration variables to connect the cluster with your Writer. If you do not wish to use Kafka, please set the variable LENS_RUN_STANDALONE to true.

...

All other Kafka configuration variables can be found here, all of which have default values that can be overridden.

...

Logging in the Lens Writer works the same wayas the Lens, and like with most functionality is configurable through the use of environment variables; this list override-able options and their descriptions can be found here. When running the Lens Writer locally from the command line using the instructions below, the Writer will automatically log to your terminal instance. In addition to this, the archives of logs will be saved within the docker container at /var/log/datalens/archive/current/ and /var/log/datalens/json/archive/ for text and JSON logs respectively, where the current logs can be found at /var/log/datalens/text/current/ and /var/log/datalens/json/current/. By default, a maximum of 7 log files will be archived for each file type, however this can be overridden. If running a Writer on cloud in an AWS environment, then connect to your instance via SSH or PuTTY, and the previously outlined logging locations apply.

...

Neo4j is a Property Graph database management system with native graph storage and processing. As Neo4j is not a Semantic Knowledge Graph, storing RDF data in Neo4j in a lossless manner may require additional configuration options to be set. The defaults have been set, as seen here, for the most likely scenario, and also to allow imported RDF to be subsequently exported without losing a single triple in the process. As with all config, this can be overridden to suit your needs. For more information on how your data is represented in Neo4j see below.

Optional Configuration

There is also a further selection of optional configurations for given situations, see here for the full list.

Accessing the configuration of a running Writer

Once a Writer has started and is operational, you can request to view the current config by calling one of the Writer’s built-in APIs, this is explained in more detail below. Please note, that in order to change any config variable on a running Writer, it must be shut down and restarted.

...

The deployment approach we recommend at Data Lens is to use Amazon Web Services, this is both to store your source and RDF data, and to host and run your Lenses and Writer. We have written a brief DevOps guide intended to support you in deploying Data Lens into the AWS

The aim is to deploy the Lens and other services using AWS by setting up the following architecture:

  • An Amazon Web Services Elastic Container Service (ECS)

...

The workflow the guide aims to achieve is as follows:

  1. A source data file is placed into the S3 bucket

  2. The Lambda is monitoring this bucket and notifies Kafka

  3. The Lens reads the message from Kafka and transforms the source data file into RDF

  4. The transformed data is passed to the Writer, which writes it to a Semantic Knowledge Graph or Property Graph

This is achieved by setting up the following architecture:

Info

For more information on the Architecture and Deployment of an Enterprise System, see our guide.

...

Ingesting RDF Data

The Lens Writer is designed to ingest RDF data, most commonly in the form of NQuads (.nq) files, and this can be done in a number of ways.

...

  • To use a local URL for directories and files, both the format of file:///var/local/data-lens-output/ and /var/local/data-lens-output/ are supported.

  • To use a remote http(s) URL for files, https://example.com/input-rdf-file.nq is supported.

  • To use a remote AWS S3 URL for directories and files, s3://example/folder/ is supported where the format is s3://<bucket-name>/<directory>/<file-name>. If you are using an S3 bucket for any directory, you must specify an AWS access key and secret key.

Also included in the Writer, is the ability to delete your source NQuad input files after they have been ingested into your Triple Store. This is done by setting the DELETE_SOURCE config value to true. Enabling this means that your S3 Bucket or local file store, will not continuously fill up with RDF NQuad data generated from your Lenses.

Endpoint

First, the easiest way to ingest an RDF file into the Lens Writer is to use the built-in APIs. Using the process GET endpoint, you can specify the URL of an RDF file to ingest, and in return, you will be provided with the success status of the operation.

...

The second, and the more versatile and scalable ingestion method, is to use a message queue such as Apache Kafka. To set up a Kafka Cluster, follow the instructions here, but in short, to ingest RDF files into the Lens Writer you require a Producer. The topic name for which this Producer subscribes to must be the same name that you specified in the KAFKA_TOPIC_NAME_SUCCESS config option (defaults to “success_queue”). Please ensure that this is the same as the success queue topic name in the Lenses you wish to ingest transformed data from. Once set up, if manually pushing data to Kafka, each message sent from the Producer must consist solely of URL of the file, for example, > s3://examplebucket/folder/input-rdf-data.nq.

...

For more information on how the provenance is laid out, as well as how to query in from you Triple Store, see the Provenance Guide.

...

REST API Endpoints

In addition to the Process Endpoint designed for ingesting data into the Writer, there is a selection of built-in exposed endpoints for you to call.

...

As previously outlined in the Ingesting Data via Endpoint section, using the process endpoint is one way of triggering the Lens to ingest your source data. When an execution of the Writer fails after being triggered in this way, the response will be a status 400 Bad Request as follows.

...