Quick Start Guide – Document Lens v1.3+

This is a quick start guide to get the Document Lens up and running in the quickest and simplest possible way so you can start ingesting and transforming data straight away. For a more in-depth set of instructions go to the User Guide.

In this guide we will be setting up and running the Lens as a docker image deployed to your local machine, however we support a number of cloud deployments technologies, including full support of AWS.

 

1. Configuring the Lens

All Lenses supplied by Data Lens are configurable through the use of Environment Variables. Setting up these environment variables will differ depending on how you choose to run the lens, see Running the Lens for more info.

Mandatory Configuration

For the Lens to operate the following configuration options are required. For a breakdown of every configuration option in the Document Lens, see the full list here.

Environment Variable

Description

Environment Variable

Description

[LICENSE]

This is the license key required to operate the lens, request your new unique license key here.

[OUTPUT_DIR_URL]

This is the directory where all generated RDF files are saved to. This also supports local and remote URLs.

[PROV_OUTPUT_DIR_URL]

This is the directory where all generated provenance files are saved to. This also supports local and remote URLs. If you do not wish to generate Provenance, you can turn it off by setting the [RECORD_PROVO] variable to false.

[LENS_RUN_STANDALONE]

Each of the Lenses are designed to be run as part of a larger end-to-end system, with the end result of data being uploaded into Semantic Knowledge Graphs or Property Graphs. As part of this process, Apache Kafka message queues are used for communicating between services.

While not a compulsory config option, for this quick start, we are going to enable standalone mode by setting this value to true, so that the Lens won't attempt to connect to external services.

2. Running the Lens

All of our Lenses are designed and built to be versatile, allowing them to be set up and ran on a number of environments, including in cloud or on-premise. This is achieved through the use of Docker Containers.

Local Docker Image

For this quick start guide, we are going to use the simplest method of deployment, and this is to run the Lens' Docker image locally. To do this, first please ensure you have Docker installed. Once installed, simply by running a command with the following structure, docker will start the container and run the Lens from your downloaded image. In the next steps, we assume the Data Lens license string has been stored into the [LICENSE] environment variable.

For UNIX-based machines (macOS and Linux), the command is the following.

docker run \ --env LICENSE=$LICENSE \ --env OUTPUT_DIR_URL=/var/local/output/ \ --env LENS_RUN_STANDALONE=true \ --env PROV_OUTPUT_DIR_URL=/var/local/prov-output/ \ -p 8080:8080 \ -v /var/local/:/var/local/ \ lens-unstructured:latest

For Windows machines, the command is the following.

docker run ^ --env LICENSE=%LICENSE% ^ --env OUTPUT_DIR_URL="/data/output/" ^ --env LENS_RUN_STANDALONE=true ^ --env PROV_OUTPUT_DIR_URL="/data/prov-output/" ^ -p 8080:8080 ^ -v C:\data\:/data/ ^ lens-unstructured:latest

The above examples demonstrate how to override configuration options using environment variables in your Lens. Line 2 shows the use of passing in an environment variable saved to the machine, whereas lines 3-5 show simply a string value being passed to it. Given the Lens is run on port 8080, line 6 exposes and binds that port of the host machine so that the APIs can be triggered. The -v flag seen on line 7 mounts the working directory into the container; when the host directory of a bind-mounted volume doesn’t exist, Docker will automatically create this directory on the host for you. Finally, line 8 is the name and version of the Docker image you wish to run.

For more information of running Docker Images, see the official Docs.

3. Ingesting Data / Triggering the Lens

The easiest way to ingest a file into the Document Lens is to use the built-in APIs. Using the process GET endpoint you can specify the URL of a file to ingest and, in return, you will be provided with the URL of the generated RDF data file.

The structure and parameters for the GET request is as follows: http://<lens-ip>:<lens-port>/process?inputFileURL=<input-file-url>, for example: http://127.0.0.1:8080/process?inputFileURL=file:///var/local/input-document.pdf, where the response is in the form of a JSON.

Once an input file has successfully been processed after being ingested via the Process endpoint, the response returned from the Lens is in the form of a JSON. Within the JSON response is the output element; this element contains the URL of the generated RDF file.

Sample output:

{     "input": "file:///var/local/input/input-document.pdf",     "output": "/var/local/output/Document-Lens-44682bd6-3fbc-429b-988d-40dda8892328.nq" }

To learn more about the content of the output file, see the Input/Output Data Example section of the User Guide.