User Guide – Document Lens v1.3+

Intro

This is the full User Guide for the Document Lens. It contains an in-depth set of instructions to fully set up, configure, and run the Lens so you can start ingesting data as part of an end-to-end system. For a guide to get the Lens up and running in the quickest and simplest possible way, see the Quick Start Guide. For a list of what has changed since the last release, visit the User Release Notes.


Table of Contents

 

 


Configuring the Lens

All Lenses supplied by Data Lens are configurable through the use of Environment Variables. How to declare these environment variables will differ slightly depending on how you choose to run the Lens, so please see Running the Lens for more info.

Mandatory Configuration

For the Lens to operate the following configuration options are required. For a breakdown of every configuration option in the Document Lens, see the full list here.

Environment Variable

Description

Environment Variable

Description

[LICENSE]

This is the license key required to operate the lens, request your new unique license key here.

[OUTPUT_DIR_URL]

This is the directory where all generated RDF files are saved to. This also supports local and remote URLs.

[PROV_OUTPUT_DIR_URL]

This is the directory where all generated provenance files are saved to. This also supports local and remote URLs. If you do not wish to generate Provenance, you can turn it off by setting the [RECORD_PROVO] variable to false.

AWS Configuration

If you wish to use cloud services such as Amazon Web Services you need to specify an AWS Access Key and Secret Key, and AWS Region, through [AWS_ACCESS_KEY], [AWS_SECRET_KEY], and [S3_REGION], respectively. By providing your AWS credentials, this will give you permission for accessing, downloading, and uploading remote files to S3 Buckets. The S3 Region option specifies the region of where in AWS your files and services reside. Please note that all services must be in the same region, including if you choose to run the Lens in an EC2 instance or with the use of Lambdas.

Information Extraction Configuration

The Document Lens extracts entities from documents using a combination of AI techniques and semantics. The first step is to extract text from PDF, DOCX, and TXT files. During the information extraction process, strings in the text (i.e., terms) can be associated to URIs describing entities. Entities can belong to three different sets:

  1. entities from the pre-built annotator index stored in a Redis database;

  2. entities imported via SPARQL using the terms loader;

  3. out-of-knowledge-base entities.

Annotator Index

The first set is always enabled. A Redis database can be set through the [ANNOTATOR_INDEX_URL] configuration option. We provide the following entity indices pre-built on the DBpedia Knowledge Graph.

Annotator Index

Terms

URL

Annotator Index

Terms

URL

DBpedia Full

17,310,185

s3://data-lens-indices/DBpedia-entity-mentions-v1.0.rdb

DBpedia Finance

373,792

s3://data-lens-indices/DBpedia-Finance-entity-mentions-v1.0.rdb

Terms Loader

The second set can be enabled by setting the [TERMS_LOADER_ENABLED] option to true. Here, terms and their respective URIs are loaded from a SPARQL endpoint into the index database. If the endpoint needs authentication, username and password can be set in the respective variables. See the Information Extraction section in the Configurable Options for an overview.

The SPARQL query must be saved in a text file and must return two variables named ?uri and ?label, which denote the entity URI and label (i.e., a term), respectively. For instance, the following query extracts all entities and their preferred English labels from a dataset described using the SKOS vocabulary.

PREFIX skos: <http://www.w3.org/2004/02/skos/core#> SELECT ?uri ?label WHERE { ?uri skos:prefLabel ?label FILTER(lang(?label) = 'en') }

While the Lens is running, the terms may be reloaded by simply triggering the /reloadTerms endpoint. When this happens, any term previously loaded into the Lens index database is updated, meaning that changes in the triple store will be reflected in the index database. The Lens locally stores a backup of the original annotator index, therefore this action does not involve any database rollbacks nor downloading another annotator index file. Logically, the /reloadTerms endpoint cannot be triggered if the [TERMS_LOADER_ENABLED] option is set to false.

Out-of-Knowledge-Base Entities

Entities detected in text that are not found in the index database can still be returned to the user by setting the [ANNOTATOR_OUT_OF_KB_ENTITIES] variable to true. For each entity, a URI belonging to the desired namespace – stored in [ANNOTATOR_NAMESPACE] – will be generated.

Kafka Configuration

One of the many ways to interface with the Lens is through the use of Apache Kafka. With the Document Lens, a Kafka Message Queue can be used for managing both the input and the output of data to and from the Lens. To properly set up your Kafka Cluster, see the instructions here. Once complete, use the following Kafka configuration variables to connect the cluster with your Lens. If you do not wish to use Kafka, please set the variable [LENS_RUN_STANDALONE] to true.

The Kafka Broker is what tells the Lens where to look for your Kafka Cluster, so set this property as follows: <kafka-ip>:<kafka-port>. The recommended port is 9092.

All other Kafka configuration variables can be found here, all of which have default values that can be overridden.

Provenance Configuration

As previously mentioned, Provenance is generated by default. This can be turned off by setting the [RECORD_PROVO] variable to false, otherwise [PROV_OUTPUT_DIR_URL] is required. If you wish to store this Provenance remotely in an S3 Bucket, then you are required to specify your region, access key, and secret key, through [PROV_S3_REGION], [PROV_AWS_ACCESS_KEY], and [PROV_AWS_SECRET_KEY], respectively.

If you wish to manage the Provenance output files through Kafka, then you can choose to use the same brokers and topic names as with the previously specified data files, or an entirely different cluster. All Provenance configuration can be found here.

Logging Configuration

Logging in the Document Lens works the same way as with all other Lens, and like with most functionality is configurable through the use of environment variables; this list override-able options and their descriptions can be found here. When running the Lens locally from the command line using the instructions below, the Lens will automatically log to your terminal instance. In addition to this, the archives of logs will be saved within the docker container at /var/log/datalens/archive/current/ and /var/log/datalens/json/archive/ for text and JSON logs respectively, where the current logs can be found at /var/log/datalens/text/current/ and /var/log/datalens/json/current/. By default, a maximum of 7 log files will be archived for each file type, however this can be overridden. If running a Lens on cloud in an AWS environment, then connect to your instance via SSH or PuTTY, and the previously outlined logging locations apply.

Optional Configuration

There is also a further selection of optional configurations for given situations, see here for the full list.

Directories in Lenses

The Lenses are designed to support files and directories from an array of sources. This includes both local URLs and remote URLs including cloud-based technologies such as AWS S3. The location should be expressed as a URL string (Ref. RFC-3986).

  • To use a local URL for directories and files, both the format of file:///var/local/data-lens-output/ and /var/local/data-lens-output/ are supported.

  • To use a remote http(s) URL for files, https://example.com/input-file.csv is supported.

  • To use a remote AWS S3 URL for directories and files, s3://example/folder/ is supported where the format is s3://<bucket-name>/<directory>/<file-name>. If you are using an S3 bucket for any directory, you must specify an AWS access key and secret key.

Accessing the configuration of a running Lens

Once a Lens has started and is operational, you can request to view the current config by calling one of the Lens' built-in APIs, this is explained in more detail below. Please note, that in order to change any config variable on a running Lens, it must be shut down and restarted.

 


 

Running the Lens

All of our Lenses are designed and built to be versatile, allowing them to be set up and ran on a number of environments, including in cloud or on-premise. This is achieved through the use of Docker Containers. In addition to this, we have full support for the Amazon Web Services Marketplace, where you can directly subscribe to and run your Lens from.

Local Docker Image

To run the lens locally, first please ensure you have Docker installed. Then simply by running a command with the following structure, docker will start the container and run the Lens from your downloaded image. In the next steps, we assume the Data Lens license string has been stored into the [LICENSE] environment variable.

For UNIX-based machines (macOS and Linux), the command is the following.

docker run \ --env LICENSE=$LICENSE \ --env OUTPUT_DIR_URL=/var/local/output/ \ --env LENS_RUN_STANDALONE=true \ --env PROV_OUTPUT_DIR_URL=/var/local/prov-output/ \ -p 8080:8080 \ -v /var/local/:/var/local/ \ lens-unstructured:latest

For Windows machines, the command is the following.

docker run ^ --env LICENSE=%LICENSE% ^ --env OUTPUT_DIR_URL="/data/output/" ^ --env LENS_RUN_STANDALONE=true ^ --env PROV_OUTPUT_DIR_URL="/data/prov-output/" ^ -p 8080:8080 ^ -v C:\data\:/data/ ^ lens-static:latest

The above examples demonstrate how to override configuration options using environment variables in your Lens. Line 2 shows the use of passing in an environment variable saved to the machine, whereas lines 3-5 show simply a string value being passed to it. Given the Lens is run on port 8080, line 6 exposes and binds that port of the host machine so that the APIs can be triggered. The -v flag seen on line 7 mounts the working directory into the container; when the host directory of a bind-mounted volume doesn’t exist, Docker will automatically create this directory on the host for you. Finally, line 8 is the name and version of the Docker image you wish to run.

For more information of running Docker Images, see the official Docs.

Docker on AWS

The deployment approach we recommend at Data Lens is to use Amazon Web Services, this is to both store source and RDF data, as well as to host and run your Lenses and Writer.

The aim is to deploy the Lens and other services using AWS by setting up the following architecture:

  • An Amazon Web Services Elastic Container Service (ECS) cluster, hosting a single EC2 instance. Running the following containers:

  • A Lambda Function

  • An S3 bucket

For more information on the Architecture and Deployment of an Enterprise System, see our guide.

AWS Marketplace

We now have full support for the Amazon Web Services Marketplace, where you can directly subscribe to a Lens. Then, using our CloudFormation Templates, you can deploy a one-click solution to run your Lens. See here for further details and instructions to get you started.

 


Ingesting Data

The Document Lens supports a number of ways to ingest your data files. All three supported file types, PDF, DOCX, and TXT, are ingested in the same way, with the only exception of PDF files, whose maximum number of extracted characters may be limited via the [PDF_CHAR_LIMIT] environment variable. Please note that this limit represents the total number of characters and does not cause iteration.

Endpoint

First, the easiest way to ingest a file into the Document Lens is to use the built-in APIs. Using the process GET endpoint, you can specify the URL of a file to ingest in the same way as previously outlined, and in return, you will be provided with the URL of the generated RDF data file.

The structure and parameters for the GET request is as follows: http://<lens-ip>:<lens-port>/process?inputFileURL=<input-file-url>, for example: http://127.0.0.1:8080/process?inputFileURL=file:///var/local/input-document.pdf, where the response is in the form of a JSON.

Kafka

The second, and the more versatile and scalable ingestion method, is to use a message queue such as Apache Kafka. To set up a Kafka Cluster, follow the instructions here, but in short, to ingest files into the Document Lens you require a Producer. The topic name for which this Producer subscribes to must be the same name that you specified in the [KAFKA_TOPIC_NAME_SOURCE] config option (defaults to “source_urls”). Once set up, each message sent from the Producer must consist solely of URL of the file, for example, s3://examplebucket/folder/input-document.pdf.

S3 Lambda

If you wish to use Kafka, and you are also using S3 to store your source data, we have developed an AWS Lambda to aid with the ingestion of data into your Document Lens. The Lambda is designed to monitor a specific Bucket in S3, and when a file arrives or is modified in a specific directory, a message is written to a specified Kafka Topic containing the URL of the new/modified file. Subsequently, this will then be ingested by the Lens. For instructions on how to set up the Lambda within your AWS environment, click here.

 


 

Output Data

The graph in the generated RDF file is structured in the following way. In the figure below, the left side of the diagram shows the individual of type dlo:Document (where dlo: is the Data Lens Ontology namespace http://www.data-lens.co.uk/ontology#) along with its author, extracted from the PDF metadata, and original location. A single document can be linked with many dlo:Mentions (right side), one of which is shown here along with the text excerpt it was anchored to, the DBpedia entity extracted, and the confidence value between 0 and 1 of the extraction. The white colour indicates terms defined in the Data Lens ontology, the grey colour indicates terms defined in the output file, whereas the green colour indicates individuals that belong to external namespaces.

The data files created and output from the Lens is the same regardless on how it was triggered or ingested, however the way in which this information is communicated back to you is slightly different for each method.

Endpoint

Once an input file has successfully been processed after being ingested via the Process endpoint, the response returned from the Lens is in the form of a JSON. Within the JSON response is the output element; this element contains the URL of the generated RDF file.

Sample output:

Kafka

If you have a Kafka Cluster set up and running, then the successfully generated RDF file URL will be pushed to you Kafka Queue. It will be pushed to the Topic specified in the [KAFKA_TOPIC_NAME_SUCCESS] config option, which defaults to “success_queue”. One of the many advantages of using this approach is that now this transformed data can be ingested using our Lens Writer which will publish the RDF to a triple store of your choice.

Dead Letter Queue

If something goes wrong during the operation of the Lens, the system will publish a message to the Dead Letter Queue Kafka topic (defaults to “dead_letter_queue”) explaining what went wrong along with meta-data about that ingestion, allowing for the problem to be diagnosed and later re-ingested. If enabled, the provenance generated for the current ingestion will also be included as JSON-LD. This message will be in the form of a JSON with the following structure:

 

 


 

Input/Output Data Example

The following is an excerpt from an input PDF file. In this example, the Document Lens extracts entities from the file which are found in the pre-built DBpedia Finance index, enriched with terms from the Financial Industry Business Ontology (FIBO).

The list of extracted entities can be seen in the table below. Note that the entities belong to either the DBpedia knowledge graph or the FIBO ontology. The entities ROE and 4Q2019 were not returned since they were not found in either of the indices. However, setting the option [ANNOTATOR_OUT_OF_KB_ENTITIES] to true with the [ANNOTATOR_THRESHOLD] low enough would also return out-of-knowledge-base entities.

In the following excerpt, part of the content of the output RDF file in N-Quads format is shown.

In order to retrieve the entities and their information, the RDF data above can be queried by performing the simple SPARQL query below.

 


 

Provenance Data

Within the Document Lens, time-series data is supported as standard, every time a Lens ingests some data we add provenance information. This means that you have a full record of data over time allowing you to see what the state if the data was at any moment. The model we use to record Provenance information is the w3c standard PROV-O model.

Provenance files are uploaded to the location specified in the [PROV_OUTPUT_DIR_URL] and file location then pushed to the Kafka Topic declared in [PROV_KAFKA_TOPIC_NAME_SUCCESS].

The provenance activities in the Document Lens are the following:

  • main-execution

    1. input-file-download

    2. text-extraction

    3. concept-annotation

    4. semantification

    5. model-unification

    6. output-file-upload

  • kafkaActivity

For more information on how the provenance is laid out, as well as how to query it from your Triple Store, see the Provenance Guide.

 


 

REST API Endpoints

In addition to the Process Endpoint designed for ingesting data into the Lens, there is a selection of built-in exposed endpoints for you to call.

API

HTTP Request

URL Template

Description

API

HTTP Request

URL Template

Description

Process

GET

/process?inputFileURL=<input-file-url>

Tells the Lens to ingest the file located at the specified URL location

Reload terms

GET

/reloadTerms

Tells the Lens to reload the terms from the SPARQL endpoint specified at startup. [TERMS_LOADER_ENABLED] configuration option must be set to true.

Config

GET

/config

Displays all Lens configuration as JSON

GET

/config?paths=<config-options>

Displays all Lens configuration specified in the comma-separated list

License

GET

/license

Displays license information

Reload Terms

As previously mentioned, while the Lens is running, the terms may be reloaded by simply triggering the /reloadTerms endpoint. When this happens, any term previously loaded into the Lens index database is updated, meaning that changes in the triple store will be reflected in the index database. The Lens locally stores a backup of the original annotator index, therefore this action does not involve any database rollbacks nor downloading another annotator index file. To use this endpoint, the [TERMS_LOADER_ENABLED] configuration option must be set to true. Upon successful reloading, the following JSON response will be returned:

Config

The config endpoint is a GET request that allows you to view the configuration settings of a running lens. By sending GET http://<lens-ip>:<lens-port>/config (for example http://127.0.0.1:8080/config), you will receive the entire configuration represented as a JSON, as seen in this small snippet below. All confidential values (such as AWS credentials) are replaced with the fixed string “REDACTED“.

Alternatively, you can specify exactly what config options you wish to return by providing a comma-separated list of variables under the paths parameter. For example, the request of GET http://<lens-ip>:<lens-port>/config?paths=lens.config.outputDirUrl,logging.loggers would return the following.

License

The license endpoint is a GET request that allows you to view information about your license key that is in use on a running lens. By sending GET http://<lens-ip>:<lens-port>/license (for example: http://127.0.0.1:8080/license), you will receive a JSON response containing the following values.

Process

As previously outlined in the Ingesting Data via Endpoint section, using the process endpoint is one way of triggering the Lens to ingest your source data. When an execution of the Lens fails after being triggered in this way, the response will be a status 400 Bad Request and contain a response message similar to that sent to the dead letter queue as outlined above.