Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

This is the full User Guide for the Structured File Lens, it contains an in-depth set of instructions to fully set up, configure, and run the Lens so you can start ingesting data as part of an end-to-end system. For a guide to get the Lens up and running in the quickest and simplest possible way, see the Quick Start Guide. For a list of what has changed since the last release, visit the User Release Notes.

...

Table of Contents

Table of Contents

...

The first step in configuring the Structured File Lens is to create a mapping file. The mapping file is what creates the links between your source data and your target model (ontology). This can be created using our online Data Lens Mapping Tool utilising an intuitive web-based UI. Log in hereto get started, and select the option for Structured File Lens. The Structured File Lens is capable of ingesting XML, CSV, and JSON files, creation of mapping files differ slightly between file types so ensure to select the correct options for your use case. Alternatively, the Mapping Tool can be deployed to your own infrastructure, this enables additional functionality such as the ability to update mapping files on a running Lens. To do this, follow these instructions.

However, if you wish you create your RML mapping files manually, there is a detailed step by step guide on creating one from scratch.

...

All Lenses supplied by Data Lens are configurable through the use of Environment Variables. How to declare these environment variables will differ slightly depending on how you choose to run the Lens, so please see Running the Lens for more info. For a breakdown of every configuration option in the Structured File Lens, see the full list here.

Mandatory Configuration

For the Lens to operate the following configuration options are required.

  • License - LICENSE

    • This is the license key required to operate the lens, request your new unique license key here.

  • Mapping Directory URL - MAPPINGS_DIR_URL

    • This is the directory where your mapping file(s) is located. As with all directories, this can be either local or on a remote S3 bucket. Mapping files for the Structured File Lens can be created using our Mapping Config Web App and can be pushed directly to a running Lens.

  • Output Directory URL - OUTPUT_DIR_URL

    • This is the directory where all generated RDF files are saved to. This also supports local and remote URLs.

  • Provenance Output Directory URL - PROV_OUTPUT_DIR_URL

    • Out of the box, the Structured File Lens supports Provenance and it is generated by default. Once generated, the Provenance is saved to separate output files to the transformed source data. This option specifies the directory where provenance RDF files are saved to, which also supports local and remote URLs.

    • If you do not wish to generate Provenance, you can turn it off by setting the RECORD_PROVO variable to false. In this case, the PROV_OUTPUT_DIR_URL option is no longer required. For more information on Provenance configuration, see below.

AWS Configuration

If you wish to use cloud services such as Amazon Web Services you need to specify an AWS Access Key and Secret Key, and AWS Region, through AWS_ACCESS_KEY, AWS_SECRET_KEY, and S3_REGION respectively. By providing your AWS credentials, this will give you permission for accessing, downloading, and uploading remote files to S3 Buckets. The S3 Region option specifies the region of where in AWS your files and services reside. Please note that all services must be in the same region, including if you choose to run the Lens in an EC2 instance or with the use of Lambdas.

...

One of the many ways to interface with the Lens is through the use of Apache Kafka. With the Structured File Lens, a Kafka Message Queue can be used for managing both the input and the output of data to and from the Lens. To properly setup your Kafka Cluster, see the instructions here. Once complete, use the following Kafka configuration variables to connect the cluster with your Lens. If you do not wish to use Kafka, please set the variable LENS_RUN_STANDALONE to true.

...

All other Kafka configuration variables can be found here, all of which have default values that can be overridden.

Provenance Configuration

As previously mentioned, Provenance is generated by default, this can be turned off by setting the RECORD_PROVO variable to false, otherwise PROV_OUTPUT_DIR_URL is required. If you wish to store this Provenance remotely in an S3 Bucket, then you are required to specify your region, access key, and secret key, through PROV_S3_REGION, PROV_AWS_ACCESS_KEY, and PROV_AWS_SECRET_KEY respectively.

If you wish to manage the Provenance output files through Kafka, then you can choose to use the same brokers and topic names as with the previously specified data files, or an entirely different cluster. All Provenance configuration can be found here.

Logging Configuration

Logging in the Structured File Lens works the same wayas with all other Lens, and like with most functionality is configurable through the use of environment variables; this list override-able options and their descriptions can be found here. When running the Lens locally from the command line using the instructions below, the Lens will automatically log to your terminal instance. In addition to this, the archives of logs will be saved within the docker container at /var/log/datalens/archive/current/ and /var/log/datalens/json/archive/ for text and JSON logs respectively, where the current logs can be found at /var/log/datalens/text/current/ and /var/log/datalens/json/current/. By default, a maximum of 7 log files will be archived for each file type, however this can be overridden. If running a Lens on cloud in an AWS environment, then connect to your instance via SSH or PuTTY, and the previously outlined logging locations apply.

...

There is also a further selection of optional configurations for given situations, see here for the full list.

Directories in Lenses

...

  • To use a local URL for directories and files, both the format of file:///var/local/data-lens-output/ and /var/local/data-lens-output/ are supported.

  • To use a remote http(s) URL for files, https://example.com/input-file.csv is supported.

  • To use a remote AWS S3 URL for directories and files, s3://example/folder/ is supported where the format is s3://<bucket-name>/<directory>/<file-name>. If you are using an S3 bucket for any directory, you must specify an AWS access key and secret key.

...

Once a Lens has started and is operational, you can request to view the current config by calling one of the Lens' built-in APIs, this is explained in more detail below. Please note, that in order to change any config variable on a running Lens, it must be shut down and restarted.

...

The deployment approach we recommend at Data Lens is to use Amazon Web Services, this is to both store source and RDF data, as well as to host and run your Lenses and Writer. We have written a brief DevOps guide intended to support you in deploying Data Lens into the AWS

The aim is to deploy the Lens and other services using AWS by setting up the following architecture:

  • An Amazon Web Services Elastic Container Service (ECS)

...

The workflow the guide aims to achieve is as follows:

  1. A source data file is placed into the S3 bucket

  2. The Lambda is monitoring this bucket and notifies Kafka into RDF

  3. The Lens reads the message from Kafka and transforms the source data file

  4. The transformed data is passed to the Writer, which writes it to a Semantic Knowledge Graph or Property Graph

This is achieved by setting up the following architecture:

Info

For more information on the Architecture and Deployment of an Enterprise System, see our guide.

...

Ingesting Data

The Structured File Lens supports a number of ways to ingest your data files. While all three supported file types, CSV, XML, and JSON, are ingested in the same way, there may be some additional parameters you wish to set for CSV and XML each as detailed below.

Endpoint

First, the easiest way to ingest a file into the Structured File Lens is to use the built-in APIs. Using the process GET endpoint, you can specify the URL of a file to ingest in the same way as previously outlined, and in return, you will be provided with the URL of the generated RDF data file.

The structure and parameters for the GET request is as follows: http://<lens-ip>:<lens-port>/process?inputFileURL=<input-file-url>, for example, http://127.0.0.1:8080/process?inputFileURL=file:///var/local/input-data.csv, where the response is in the form of a JSON.

...

If you wish to use Kafka, and you are also using S3 to store your source data, we have developed an AWS Lambda to aid with the ingestion of data into your Structured File Lens. The Lambda is designed to monitor a specific Bucket in S3, and when a file arrives or is modified in a specific directory, a message is written to a specified Kafka Topic containing the URL of the new/modified file. Subsequently, this will then be ingested by the Lens. For instructions on how to set up the Lambda within your AWS environment, click here.

CSV Splitting

While ingesting CSV files are the same as with XML and JSON, there are a couple of points to note. A very large CSV file with a large number of rows will be split into chunks and processed separately, by default every 100,000 lines. This allows for better performance and continuous output of RDF files. When processed using Kafka, messages are continuously pushed to the Success Queue, however when using the Process endpoint, the response will only be returned once the entire file transformation has been completed. This file chunking size can be overridden with the configuration option MAX_CSV_ROWS, or conversely turned off by setting this to 0 (not recommended).

...

Once an input file has successfully been processed after being ingested via the Process endpoint, the response returned from the Lens is in the form of a JSON. Within the JSON response is the outputFileLocations element; this element contains a list of all the URLs of generated RDF files. Usually this would be a single file, however multiple files will be generated and listed when ingesting large CSV files.

Sample output:

Code Block
languagejson
{
    "input": "file:///var/local/input/input-data.csv",
    "failedIterations": 0,
    "successfulIterations": 1,
    "outputFileLocations": [
        "/var/local/output/Structured-File-Lens-44682bd6-3fbc-429b-988d-40dda8892328.nq"
    ]
}

...

If you have a Kafka Cluster set up and running, then the successfully generated RDF file URL will be pushed to you Kafka Queue. It will be pushed to the Topic specified in the KAFKA_TOPIC_NAME_SUCCESS config option, which defaults to “success_queue”. One of the many advantages of using this approach is that now this transformed data can be ingested using our Lens Writer which will publish the RDF to a Semantic Knowledge Graph (or selection of Property Graphs) of your choice!

...

For more information on how the provenance is laid out, as well as how to query it from your Triple Store, see the Provenance Guide.

...

REST API Endpoints

In addition to the Process Endpoint designed for ingesting data into the Lens, there is a selection of built-in exposed endpoints for you to call.

...

As previously outlined in the Ingesting Data via Endpoint section, using the process endpoint is one way of triggering the Lens to ingest your source data. When an execution of the Lens fails after being triggered in this way, the response will be a status 400 Bad Request as follows.

...