Common Data Lens Architectures
Ingesting XML/CSV/JSON files
Structured File Lens + Kafka + Lens Writer
Once you have your Structured File Lens and your Lens Writer up and running, explained below is an example of an end-to-end enterprise-ready highly-scalable system showing what is required to ingest your structured files (CSV/XML/JSON) into your Knowledge or Property Graph. The intended flow of your data through the systems is as follows:
Source File System → Kafka → Structured File Lens → Kafka → Lens Writer → Triple Store
The first thing to determine is where your source data files are stored. Whether locally, remotely, or in an S3 Bucket, the Structured File Lens must be told where to retrieve these file from. Here we utilise message queues in the form of Apache Kafka. By setting up a Kafka Producer which subscribes to the topic name as specified in the Lens’ KAFKA_TOPIC_NAME_SOURCE
config variable (defaults to “source_urls”), you are able to directly send file URLs to the Lens. Once set up, each message sent from the Producer must consist solely of URL of the file, for example, s3://examplebucket/folder/input-data.csv
. Additionally, if you are using Kafka and S3 Buckets, you can use our AWS Lambda to automatically push a message to the Kafka Queue whenever a new file is uploaded to your S3 Bucket, which in turn will automatically trigger the Lens to ingest and transform this data.
Once the Structured File Lens has ingested and transformed your input source data, the generated RDF data will then be uploaded to the output directory location specified in the OUTPUT_DIR_URL
config option. It will also push a success message to the Kafka Topic you specified in KAFKA_TOPIC_NAME_SUCCESS
, defaults to “success_queue”. This means that the Lens Writer is able to pick up the newly published message from the Kafka Queue, ingest the generated RDF file, and publish this to the Triple Store specified in your Writer’s TRIPLESTORE_ENDPOINT
config value.
When provenance data is generated in the Structured File Lens, you have the option to specify a separate output directory location for the generated provenance RDF, and separate a Kafka Success Queue via the PROV_OUTPUT_DIR_URL
and PROV_KAFKA_TOPIC_NAME_SUCCESS
config options respectively. In the case where you wish to have your provenance data uploaded to a separate Triple Store, an additional Lens Writer is required, this will be configured whereby its Kafka topic will be directed to this separate provenance topic specified in the Lens.
Ingesting from an SQL Database
SQL Lens + Kafka + Lens Writer
Once you have your SQL Lens and your Lens Writer up and running, explained below is an example of an end-to-end enterprise-ready highly-scalable system showing what is required to ingest data from your Relational SQL Databases into your Knowledge or Property Graph. The intended flow of your data through the systems is as follows:
Relational SQL Database → Cron Scheduler / API Endpoint → SQL Lens → Kafka → Lens Writer → Triple Store
As seen in the SQL Lens User Guide, the connection to your Database lies within the mapping files that you have created. The process in order for the Lens to start ingesting data from your DB can be triggered in two ways. One is to use the exposed API Endpoint, this is simply a GET request targeting the Lens, for example, http://<lens-ip>:<lens-port>/process
. Another is to use a Cron Expression to set up a time-based job scheduler which will schedule the Lens to ingest your specified data from your database(s) periodically at fixed times, dates, or intervals.
Once the SQL Lens has ingested and transformed your input source data, the generated RDF data will then be uploaded to the output directory location specified in the OUTPUT_DIR_URL
config option. It will also push a success message to the Kafka Topic you specified in KAFKA_TOPIC_NAME_SUCCESS
, defaults to “success_queue”. This means that the Lens Writer is able to pick up the newly published message from the Kafka Queue, ingest the generated RDF file, and publish this to the Triple Store specified in your Writer’s TRIPLESTORE_ENDPOINT
config value.
When provenance data is generated in the SQL Lens, you have the option to specify a separate output directory location for the generated provenance RDF, and separate a Kafka Success Queue via the PROV_OUTPUT_DIR_URL
and PROV_KAFKA_TOPIC_NAME_SUCCESS
config options respectively. In the case where you wish to have your provenance data uploaded to a separate Triple Store, an additional Lens Writer is required, this will be configured whereby its Kafka topic will be directed to this separate provenance topic specified in the Lens.
Ingesting from a REST API
RESTful Lens + Kafka + Lens Writer
Once you have your RESTful Lens and your Lens Writer up and running, explained below is an example of an end-to-end enterprise-ready highly-scalable system showing what is required to ingest data from your REST API endpoint into your Knowledge or Property Graph. The intended flow of your data through the systems is as follows:
REST API → Cron Scheduler / API Endpoint → RESTful Lens → Kafka → Lens Writer → Triple Store
As seen in the RESTful Lens User Guide, the connection to your REST API lies within the JSON_REST_CONFIG_URL
configuration variable. The process in order for the Lens to start ingesting data from your API can be triggered in two ways. One is to use the Lens' exposed API Endpoint, this is simply a GET request targeting the Lens, for example, http://<lens-ip>:<lens-port>/process
. Another is to use a Cron Expression to set up a time-based job scheduler which will schedule the Lens to ingest your API data periodically at fixed times, dates, or intervals.
Once the RESTful Lens has ingested and transformed your input source data, the generated RDF data will then be uploaded to the output directory location specified in the OUTPUT_DIR_URL
config option. If enabled, it will also push a success message to the Kafka Topic you specified in KAFKA_TOPIC_NAME_SUCCESS
, defaults to “success_queue”. This means that the Lens Writer is able to pick up the newly published message from the Kafka Queue, ingest the generated RDF file, and publish this to the Triple Store specified in your Writer’s TRIPLESTORE_ENDPOINT
config value.
When provenance data is generated in the RESTful Lens, you have the option to specify a separate output directory location for the generated provenance RDF, and separate a Kafka Success Queue via the PROV_OUTPUT_DIR_URL
and PROV_KAFKA_TOPIC_NAME_SUCCESS
config options respectively. In the case where you wish to have your provenance data uploaded to a separate Triple Store, an additional Lens Writer is required, this will be configured whereby its Kafka topic will be directed to this separate provenance topic specified in the Lens.
Ingesting PDF/doc(x)/txt files
Document Lens + Kafka + Lens Writer
Once you have your Document Lens and your Lens Writer up and running, explained below is an example of an end-to-end enterprise-ready highly-scalable system showing what is required to ingest your document files (PDF/doc(x)/txt) into your Knowledge or Property Graph. The intended flow of your data through the systems is as follows:
The first thing to determine is where your source data files are stored. Whether locally, remotely, or in an S3 Bucket, the Document Lens must be told where to retrieve these file from. Here we utilise message queues in the form of Apache Kafka. By setting up a Kafka Producer which subscribes to the topic name as specified in the Lens’ KAFKA_TOPIC_NAME_SOURCE
config variable (defaults to “source_urls”), you are able to directly send file URLs to the Lens. Once set up, each message sent from the Producer must consist solely of URL of the file, for example, s3://examplebucket/folder/input-data.pdf
. Additionally, if you are using Kafka and S3 Buckets, you can use our AWS Lambda to automatically push a message to the Kafka Queue whenever a new file is uploaded to your S3 Bucket, which in turn will automatically trigger the Lens to ingest and transform this data.
Once the Document Lens has ingested and extracted terms from your input source data, the generated RDF data will then be uploaded to the output directory location specified in the OUTPUT_DIR_URL
config option. If enabled, it will also push a success message to the Kafka Topic you specified in KAFKA_TOPIC_NAME_SUCCESS
, defaults to “success_queue”. This means that the Lens Writer is able to pick up the newly published message from the Kafka Queue, ingest the generated RDF file, and publish this to the Triple Store specified in your Writer’s TRIPLESTORE_ENDPOINT
config value.
When provenance data is generated in the Document Lens, you have the option to specify a separate output directory location for the generated provenance RDF, and separate a Kafka Success Queue via the PROV_OUTPUT_DIR_URL
and PROV_KAFKA_TOPIC_NAME_SUCCESS
config options respectively. In the case where you wish to have your provenance data uploaded to a separate Triple Store, an additional Lens Writer is required, this will be configured whereby its Kafka topic will be directed to this separate provenance topic specified in the Lens.