Configurable Options – Document Lens v1.3+
Below is a table containing all of the configurable options within the Document Lens. To see how to set config variables, see the Quick Start Guide or the Full User Guide. Mandatory variables are highlighted in red.
Lens Configuration
Environment Variable | Default Value | Description | Version |
---|---|---|---|
FRIENDLY_NAME | Document-Lens | The name you wish to set your Lens up with. | v1.3+ |
LICENSE |
| The License key provided required for running the Lens. | v1.3+ |
OUTPUT_DIR_URL |
| The URL of the directory you wish the generated RDF to be output to. Can be local or remote, see here for more details. If the protocol is | v1.3+ |
S3_REGION | us-east-1 | The region in AWS where your files and services reside. Note: all services must be in the same region. | v1.3+ |
AWS_ACCESS_KEY |
| Your access key for AWS. | v1.3+ |
AWS_SECRET_KEY |
| Your secret key for AWS. | v1.3+ |
LENS_RUN_STANDALONE | false | Each of the Lenses are designed to run as part of a larger end to end system with the end result being the data is uploaded to a Knowledge or Property Graph. As part of this process, Kafka is used to communicate between services. This is enabled by default, however if you want to run the Lens as standalone without communicating to other services, set this property to true. | v1.3+ |
Information Extraction Configuration
Environment Variable | Default Value | Description | Version |
---|---|---|---|
ANNOTATOR_INDEX_URL | s3://data-lens-indices/DBpedia-Finance-entity-mentions-v1.0.rdb | The Redis database containing the index of the entities that can be extracted from plain text. By default, the lens will utilise the index pre-built using entities from the DBpedia knowledge graph categorised as Finance. More details in the User Guide. | v1.3+ |
ANNOTATOR_NER_MODEL | ner.random-forest.tag.2.20190724.model | The machine-learning model used for Named Entity Recognition (NER) of entities in plain text. The model is saved in the Weka binary format. | v1.3+ |
ANNOTATOR_EL_MODEL | el.random-forest.31.20190405.model | The machine-learning model used for Entity Linking (EL), i.e. associates a confidence value to the extracted entities. The model is saved in the Weka binary format. | v1.3+ |
ANNOTATOR_THRESHOLD | 0.7 | The minimum confidence value for the extracted entities. Allowed values are between 0 and 1. | v1.3+ |
ANNOTATOR_OUT_OF_KB_ENTITIES | false | Create a URI for those entities found in the text which were not found in the index. | v1.3+ |
ANNOTATOR_NAMESPACE | http://www.data-lens.co.uk/ | The namespace of the entity URIs returned in the RDF data. | v1.3+ |
SPACY_TIMEOUT | 50000 | The timeout in milliseconds for the spaCy NLP library to annotate part-of-speech tags in a single document, as part of the Named Entity Recognition phase. | v1.3+ |
PDF_CHAR_LIMIT | -1 | The maximum amount of characters extracted from a PDF file. Default value is -1 (unlimited). Please note that this limit represents the total number of characters and does not cause iteration. | v1.3+ |
TERMS_LOADER_ENABLED | false | Enable the loading of terms from an external SPARQL endpoint. The terms are then added to the index of the entities that can be extracted from plain text. | v1.3+ |
TERMS_LOADER_ENDPOINT |
| The external SPARQL endpoint for the terms loader. | v1.3+ |
TERMS_LOADER_USER |
| The external SPARQL endpoint username for the terms loader. Set to an empty string if not needed. | v1.3+ |
TERMS_LOADER_PASS |
| The external SPARQL endpoint password for the terms loader. Set to an empty string if not needed. | v1.3+ |
TERMS_LOADER_QUERY_FILE_URL | s3://data-lens-indices/terms-loader-query.sparql | The SPARQL query to execute to load the terms. The query must return two variables named | v1.3+ |
WHICH_PYTHON | /usr/local/bin/python3 | The path to the Python 3 interpreter. Do not modify if the Lens is executed via Docker container. | v1.3+ |
Kafka Configuration
Environment Variable | Default Value | Description | Version |
---|---|---|---|
KAFKA_BROKERS | localhost:9092 | The Kafka Broker is what tells the Lens where to look for your Kafka Cluster. Set with the following structure | v1.3+ |
KAFKA_TOPIC_NAME_SOURCE | source_urls | The topic used for the Consumer to read messages from containing input file URLs in order to ingest data. | v1.3+ |
KAFKA_TOPIC_NAME_DLQ | dead_letter_queue | The topic used to push messages containing reasons for failure within the Lens. These messages are represented as a JSON. | v1.3+ |
KAFKA_TOPIC_NAME_SUCCESS | success_queue | The topic used for the messages sent containing the file URLs of the successfully transformed RDF data files. | v1.3+ |
KAFKA_GROUP_ID_CONFIG | consumerGroup1 | The identifier of the group this consumer belongs to. | v1.3+ |
KAFKA_AUTO_OFFSET_RESET_CONFIG | earliest | What to do when there is no initial offset in Kafka or if an offset is out of range.
| v1.3+ |
KAFKA_MAX_POLL_RECORDS | 100 | The maximum number of records returned in a single call to poll. | v1.3+ |
KAFKA_TIMEOUT | 1000000 | Kafka consumer polling time out. | v1.3+ |
Provenance Configuration
Environment Variable | Default Value | Description | Version |
---|---|---|---|
RECORD_PROVO | true | Parameter indicating whether any provenance meta-data should be generated. When set to true, environment variable | v1.3+ |
PROV_OUTPUT_DIR_URL |
| The URL of the directory you wish the generated provenance files to be output to. Can be local or remote, see here for more details. If the protocol is | v1.3+ |
PROV_S3_REGION | us-east-1 | The region in AWS where you wish to upload the generated provenance files | v1.3+ |
PROV_AWS_ACCESS_KEY |
| Your access key for AWS. | v1.3+ |
PROV_AWS_SECRET_KEY |
| Your secret key for AWS. | v1.3+ |
PROV_KAFKA_BROKERS | localhost:9092 | This is the location of your Kafka Cluster for provenance. This can be the same or different as your broker for the Lens | v1.3+ |
PROV_KAFKA_TOPIC_NAME_DLQ | prov_dead_letter_queue | The topic used for your dead letter queue provenance messages. This can be the same or different as your DLQ topic for the Lens | v1.3+ |
PROV_KAFKA_TOPIC_NAME_SUCCESS | prov_success_queue | The topic used for the messages sent containing the file URLs of the successfully generated provenance files. This can be the same or different as your success queue topic for the Lens | v1.3+ |
SWITCHED_OFF_ACTIVITIES | empty string | A comma-separated list of activity IDs that you wish to exclude from the generated provenance file. The Lens contains the following processes: | v1.3+ |
Logging Configuration
Environment Variable | Default Value | Description | Version |
---|---|---|---|
LOGGING_LEVEL | WARN | Global log level | v1.3+ |
LOGGING_LOGGERS_DATALENS | DEBUG | Log level for Data Lens loggers | v1.3+ |
LOGGING_LOGGERS_DROPWIZARD | INFO | Log level for Dropwizard loggers | v1.3+ |
LOGGING_APPENDERS_CONSOLE_TIMEZONE | UTC | Timezone for console logging | v1.3+ |
LOGGING_APPENDERS_TXT_FILE_THRESHOLD | ALL | Threashold for text logging | v1.3+ |
Log Format (not overridable) | %-6level [%d{HH:mm:ss.SSS}] [%t] %logger{5} - %X{code} %msg %n | Pattern for logging messages | v1.3+ |
Current Log Filename (not overridable) | /var/log/datalens/text/current/application_${applicationName}_${timeStamp}.txt.log | Pattern for log file name | v1.3+ |
LOGGING_APPENDERS_TXT_FILE_ARCHIVE | true | Archive log text files | v1.3+ |
Archived Log Filename Pattern (not overridable) | /var/log/datalens/text/archive/application_${applicationName}_${timeStamp}_to_%d{yyyy-MM-dd}.txt.log | Log file rollover frequency depends on pattern in following property. For example %d{yyyy-MM-ww} declares rollover weekly | v1.3+ |
LOGGING_APPENDERS_TXT_FILE_ARCHIVED_TXT_FILE_COUNT | 7 | Max number of archived text files | v1.3+ |
LOGGING_APPENDERS_TXT_FILE_TIMEZONE | UTC | Timezone for text file logging | v1.3+ |
LOGGING_APPENDERS_JSON_FILE_THRESHOLD | ALL | Threashold for text logging | v1.3+ |
Log Format (not overridable) | %-6level [%d{HH:mm:ss.SSS}] [%t] %logger{5} - %X{code} %msg %n | Pattern for logging messages | v1.3+ |
Current Log Filename (not overridable) | /var/log/datalens/json/current/application_${applicationName}_${timeStamp}.json.log | Pattern for log file name | v1.3+ |
LOGGING_APPENDERS_JSON_FILE_ARCHIVE | true | Archive log text files | v1.3+ |
Archived Log Filename Pattern (not overridable) | /var/log/datalens/json/archive/application_${applicationName}_${timeStamp}_to_%d{yyyy-MM-dd}.json.log | Log file rollover frequency depends on pattern in following property. For example %d{yyyy-MM-ww} declares rollover weekly | v1.3+ |
LOGGING_APPENDERS_JSON_FILE_ARCHIVED_FILE_COUNT | 7 | Max number of archived text files | v1.3+ |
LOGGING_APPENDERS_JSON_FILE_TIMEZONE | UTC | Timezone for text file logging | v1.3+ |
LOGGING_APPENDERS_JSON_FILE_LAYOUT_TYPE | json | The layout type for the JSON logger | v1.3+ |