Skip to main content
Skip table of contents

Loading Source Data

Guide objective

Follow this guide to learn how to configure and execute a source data load.

Agile Data Engine executes loads by running SQL commands in the target database. Parameters of a file load command instruct the target database management system to execute a load. These parameters contain information such as:

  • Target entity name

  • Paths to source files

  • Authorization to access source files

  • Source file format

  • Various target database specific load options

The premise is that source data files are stored in cloud storage and the target database management system has access to them.


0. Prerequisites

0.1 Target database configuration

A target database has to be configured and a Runtime connected to it before proceeding with loading source data. See Configuring Target Databases for instructions.

0.2 Source data integration to cloud storage

Source data has to be stored in a cloud storage service supported by the target database:

  • AWS S3

  • Azure Blob Storage

  • Azure Data Lake Storage

  • Google Cloud Storage

Source data has to be in a file format supported by the target database, for example:

  • CSV

  • JSON

  • XML

  • PARQUET

  • AVRO

If you do not have any source data files in your cloud storage yet and there are no existing source system integrations that could produce such files, you can also extract sample files manually from a source system and upload them into cloud storage for initial file load development and testing.


1. Configuring source data access

1.1 Configuring the target database for cloud storage access

Agile Data Engine executes a file load by running an SQL command in the target database that specifies which files are to be loaded from a specified cloud storage into a specified table. Therefore, the database needs to have access to the specified cloud storage. Cloud storage access is usually configured for an identity which is referred to in an object in the database management system. The use of this object is then granted to the user running file loads. Additionally, you might have to configure network settings if network access to the cloud storage is restricted.

Cloud storage access configuration is database product and cloud storage service specific, find the details for your environment below and refer to the database product documentation.

Private Edition:
Note that with the Private Edition you can alternatively use the default storage which does not require configuration. A default storage bucket/container is created per Runtime environment in the Agile Data Engine installation process. See chapter 1.2 Using Agile Data Engine default storage.

Snowflake

Create a storage integration and test connectivity between Snowflake and the cloud storage bucket/container according to Snowflake instructions.

In Snowflake file loads are executed as COPY commands. You may use storage integrations (preferred) and/or external stages with Agile Data Engine file loads to configure access to source files in cloud storage. An external stage might be necessary if you need to transform data during a file load with an OVERRIDE_FILE_LOAD step (also see chapter 4.2 Customizing a file load).

If the cloud storage is network restricted, refer to Snowflake documentation:

If you are using Private Link, refer to Snowflake documentation:

Amazon Redshift

Create an IAM role, associate it with the Redshift cluster and grant access to a source data bucket. Test connectivity between Redshift and the bucket using the role. In the SaaS edition, the Agile Data Engine support team will also grant permissions for that role into an S3 bucket that will contain manifest files.

File loads are executed in Amazon Redshift as COPY commands. See AWS documentation for details about authorization parameters and access permissions.

See AWS documentation on controlling access from VPC endpoints with bucket policies if network restrictions are required.

Azure SQL Database

Create an external data source and test connectivity between the SQL Database and the storage container.

File loads are executed in Azure SQL Database as BULK INSERT commands. Use external data sources to configure access to source files in Azure Storage.

See Configure Azure Storage firewalls and virtual networks for instructions on Azure Storage network settings.

Azure Synapse SQL

Configure authentication with a Managed Identity and test connectivity between Synapse and the storage container.

File loads are executed in Azure Synapse SQL as COPY commands. Configure authentication using a Managed Identity according to the Microsoft documentation to access source files in Azure Storage.

Grant access to trusted Azure services in the storage account network settings.

1.2 Using Agile Data Engine default storage

Edition: Private

Agile Data Engine Private Edition installation creates a default cloud storage bucket/container for incoming data per Runtime environment. Optionally, this can be used without extra configuration by uploading source data files there. Agile Data Engine will enter an identity and a key/token within data load commands for the target database to access files in the default storage.


2. Creating entities

2.1 Creating a package

Entities are created into packages. Please refer to the Designing packages guide to understand how the package structure should be designed.

Create a package for your staging entities:

  1. Open the Designer.

2. Give a package name and a package description, for example:

FIELD

EXAMPLE VALUE

PACKAGE NAME:

STG_TAXIDATA

PACKAGE DESCRIPTION:

#staging #taxidata

3. Press Create package.

2.2 Creating a source entity

Source entities are of type METADATA_ONLY and their purpose is to describe the source data. Note that Agile Data Engine does not create METADATA_ONLY entities into the target database.

  • A source entity should be named after the dataset in the source system, e.g. a database table.

  • A source system is defined for source entities. See CONFIG_SYSTEMS on how to configure source systems.

  • Attributes are optional in a source entity.

  • Source entities can be used to import the source system data model as metadata into Agile Data Engine. See the entity import reference for more information on how to import entities.

SaaS Edition:

  • Source entities are mandatory.

  • Source entity name and source system name are referred to later on in the file loading process when calling the Notify API (chapter 5).

Private Edition:

  • Source entities are optional yet recommended.

  • Source entity name is referred to later on in the file loading process when naming the files (chapter 5).

  • If source entities are not defined, filenames have to correspond to staging table names.

Create a source entity for your source dataset:

  1. Open the staging package.

  2. In the Entities tab, press Create new entity.

  3. Enter details, for example:

FIELD

EXAMPLE VALUE

ENTITY LOGICAL NAME:

RATECODE_DESCRIPTION

TYPE:

SOURCE

SOURCE SYSTEM:

taxidata

PHYSICAL TYPE:

METADATA_ONLY

SCHEMA:

src

Note that if you have not configured a source system yet, create one in CONFIG_SYSTEMS.

4. Press Create.

2.3 Creating a stage entity

Stage entities (staging tables) are file load target tables where the data will be first loaded to. Staging table attributes and attribute data types have to match the incoming data files. A common approach is to import staging tables into the Designer instead of creating them manually. See the entity import reference for more information on how to import entities.

To create a staging table manually, follow these instructions:

  1. Open the staging package.

  2. In the Entities tab press Create new entity.

  3. Enter details, for example:

FIELD

EXAMPLE VALUE

ENTITY LOGICAL NAME:

STG_RATECODE_DESCRIPTION_TAXIDATA

TYPE:

STAGE

PHYSICAL TYPE:

TABLE

SCHEMA:

staging

4. Press Create.

5. In the Attributes tab add attributes by pressing Create attribute and entering attribute details.


3. Creating a load schedule

A load schedule is required for running (file) loads. Loads placed into load schedules form load workflows.

Inherited schedules

Since the file load is usually the first load in a workflow, it is sufficient to add it into a schedule. Loads further down the workflow will inherit the schedule through dependencies mapped in entity mappings if their schedules are set to NONE.

An inherited schedule can be overridden by defining another load schedule for a load. This is common, for example, with publish loads which could be dependent on entities loaded in several different workflows.

See Designing Workflows to learn more about workflows.

Create a load schedule for your source data:

  1. Open CONFIG_LOAD_SCHEDULES configuration package.

  2. In the CONFIGURATION tab, press Create new load schedule.

  3. Enter details, for example:

FIELD

EXAMPLE VALUE

SCHEDULE NAME:

TAXIDATA

CRON EXPRESSION:

30 2 * * *

LOAD POOL NAME:

DAG GENERATION MODE:

DESCRIPTION:

#taxidata

4. Press Create

5. Commit the changes from the PACKAGE DETAILS - Commit tab by pressing Commit package, entering a commit message and pressing Commit package again.


4. Creating a file load

4.1 Creating a file load from source to stage

Create a file load from the source entity into the staging table:

  1. Open the staging entity and go to the Loads tab.

  2. Create a new load by pressing Create incoming load.

  3. Enter details, for example:

FIELD

EXAMPLE VALUE

LOAD NAME:

load_stg_ratecode_description_taxidata_from_ratecode_description

LOAD TYPE:

LOAD_FILE

SCHEDULE:

TAXIDATA

CRON EXPRESSION:

DESCRIPTION:

4. Press Create.

5. Create an entity mapping by pressing Create entity mapping.

6. Find the source entity, e.g. RATECODE_DESCRIPTION, and press Select entity.

7. If you have created or imported attributes into your source entity, map them to the stage entity. If not, go to the next step.

8. Go back to the load view by clicking the name of the load, e.g. load_stg_ratecode_description_taxidata_from_ratecode_description.

The next steps depend on your target database product and choices made in chapter 1 Configuring source data access.

Private Edition & Default storage

If you are using the Private Edition and your source data is in the Agile Data Engine default storage you can go to chapter 5. Queuing incoming files.

Snowflake

If you are using a storage integration:

9. Add a load option by pressing Add load option.

10. Select OPT_STORAGE_INTEGRATION and enter the name of your storage integration.

11. Press Add.

Amazon Redshift

The IAM role with access to source data is configured to the Agile Data Engine configuration and will be placed in COPY commands automatically. In the SaaS Edition configuration is done by the SaaS support team. No additional steps are required.

Azure SQL Database

If you are using an external data source:

  1. Add a load option by pressing Add load option.

  2. Select OPT_STORAGE_INTEGRATION and enter the name of your external data source.

  3. Press Add.

Azure Synapse SQL

In Azure Synapse source data is accessed with a Managed Identity (see chapter 1 Configuring source data access). No additional steps are required.

4.2 Customizing a file load

There are several load options that can be used with file loads. Please refer to the reference guide for more details and examples. Note that often load options are target database product specific.

OVERRIDE_FILE_LOAD load step type can be used to fully customize a file load i.e. manually define the load SQL.


5. Queuing incoming files

The method for queuing incoming files depends on your Agile Data Engine edition:

  • SaaS: Use the Notify API

  • Private: Queue manifests (recommended) or single files

5.1 Queuing files with Notify API

Edition: SaaS

Notify API is used for queuing incoming source data files to be loaded into the target database in Agile Data Engine SaaS Edition. Notify API has to be called from an external service, e.g. an integration tool, when source files are in cloud storage ready to be loaded into the data warehouse. In addition, Agile Data Engine provides a Swagger UI{target="_blank"} that enables manual API calls for development and testing purposes.

Follow the steps below to queue incoming files. It is recommended to first test the process manually in the UI. Once the process is tested and working, it is easier to adapt it to an external service that will run the API requests. See the reference guide for further information and example solutions.

Authentication

All requests have to be authenticated. When calling the APIs from an external service, authentication is done with an API key and secret which are provided by the Agile Data Engine support team. Also, the public IP address of the service executing the requests has to be allowed by the support team.

To authenticate in the UI:

  1. Navigate to the UI from the RUNTIMES menu. Select the desired Runtime environment and press Notify API.

  2. Authorize the requests by pressing Authorize, then again Authorize under OAuth2.

Response: 401 Unauthorized

Error 401 when executing a request in the UI usually indicates that the authorization has expired. Go back to Authorize, Logout and Authorize again.

Create a manifest

Manifests give Agile Data Engine information about the files to be loaded such as file format details and a list of files as manifest entries.

To create a manifest in the UI:

  1. Use Create manifest, press Try it out.

  2. Enter details, for example:

FIELD

EXAMPLE VALUE

tenantName:

local

installationName:

local

environmentName:

local

sourceSystemName:

taxidata

sourceEntityName:

ratecode_description

Request body:

JSON
{
  "compression": "GZIP",
  "delim": "COMMA",
  "format": "CSV",
  "fullscanned": true,
  "skiph": 1
}

Notes

TenantName, installationName and environmentName are always local.

SourceSystemName and sourceEntityName have to match what you have set for your source entity in chapter 2.2.

See the reference guide for detailed options regarding the Request body. Accepted values are also listed in the UI: Scroll down to Schemas and see Manifest.

3. Press Execute.

4. Response code 201 indicates a successful request. Copy the manifest id from the response body and keep it for the next steps.

Example response:

JSON
{
  "id": "c026bb41-e4c4-48bf-9ac6-06f3f2e44129",
  "state": "OPEN",
  "columns": [],
  "batch": 1618915514245,
  "fullscanned": true,
  "delim": "COMMA",
  "skiph": 1,
  "compression": "GZIP",
  "format": "CSV",
  "created": "2021-04-20T10:45:14.245846",
  "modified": "2021-04-20T10:45:14.268148"
}

Create manifest entries

Manifest entries list the files to be loaded. Entries can be created all at once (PUT) or one by one (POST); all at once (PUT) is recommended since it is faster. You may list just one file or as many files per manifest as the target database can handle in one file load command.

Batch IDs and the Run ID Logic

Batch ids can be given either on the manifest level (all files are part of the same batch) or on the manifest entry level (files are separate batches). Batch ids are essential with the Run ID Logic.

Creating manifest entries (PUT) in the UI:

  1. Create manifest entries with Create multiple manifest entries.

  2. Enter details, for example:

FIELD

EXAMPLE VALUE

tenantName:

local

installationName:

local

environmentName:

local

sourceSystemName:

taxidata

sourceEntityName:

ratecode_description

id:

c026bb41-e4c4-48bf-9ac6-06f3f2e44129

Request body:

JSON
[
  {
    "batch": 1618916378963,
    "sourceFile": "azure://example.blob.core.windows.net/example-container/taxidata/ratecode_description.batch.1618916378963.csv.gz"
  },
  {
    "batch": 1618916457230,
    "sourceFile": "azure://example.blob.core.windows.net/example-container/taxidata/ratecode_description.batch.1618916457230.csv.gz"
  }
]

Notes

TenantName, installationName and environmentName are always local.

SourceSystemName and sourceEntityName have to match what you have set for your source entity in chapter 2.2.

Id is the manifest ID which is received in the response of Create manifest.

See the reference guide for detailed options regarding the Request body. Accepted values are also listed in the UI: Scroll down to Schemas and see ManifestEntry.

3. Press Execute.

4. Response code 201 indicates a successful request.

Example response:

JSON
[
  {
    "id": "d46c6bfd-fbfd-46f4-a8b8-44084006193f",
    "batch": 1618916378963,
    "sourceFile": "azure://example.blob.core.windows.net/example-container/taxidata/ratecode_description.batch.1618916378963.csv.gz"
  },
  {
    "id": "d846a814-f7d6-4dcc-b04e-01ffbc075295",
    "batch": 1618916457230,
    "sourceFile": "azure://example.blob.core.windows.net/example-container/taxidata/ratecode_description.batch.1618916457230.csv.gz"
  }
]

Notify

Notify changes the manifest state from OPEN to NOTIFIED. After that the manifest is sent to processing and it cannot be changed anymore.

Notifying a manifest in the UI:

  1. Use Notify ADE by manifest to be ready for processing, press Try it out.

  2. Enter details, for example:

FIELD

EXAMPLE VALUE

tenantName:

local

installationName:

local

environmentName:

local

sourceSystemName:

taxidata

sourceEntityName:

ratecode_description

id:

c026bb41-e4c4-48bf-9ac6-06f3f2e44129

Notes

TenantName, installationName and environmentName are always local.

SourceSystemName and sourceEntityName have to match what you have set for your source entity in chapter 2.2.

Id is the manifest ID which is received in the response of Create manifest.

3. Press Execute.

4. Response code 200 indicates a successful request.

The manifest is now queued and you are ready to test the file load, go to chapter 6 Executing a file load.

5.2 Queuing single files

Edition: Private

This method can be used only in the Private Edition and only with the Agile Data Engine default storage for incoming data (unless using OPT_DATA_FILE_LOCATION). 5.3 Queuing manifests offers more flexibility and is therefore the recommended method for the Private Edition. However, queuing single files is quick and easy so it is often a good starting point when developing and testing file loads. See the reference guide for single file loads for further information.

1. Upload a file named according to the naming convention to the file load bucket/container. Information about the file format etc. is given in the filename. For example:

azure://example.blob.core.windows.net/example-file-load-container/ratecode_description/table.ratecode_description.1618916378963.batch.1618916378963.fullscanned.true.delim.comma.skiph.1.csv.gz

2. In Private AWS a notification file is automatically created into the notify bucket. In Private Azure create a notification file into the notify container. A notification file is named like the data file but the file content is empty. For example:

azure://example.blob.core.windows.net/example-notify-container/ratecode_description/table.ratecode_description.1618916378963.batch.1618916378963.fullscanned.true.delim.comma.skiph.1.csv.gz

The file is now queued and you are ready to test the file load, go to chapter 6 Executing a file load.

5.3 Queuing manifests

Edition: Private

Queuing incoming data files with manifests is the recommended method for Agile Data Engine Private Edition. Manifests can be used both with the default storage and with any other cloud storage bucket/container that the target database can access (see chapter 1 Configuring source data access).

A manifest is a JSON file that gives Agile Data Engine information about the files to be loaded such as file format details and a list of files as entries. See the reference guide for manifest loads for further information.

  1. Upload data files to cloud storage. For example:

azure://example.blob.core.windows.net/example-container/taxidata/ratecode_description.batch.1618916378963.csv.gz
azure://example.blob.core.windows.net/example-container/taxidata/ratecode_description.batch.1618916457230.csv.gz

2. Create a manifest according to the guidelines, for example:

JSON
{
    "entries": [
        {
            "mandatory": true,
            "batch": 1618916378963,
            "url": "azure://example.blob.core.windows.net/example-container/taxidata/ratecode_description.batch.1618916378963.csv.gz"
        },
        {
            "mandatory": true,
            "batch": 1618916457230,
            "url": "azure://example.blob.core.windows.net/example-container/taxidata/ratecode_description.batch.1618916457230.csv.gz"
        }
    ],
    "compression": "GZIP",
    "delim": "COMMA",
    "format": "CSV",
    "fullscanned": true,
    "skiph": 1
}

3. Upload the manifest to the file load bucket/container (default storage), for example:

azure://example.blob.core.windows.net/example-file-load-container/manifest/ratecode_description/manifest-table.ratecode_description.1618916457230.json

4. In Private AWS a notification file is automatically created into the notify bucket. In Private Azure create a notification file into the notify container. A notification file is named like the manifest but the file content is empty. For example:

azure://example.blob.core.windows.net/example-notify-container/manifest/ratecode_description/manifest-table.ratecode_description.1618916457230.json

The manifest is now queued and you are ready to test the file load, go to chapter 6 Executing a file load.

Batch IDs and the Run ID Logic

Batch ids can be given either on the manifest level (all files are part of the same batch) or on the manifest entry level (files are separate batches). Batch ids are essential with the Run ID Logic.


6. Executing a file load

Commit the CONFIG_LOAD_SCHEDULES configuration package and your staging package to deploy the changes into a Runtime environment first if you have not done that yet.

Execute the file load:

  1. Navigate to Workflow Orchestration from the RUNTIMES menu by selecting the desired Runtime environment and pressing Workflow orchestration.

  2. Find your workflow, e.g. TAXIDATA and enable it from the toggle switch.

  3. Open the workflow by pressing its name.

  4. Trigger the the workflow by pressing Trigger DAG.

  5. Wait for the load to finish and check the logs from the Workflow Orchestration UI.

See Triggering a Workflow for further information.

JavaScript errors detected

Please note, these errors can depend on your browser setup.

If this problem persists, please contact our support.