Loading Source Data

Guide objective

Follow this guide to learn how to configure and execute a file load.

Agile Data Engine executes loads by running SQL commands in the target database. In the case of file-based ingestion, ADE generates and orchestrates file load statements that instruct the database to load data from cloud storage.

Each file load command includes parameters that define how the database should perform the operation. These parameters typically include:

Target entity name (i.e. the table to load into)
Paths to source files in cloud storage
Authorization details required to access the files
Source file format, such as CSV, JSON, or Parquet
Possible target database-specific options

This process assumes that source data files are already available in cloud storage and that the target database system has the necessary access to read them.

0. Prerequisites

0.1 Getting started

For a general overview of how data ingestion can be handled in Agile Data Engine, see Setting Up Data Ingestion. This guide specifically follows Option 1A: Ingesting Data from Cloud Storage with ADE and Notify API.

Also see Getting started with Notify API and Notify API reference documentation for a better understanding of the source file notification process.

0.2 Target database configuration

A target database has to be configured and a Runtime connected to it before proceeding with loading source data. See Target Database Setup for instructions.

0.3 Source data integration to cloud storage

Source data has to be stored in a cloud storage service supported by the target database:

AWS S3
Azure Blob Storage
Google Cloud Storage

Source data has to be in a file format supported by the target database, for example:

CSV
JSON
XML
PARQUET
AVRO

If you do not have any source data files in your cloud storage yet and there are no existing source system integrations that could produce such files, you can also extract sample files manually from a source system and upload them into cloud storage for initial file load development and testing.

1. Configuring the target database for cloud storage access

Agile Data Engine performs a file load by executing an SQL command in the target database. This command specifies which files should be loaded from a cloud storage location into a particular table. Therefore, the database must have access to the referenced cloud storage.

Cloud storage access is typically granted to the database via an identity (such as a role or service principal), which is linked to an object like a storage integration, external stage, or credential within the database system. This object is then used in the file load command and must be granted to the ADE user executing the load.

In addition, you may need to configure network-level access if your cloud storage is restricted by firewall rules or private networking.

Cloud storage access configuration depends on both the database product and the cloud provider. Refer to the sections below for environment-specific guidance, and consult your database documentation for further details.

Amazon Redshift

Create an IAM role, associate it with the Redshift cluster and grant access to a source data bucket. Test connectivity between Redshift and the bucket using the role. ADE support will also grant permissions for that role into an S3 bucket that will contain manifest files.

File loads are executed in Amazon Redshift as COPY commands. See AWS documentation for details about authorization parameters and access permissions.

See AWS documentation on controlling access from VPC endpoints with bucket policies if network restrictions are required.

Azure SQL Database

Create an external data source and test connectivity between the SQL Database and the storage container.

File loads are executed in Azure SQL Database as BULK INSERT commands. Use external data sources to configure access to source files in Azure Storage.

See Azure Storage firewall rules for instructions on Azure Storage network settings.

Azure Synapse SQL

Configure authentication with a Managed Identity and test connectivity between Synapse and the storage container.

File loads are executed in Azure Synapse SQL as COPY commands. Configure authentication using a Managed Identity according to the Microsoft documentation to access source files in Azure Storage.

Grant access to trusted Azure services in the storage account network settings.

Databricks SQL

Configure external locations in Unity Catalog and test connectivity between Databricks and the cloud storage bucket/container according to Databricks instructions.

In Databricks SQL file loads are executed as COPY INTO commands. Use external locations with ADE file loads to configure access to source files in cloud storage.

If the cloud storage is network restricted, refer to Databricks documentation:

Google BigQuery

Grant read access to the bucket for the service account used by ADE and test connectivity between BigQuery and Cloud Storage according to Google instructions.

In BigQuery file loads are executed as LOAD DATA statements. See Google documentation on batch loading data.

Microsoft Fabric

Configure authentication with a Managed Identity and test connectivity between Fabric Warehouse and the storage container.

In Microsoft Fabric Warehouse file loads are executed as COPY statements. Configure authentication using a workspace identity according to Microsoft documentation to access source files with OneLake shortcuts.

Snowflake

Create a storage integration and test connectivity between Snowflake and the cloud storage bucket/container according to Snowflake instructions.

In Snowflake file loads are executed as COPY commands. You may use storage integrations (preferred) and/or external stages with Agile Data Engine file loads to configure access to source files in cloud storage. An external stage might be necessary if you need to transform data during a file load with an OVERRIDE_FILE_LOAD step (also see chapter 4.2 Customizing a file load).

If the cloud storage is network restricted, refer to Snowflake documentation:

If you are using Private Link, refer to Snowflake documentation:

2. Creating entities

2.1 Creating a package

Entities are created into packages. Please refer to the Designing packages guide to understand how the package structure should be designed.

Create a package for your staging entities:

Open the Designer.

2. Give a package name and a package description, for example:

Field	Example value
PACKAGE NAME	STG_TAXIDATA
PACKAGE DESCRIPTION	#staging #taxidata

3. Press Create package.

2.2 Creating a source entity

Source entities are of type METADATA_ONLY and their purpose is to describe the source data. Note that Agile Data Engine does not create METADATA_ONLY entities into the target database.

A source entity should be named after the dataset in the source system, e.g. a database table.
A source system is defined for source entities. See CONFIG_SYSTEMS on how to configure source systems.
Attributes are optional in a source entity.
Source entities can be used to import the source system data model as metadata into Agile Data Engine. See Import entities for more information.

Notify API

Source entities are mandatory for the file notification process as they are used to map the notified source files to a specific file load and target staging entity.
Source entity name and source system name are referred to later on in the file loading process when calling the Notify API (chapter 5).

Create a source entity for your source dataset:

Open the staging package.
In the Entities tab, press Create new entity.
Enter details, for example:

Field	Example value
ENTITY LOGICAL NAME	RATECODE_DESCRIPTION
TYPE	SOURCE
SOURCE SYSTEM	taxidata
PHYSICAL TYPE	METADATA_ONLY
SCHEMA	src

Note that if you have not configured a source system yet, create one in CONFIG_SYSTEMS.

4. Press Create.

2.3 Creating a stage entity

Stage entities (staging tables) are file load target tables where the data will be first loaded to. Staging table attributes and attribute data types have to match the incoming data files. A common approach is to import staging tables into the Designer instead of creating them manually. See the entity import reference for more information on how to import entities.

To create a staging table manually, follow these instructions:

Open the staging package.
In the Entities tab press Create new entity.
Enter details, for example:

Field	Example value
ENTITY LOGICAL NAME	STG_RATECODE_DESCRIPTION_TAXIDATA
TYPE	STAGE
PHYSICAL TYPE	TABLE
SCHEMA	staging

4. Press Create.

5. In the Attributes tab add attributes by pressing Create attribute and entering attribute details.

3. Creating a load schedule

A load schedule is required for running (file) loads. Loads placed into load schedules form load workflows.

Inherited schedules

Since the file load is usually the first load in a workflow, it is sufficient to add it into a schedule. Loads further down the workflow will inherit the schedule through dependencies mapped in entity mappings if their schedules are set to NONE.

An inherited schedule can be overridden by defining another load schedule for a load. This is common, for example, with publish loads which could be dependent on entities loaded in several different workflows.

See Designing Workflows to learn more about workflows.

Create a load schedule for your source data:

Open CONFIG_LOAD_SCHEDULES configuration package.
In the CONFIGURATION tab, press Create new load schedule.
Enter details, for example:

Field	Example value
SCHEDULE NAME	TAXIDATA
CRON EXPRESSION	30 2 * * *
LOAD POOL NAME
DAG GENERATION MODE
DESCRIPTION	#taxidata

4. Press Create.

5. Commit the changes from the PACKAGE DETAILS - Commit tab by pressing Commit package, entering a commit message and pressing Commit package again.

4. Creating a file load

4.1 Creating a file load from source to stage

Create a file load from the source entity into the staging table:

Open the staging entity and go to the Loads tab.
Create a new load by pressing Create incoming load.
Enter details, for example:

Field	Example value
LOAD NAME	load_stg_ratecode_description_taxidata_from_ratecode_description
LOAD TYPE	LOAD_FILE
SCHEDULE	TAXIDATA

4. Press Create.

5. Create an entity mapping by pressing Create entity mapping.

6. Find the source entity, e.g. RATECODE_DESCRIPTION, and press Select entity.

7. If you have created or imported attributes into your source entity, map them to the stage entity. If not, go to the next step.

8. Go back to the load view by clicking the name of the load, e.g. load_stg_ratecode_description_taxidata_from_ratecode_description.

The next steps depend on your target database product and choices made in chapter 1 Configuring source data access.

Amazon Redshift

The IAM role with access to source data is configured to the Agile Data Engine configuration and will be placed in COPY commands automatically. No additional steps are required.

Azure SQL Database

If you are using an external data source:

Add a load option by pressing Add load option.
Select OPT_STORAGE_INTEGRATION and enter the name of your external data source.
Press Add.

Azure Synapse SQL

In Azure Synapse, source data is accessed with a Managed Identity (see chapter 1). No additional steps are required.

Databricks SQL

In Databricks, access to source files is configured with an external location (see chapter 1). No additional steps are required.

Google BigQuery

In GCP, access to source files is configured to the service account (see chapter 1). No additional steps are required.

Snowflake

If you are using a storage integration:

Add a load option by pressing Add load option.
Select OPT_STORAGE_INTEGRATION and enter the name of your storage integration.
Press Add.

If you are using an external stage:

Notify source files (Chapter 5) with the external stage name and relative URL. In this case no load options are required as the stage reference is part of the file path.

4.2 Customizing a file load

There are several load options that can be used with file loads. Refer to the reference documentation for more details and examples. Note that often load options are database product-specific.

OVERRIDE_FILE_LOAD load step type can be used to fully customize a file load i.e. manually define the load SQL.

If using Snowflake, see also Customizing File Loads in Snowflake.

5. Queuing files with Notify API

Notify API is used for queuing incoming source data files to be loaded into the target database in Agile Data Engine. Notify API is called from an external service, e.g. an integration tool, when source files are in cloud storage ready to be loaded into the data warehouse. In addition, ADE provides a Swagger UI that enables manual API calls for development and testing purposes.

Follow the steps below to queue incoming files. It is recommended to first test the process manually in the UI. Once the process is tested and working, it is easier to adapt it to an external service that will run the API requests. See the reference guide for further information and example solutions.

5.1 Authentication

All requests must be authenticated. When calling the APIs from an external service, authentication is done with an API key and secret which are provided by ADE support. Also, the public IP address of the service executing the requests must first be allowed. Send a request through the support portal.

To authenticate in the Swagger UI:

In ADE, navigate to the Runtime menu, select the desired environment and click Notify API Documentation.
Authorize the requests by pressing Authorize, then again Authorize under OAuth2.

Response: 401 Unauthorized

Error 401 when executing a request in the UI usually indicates that the authorization has expired. Go back to Authorize, Logout and Authorize again.

5.2 Create a manifest

Manifests give Agile Data Engine information about the files to be loaded such as file format details and a list of files as manifest entries.

To create a manifest in the UI:

Use Create manifest, press Try it out.
Enter details, for example:

Field	Example value
tenantName	local
installationName	local
environmentName	local
sourceSystemName	taxidata
sourceEntityName	ratecode_description

Request body:

JSON

{
  "compression": "GZIP",
  "delim": "COMMA",
  "format": "CSV",
  "fullscanned": true,
  "skiph": 1
}

Notes

TenantName, installationName and environmentName are always local.

SourceSystemName and sourceEntityName have to match what you have set for your source entity in chapter 2.2.

See the reference guide for detailed options regarding the Request body. Accepted values are also listed in the UI: Scroll down to Schemas and see Manifest.

3. Press Execute.

4. Response code 201 indicates a successful request. Copy the manifest id from the response body and keep it for the next steps.

Example response:

JSON

{
  "id": "c026bb41-e4c4-48bf-9ac6-06f3f2e44129",
  "state": "OPEN",
  "columns": [],
  "batch": 1618915514245,
  "fullscanned": true,
  "delim": "COMMA",
  "skiph": 1,
  "compression": "GZIP",
  "format": "CSV",
  "created": "2021-04-20T10:45:14.245846",
  "modified": "2021-04-20T10:45:14.268148"
}

5.3 Create manifest entries

Manifest entries list the files to be loaded. Entries can be created all at once (PUT) or one by one (POST); all at once (PUT) is recommended since it is faster. You may list just one file or as many files per manifest as the target database can handle in one file load command.

Batch IDs and the Run ID Logic

Batch ids can be given either on the manifest level (all files are part of the same batch) or on the manifest entry level (files are separate batches). Batch ids are essential with the Run ID Logic.

Creating manifest entries (PUT) in the UI:

Create manifest entries with Create multiple manifest entries.
Enter details, for example:

Field	Example value
tenantName	local
installationName	local
environmentName	local
sourceSystemName	taxidata
sourceEntityName	ratecode_description
id	c026bb41-e4c4-48bf-9ac6-06f3f2e44129

Request body:

JSON

[
  {
    "batch": 1618916378963,
    "sourceFile": "azure://example.blob.core.windows.net/example-container/taxidata/ratecode_description.batch.1618916378963.csv.gz"
  },
  {
    "batch": 1618916457230,
    "sourceFile": "azure://example.blob.core.windows.net/example-container/taxidata/ratecode_description.batch.1618916457230.csv.gz"
  }
]

Notes

TenantName, installationName and environmentName are always local.

SourceSystemName and sourceEntityName have to match what you have set for your source entity in chapter 2.2.

Id is the manifest ID which is received in the response of Create manifest.

See the reference guide for detailed options regarding the Request body. Accepted values are also listed in the UI: Scroll down to Schemas and see ManifestEntry.

3. Press Execute.

4. Response code 201 indicates a successful request.

Example response:

JSON

[
  {
    "id": "d46c6bfd-fbfd-46f4-a8b8-44084006193f",
    "batch": 1618916378963,
    "sourceFile": "azure://example.blob.core.windows.net/example-container/taxidata/ratecode_description.batch.1618916378963.csv.gz"
  },
  {
    "id": "d846a814-f7d6-4dcc-b04e-01ffbc075295",
    "batch": 1618916457230,
    "sourceFile": "azure://example.blob.core.windows.net/example-container/taxidata/ratecode_description.batch.1618916457230.csv.gz"
  }
]

5.4 Notify

Notify changes the manifest state from OPEN to NOTIFIED. After that the manifest is sent to processing and it cannot be changed anymore.

Notifying a manifest in the UI:

Use Notify ADE by manifest to be ready for processing, press Try it out.
Enter details, for example:

Field	Example value
tenantName	local
installationName	local
environmentName	local
sourceSystemName	taxidata
sourceEntityName	ratecode_description
id	c026bb41-e4c4-48bf-9ac6-06f3f2e44129

Notes

TenantName, installationName and environmentName are always local.

SourceSystemName and sourceEntityName have to match what you have set for your source entity in chapter 2.2.

Id is the manifest ID which is received in the response of Create manifest.

3. Press Execute.

4. Response code 200 indicates a successful request.

The manifest is now queued and you are ready to test the file load, go to chapter 6 Executing a file load.

6. Executing a file load

Commit the CONFIG_LOAD_SCHEDULES configuration package and your staging package to deploy the changes into a Runtime environment first if you have not done that yet.

Execute the file load:

Navigate to Workflow Orchestration of the desired environment from the Runtime menu.
Find your workflow, e.g. TAXIDATA and enable it from the toggle switch.
Open the workflow by pressing its name.
Trigger the the workflow by pressing Trigger DAG.
Wait for the load to finish and check the logs from the Workflow Orchestration UI.

See Triggering a Workflow for further information.