Loading Source Data
Guide objective
Follow this guide to learn how to configure and execute a file load.
Agile Data Engine executes loads by running SQL commands in the target database. In the case of file-based ingestion, ADE generates and orchestrates file load statements that instruct the database to load data from cloud storage.
Each file load command includes parameters that define how the database should perform the operation. These parameters typically include:
Target entity name (i.e. the table to load into)
Paths to source files in cloud storage
Authorization details required to access the files
Source file format, such as CSV, JSON, or Parquet
Possible target database-specific options
This process assumes that source data files are already available in cloud storage and that the target database system has the necessary access to read them.
0. Prerequisites
0.1 Getting started
For a general overview of how data ingestion can be handled in Agile Data Engine, see Setting Up Data Ingestion. This guide specifically follows Option 1A: Ingesting Data from Cloud Storage with ADE and Notify API.
Also see Getting started with Notify API and Notify API reference documentation for a better understanding of the source file notification process.
0.2 Target database configuration
A target database has to be configured and a Runtime connected to it before proceeding with loading source data. See Target Database Setup for instructions.
0.3 Source data integration to cloud storage
Source data has to be stored in a cloud storage service supported by the target database:
AWS S3
Azure Blob Storage
Google Cloud Storage
Source data has to be in a file format supported by the target database, for example:
CSV
JSON
XML
PARQUET
AVRO
If you do not have any source data files in your cloud storage yet and there are no existing source system integrations that could produce such files, you can also extract sample files manually from a source system and upload them into cloud storage for initial file load development and testing.
1. Configuring the target database for cloud storage access
Agile Data Engine performs a file load by executing an SQL command in the target database. This command specifies which files should be loaded from a cloud storage location into a particular table. Therefore, the database must have access to the referenced cloud storage.
Cloud storage access is typically granted to the database via an identity (such as a role or service principal), which is linked to an object like a storage integration, external stage, or credential within the database system. This object is then used in the file load command and must be granted to the ADE user executing the load.
In addition, you may need to configure network-level access if your cloud storage is restricted by firewall rules or private networking.
Cloud storage access configuration depends on both the database product and the cloud provider. Refer to the sections below for environment-specific guidance, and consult your database documentation for further details.
2. Creating entities
2.1 Creating a package
Entities are created into packages. Please refer to the Designing packages guide to understand how the package structure should be designed.
Create a package for your staging entities:
Open the Designer.
2. Give a package name and a package description, for example:
Field | Example value |
---|---|
PACKAGE NAME | STG_TAXIDATA |
PACKAGE DESCRIPTION | #staging #taxidata |
3. Press Create package.
2.2 Creating a source entity
Source entities are of type METADATA_ONLY and their purpose is to describe the source data. Note that Agile Data Engine does not create METADATA_ONLY
entities into the target database.
A source entity should be named after the dataset in the source system, e.g. a database table.
A source system is defined for source entities. See CONFIG_SYSTEMS on how to configure source systems.
Attributes are optional in a source entity.
Source entities can be used to import the source system data model as metadata into Agile Data Engine. See Import entities for more information.
Notify API
Source entities are mandatory for the file notification process as they are used to map the notified source files to a specific file load and target staging entity.
Source entity name and source system name are referred to later on in the file loading process when calling the Notify API (chapter 5).
Create a source entity for your source dataset:
Open the staging package.
In the Entities tab, press Create new entity.
Enter details, for example:
Field | Example value |
---|---|
ENTITY LOGICAL NAME | RATECODE_DESCRIPTION |
TYPE | SOURCE |
SOURCE SYSTEM | taxidata |
PHYSICAL TYPE | METADATA_ONLY |
SCHEMA | src |
Note that if you have not configured a source system yet, create one in CONFIG_SYSTEMS.
4. Press Create.
2.3 Creating a stage entity
Stage entities (staging tables) are file load target tables where the data will be first loaded to. Staging table attributes and attribute data types have to match the incoming data files. A common approach is to import staging tables into the Designer instead of creating them manually. See the entity import reference for more information on how to import entities.
To create a staging table manually, follow these instructions:
Open the staging package.
In the Entities tab press Create new entity.
Enter details, for example:
Field | Example value |
---|---|
ENTITY LOGICAL NAME | STG_RATECODE_DESCRIPTION_TAXIDATA |
TYPE | STAGE |
PHYSICAL TYPE | TABLE |
SCHEMA | staging |
4. Press Create.
5. In the Attributes tab add attributes by pressing Create attribute and entering attribute details.
3. Creating a load schedule
A load schedule is required for running (file) loads. Loads placed into load schedules form load workflows.
Inherited schedules
Since the file load is usually the first load in a workflow, it is sufficient to add it into a schedule. Loads further down the workflow will inherit the schedule through dependencies mapped in entity mappings if their schedules are set to NONE.
An inherited schedule can be overridden by defining another load schedule for a load. This is common, for example, with publish loads which could be dependent on entities loaded in several different workflows.
See Designing Workflows to learn more about workflows.
Create a load schedule for your source data:
Open CONFIG_LOAD_SCHEDULES configuration package.
In the CONFIGURATION tab, press Create new load schedule.
Enter details, for example:
Field | Example value |
---|---|
SCHEDULE NAME | TAXIDATA |
CRON EXPRESSION | 30 2 * * * |
LOAD POOL NAME | |
DAG GENERATION MODE | |
DESCRIPTION | #taxidata |
4. Press Create.
5. Commit the changes from the PACKAGE DETAILS - Commit tab by pressing Commit package, entering a commit message and pressing Commit package again.
4. Creating a file load
4.1 Creating a file load from source to stage
Create a file load from the source entity into the staging table:
Open the staging entity and go to the Loads tab.
Create a new load by pressing Create incoming load.
Enter details, for example:
Field | Example value |
---|---|
LOAD NAME | load_stg_ratecode_description_taxidata_from_ratecode_description |
LOAD TYPE | LOAD_FILE |
SCHEDULE | TAXIDATA |
4. Press Create.
5. Create an entity mapping by pressing Create entity mapping.
6. Find the source entity, e.g. RATECODE_DESCRIPTION, and press Select entity.
7. If you have created or imported attributes into your source entity, map them to the stage entity. If not, go to the next step.
8. Go back to the load view by clicking the name of the load, e.g. load_stg_ratecode_description_taxidata_from_ratecode_description.
The next steps depend on your target database product and choices made in chapter 1 Configuring source data access.
4.2 Customizing a file load
There are several load options that can be used with file loads. Refer to the reference documentation for more details and examples. Note that often load options are database product-specific.
OVERRIDE_FILE_LOAD load step type can be used to fully customize a file load i.e. manually define the load SQL.
If using Snowflake, see also Customizing File Loads in Snowflake.
5. Queuing files with Notify API
Notify API is used for queuing incoming source data files to be loaded into the target database in Agile Data Engine. Notify API is called from an external service, e.g. an integration tool, when source files are in cloud storage ready to be loaded into the data warehouse. In addition, ADE provides a Swagger UI that enables manual API calls for development and testing purposes.
Follow the steps below to queue incoming files. It is recommended to first test the process manually in the UI. Once the process is tested and working, it is easier to adapt it to an external service that will run the API requests. See the reference guide for further information and example solutions.
5.1 Authentication
All requests must be authenticated. When calling the APIs from an external service, authentication is done with an API key and secret which are provided by ADE support. Also, the public IP address of the service executing the requests must first be allowed. Send a request through the support portal.
To authenticate in the Swagger UI:
In ADE, navigate to the Runtime menu, select the desired environment and click Notify API Documentation.
Authorize the requests by pressing Authorize, then again Authorize under OAuth2.
Response: 401 Unauthorized
Error 401 when executing a request in the UI usually indicates that the authorization has expired. Go back to Authorize, Logout and Authorize again.
5.2 Create a manifest
Manifests give Agile Data Engine information about the files to be loaded such as file format details and a list of files as manifest entries.
To create a manifest in the UI:
Use Create manifest, press Try it out.
Enter details, for example:
Field | Example value |
---|---|
tenantName | local |
installationName | local |
environmentName | local |
sourceSystemName | taxidata |
sourceEntityName | ratecode_description |
Request body:
{
"compression": "GZIP",
"delim": "COMMA",
"format": "CSV",
"fullscanned": true,
"skiph": 1
}
Notes
TenantName, installationName and environmentName are always local
.
SourceSystemName and sourceEntityName have to match what you have set for your source entity in chapter 2.2.
See the reference guide for detailed options regarding the Request body. Accepted values are also listed in the UI: Scroll down to Schemas and see Manifest.
3. Press Execute.
4. Response code 201 indicates a successful request. Copy the manifest id from the response body and keep it for the next steps.
Example response:
{
"id": "c026bb41-e4c4-48bf-9ac6-06f3f2e44129",
"state": "OPEN",
"columns": [],
"batch": 1618915514245,
"fullscanned": true,
"delim": "COMMA",
"skiph": 1,
"compression": "GZIP",
"format": "CSV",
"created": "2021-04-20T10:45:14.245846",
"modified": "2021-04-20T10:45:14.268148"
}
5.3 Create manifest entries
Manifest entries list the files to be loaded. Entries can be created all at once (PUT) or one by one (POST); all at once (PUT) is recommended since it is faster. You may list just one file or as many files per manifest as the target database can handle in one file load command.
Batch IDs and the Run ID Logic
Batch ids can be given either on the manifest level (all files are part of the same batch) or on the manifest entry level (files are separate batches). Batch ids are essential with the Run ID Logic.
Creating manifest entries (PUT) in the UI:
Create manifest entries with Create multiple manifest entries.
Enter details, for example:
Field | Example value |
---|---|
tenantName | local |
installationName | local |
environmentName | local |
sourceSystemName | taxidata |
sourceEntityName | ratecode_description |
id | c026bb41-e4c4-48bf-9ac6-06f3f2e44129 |
Request body:
[
{
"batch": 1618916378963,
"sourceFile": "azure://example.blob.core.windows.net/example-container/taxidata/ratecode_description.batch.1618916378963.csv.gz"
},
{
"batch": 1618916457230,
"sourceFile": "azure://example.blob.core.windows.net/example-container/taxidata/ratecode_description.batch.1618916457230.csv.gz"
}
]
Notes
TenantName, installationName and environmentName are always local
.
SourceSystemName and sourceEntityName have to match what you have set for your source entity in chapter 2.2.
Id is the manifest ID which is received in the response of Create manifest.
See the reference guide for detailed options regarding the Request body. Accepted values are also listed in the UI: Scroll down to Schemas and see ManifestEntry.
3. Press Execute.
4. Response code 201 indicates a successful request.
Example response:
[
{
"id": "d46c6bfd-fbfd-46f4-a8b8-44084006193f",
"batch": 1618916378963,
"sourceFile": "azure://example.blob.core.windows.net/example-container/taxidata/ratecode_description.batch.1618916378963.csv.gz"
},
{
"id": "d846a814-f7d6-4dcc-b04e-01ffbc075295",
"batch": 1618916457230,
"sourceFile": "azure://example.blob.core.windows.net/example-container/taxidata/ratecode_description.batch.1618916457230.csv.gz"
}
]
5.4 Notify
Notify changes the manifest state from OPEN to NOTIFIED. After that the manifest is sent to processing and it cannot be changed anymore.
Notifying a manifest in the UI:
Use Notify ADE by manifest to be ready for processing, press Try it out.
Enter details, for example:
Field | Example value |
---|---|
tenantName | local |
installationName | local |
environmentName | local |
sourceSystemName | taxidata |
sourceEntityName | ratecode_description |
id | c026bb41-e4c4-48bf-9ac6-06f3f2e44129 |
Notes
TenantName, installationName and environmentName are always local
.
SourceSystemName and sourceEntityName have to match what you have set for your source entity in chapter 2.2.
Id is the manifest ID which is received in the response of Create manifest.
3. Press Execute.
4. Response code 200 indicates a successful request.
The manifest is now queued and you are ready to test the file load, go to chapter 6 Executing a file load.
6. Executing a file load
Commit the CONFIG_LOAD_SCHEDULES configuration package and your staging package to deploy the changes into a Runtime environment first if you have not done that yet.
Execute the file load:
Navigate to Workflow Orchestration of the desired environment from the Runtime menu.
Find your workflow, e.g.
TAXIDATA
and enable it from the toggle switch.Open the workflow by pressing its name.
Trigger the the workflow by pressing Trigger DAG.
Wait for the load to finish and check the logs from the Workflow Orchestration UI.
See Triggering a Workflow for further information.