Loading Source Data
Guide objective
Follow this guide to learn how to configure and execute a source data load.
Agile Data Engine executes loads by running SQL commands in the target database. Parameters of a file load command instruct the target database management system to execute a load. These parameters contain information such as:
Target entity name
Paths to source files
Authorization to access source files
Source file format
Various target database specific load options
The premise is that source data files are stored in cloud storage and the target database management system has access to them.
0. Prerequisites
0.1 Target database configuration
A target database has to be configured and a Runtime connected to it before proceeding with loading source data. See Configuring Target Databases for instructions.
0.2 Source data integration to cloud storage
Source data has to be stored in a cloud storage service supported by the target database:
AWS S3
Azure Blob Storage
Azure Data Lake Storage
Google Cloud Storage
Source data has to be in a file format supported by the target database, for example:
CSV
JSON
XML
PARQUET
AVRO
If you do not have any source data files in your cloud storage yet and there are no existing source system integrations that could produce such files, you can also extract sample files manually from a source system and upload them into cloud storage for initial file load development and testing.
1. Configuring source data access
1.1 Configuring the target database for cloud storage access
Agile Data Engine executes a file load by running an SQL command in the target database that specifies which files are to be loaded from a specified cloud storage into a specified table. Therefore, the database needs to have access to the specified cloud storage. Cloud storage access is usually configured for an identity which is referred to in an object in the database management system. The use of this object is then granted to the user running file loads. Additionally, you might have to configure network settings if network access to the cloud storage is restricted.
Cloud storage access configuration is database product and cloud storage service specific, find the details for your environment below and refer to the database product documentation.
Private Edition:
Note that with the Private Edition you can alternatively use the default storage which does not require configuration. A default storage bucket/container is created per Runtime environment in the Agile Data Engine installation process. See chapter 1.2 Using Agile Data Engine default storage.
Snowflake
Create a storage integration and test connectivity between Snowflake and the cloud storage bucket/container according to Snowflake instructions.
In Snowflake file loads are executed as COPY commands. You may use storage integrations (preferred) and/or external stages with Agile Data Engine file loads to configure access to source files in cloud storage. An external stage might be necessary if you need to transform data during a file load with an OVERRIDE_FILE_LOAD step (also see chapter 4.2 Customizing a file load).
If the cloud storage is network restricted, refer to Snowflake documentation:
If you are using Private Link, refer to Snowflake documentation:
Amazon Redshift
Create an IAM role, associate it with the Redshift cluster and grant access to a source data bucket. Test connectivity between Redshift and the bucket using the role. In the SaaS edition, the Agile Data Engine support team will also grant permissions for that role into an S3 bucket that will contain manifest files.
File loads are executed in Amazon Redshift as COPY commands. See AWS documentation for details about authorization parameters and access permissions.
See AWS documentation on controlling access from VPC endpoints with bucket policies if network restrictions are required.
Azure SQL Database
Create an external data source and test connectivity between the SQL Database and the storage container.
File loads are executed in Azure SQL Database as BULK INSERT commands. Use external data sources to configure access to source files in Azure Storage.
See Configure Azure Storage firewalls and virtual networks for instructions on Azure Storage network settings.
Azure Synapse SQL
Configure authentication with a Managed Identity and test connectivity between Synapse and the storage container.
File loads are executed in Azure Synapse SQL as COPY commands. Configure authentication using a Managed Identity according to the Microsoft documentation to access source files in Azure Storage.
Grant access to trusted Azure services in the storage account network settings.
1.2 Using Agile Data Engine default storage
Edition: Private
Agile Data Engine Private Edition installation creates a default cloud storage bucket/container for incoming data per Runtime environment. Optionally, this can be used without extra configuration by uploading source data files there. Agile Data Engine will enter an identity and a key/token within data load commands for the target database to access files in the default storage.
2. Creating entities
2.1 Creating a package
Entities are created into packages. Please refer to the Designing packages guide to understand how the package structure should be designed.
Create a package for your staging entities:
Open the Designer.
2. Give a package name and a package description, for example:
FIELD | EXAMPLE VALUE |
---|---|
PACKAGE NAME: | STG_TAXIDATA |
PACKAGE DESCRIPTION: | #staging #taxidata |
3. Press Create package.
2.2 Creating a source entity
Source entities are of type METADATA_ONLY and their purpose is to describe the source data. Note that Agile Data Engine does not create METADATA_ONLY entities into the target database.
A source entity should be named after the dataset in the source system, e.g. a database table.
A source system is defined for source entities. See CONFIG_SYSTEMS on how to configure source systems.
Attributes are optional in a source entity.
Source entities can be used to import the source system data model as metadata into Agile Data Engine. See the entity import reference for more information on how to import entities.
SaaS Edition:
Source entities are mandatory.
Source entity name and source system name are referred to later on in the file loading process when calling the Notify API (chapter 5).
Private Edition:
Source entities are optional yet recommended.
Source entity name is referred to later on in the file loading process when naming the files (chapter 5).
If source entities are not defined, filenames have to correspond to staging table names.
Create a source entity for your source dataset:
Open the staging package.
In the Entities tab, press Create new entity.
Enter details, for example:
FIELD | EXAMPLE VALUE |
---|---|
ENTITY LOGICAL NAME: | RATECODE_DESCRIPTION |
TYPE: | SOURCE |
SOURCE SYSTEM: | taxidata |
PHYSICAL TYPE: | METADATA_ONLY |
SCHEMA: | src |
Note that if you have not configured a source system yet, create one in CONFIG_SYSTEMS.
4. Press Create.
2.3 Creating a stage entity
Stage entities (staging tables) are file load target tables where the data will be first loaded to. Staging table attributes and attribute data types have to match the incoming data files. A common approach is to import staging tables into the Designer instead of creating them manually. See the entity import reference for more information on how to import entities.
To create a staging table manually, follow these instructions:
Open the staging package.
In the Entities tab press Create new entity.
Enter details, for example:
FIELD | EXAMPLE VALUE |
---|---|
ENTITY LOGICAL NAME: | STG_RATECODE_DESCRIPTION_TAXIDATA |
TYPE: | STAGE |
PHYSICAL TYPE: | TABLE |
SCHEMA: | staging |
4. Press Create.
5. In the Attributes tab add attributes by pressing Create attribute and entering attribute details.
3. Creating a load schedule
A load schedule is required for running (file) loads. Loads placed into load schedules form load workflows.
Inherited schedules
Since the file load is usually the first load in a workflow, it is sufficient to add it into a schedule. Loads further down the workflow will inherit the schedule through dependencies mapped in entity mappings if their schedules are set to NONE.
An inherited schedule can be overridden by defining another load schedule for a load. This is common, for example, with publish loads which could be dependent on entities loaded in several different workflows.
See Designing Workflows to learn more about workflows.
Create a load schedule for your source data:
Open CONFIG_LOAD_SCHEDULES configuration package.
In the CONFIGURATION tab, press Create new load schedule.
Enter details, for example:
FIELD | EXAMPLE VALUE |
---|---|
SCHEDULE NAME: | TAXIDATA |
CRON EXPRESSION: | 30 2 * * * |
LOAD POOL NAME: | |
DAG GENERATION MODE: | |
DESCRIPTION: | #taxidata |
4. Press Create
5. Commit the changes from the PACKAGE DETAILS - Commit tab by pressing Commit package, entering a commit message and pressing Commit package again.
4. Creating a file load
4.1 Creating a file load from source to stage
Create a file load from the source entity into the staging table:
Open the staging entity and go to the Loads tab.
Create a new load by pressing Create incoming load.
Enter details, for example:
FIELD | EXAMPLE VALUE |
---|---|
LOAD NAME: | load_stg_ratecode_description_taxidata_from_ratecode_description |
LOAD TYPE: | LOAD_FILE |
SCHEDULE: | TAXIDATA |
CRON EXPRESSION: | |
DESCRIPTION: |
4. Press Create.
5. Create an entity mapping by pressing Create entity mapping.
6. Find the source entity, e.g. RATECODE_DESCRIPTION, and press Select entity.
7. If you have created or imported attributes into your source entity, map them to the stage entity. If not, go to the next step.
8. Go back to the load view by clicking the name of the load, e.g. load_stg_ratecode_description_taxidata_from_ratecode_description.
The next steps depend on your target database product and choices made in chapter 1 Configuring source data access.
Private Edition & Default storage
If you are using the Private Edition and your source data is in the Agile Data Engine default storage you can go to chapter 5. Queuing incoming files.
Snowflake
If you are using a storage integration:
9. Add a load option by pressing Add load option.
10. Select OPT_STORAGE_INTEGRATION and enter the name of your storage integration.
11. Press Add.
Amazon Redshift
The IAM role with access to source data is configured to the Agile Data Engine configuration and will be placed in COPY commands automatically. In the SaaS Edition configuration is done by the SaaS support team. No additional steps are required.
Azure SQL Database
If you are using an external data source:
Add a load option by pressing Add load option.
Select OPT_STORAGE_INTEGRATION and enter the name of your external data source.
Press Add.
Azure Synapse SQL
In Azure Synapse source data is accessed with a Managed Identity (see chapter 1 Configuring source data access). No additional steps are required.
4.2 Customizing a file load
There are several load options that can be used with file loads. Please refer to the reference guide for more details and examples. Note that often load options are target database product specific.
OVERRIDE_FILE_LOAD load step type can be used to fully customize a file load i.e. manually define the load SQL.
5. Queuing incoming files
The method for queuing incoming files depends on your Agile Data Engine edition:
SaaS: Use the Notify API
Private: Queue manifests (recommended) or single files
5.1 Queuing files with Notify API
Edition: SaaS
Notify API is used for queuing incoming source data files to be loaded into the target database in Agile Data Engine SaaS Edition. Notify API has to be called from an external service, e.g. an integration tool, when source files are in cloud storage ready to be loaded into the data warehouse. In addition, Agile Data Engine provides a Swagger UI{target="_blank"} that enables manual API calls for development and testing purposes.
Follow the steps below to queue incoming files. It is recommended to first test the process manually in the UI. Once the process is tested and working, it is easier to adapt it to an external service that will run the API requests. See the reference guide for further information and example solutions.
Authentication
All requests have to be authenticated. When calling the APIs from an external service, authentication is done with an API key and secret which are provided by the Agile Data Engine support team. Also, the public IP address of the service executing the requests has to be allowed by the support team.
To authenticate in the UI:
Navigate to the UI from the RUNTIMES menu. Select the desired Runtime environment and press Notify API.
Authorize the requests by pressing Authorize, then again Authorize under OAuth2.
Response: 401 Unauthorized
Error 401 when executing a request in the UI usually indicates that the authorization has expired. Go back to Authorize, Logout and Authorize again.
Create a manifest
Manifests give Agile Data Engine information about the files to be loaded such as file format details and a list of files as manifest entries.
To create a manifest in the UI:
Use Create manifest, press Try it out.
Enter details, for example:
FIELD | EXAMPLE VALUE |
---|---|
tenantName: | local |
installationName: | local |
environmentName: | local |
sourceSystemName: | taxidata |
sourceEntityName: | ratecode_description |
Request body:
{
"compression": "GZIP",
"delim": "COMMA",
"format": "CSV",
"fullscanned": true,
"skiph": 1
}
Notes
TenantName, installationName and environmentName are always local.
SourceSystemName and sourceEntityName have to match what you have set for your source entity in chapter 2.2.
See the reference guide for detailed options regarding the Request body. Accepted values are also listed in the UI: Scroll down to Schemas and see Manifest.
3. Press Execute.
4. Response code 201 indicates a successful request. Copy the manifest id from the response body and keep it for the next steps.
Example response:
{
"id": "c026bb41-e4c4-48bf-9ac6-06f3f2e44129",
"state": "OPEN",
"columns": [],
"batch": 1618915514245,
"fullscanned": true,
"delim": "COMMA",
"skiph": 1,
"compression": "GZIP",
"format": "CSV",
"created": "2021-04-20T10:45:14.245846",
"modified": "2021-04-20T10:45:14.268148"
}
Create manifest entries
Manifest entries list the files to be loaded. Entries can be created all at once (PUT) or one by one (POST); all at once (PUT) is recommended since it is faster. You may list just one file or as many files per manifest as the target database can handle in one file load command.
Batch IDs and the Run ID Logic
Batch ids can be given either on the manifest level (all files are part of the same batch) or on the manifest entry level (files are separate batches). Batch ids are essential with the Run ID Logic.
Creating manifest entries (PUT) in the UI:
Create manifest entries with Create multiple manifest entries.
Enter details, for example:
FIELD | EXAMPLE VALUE |
---|---|
tenantName: | local |
installationName: | local |
environmentName: | local |
sourceSystemName: | taxidata |
sourceEntityName: | ratecode_description |
id: | c026bb41-e4c4-48bf-9ac6-06f3f2e44129 |
Request body:
[
{
"batch": 1618916378963,
"sourceFile": "azure://example.blob.core.windows.net/example-container/taxidata/ratecode_description.batch.1618916378963.csv.gz"
},
{
"batch": 1618916457230,
"sourceFile": "azure://example.blob.core.windows.net/example-container/taxidata/ratecode_description.batch.1618916457230.csv.gz"
}
]
Notes
TenantName, installationName and environmentName are always local.
SourceSystemName and sourceEntityName have to match what you have set for your source entity in chapter 2.2.
Id is the manifest ID which is received in the response of Create manifest.
See the reference guide for detailed options regarding the Request body. Accepted values are also listed in the UI: Scroll down to Schemas and see ManifestEntry.
3. Press Execute.
4. Response code 201 indicates a successful request.
Example response:
[
{
"id": "d46c6bfd-fbfd-46f4-a8b8-44084006193f",
"batch": 1618916378963,
"sourceFile": "azure://example.blob.core.windows.net/example-container/taxidata/ratecode_description.batch.1618916378963.csv.gz"
},
{
"id": "d846a814-f7d6-4dcc-b04e-01ffbc075295",
"batch": 1618916457230,
"sourceFile": "azure://example.blob.core.windows.net/example-container/taxidata/ratecode_description.batch.1618916457230.csv.gz"
}
]
Notify
Notify changes the manifest state from OPEN to NOTIFIED. After that the manifest is sent to processing and it cannot be changed anymore.
Notifying a manifest in the UI:
Use Notify ADE by manifest to be ready for processing, press Try it out.
Enter details, for example:
FIELD | EXAMPLE VALUE |
---|---|
tenantName: | local |
installationName: | local |
environmentName: | local |
sourceSystemName: | taxidata |
sourceEntityName: | ratecode_description |
id: | c026bb41-e4c4-48bf-9ac6-06f3f2e44129 |
Notes
TenantName, installationName and environmentName are always local.
SourceSystemName and sourceEntityName have to match what you have set for your source entity in chapter 2.2.
Id is the manifest ID which is received in the response of Create manifest.
3. Press Execute.
4. Response code 200 indicates a successful request.
The manifest is now queued and you are ready to test the file load, go to chapter 6 Executing a file load.
5.2 Queuing single files
Edition: Private
This method can be used only in the Private Edition and only with the Agile Data Engine default storage for incoming data (unless using OPT_DATA_FILE_LOCATION). 5.3 Queuing manifests offers more flexibility and is therefore the recommended method for the Private Edition. However, queuing single files is quick and easy so it is often a good starting point when developing and testing file loads. See the reference guide for single file loads for further information.
1. Upload a file named according to the naming convention to the file load bucket/container. Information about the file format etc. is given in the filename. For example:
azure://example.blob.core.windows.net/example-file-load-container/ratecode_description/table.ratecode_description.1618916378963.batch.1618916378963.fullscanned.true.delim.comma.skiph.1.csv.gz
2. In Private AWS a notification file is automatically created into the notify bucket. In Private Azure create a notification file into the notify container. A notification file is named like the data file but the file content is empty. For example:
azure://example.blob.core.windows.net/example-notify-container/ratecode_description/table.ratecode_description.1618916378963.batch.1618916378963.fullscanned.true.delim.comma.skiph.1.csv.gz
The file is now queued and you are ready to test the file load, go to chapter 6 Executing a file load.
5.3 Queuing manifests
Edition: Private
Queuing incoming data files with manifests is the recommended method for Agile Data Engine Private Edition. Manifests can be used both with the default storage and with any other cloud storage bucket/container that the target database can access (see chapter 1 Configuring source data access).
A manifest is a JSON file that gives Agile Data Engine information about the files to be loaded such as file format details and a list of files as entries. See the reference guide for manifest loads for further information.
Upload data files to cloud storage. For example:
azure://example.blob.core.windows.net/example-container/taxidata/ratecode_description.batch.1618916378963.csv.gz
azure://example.blob.core.windows.net/example-container/taxidata/ratecode_description.batch.1618916457230.csv.gz
2. Create a manifest according to the guidelines, for example:
{
"entries": [
{
"mandatory": true,
"batch": 1618916378963,
"url": "azure://example.blob.core.windows.net/example-container/taxidata/ratecode_description.batch.1618916378963.csv.gz"
},
{
"mandatory": true,
"batch": 1618916457230,
"url": "azure://example.blob.core.windows.net/example-container/taxidata/ratecode_description.batch.1618916457230.csv.gz"
}
],
"compression": "GZIP",
"delim": "COMMA",
"format": "CSV",
"fullscanned": true,
"skiph": 1
}
3. Upload the manifest to the file load bucket/container (default storage), for example:
azure://example.blob.core.windows.net/example-file-load-container/manifest/ratecode_description/manifest-table.ratecode_description.1618916457230.json
4. In Private AWS a notification file is automatically created into the notify bucket. In Private Azure create a notification file into the notify container. A notification file is named like the manifest but the file content is empty. For example:
azure://example.blob.core.windows.net/example-notify-container/manifest/ratecode_description/manifest-table.ratecode_description.1618916457230.json
The manifest is now queued and you are ready to test the file load, go to chapter 6 Executing a file load.
Batch IDs and the Run ID Logic
Batch ids can be given either on the manifest level (all files are part of the same batch) or on the manifest entry level (files are separate batches). Batch ids are essential with the Run ID Logic.
6. Executing a file load
Commit the CONFIG_LOAD_SCHEDULES configuration package and your staging package to deploy the changes into a Runtime environment first if you have not done that yet.
Execute the file load:
Navigate to Workflow Orchestration from the RUNTIMES menu by selecting the desired Runtime environment and pressing Workflow orchestration.
Find your workflow, e.g. TAXIDATA and enable it from the toggle switch.
Open the workflow by pressing its name.
Trigger the the workflow by pressing Trigger DAG.
Wait for the load to finish and check the logs from the Workflow Orchestration UI.
See Triggering a Workflow for further information.