Manifest load
Edition: Private
A manifest load can load multiple files in one file load command. A manifest is a JSON file that contains a list of file locations and parameters to guide the file load.
Manifests are uploaded into the Agile Data Engine default file load bucket/container and notifications per manifest are created into the notify bucket/container.
Data files can be stored in any cloud storage bucket/container accessible by the target database.
See also:
Manifest content
A generic format for the manifest content:
{
"table": {string},
"batch": {integer},
"fullscanned": {boolean},
"delim": {string},
"skiph": {integer},
"format": {string},
"compression": {string},
"entries": [
{
"mandatory": {boolean},
"url": {string}
"batch": {integer}
},
...
],
"columns": [
{string},
...
]
}
Manifest parameter details and guidelines:
Parameter | Required/ Optional | Alternatively in manifest name* | Value | Default | Applicable format | Description |
---|---|---|---|---|---|---|
table | OPTIONAL | x | source entity name | all | Name of the SOURCE entity. If a SOURCE entity is not defined, name of the STAGE entity. | |
batch | OPTIONAL | x | manifest batch id | all | Identifier of the data batch from the process that generated the file. Note that batch ids can be given either on the manifest level (all files are part of the same batch) or on the manifest entry level (files are separate batches). Batch ids are essential with the Run ID Logic. | |
fullscanned | OPTIONAL | x | true/false | false | all | Describes if the manifest represents the full data from the source or just a part of it. When set to true, the target table is truncated before the file load. |
delim | OPTIONAL | x | comma/tab/semicolon/pipe/hash | comma | csv | Field delimiter of a CSV formatted data file. |
skiph | OPTIONAL | x | number of header lines to skip | 0 | csv | Number of header rows in a CSV formatted data file which should be skipped in the file load. |
format | CONDITIONAL** | x | csv/json/xml/parquet/avro | all | Data file type. Note that supported formats depend on the target database product. | |
compression | OPTIONAL | x | gzip/bz2/lzo | all | Data file compression. Note that supported compression methods depend on the target database product. | |
entries | REQUIRED | array of entries | all | Source file entries and their parameters listed in an array. | ||
columns | OPTIONAL | array of column names | csv | SNOWFLAKE: List of column names in CSV files. Can be used when the CSV file has less columns than the target staging table or when the column order is different. |
* Manifest parameters (other than the arrays) can either be defined in the manifest content or in the manifest name.
** Format has to be defined either in the manifest content or in the manifest name.
Manifest entry parameter details and guidelines:
Parameter | Required/Optional | Value | Default | Description |
---|---|---|---|---|
mandatory | REQUIRED | true/false | false | REDSHIFT: Specifies whether COPY should return an error if the file is not found. OTHER: Required due to legacy reasons. |
url | REQUIRED | entry filename and path | Entry (file) to be loaded. The path depends on where source entries are located (e.g. S3 or Azure Storage) and how the file load and the target database are configured. | |
batch | OPTIONAL | file batch id | Identifier of the data batch from the process that generated the file. Note that batch ids can be given either on the manifest level (all files are part of the same batch) or on the manifest entry level (files are separate batches). Batch ids are essential with the Run ID Logic. |
Manifest naming convention
The manifest naming convention is similar to the file naming convention in single file loads:
manifest-table.{table_value}.{timestamp_value}.batch.{batch_value}.fullscanned.{fullscanned_value}.delim.{delim_value}.skiph.{skiph_value}.{format_value}.{compression_value}.json
See parameter details and guidelines in the table above. Parameter values can be either given in the manifest name or in the manifest content. The minimum requirement for a manifest name is:
manifest-table.{table_value}.{timestamp_value}.json
where:
{table_value} is the SOURCE entity name or the STAGE entity name if a SOURCE entity has not been defined.
{timestamp_value} is the manifest creation timestamp in milliseconds that ensures the file name is unique.
Notes
Manifest loads are the recommended file load method in Agile Data Engine Private Edition. They offer more flexibility than single file loads.
Manifest loads can have significant performance effects if the target database product is able to load multiple files in parallel.
A manifest can contain from one to as many entries (listed files) as the target database supports in one file load command.
File loads can be customized with OVERRIDE_FILE_LOAD. For example, a part of the file path could be defined in the OVERRIDE_FILE_LOAD step and manifests would then contain URLs relative to that path.
SNOWFLAKE: External stages{target="_blank"} can be referred to in manifests with URL prefix sfstage:// or @. See example below.
Examples
Manifest CSV file load into a staging table in Private AWS
SOURCE and STAGE entities created in the Designer and deployed into a Runtime environment:
NAME | TYPE | SOURCE SYSTEM | PHYSICAL TYPE |
---|---|---|---|
RATECODE_DESCRIPTION | SOURCE | taxidata | METADATA_ONLY |
STG_RATECODE_DESCRIPTION_TAXIDATA | STAGE | TABLE |
An incoming file load has to be configured for the STAGE entity where the SOURCE entity is mapped as the source. The file load has to be placed in a schedule.
Manifest content:
{
"entries": [
{
"mandatory": true,
"batch": 1618916378963,
"url": "s3://example-bucket/taxidata/ratecode_description.batch.1618916378963.csv.gz"
},
{
"mandatory": true,
"batch": 1618916457230,
"url": "s3://example-bucket/taxidata/ratecode_description.batch.1618916457230.csv.gz"
}
],
"compression": "GZIP",
"delim": "COMMA",
"format": "CSV",
"fullscanned": true,
"skiph": 1
}
Filename:
manifest-table.ratecode_description.1618916457230.json
File uploaded to:
s3://example-file-load-bucket/manifest/ratecode_description/manifest-table.ratecode_description.1618916457230.json
Notification file automatically created:
s3://example-notify-bucket/manifest/ratecode_description/manifest-table.ratecode_description.1618916457230.json
Once deployed and enabled, the workflow is executed according to the schedule or it can be manually triggered from the Workflow Orchestration.
Manifest CSV file load into a staging table in Private Azure
SOURCE and STAGE entities created in the Designer and deployed into a Runtime environment:
NAME | TYPE | SOURCE SYSTEM | PHYSICAL TYPE |
---|---|---|---|
RATECODE_DESCRIPTION | SOURCE | taxidata | METADATA_ONLY |
STG_RATECODE_DESCRIPTION_TAXIDATA | STAGE | TABLE |
An incoming file load has to be configured for the STAGE entity where the SOURCE entity is mapped as the source. The file load has to be placed in a schedule.
Manifest content:
{
"entries": [
{
"mandatory": true,
"batch": 1618916378963,
"url": "azure://example.blob.core.windows.net/example-container/taxidata/ratecode_description.batch.1618916378963.csv.gz"
},
{
"mandatory": true,
"batch": 1618916457230,
"url": "azure://example.blob.core.windows.net/example-container/taxidata/ratecode_description.batch.1618916457230.csv.gz"
}
],
"compression": "GZIP",
"delim": "COMMA",
"format": "CSV",
"fullscanned": true,
"skiph": 1
}
Filename:
manifest-table.ratecode_description.1618916457230.json
File uploaded to:
azure://example.blob.core.windows.net/example-file-load-container/manifest/ratecode_description/manifest-table.ratecode_description.1618916457230.json
Notification file uploaded to:
azure://example.blob.core.windows.net/example-notify-container/manifest/ratecode_description/manifest-table.ratecode_description.1618916457230.json
Once deployed and enabled, the workflow is executed according to the schedule or it can be manually triggered from the Workflow Orchestration.
Example manifest with JSON files in Google Storage
Manifest content:
{
"entries": [
{
"mandatory": true,
"batch": 1618916378963,
"url": "gs://example-bucket/taxidata/ratecode_description.batch.1618916378963.json.gz"
},
{
"mandatory": true,
"batch": 1618916457230,
"url": "gs://example-bucket/taxidata/ratecode_description.batch.1618916457230.json.gz"
}
],
"compression": "GZIP",
"format": "JSON",
"fullscanned": false
}
Example manifest referencing a Snowflake external stage
Use of an external stage is required when using the transforming data during a load feature in Snowflake.
In Agile Data Engine file loads can be customized with OVERRIDE_FILE_LOAD.
An external stage can be referred to in manifests with URL prefix sfstage:// or @.
The role Agile Data Engine user is using in Snowflake must be granted the appropriate privileges.
Manifest content:
{
"entries": [
{
"mandatory": true,
"batch": 1618916378963,
"url": "@example_schema.example_stage/taxidata/ratecode_description.batch.1618916378963.json.gz"
},
{
"mandatory": true,
"batch": 1618916457230,
"url": "@example_schema.example_stage/taxidata/ratecode_description.batch.1618916457230.json.gz"
}
],
"compression": "GZIP",
"format": "JSON",
"fullscanned": false
}