Manifest load

Edition: Private

A manifest load can load multiple files in one file load command. A manifest is a JSON file that contains a list of file locations and parameters to guide the file load.

Manifests are uploaded into the Agile Data Engine default file load bucket/container and notifications per manifest are created into the notify bucket/container.

Data files can be stored in any cloud storage bucket/container accessible by the target database.

See also:

Manifest content

A generic format for the manifest content:

JSON

{
    "table": {string},
    "batch": {integer},
    "fullscanned": {boolean},
    "delim": {string},
    "skiph": {integer},
    "format": {string},
    "compression": {string},
    "entries": [
        {
            "mandatory": {boolean},
            "url": {string}
            "batch": {integer}
        }, 
        ...
    ],
    "columns": [
        {string},
        ...
    ]
}

Manifest parameter details and guidelines:

Parameter	Required/ Optional	Alternatively in manifest name*	Value	Default	Applicable format	Description
table	OPTIONAL	x	source entity name		all	Name of the SOURCE entity. If a SOURCE entity is not defined, name of the STAGE entity.
batch	OPTIONAL	x	manifest batch id		all	Identifier of the data batch from the process that generated the file. Note that batch ids can be given either on the manifest level (all files are part of the same batch) or on the manifest entry level (files are separate batches). Batch ids are essential with the Run ID Logic.
fullscanned	OPTIONAL	x	true/false	false	all	Describes if the manifest represents the full data from the source or just a part of it. When set to true, the target table is truncated before the file load.
delim	OPTIONAL	x	comma/tab/semicolon/pipe/hash	comma	csv	Field delimiter of a CSV formatted data file.
skiph	OPTIONAL	x	number of header lines to skip	0	csv	Number of header rows in a CSV formatted data file which should be skipped in the file load.
format	CONDITIONAL**	x	csv/json/xml/parquet/avro		all	Data file type. Note that supported formats depend on the target database product.
compression	OPTIONAL	x	gzip/bz2/lzo		all	Data file compression. Note that supported compression methods depend on the target database product.
entries	REQUIRED		array of entries		all	Source file entries and their parameters listed in an array.
columns	OPTIONAL		array of column names		csv	SNOWFLAKE: List of column names in CSV files. Can be used when the CSV file has less columns than the target staging table or when the column order is different.

* Manifest parameters (other than the arrays) can either be defined in the manifest content or in the manifest name.

** Format has to be defined either in the manifest content or in the manifest name.

Manifest entry parameter details and guidelines:

Parameter

Required/Optional

Value

Default

Description

mandatory

REQUIRED

true/false

false

REDSHIFT: Specifies whether COPY should return an error if the file is not found.

OTHER: Required due to legacy reasons.

url

REQUIRED

entry filename and path

Entry (file) to be loaded. The path depends on where source entries are located (e.g. S3 or Azure Storage) and how the file load and the target database are configured.

batch

OPTIONAL

file batch id

Identifier of the data batch from the process that generated the file. Note that batch ids can be given either on the manifest level (all files are part of the same batch) or on the manifest entry level (files are separate batches). Batch ids are essential with the Run ID Logic.

Manifest naming convention

The manifest naming convention is similar to the file naming convention in single file loads:

manifest-table.{table_value}.{timestamp_value}.batch.{batch_value}.fullscanned.{fullscanned_value}.delim.{delim_value}.skiph.{skiph_value}.{format_value}.{compression_value}.json

See parameter details and guidelines in the table above. Parameter values can be either given in the manifest name or in the manifest content. The minimum requirement for a manifest name is:

manifest-table.{table_value}.{timestamp_value}.json

where:

{table_value} is the SOURCE entity name or the STAGE entity name if a SOURCE entity has not been defined.
{timestamp_value} is the manifest creation timestamp in milliseconds that ensures the file name is unique.

Notes

Manifest loads are the recommended file load method in Agile Data Engine Private Edition. They offer more flexibility than single file loads.

Manifest loads can have significant performance effects if the target database product is able to load multiple files in parallel.

A manifest can contain from one to as many entries (listed files) as the target database supports in one file load command.

File loads can be customized with OVERRIDE_FILE_LOAD. For example, a part of the file path could be defined in the OVERRIDE_FILE_LOAD step and manifests would then contain URLs relative to that path.

SNOWFLAKE: External stages{target="_blank"} can be referred to in manifests with URL prefix sfstage:// or @. See example below.

Examples

Manifest CSV file load into a staging table in Private AWS

SOURCE and STAGE entities created in the Designer and deployed into a Runtime environment:

NAME	TYPE	SOURCE SYSTEM	PHYSICAL TYPE
RATECODE_DESCRIPTION	SOURCE	taxidata	METADATA_ONLY
STG_RATECODE_DESCRIPTION_TAXIDATA	STAGE		TABLE

An incoming file load has to be configured for the STAGE entity where the SOURCE entity is mapped as the source. The file load has to be placed in a schedule.

Manifest content:

JSON

{
    "entries": [
        {
            "mandatory": true,
            "batch": 1618916378963,
            "url": "s3://example-bucket/taxidata/ratecode_description.batch.1618916378963.csv.gz"
        },
        {
            "mandatory": true,
            "batch": 1618916457230,
            "url": "s3://example-bucket/taxidata/ratecode_description.batch.1618916457230.csv.gz"
        }
    ],
    "compression": "GZIP",
    "delim": "COMMA",
    "format": "CSV",
    "fullscanned": true,
    "skiph": 1
}

Filename:

manifest-table.ratecode_description.1618916457230.json

File uploaded to:

s3://example-file-load-bucket/manifest/ratecode_description/manifest-table.ratecode_description.1618916457230.json

Notification file automatically created:

s3://example-notify-bucket/manifest/ratecode_description/manifest-table.ratecode_description.1618916457230.json

Once deployed and enabled, the workflow is executed according to the schedule or it can be manually triggered from the Workflow Orchestration.

Manifest CSV file load into a staging table in Private Azure

SOURCE and STAGE entities created in the Designer and deployed into a Runtime environment:

NAME	TYPE	SOURCE SYSTEM	PHYSICAL TYPE
RATECODE_DESCRIPTION	SOURCE	taxidata	METADATA_ONLY
STG_RATECODE_DESCRIPTION_TAXIDATA	STAGE		TABLE

An incoming file load has to be configured for the STAGE entity where the SOURCE entity is mapped as the source. The file load has to be placed in a schedule.

Manifest content:

JSON

{
    "entries": [
        {
            "mandatory": true,
            "batch": 1618916378963,
            "url": "azure://example.blob.core.windows.net/example-container/taxidata/ratecode_description.batch.1618916378963.csv.gz"
        },
        {
            "mandatory": true,
            "batch": 1618916457230,
            "url": "azure://example.blob.core.windows.net/example-container/taxidata/ratecode_description.batch.1618916457230.csv.gz"
        }
    ],
    "compression": "GZIP",
    "delim": "COMMA",
    "format": "CSV",
    "fullscanned": true,
    "skiph": 1
}

Filename:

manifest-table.ratecode_description.1618916457230.json

File uploaded to:

azure://example.blob.core.windows.net/example-file-load-container/manifest/ratecode_description/manifest-table.ratecode_description.1618916457230.json

Notification file uploaded to:

azure://example.blob.core.windows.net/example-notify-container/manifest/ratecode_description/manifest-table.ratecode_description.1618916457230.json

Once deployed and enabled, the workflow is executed according to the schedule or it can be manually triggered from the Workflow Orchestration.

Example manifest with JSON files in Google Storage

Manifest content:

JSON

{
    "entries": [
        {
            "mandatory": true,
            "batch": 1618916378963,
            "url": "gs://example-bucket/taxidata/ratecode_description.batch.1618916378963.json.gz"
        },
        {
            "mandatory": true,
            "batch": 1618916457230,
            "url": "gs://example-bucket/taxidata/ratecode_description.batch.1618916457230.json.gz"
        }
    ],
    "compression": "GZIP",
    "format": "JSON",
    "fullscanned": false
}

Example manifest referencing a Snowflake external stage

Use of an external stage is required when using the transforming data during a load feature in Snowflake.
In Agile Data Engine file loads can be customized with OVERRIDE_FILE_LOAD.
An external stage can be referred to in manifests with URL prefix sfstage:// or @.
The role Agile Data Engine user is using in Snowflake must be granted the appropriate privileges.

Manifest content:

JSON

{
    "entries": [
        {
            "mandatory": true,
            "batch": 1618916378963,
            "url": "@example_schema.example_stage/taxidata/ratecode_description.batch.1618916378963.json.gz"
        },
        {
            "mandatory": true,
            "batch": 1618916457230,
            "url": "@example_schema.example_stage/taxidata/ratecode_description.batch.1618916457230.json.gz"
        }
    ],
    "compression": "GZIP",
    "format": "JSON",
    "fullscanned": false
}