Satellite
Data Vault satellites consist of descriptive data or the context of a business key or a relationship. Satellites:
Are always dependent on one hub or one link
Store the history
Are source system specific
Can be split into multiple satellites according to the rate of change or domain
Creating a satellite
Follow these steps to create a satellite:
Select or create a package and create a new entity, choose type SAT and physical type TABLE.
Navigate to the source entity (staging table) and create an outgoing load selecting the satellite as target entity.
Create an entity mapping as follows:
Source attribute | Transformation | Target attribute | Notes |
---|---|---|---|
Business key attribute(s) in source entity. | HASH | dv_id | Map the source attribute(s) for the business key. Note that if you add a foreign key reference from the satellite dv_id to the hub or link dv_id beforehand, you can use the same mapping as in the primary key automatically by selecting Copy by Reference. |
- | CURRENT_TS | dv_load_time | Load timestamp, does not need any modifications by default. |
- | RUN_ID | dv_run_id | Run ID (see Using Run ID Logic), does not need any modifications by default. |
Source name attribute in source entity. | NONE / CUSTOM | dv_source | Best practice is to have the source name as a metadata attribute on staging entities and to map it to Data Vault entities without any transformations. Alternatively, a CUSTOM transformation can be used e.g. with a source name or an environment variable given in single quotes. However, it is best to avoid the CUSTOM transformation type if possible. |
All relevant descriptive attributes and the business key attributes. | HASH | dv_datahash | Calculating datahash from all mapped source attributes. Datahash is used to compare records when inserting new satellite records. A good practice is to map dv_datahash as the final attribute as then you can select Map all mapped. Also map the business key attributes to the datahash as the default load logic expects it. |
All relevant descriptive attributes. | NONE | Multiple, depending on satellite. | All relevant descriptive attributes mapped to target attributes. |
Use the Export SQL feature to preview entity SQL & load SQL. Note that Agile Data Engine will automatically create current views (entity type SAT_C) for satellites.
Naming conventions
Note that you can modify the default entity and attribute naming used in the examples in the CONFIG_ENTITY_DEFAULTS configuration package to fit your naming conventions. See more in Customizing default attributes and transformations
Notes
There are different practices on whether to map the business key attributes mapped to the dv_id also as attributes of the satellite. Storing all fields coming from the source in a satellite when possible is a good practice as it enables you to create new entities out of historical data in the future if needed. Having non-hashed key values available can also help when troubleshooting.
Consider which attributes should be part of dv_datahash as it is used to detect changes in the incoming data. Usually all mapped attributes from the source are included but metadata attributes (e.g. stg_source_system, stg_source_entity) can be omitted. Note that the default load logic expects that the business key attributes are also mapped to the datahash.
The primary key of a satellite table is dv_id + dv_load_time, i.e. each row represents a version (dv_load_time) of the descriptive attributes for a business key or a relationship (dv_id). Therefore, the incoming data in a single satellite load execution should not contain multiple rows per dv_id as that would result in duplicates over dv_id and dv_load_time. If there can be situations where there are multiple batches of source data staged and loaded at the same time, use Run ID Logic or OPT_SCAN_ONLY_THE_LATEST_BY_KEY to avoid duplicates. See examples below for more details.
If you want to change the default configuration of any entity type, including satellites and other Data Vault entities, it can be done using the CONFIG_ENTITY_DEFAULTS configuration package. Similarly, transformations can be edited with CONFIG_LOAD_TRANSFORMATIONS.
Examples
Satellite load using Run ID Logic
In this example we configure a satellite load with Run ID Logic so that each source data batch is inserted into the satellite in a separate transaction resulting in a different dv_load_timestamp. This way duplicates over dv_id and dv_load_time are avoided in a situation where there can be multiple source data batches loaded per load execution. See Using Run ID Logic to better understand the concepts.
First, create a satellite entity with a load as instructed above. Then configure load options OPT_USE_RUN_IDS and OPT_NUM_RUN_IDS_PER_TRANSACTION as follows:
OPT_USE_RUN_IDS: true
OPT_NUM_RUN_IDS_PER_TRANSACTION: 1
With these settings Run ID Logic is enabled and each run id is loaded into the satellite in a separate transaction and in the correct order. This results in a change history per dv_id correctly ordered by dv_load_time.
Satellite load using OPT_SCAN_ONLY_THE_LATEST_BY_KEY
If staged source data contains multiple rows over dv_id, then a regular satellite load will result in duplicates over dv_id and dv_load_time. If there are duplicates due to there being multiple source data batches staged at the same time, see the Run ID Logic example above. If you are not using Run ID Logic or if there are duplicates even within one batch of source data, you can use OPT_SCAN_ONLY_THE_LATEST_BY_KEY to pick the latest row over dv_id. However, note that this option discards all but the latest version over dv_id.
This method can also be used for initial loading satellites if you only need to persist the latest version over dv_id. See Initial load for a satellite below for more about initial loads.
First, create a satellite entity with a load as instructed above. Then configure the load option as follows:
OPT_SCAN_ONLY_THE_LATEST_BY_KEY: true
Note that by default, the source entity must have an attribute of either DV_LOAD_TIME or SDT_STAGE_CREATE_TIME technical attribute type. Also the target entity (satellite) must have either a DV_HASHKEY or a DV_REFERENCING_HASHKEY but this should be set by default.
This results in a load that partitions the source dataset by dv_id, orders it by a staging timestamp and selects the latest row:
INSERT INTO rdv.S_EXAMPLE (
...
)
SELECT DISTINCT
...
FROM (
SELECT
...
, ROW_NUMBER()
OVER (
PARTITION BY
-- Data vault hash key formula, for example:
MD5(NVL(UPPER(CAST(key_attribute AS VARCHAR)), '-1'))
ORDER BY
-- Default staging timestamp, for example:
stg_create_time DESC
)
AS
latest_key_entry
FROM
staging.STG_BARENTSWATCH_AIS_VOYAGE src_entity
) src
WHERE
latest_key_entry = 1
...
;
You can customize which attribute is used for ordering the source data with OPT_SCAN_ONLY_THE_LATEST_WITH_SOURCE_ATTRIBUTE_TYPE. Define a technical attribute type for an attribute in the source entity and then set that type using the load option.
Note that you can also combine this load pattern with Run ID Logic when needed:
Load one batch per transaction and get the latest row per batch when there can be multiple rows per dv_id within one batch.
OR only keep the latest row per dv_id and per load execution by omitting OPT_NUM_RUN_IDS_PER_TRANSACTION but otherwise benefit from Run ID Logic.
Initial load for a satellite
Initial loads are needed when creating new entities and the task is to load all historical data into them. With data vault hubs and links this is simple as they are just lists of unique values or combinations, and the default loads can be used. However, satellites are more complicated especially if the full change history needs to be preserved and sorted correctly in the initial load. If you only need to persist the latest version over dv_id, see Satellite load using OPT_SCAN_ONLY_THE_LATEST_BY_KEY above.
If you have a limited number of source data batches you can use the method described in Satellite load using Run ID Logic. However, this method becomes slow when there are lots of batches as each batch needs to be inserted in a separate transaction.
Note that you can specify multiple loads for an entity, i.e. you can add initial loads as needed for entities in addition to the default incremental loads. These initial loads can be disabled by default and only run when needed in separate workflows.
This example uses the QUALIFY clause which is not available in all supported target database products. If you cannot use QUALIFY, configure an OVERRIDE load step instead.
First, add a new load for a satellite entity and create an entity mapping as instructed above. Then, edit the mapping and transformation for dv_load_time:
OPTION 1: Use a change timestamp from the source system if one is available and if you can trust it. Map the source change timestamp to dv_load_time.
OPTION 2: Use the current timestamp as basis but order the versions over dv_id by a source attribute that defines the correct order, e.g. a batch id. Then generate unique load timestamps for each version from the current timestamp separated by e.g. millisecond like in the example:
SQL-- CUSTOM transformation for dv_load_time dateadd( 'millisecond', row_number() over (partition by dv_id order by stg_batch_id asc) - 1, current_timestamp )
To discard unnecessary rows where data has not changed we can add a QUALIFY clause using OPT_WHERE:
1=1
qualify nvl(dv_datahash !=
lag(dv_datahash, 1) over (partition by dv_id order by stg_batch_id asc), true)
Where:
1=1 is part of the WHERE clause and allows us to use OPT_WHERE for QUALIFY as a workaround.
Incoming rows are partitioned by dv_id and ordered by the chosen source attribute that defines the correct order, here stg_batch_id.
dv_datahash is compared to dv_datahash of the previous row using LAG and the row is qualified if it does not match, i.e. data has changed between versions.
The comparison returns null if a previous row cannot be found. NVL returns true to qualify first rows.
Since this is an initial load and the satellite should be empty, you can also set OPT_SKIP_DUPLICATE_PREVENTION = true or at least OPT_SKIP_DISTINCT_ONLY = true.
If you need to iterate in your development process and run this load multiple times, you might add a PRE step to truncate the table. However, this might be risky if the initial load is later triggered by accident.
Example load SQL:
INSERT INTO rdv.S_EXAMPLE (
...
)
SELECT
...
FROM (
SELECT
...
, dateadd(
'millisecond',
row_number() over (partition by dv_id order by stg_batch_id asc) - 1,
current_timestamp
) AS dv_load_time
...
FROM
staging.STG_EXAMPLE src_entity
WHERE
1=1
QUALIFY
nvl(dv_datahash !=
lag(dv_datahash, 1) over (partition by dv_id order by stg_batch_id asc), true)
) src
Configuring a custom comparison hash transformation
CONFIG_LOAD_TRANSFORMATIONS can be used to customize load transformations. This example defines the COMPARISON_HASH transformation as an alternative for the default HASH transformation in Agile Data Engine:
Case-sensitive
Nulls converted to ‘-1’
Uses MD5 hash
Configuration JSON including examples for Snowflake & Synapse:
{
"packageName": "CONFIG_LOAD_TRANSFORMATIONS",
...
"attrTransformationTypes": [
{
"transformationType": "COMPARISON_HASH",
"shortDescription": "Comparison hash transformation",
"description": "Hash transformation for change comparison: Case-sensitive, no trimming, nulls converted to '-1'.",
"isShownInDesigner": true,
"ordinalPositionInDesigner": 21
},
...
],
"dbmsTransformationFormulas": [
{
"transformationType": "COMPARISON_HASH",
"dbmsProduct": "SNOWFLAKE",
"transformationFormula": "<@compress single_line=true>MD5(<#list sourceAttributes as attribute> <#if attribute.datatype?? && attribute.datatype == \"GEOGRAPHY\"> NVL(ST_ASWKT(${attribute.attributeName}), '-1') <#else> NVL(CAST(${attribute.attributeName} AS VARCHAR), '-1') </#if> <#if attribute_has_next> || '~' ||</#if></#list>)</@compress>"
},
{
"transformationType": "COMPARISON_HASH",
"dbmsProduct": "MS_SQL_DW",
"transformationFormula": "<@compress single_line=true>CONVERT(CHAR(32),HASHBYTES('MD5',<#list sourceAttributes as attribute> COALESCE(CAST(${attribute.attributeName} AS VARCHAR), '-1')<#if attribute_has_next> + '~' + </#if></#list>), 2)</@compress>"
},
...
]
}
More examples available in GitHub
Our latest default configuration contains a similar comparison hash transformation. You can find it and other examples in this GitHub repository.
Example SQL output in Snowflake using COMPARISON_HASH:
MD5(
NVL(CAST(attr1 AS VARCHAR), '-1') || '~' ||
NVL(CAST(attr2 AS VARCHAR), '-1') || '~' ||
...
)
Compared to using HASH (default):
MD5(
UPPER(NVL(CAST(attr1 AS VARCHAR), '-1')) || '~' ||
UPPER(NVL(CAST(attr2 AS VARCHAR), '-1')) || '~' ||
...
)