Hub
Data Vault hubs are unique lists of business keys which:
Have the same meaning and grain throughout the business
Bind data to business processes
Are immutable or changes are rare
Should be understandable to humans
All business keys from every source system representing the same business entity are collected to the same hub if it is technically possible.
Creating a hub
Follow these steps to create a hub:
Select or create a package and create a new entity, choose type HUB and physical type TABLE.
Navigate to the source entity (staging table) and create an outgoing load selecting the hub as target entity.
Create an entity mapping as follows:
Source attribute | Transformation | Target attribute | Notes |
---|---|---|---|
Business key attribute(s) in source entity. | HASH | dv_id | Map the source attribute(s) for the business key. |
- | CURRENT_TS | dv_load_time | Load timestamp, does not need any modifications by default. |
- | RUN_ID | dv_run_id | Run ID (see Using Run ID Logic), does not need any modifications by default. |
Source name attribute in source entity. | NONE / CUSTOM | dv_source | Best practice is to have the source name as a metadata attribute on staging entities and to map it to Data Vault entities without any transformations. Alternatively, a CUSTOM transformation can be used e.g. with a source name or an environment variable given in single quotes. However, it is best to avoid the CUSTOM transformation type if possible. |
Business key attribute(s) in source entity. | DV_BUSINESS_KEY | business_key | Create a new attribute in the hub for the business key. Map the same attributes as mapped into dv_id. DV_BUSINESS_KEY transformation contains the same transformations as HASH but without the hashing. |
Use the Export SQL feature to preview entity SQL & load SQL.
Naming conventions
Note that you can modify the default entity and attribute naming used in the examples in the CONFIG_ENTITY_DEFAULTS configuration package to fit your naming conventions. See more in Customizing default attributes and transformations
Notes
Hubs are usually loaded from multiple different source entities or even from multiple attributes in a single source entity. Each source can be added as a separate incoming load for the hub.
The most common pattern is to load hubs from all sources where the business key is present to ensure referential integrity. However, sometimes this is not practical e.g. due to performance issues. For example, an accounting system may provide a master list of distinct accounts which are then referred to in accounting documents. If the number of accounting documents is very large, it might make sense to only load the hub from the master account list.
Quite often natural business keys that would be present universally over all source systems cannot be found. Therefore, source system surrogate keys are commonly used as hub business keys. See the example below on how to map source system name as part of business key to avoid collisions for surrogate keys coming from different source systems.
Data Vault uses hash keys by default and all key attributes are cast to string before hashing. If your business key contains date, time or other attributes which can be represented in multiple different string formats depending e.g. on target database default settings, it is advisable to cast these types of attributes explicitly to a given string format. Otherwise changes to target database default settings may change output formats of implicit casts and break the model. These casts can be configured as part of the hash key transformation (see example below).
If you want to change the default configuration of any entity type, including hubs and other Data Vault entities, it can be done using the CONFIG_ENTITY_DEFAULTS configuration package. Similarly, transformations can be edited with CONFIG_LOAD_TRANSFORMATIONS.
Examples
Using source system name as part of business key
In this example we are loading hub H_CUSTOMER from source systems A and B with the following premises:
There is no universal natural business key that could be used for customer → source system technical keys are the only option
Customers in the systems can be partially overlapping i.e. the same customer might be found from both systems
Systems A and B do not share customer keys
Both systems use running integer keys for customer → keys that do not have the same meaning can collide
In this case we can create loads from both source systems to the same hub using the source system name as part of the key.
Entity mappings in loads:
Source attribute | Transformation | Target attribute | Notes |
---|---|---|---|
stg_source_system & customer_key_a / customer_key_b | HASH | dv_id | Source system name and customer key mapped to the dv_id in loads from both source systems. |
- | CURRENT_TS | dv_load_time | Load timestamp, does not need any modifications by default. |
- | RUN_ID | dv_run_id | Run ID (see Using Run ID Logic), does not need any modifications by default. |
Source name attribute in source entity. | NONE / CUSTOM | dv_source | Best practice is to have the source name as a metadata attribute on staging entities and to map it to Data Vault entities without any transformations. Alternatively, a CUSTOM transformation can be used e.g. with a source name or an environment variable given in single quotes. However, it is best to avoid the CUSTOM transformation type if possible. |
stg_source_system & customer_key_a / customer_key_b | DV_BUSINESS_KEY | business_key | Source system name and customer key mapped to business_key in loads from both source systems. |
Note that these loads result in a hub where all distinct customer keys from both systems are on separate rows, even ones that mean the same customer. If a mapping that connects system A and B customer keys together is available, you can create a same-as link to identify which keys belong to the same customer.
Configuring a custom hash key transformation
CONFIG_LOAD_TRANSFORMATIONS can be used to customize load transformations. This example defines the HASH_KEY transformation as an alternative for the default HASH transformation in Agile Data Engine:
Case-insensitive
Trims source string attributes
Converts nulls to ‘-1’
Casts date and time formats explicitly to given string formats
Uses MD5 hash
Configuration JSON including examples for Snowflake & Synapse:
{
"packageName": "CONFIG_LOAD_TRANSFORMATIONS",
...
"attrTransformationTypes": [
{
"transformationType": "HASH_KEY",
"shortDescription": "Hash key transformation",
"description": "Hash transformation for keys: Case-insensitive, trimmed, nulls converted to '-1', date/time formats explicitly cast as string, MD5-hashed.",
"isShownInDesigner": true,
"ordinalPositionInDesigner": 20
},
...
],
"dbmsTransformationFormulas": [
{
"transformationType": "HASH_KEY",
"dbmsProduct": "SNOWFLAKE",
"transformationFormula": "<@compress single_line=true>MD5(<#list sourceAttributes as attribute> <#if attribute.datatype?? && attribute.datatype == \"DATE\"> NVL(TO_VARCHAR(${attribute.attributeName}, 'YYYYMMDD'), '-1') <#elseif attribute.datatype?? && (attribute.datatype?lower_case?contains(\"varchar\") || attribute.datatype?lower_case?contains(\"char\"))> NVL(UPPER(TRIM(${attribute.attributeName})), '-1') <#elseif attribute.datatype?? && attribute.datatype == \"TIMESTAMP\"> NVL(TO_VARCHAR(${attribute.attributeName}, 'YYYYMMDDTHH24MISSFF5'), '-1') <#elseif attribute.datatype?? && attribute.datatype == \"TIMESTAMP_TZ\"> NVL(TO_VARCHAR(${attribute.attributeName}, 'YYYYMMDDTHH24MISSFF5TZHTZM'), '-1') <#elseif attribute.datatype?? && attribute.datatype == \"GEOGRAPHY\"> NVL(ST_ASWKT(${attribute.attributeName}), '-1') <#else> NVL(UPPER(CAST(${attribute.attributeName} AS VARCHAR)), '-1') </#if> <#if attribute_has_next> || '~' ||</#if></#list>)</@compress>"
},
{
"transformationType": "HASH_KEY",
"dbmsProduct": "MS_SQL_DW",
"transformationFormula": "<@compress single_line=true>CONVERT(CHAR(32),HASHBYTES('MD5',<#list sourceAttributes as attribute><#if attribute.datatype?? && attribute.datatype == \"DATE\"> COALESCE(FORMAT(${attribute.attributeName}, 'yyyyMMdd'), '-1') <#elseif attribute.datatype?? && ( attribute.datatype?lower_case?contains(\"varchar\") || attribute.datatype?lower_case?contains(\"char\") )>COALESCE(UPPER(TRIM(${attribute.attributeName})), '-1')<#elseif attribute.datatype?? && attribute.datatype == \"TIMESTAMP\"> COALESCE(FORMAT(${attribute.attributeName}, 'yyyyMMddTHHmmssffffff'), '-1') <#elseif attribute.datatype?? && attribute.datatype == \"TIMESTAMP_TZ\"> COALESCE(FORMAT(${attribute.attributeName}, 'yyyyMMddTHHmmssffffffz'), '-1') <#elseif attribute.datatype?? && attribute.datatype?lower_case?contains(\"bool\")>COALESCE(CAST(CAST([${attribute.attributeName}] AS INT) AS VARCHAR), '-1')<#else>COALESCE(UPPER(CAST([${attribute.attributeName}] AS VARCHAR)), '-1') </#if><#if attribute_has_next> + '~' + </#if></#list>), 2)</@compress>"
},
...
]
}
More examples available in GitHub
Our latest default configuration contains a similar hash key transformation. You can find it and other examples in this GitHub repository.
Example SQL output in Snowflake using HASH_KEY:
MD5(
NVL(UPPER(TRIM(string_attribute)), '-1') || '~' ||
NVL(TO_VARCHAR(timestamp_attribute, 'YYYYMMDDTHH24MISSFF5'), '-1')
)
Compared to using HASH (default):
MD5(
UPPER(NVL(string_attribute, '-1') || '~' ||
NVL(CAST(timestamp_attribute AS VARCHAR), '-1'))
)