Hub

Data Vault hubs are unique lists of business keys which:

Have the same meaning and grain throughout the business
Bind data to business processes
Are immutable or changes are rare
Should be understandable to humans

All business keys from every source system representing the same business entity are collected to the same hub if it is technically possible.

Creating a hub

Follow these steps to create a hub:

Select or create a package and create a new entity, choose type HUB and physical type TABLE.
Navigate to the source entity (staging table) and create an outgoing load selecting the hub as target entity.
Create an entity mapping as follows:

Source attribute	Transformation	Target attribute	Notes
Business key attribute(s) in source entity.	HASH	dv_id	Map the source attribute(s) for the business key.
-	CURRENT_TS	dv_load_time	Load timestamp, does not need any modifications by default.
-	RUN_ID	dv_run_id	Run ID (see Using Run ID Logic), does not need any modifications by default.
Source name attribute in source entity.	NONE / CUSTOM	dv_source	Best practice is to have the source name as a metadata attribute on staging entities and to map it to Data Vault entities without any transformations. Alternatively, a CUSTOM transformation can be used e.g. with a source name or an environment variable given in single quotes. However, it is best to avoid the CUSTOM transformation type if possible.
Business key attribute(s) in source entity.	DV_BUSINESS_KEY	business_key	Create a new attribute in the hub for the business key. Map the same attributes as mapped into dv_id. DV_BUSINESS_KEY transformation contains the same transformations as HASH but without the hashing.

Use the Export SQL feature to preview entity SQL & load SQL.

Naming conventions

Note that you can modify the default entity and attribute naming used in the examples in the CONFIG_ENTITY_DEFAULTS configuration package to fit your naming conventions. See more in Customizing default attributes and transformations

Notes

Hubs are usually loaded from multiple different source entities or even from multiple attributes in a single source entity. Each source can be added as a separate incoming load for the hub.

The most common pattern is to load hubs from all sources where the business key is present to ensure referential integrity. However, sometimes this is not practical e.g. due to performance issues. For example, an accounting system may provide a master list of distinct accounts which are then referred to in accounting documents. If the number of accounting documents is very large, it might make sense to only load the hub from the master account list.

Quite often natural business keys that would be present universally over all source systems cannot be found. Therefore, source system surrogate keys are commonly used as hub business keys. See the example below on how to map source system name as part of business key to avoid collisions for surrogate keys coming from different source systems.

Data Vault uses hash keys by default and all key attributes are cast to string before hashing. If your business key contains date, time or other attributes which can be represented in multiple different string formats depending e.g. on target database default settings, it is advisable to cast these types of attributes explicitly to a given string format. Otherwise changes to target database default settings may change output formats of implicit casts and break the model. These casts can be configured as part of the hash key transformation (see example below).

If you want to change the default configuration of any entity type, including hubs and other Data Vault entities, it can be done using the CONFIG_ENTITY_DEFAULTS configuration package. Similarly, transformations can be edited with CONFIG_LOAD_TRANSFORMATIONS.

Examples

Using source system name as part of business key

In this example we are loading hub H_CUSTOMER from source systems A and B with the following premises:

There is no universal natural business key that could be used for customer → source system technical keys are the only option
Customers in the systems can be partially overlapping i.e. the same customer might be found from both systems
Systems A and B do not share customer keys
Both systems use running integer keys for customer → keys that do not have the same meaning can collide

In this case we can create loads from both source systems to the same hub using the source system name as part of the key.

Entity mappings in loads:

Source attribute	Transformation	Target attribute	Notes
stg_source_system & customer_key_a / customer_key_b	HASH	dv_id	Source system name and customer key mapped to the dv_id in loads from both source systems.
-	CURRENT_TS	dv_load_time	Load timestamp, does not need any modifications by default.
-	RUN_ID	dv_run_id	Run ID (see Using Run ID Logic), does not need any modifications by default.
Source name attribute in source entity.	NONE / CUSTOM	dv_source	Best practice is to have the source name as a metadata attribute on staging entities and to map it to Data Vault entities without any transformations. Alternatively, a CUSTOM transformation can be used e.g. with a source name or an environment variable given in single quotes. However, it is best to avoid the CUSTOM transformation type if possible.
stg_source_system & customer_key_a / customer_key_b	DV_BUSINESS_KEY	business_key	Source system name and customer key mapped to business_key in loads from both source systems.

Note that these loads result in a hub where all distinct customer keys from both systems are on separate rows, even ones that mean the same customer. If a mapping that connects system A and B customer keys together is available, you can create a same-as link to identify which keys belong to the same customer.

Configuring a custom hash key transformation

CONFIG_LOAD_TRANSFORMATIONS can be used to customize load transformations. This example defines the HASH_KEY transformation as an alternative for the default HASH transformation in Agile Data Engine:

Case-insensitive
Trims source string attributes
Converts nulls to ‘-1’
Casts date and time formats explicitly to given string formats
Uses MD5 hash

Configuration JSON including examples for Snowflake & Synapse:

JSON

{
    "packageName": "CONFIG_LOAD_TRANSFORMATIONS",
    ...
    "attrTransformationTypes": [
        {
            "transformationType": "HASH_KEY",
            "shortDescription": "Hash key transformation",
            "description": "Hash transformation for keys: Case-insensitive, trimmed, nulls converted to '-1', date/time formats explicitly cast as string, MD5-hashed.",
            "isShownInDesigner": true,
            "ordinalPositionInDesigner": 20
        },
        ...
    ],
    "dbmsTransformationFormulas": [
        {
            "transformationType": "HASH_KEY",
            "dbmsProduct": "SNOWFLAKE",
            "transformationFormula": "<@compress single_line=true>MD5(<#list sourceAttributes as attribute> <#if attribute.datatype?? && attribute.datatype == \"DATE\"> NVL(TO_VARCHAR(${attribute.attributeName}, 'YYYYMMDD'), '-1') <#elseif attribute.datatype?? && (attribute.datatype?lower_case?contains(\"varchar\") || attribute.datatype?lower_case?contains(\"char\"))> NVL(UPPER(TRIM(${attribute.attributeName})), '-1') <#elseif attribute.datatype?? && attribute.datatype == \"TIMESTAMP\"> NVL(TO_VARCHAR(${attribute.attributeName}, 'YYYYMMDDTHH24MISSFF5'), '-1') <#elseif attribute.datatype?? && attribute.datatype == \"TIMESTAMP_TZ\"> NVL(TO_VARCHAR(${attribute.attributeName}, 'YYYYMMDDTHH24MISSFF5TZHTZM'), '-1') <#elseif attribute.datatype?? && attribute.datatype == \"GEOGRAPHY\"> NVL(ST_ASWKT(${attribute.attributeName}), '-1') <#else> NVL(UPPER(CAST(${attribute.attributeName} AS VARCHAR)), '-1') </#if> <#if attribute_has_next> || '~' ||</#if></#list>)</@compress>"
        },
        {
            "transformationType": "HASH_KEY",
            "dbmsProduct": "MS_SQL_DW",
            "transformationFormula": "<@compress single_line=true>CONVERT(CHAR(32),HASHBYTES('MD5',<#list sourceAttributes as attribute><#if attribute.datatype?? && attribute.datatype == \"DATE\"> COALESCE(FORMAT(${attribute.attributeName}, 'yyyyMMdd'), '-1') <#elseif attribute.datatype?? && ( attribute.datatype?lower_case?contains(\"varchar\") || attribute.datatype?lower_case?contains(\"char\") )>COALESCE(UPPER(TRIM(${attribute.attributeName})), '-1')<#elseif attribute.datatype?? && attribute.datatype == \"TIMESTAMP\"> COALESCE(FORMAT(${attribute.attributeName}, 'yyyyMMddTHHmmssffffff'), '-1') <#elseif attribute.datatype?? && attribute.datatype == \"TIMESTAMP_TZ\"> COALESCE(FORMAT(${attribute.attributeName}, 'yyyyMMddTHHmmssffffffz'), '-1') <#elseif attribute.datatype?? && attribute.datatype?lower_case?contains(\"bool\")>COALESCE(CAST(CAST([${attribute.attributeName}] AS INT) AS VARCHAR), '-1')<#else>COALESCE(UPPER(CAST([${attribute.attributeName}] AS VARCHAR)), '-1') </#if><#if attribute_has_next> + '~' + </#if></#list>), 2)</@compress>"
        },
        ...
    ]
}

More examples available in GitHub

Our latest default configuration contains a similar hash key transformation. You can find it and other examples in this GitHub repository.

Example SQL output in Snowflake using HASH_KEY:

SQL

MD5(
  NVL(UPPER(TRIM(string_attribute)), '-1') || '~' ||
  NVL(TO_VARCHAR(timestamp_attribute, 'YYYYMMDDTHH24MISSFF5'), '-1')
)

Compared to using HASH (default):

SQL

MD5(
  UPPER(NVL(string_attribute, '-1') || '~' ||
  NVL(CAST(timestamp_attribute AS VARCHAR), '-1'))
)