This documentation provides practical examples for implementing a sensitive data vault model, inspired by the ADE blog post on data modeling for data protection. It focuses on segregating sensitive and non-sensitive data within the data warehouse to comply with regulations like the General Data Protection Regulation (GDPR), particularly concerning individuals' rights to access and erase their personal data.
In data vault implementations, usually hashed business keys are used as entity identifiers. However, this approach poses challenges when handling sensitive business keys, such as social security numbers, because hashes can potentially be reversed if the original business keys are known. To address this, the examples use random surrogate keys and key lookups when linking sensitive entities. This strategy allows for the removal of sensitive data without compromising the integrity of the overall data model.
The examples use the default transformation types defined in GitHub repository agile-data-engine/ade-default-configuration. However, the approach is compatible with most default data vault transformations for hash keys, business keys etc.
This transformation type is used to generate a hashed, randomizeddv_id for hubs that contain sensitive business keys.
Databricks SQL
SQL
MD5(UUID())
Google BigQuery
SQL
TO_HEX(MD5(GENERATE_UUID()))
Snowflake
SQL
MD5(UUID_STRING())
SENSITIVE_DV_ID_LOOKUP
This transformation type is used in conjunction with FTL templates to look up the randomizeddv_id from a sensitive hub when loading satellites and links that reference a sensitive hub. This transformation can also be used with status satellites when the driving key is the sensitive business key. The sensitive hub should be included as the second entity mapping in the load configuration to enable this lookup.
Databricks SQL, Google BigQuery, Snowflake
<#-- Gets attribute name for DV_HASHKEY from the second entity mapping -->
<#if model.load.sources?size != 2>
<#stop "Error: SENSITIVE_DV_ID_LOOKUP transformation requires two entity mappings.">
</#if>
<#assign referenceTable = loads.getSourceEntity(2)>
<#assign referenceTableKey = loads.getAttributeForType(referenceTable.attributes, "DV_HASHKEY")>
src_lookup.${referenceTableKey.name}
SENSITIVE_DV_LINK_ID
Link dv_ids are typically composed from all the attributes used in the link's hub references. This transformation is used to incorporate the looked-up randomized hub key into the link’s dv_id when the link is referencing a sensitive hub. This transformation can also be used with link status satellites.
Optionally, custom entity types can be configured, for example, for sensitive hubs and satellites containing sensitive attributes. This allows for the use of custom naming conventions, assignment to a separate default schema, and application of other default configurations tailored to entities that handle personally identifiable information (PII).
Examples
Sensitive hub
In this example, we will create a sensitive hub for customers, which will store identifiable customer keys:
Entity type
HUB
Entity name
H_CUSTOMER
Schema
sdv
Note that the schema is set to sdv, which differs from the default rdv (raw data vault) schema. This separation helps isolate sensitive information and, for example, simplifies the management of access rights.
A load and an entity mapping from staging is created as follows:
STG_CUSTOMER
Transformation
H_CUSTOMER
-
SENSITIVE_DV_GENERATED_ID
dv_id
-
CURRENT_TS
dv_load_time
-
RUN_ID
dv_run_id
stg_source_system
-
dv_source_system
stg_source_entity
-
dv_source_entity
-
PACKAGE_VERSION
dv_package_version
-
LOAD_NAME
dv_load_name
customerid
BUSINESS_KEY
business_key
An FTL template reference is added as an OVERRIDE load step:
INSERT INTO sdv.H_CUSTOMER (
dv_id
, dv_load_time
, dv_run_id
, dv_source_system
, dv_source_entity
, dv_package_version
, dv_load_name
, business_key
)
SELECT
dv_id
, dv_load_time
, dv_run_id
, dv_source_system
, dv_source_entity
, dv_package_version
, dv_load_name
, business_key
FROM (
SELECT
MD5(UUID()) AS dv_id
, dv_load_time
, dv_run_id
, dv_source_system
, dv_source_entity
, dv_package_version
, dv_load_name
, business_key
FROM (
SELECT DISTINCT
CURRENT_TIMESTAMP() AS dv_load_time
, <targetrunid> AS dv_run_id
, src_entity.stg_source_system AS dv_source_system
, src_entity.stg_source_entity AS dv_source_entity
, <packageversion> AS dv_package_version
, 'load_h_customer_from_stg_customer_01' AS dv_load_name
, UPPER(COALESCE(NULLIF(TRIM(CAST(src_entity.customerid AS STRING)), ''), '-1')) AS business_key
FROM staging.STG_CUSTOMER src_entity
WHERE
src_entity.stg_batch_id IN (<loadablerunids>)
)
) src
WHERE
NOT EXISTS (
SELECT
1
FROM
sdv.H_CUSTOMER trg
WHERE
trg.business_key = src.business_key
);
Google BigQuery
SQL
INSERT INTO sdv.H_CUSTOMER (
dv_id
, dv_load_time
, dv_run_id
, dv_source_system
, dv_source_entity
, dv_package_version
, dv_load_name
, business_key
)
SELECT
dv_id
, dv_load_time
, dv_run_id
, dv_source_system
, dv_source_entity
, dv_package_version
, dv_load_name
, business_key
FROM (
SELECT
TO_HEX(MD5(GENERATE_UUID())) AS dv_id
, dv_load_time
, dv_run_id
, dv_source_system
, dv_source_entity
, dv_package_version
, dv_load_name
, business_key
FROM (
SELECT DISTINCT
CURRENT_TIMESTAMP() AS dv_load_time
, <targetrunid> AS dv_run_id
, src_entity.stg_source_system AS dv_source_system
, src_entity.stg_source_entity AS dv_source_entity
, <packageversion> AS dv_package_version
, 'load_h_customer_from_stg_customer_01' AS dv_load_name
, UPPER(COALESCE(NULLIF(TRIM(CAST(src_entity.customerid AS STRING)), ''), '-1')) AS business_key
FROM staging.STG_CUSTOMER src_entity
WHERE
src_entity.stg_batch_id IN (<loadablerunids>)
)
) src
WHERE
NOT EXISTS (
SELECT
1
FROM
sdv.H_CUSTOMER trg
WHERE
trg.business_key = src.business_key
);
Snowflake
SQL
INSERT INTO sdv.H_CUSTOMER (
dv_id
, dv_load_time
, dv_run_id
, dv_source_system
, dv_source_entity
, dv_package_version
, dv_load_name
, business_key
)
SELECT
dv_id
, dv_load_time
, dv_run_id
, dv_source_system
, dv_source_entity
, dv_package_version
, dv_load_name
, business_key
FROM (
SELECT
MD5(UUID_STRING()) AS dv_id
, dv_load_time
, dv_run_id
, dv_source_system
, dv_source_entity
, dv_package_version
, dv_load_name
, business_key
FROM (
SELECT DISTINCT
CURRENT_TIMESTAMP::timestamp_ntz AS dv_load_time
, <targetrunid> AS dv_run_id
, src_entity.stg_source_system AS dv_source_system
, src_entity.stg_source_entity AS dv_source_entity
, <packageversion> AS dv_package_version
, 'load_h_customer_from_stg_customer_01' AS dv_load_name
, UPPER(NVL(NULLIF(TRIM(CAST(src_entity.customerid AS VARCHAR)), ''), '-1')) AS business_key
FROM staging.STG_CUSTOMER src_entity
WHERE
src_entity.stg_batch_id IN (<loadablerunids>)
)
) src
WHERE
NOT EXISTS (
SELECT
1
FROM
sdv.H_CUSTOMER trg
WHERE
trg.business_key = src.business_key
);
Satellite for a sensitive hub
In this example, we will create a satellite for the sensitive customer hub to store sensitive customer attributes:
Entity type
SAT
Entity name
S_CUSTOMER
Schema
sdv
A good practice is to separate sensitive and non-sensitive attributes into different satellites, placing sensitive data in the sdv schema and non-sensitive data in the standard rdv schema. This separation supports clearer data governance and simplifies tasks such as planning the erasure of identifiable information. However, it is not always straightforward to determine which attributes are sensitive or whether combinations of attributes could lead to re-identification. These decisions should always be made carefully and in collaboration with subject matter experts.
A load and two entity mappings are added:
Source entity
Position
Type
STG_CUSTOMER
1
SOURCE (Driving Run ID Logic)
H_CUSTOMER
2
SOURCE
Attribute mapping from staging:
STG_CUSTOMER
Transformation
S_CUSTOMER
customerid
SENSITIVE_DV_ID_LOOKUP
dv_id
-
CURRENT_TS
dv_load_time
-
RUN_ID
dv_run_id
stg_source_system
-
dv_source_system
stg_source_entity
-
dv_source_entity
-
PACKAGE_VERSION
dv_package_version
-
LOAD_NAME
dv_load_name
first_name
last_name
postal_address
phone_number
HASH_DIFF
dv_datahash
first_name
-
first_name
last_name
-
last_name
postal_address
-
postal_address
phone_number
-
phone_number
An FTL template reference is added as an OVERRIDE load step:
INSERT INTO sdv.S_CUSTOMER (
dv_id
, dv_load_time
, dv_run_id
, dv_source_system
, dv_source_entity
, dv_package_version
, dv_load_name
, dv_datahash
, first_name
, last_name
, postal_address
, phone_number
)
SELECT DISTINCT
dv_id
, dv_load_time
, dv_run_id
, dv_source_system
, dv_source_entity
, dv_package_version
, dv_load_name
, dv_datahash
, first_name
, last_name
, postal_address
, phone_number
FROM (
SELECT
src_lookup.dv_id AS dv_id
, CURRENT_TIMESTAMP() AS dv_load_time
, <targetrunid> AS dv_run_id
, stg_source_system AS dv_source_system
, stg_source_entity AS dv_source_entity
, <packageversion> AS dv_package_version
, 'load_s_customer_from_stg_customer_01' AS dv_load_name
, MD5(COALESCE(first_name, '-1') || '~' || COALESCE(last_name, '-1') || '~' || COALESCE(postal_address, '-1') || '~' || COALESCE(phone_number, '-1')) AS dv_datahash
, first_name AS first_name
, last_name AS last_name
, postal_address AS postal_address
, phone_number AS phone_number
FROM staging.STG_CUSTOMER src_entity
JOIN sdv.H_CUSTOMER src_lookup ON (UPPER(COALESCE(NULLIF(TRIM(CAST(src_entity.customerid AS STRING)), ''), '-1')) = src_lookup.business_key)
WHERE
src_entity.stg_batch_id IN (<loadablerunids>)
) src
WHERE
NOT EXISTS (
SELECT
1
FROM (
SELECT
t1.dv_id
,t1.dv_datahash
FROM (
SELECT
s.dv_id
,s.dv_datahash
,ROW_NUMBER()
OVER (
PARTITION BY
s.dv_id
ORDER BY
s.dv_load_time DESC
)
AS
dv_load_time_last
FROM
sdv.S_CUSTOMER s
) t1
WHERE
t1.dv_load_time_last = 1
) trg
WHERE
trg.dv_datahash = src.dv_datahash
AND
trg.dv_id = src.dv_id
);
Google BigQuery
SQL
INSERT INTO sdv.S_CUSTOMER (
dv_id
, dv_load_time
, dv_run_id
, dv_source_system
, dv_source_entity
, dv_package_version
, dv_load_name
, dv_datahash
, first_name
, last_name
, postal_address
, phone_number
)
SELECT DISTINCT
dv_id
, dv_load_time
, dv_run_id
, dv_source_system
, dv_source_entity
, dv_package_version
, dv_load_name
, dv_datahash
, first_name
, last_name
, postal_address
, phone_number
FROM (
SELECT
src_lookup.dv_id AS dv_id
, CURRENT_TIMESTAMP() AS dv_load_time
, <targetrunid> AS dv_run_id
, stg_source_system AS dv_source_system
, stg_source_entity AS dv_source_entity
, <packageversion> AS dv_package_version
, 'load_s_customer_from_stg_customer' AS dv_load_name
, TO_HEX(MD5(CONCAT(COALESCE(first_name, '-1') ,'~',COALESCE(last_name, '-1') ,'~',COALESCE(postal_address, '-1') ,'~',COALESCE(phone_number, '-1')))) AS dv_datahash
, first_name AS first_name
, last_name AS last_name
, postal_address AS postal_address
, phone_number AS phone_number
FROM staging.STG_CUSTOMER src_entity
JOIN sdv.H_CUSTOMER src_lookup ON (UPPER(COALESCE(NULLIF(TRIM(CAST(src_entity.customerid AS STRING)), ''), '-1')) = src_lookup.business_key)
WHERE
src_entity.stg_batch_id IN (<loadablerunids>)
) src
WHERE
NOT EXISTS (
SELECT
1
FROM (
SELECT
t1.dv_id
,t1.dv_datahash
FROM (
SELECT
s.dv_id
,s.dv_datahash
,ROW_NUMBER()
OVER (
PARTITION BY
s.dv_id
ORDER BY
s.dv_load_time DESC
)
AS
dv_load_time_last
FROM
sdv.S_CUSTOMER s
) t1
WHERE
t1.dv_load_time_last = 1
) trg
WHERE
trg.dv_datahash = src.dv_datahash
AND
trg.dv_id = src.dv_id
);
Snowflake
SQL
INSERT INTO sdv.S_CUSTOMER (
dv_id
, dv_load_time
, dv_run_id
, dv_source_system
, dv_source_entity
, dv_package_version
, dv_load_name
, dv_datahash
, first_name
, last_name
, postal_address
, phone_number
)
SELECT DISTINCT
dv_id
, dv_load_time
, dv_run_id
, dv_source_system
, dv_source_entity
, dv_package_version
, dv_load_name
, dv_datahash
, first_name
, last_name
, postal_address
, phone_number
FROM (
SELECT
src_lookup.dv_id AS dv_id
, CURRENT_TIMESTAMP::timestamp_ntz AS dv_load_time
, <targetrunid> AS dv_run_id
, stg_source_system AS dv_source_system
, stg_source_entity AS dv_source_entity
, <packageversion> AS dv_package_version
, 'load_s_customer_from_stg_customer_01' AS dv_load_name
, MD5(NVL(first_name, '-1') || '~' || NVL(last_name, '-1') || '~' || NVL(postal_address, '-1') || '~' || NVL(phone_number, '-1')) AS dv_datahash
, first_name AS first_name
, last_name AS last_name
, postal_address AS postal_address
, phone_number AS phone_number
FROM staging.STG_CUSTOMER src_entity
JOIN sdv.H_CUSTOMER src_lookup ON (UPPER(NVL(NULLIF(TRIM(CAST(src_entity.customerid AS VARCHAR)), ''), '-1')) = src_lookup.business_key)
WHERE
src_entity.stg_batch_id IN (<loadablerunids>)
) src
WHERE
NOT EXISTS (
SELECT
1
FROM (
SELECT
t1.dv_id
,t1.dv_datahash
FROM (
SELECT
s.dv_id
,s.dv_datahash
,ROW_NUMBER()
OVER (
PARTITION BY
s.dv_id
ORDER BY
s.dv_load_time DESC
)
AS
dv_load_time_last
FROM
sdv.S_CUSTOMER s
) t1
WHERE
t1.dv_load_time_last = 1
) trg
WHERE
trg.dv_datahash = src.dv_datahash
AND
trg.dv_id = src.dv_id
);
Link
In this example, we will create a link that references a sensitive hub:
Entity type
LINK
Entity name
L_CUSTOMER_STORE_TERRITORY
Schema
rdv
Note that unlike sensitive hubs & satellites, the link is assigned to the rdv schema since it does not contain any sensitive attributes.
A load and two entity mappings are added:
Source entity
Position
Type
STG_CUSTOMER
1
SOURCE (Driving Run ID Logic)
H_CUSTOMER
2
SOURCE
Attribute mapping from staging:
STG_CUSTOMER
Transformation
L_CUSTOMER_STORE_TERRITORY
storeid
territoryid
SENSITIVE_DV_LINK_ID
dv_id
-
CURRENT_TS
dv_load_time
-
RUN_ID
dv_run_id
stg_source_system
-
dv_source_system
stg_source_entity
-
dv_source_entity
-
PACKAGE_VERSION
dv_package_version
-
LOAD_NAME
dv_load_name
customerid
SENSITIVE_KEY_LOOKUP
dv_id_customer
storeid
HASH_KEY
dv_id_store
territoryid
HASH_KEY
dv_id_territory
Note that the sensitive customerid should not be mapped to the link dv_id. Instead, the randomized sensitive hub dv_id is joined and appended to the link dv_id by the template & transformation.
An FTL template reference is added as an OVERRIDE load step:
INSERT INTO rdv.L_CUSTOMER_STORE_TERRITORY (
dv_id
, dv_load_time
, dv_run_id
, dv_source_system
, dv_source_entity
, dv_package_version
, dv_load_name
, dv_id_customer
, dv_id_store
, dv_id_territory
)
SELECT DISTINCT
dv_id
, dv_load_time
, dv_run_id
, dv_source_system
, dv_source_entity
, dv_package_version
, dv_load_name
, dv_id_customer
, dv_id_store
, dv_id_territory
FROM (
SELECT
MD5(UPPER(COALESCE(NULLIF(TRIM(CAST(src_entity.storeid AS STRING)), ''), '-1') || '~' || COALESCE(NULLIF(TRIM(CAST(src_entity.territoryid AS STRING)), ''), '-1') || '~' || src_lookup.dv_id)) AS dv_id
, CURRENT_TIMESTAMP() AS dv_load_time
, <timemillis> AS dv_run_id
, src_entity.stg_source_system AS dv_source_system
, src_entity.stg_source_entity AS dv_source_entity
, <packageversion> AS dv_package_version
, 'load_l_customer_store_territory_from_stg_customer_01' AS dv_load_name
, src_lookup.dv_id AS dv_id_customer
, MD5(UPPER(COALESCE(NULLIF(TRIM(CAST(src_entity.storeid AS STRING)), ''), '-1'))) AS dv_id_store
, MD5(UPPER(COALESCE(NULLIF(TRIM(CAST(src_entity.territoryid AS STRING)), ''), '-1'))) AS dv_id_territory
FROM staging.STG_CUSTOMER src_entity
JOIN sdv.H_CUSTOMER src_lookup ON (UPPER(COALESCE(NULLIF(TRIM(CAST(src_entity.customerid AS STRING)), ''), '-1')) = src_lookup.business_key)
WHERE
src_entity.stg_batch_id IN (<loadablerunids>)
) src
WHERE
NOT EXISTS (
SELECT
1
FROM
rdv.L_CUSTOMER_STORE_TERRITORY trg
WHERE
trg.dv_id = src.dv_id
);
Google BigQuery
SQL
INSERT INTO rdv.L_CUSTOMER_STORE_TERRITORY (
dv_id
, dv_load_time
, dv_run_id
, dv_source_system
, dv_source_entity
, dv_package_version
, dv_load_name
, dv_id_customer
, dv_id_store
, dv_id_territory
)
SELECT DISTINCT
dv_id
, dv_load_time
, dv_run_id
, dv_source_system
, dv_source_entity
, dv_package_version
, dv_load_name
, dv_id_customer
, dv_id_store
, dv_id_territory
FROM (
SELECT
TO_HEX(MD5(UPPER(CONCAT(COALESCE(NULLIF(TRIM(CAST(src_entity.storeid AS STRING)), ''), '-1') ,'~', COALESCE(NULLIF(TRIM(CAST(src_entity.territoryid AS STRING)), ''), '-1') ,'~', src_lookup.dv_id)))) AS dv_id
, CURRENT_TIMESTAMP() AS dv_load_time
, <targetrunid> AS dv_run_id
, src_entity.stg_source_system AS dv_source_system
, src_entity.stg_source_entity AS dv_source_entity
, <packageversion> AS dv_package_version
, 'load_l_customer_store_territory_from_stg_customer_01' AS dv_load_name
, src_lookup.dv_id AS dv_id_customer
, TO_HEX(MD5(UPPER(COALESCE(NULLIF(TRIM(CAST(src_entity.storeid AS STRING)), ''), '-1')))) AS dv_id_store
, TO_HEX(MD5(UPPER(COALESCE(NULLIF(TRIM(CAST(src_entity.territoryid AS STRING)), ''), '-1')))) AS dv_id_territory
FROM staging.STG_CUSTOMER src_entity
JOIN sdv.H_CUSTOMER src_lookup ON (IFNULL(UPPER(CAST(customerid AS STRING)), '-1') = src_lookup.business_key)
WHERE
src_entity.stg_batch_id IN (<loadablerunids>)
) src
WHERE
NOT EXISTS (
SELECT
1
FROM
rdv.L_CUSTOMER_STORE_TERRITORY trg
WHERE
trg.dv_id = src.dv_id
);
Snowflake
SQL
INSERT INTO rdv.L_CUSTOMER_STORE_TERRITORY (
dv_id
, dv_load_time
, dv_run_id
, dv_source_system
, dv_source_entity
, dv_package_version
, dv_load_name
, dv_id_customer
, dv_id_store
, dv_id_territory
)
SELECT DISTINCT
dv_id
, dv_load_time
, dv_run_id
, dv_source_system
, dv_source_entity
, dv_package_version
, dv_load_name
, dv_id_customer
, dv_id_store
, dv_id_territory
FROM (
SELECT
MD5(UPPER(NVL(NULLIF(TRIM(CAST(src_entity.storeid AS VARCHAR)), ''), '-1') || '~' || NVL(NULLIF(TRIM(CAST(src_entity.territoryid AS VARCHAR)), ''), '-1') || '~' || src_lookup.dv_id)) AS dv_id
, CURRENT_TIMESTAMP::timestamp_ntz AS dv_load_time
, <targetrunid> AS dv_run_id
, src_entity.stg_source_system AS dv_source_system
, src_entity.stg_source_entity AS dv_source_entity
, <packageversion> AS dv_package_version
, 'load_l_customer_store_territory_from_stg_customer_01' AS dv_load_name
, src_lookup.dv_id AS dv_id_customer
, MD5(UPPER(NVL(NULLIF(TRIM(CAST(src_entity.storeid AS VARCHAR)), ''), '-1'))) AS dv_id_store
, MD5(UPPER(NVL(NULLIF(TRIM(CAST(src_entity.territoryid AS VARCHAR)), ''), '-1'))) AS dv_id_territory
FROM staging.STG_CUSTOMER src_entity
JOIN sdv.H_CUSTOMER src_lookup ON (UPPER(NVL(NULLIF(TRIM(CAST(src_entity.customerid AS VARCHAR)), ''), '-1')) = src_lookup.business_key)
WHERE
src_entity.stg_batch_id IN (<loadablerunids>)
) src
WHERE
NOT EXISTS (
SELECT
1
FROM
rdv.L_CUSTOMER_STORE_TERRITORY trg
WHERE
trg.dv_id = src.dv_id
);
Status satellite for a link
In this example, we will create a status satellite for a link that references a sensitive hub:
Entity type
S_SAT
Entity name
STS_CUSTOMER_STORE_TERRITORY
Schema
bdv
A load and two entity mappings are added:
Source entity
Position
Type
STG_CUSTOMER
1
SOURCE (Driving Run ID Logic)
H_CUSTOMER
2
SOURCE
Attribute mapping from staging:
STG_CUSTOMER
Transformation
L_CUSTOMER_STORE_TERRITORY
storeid
territoryid
SENSITIVE_DV_LINK_ID
dv_id
-
CURRENT_TS
dv_load_time
-
RUN_ID
dv_run_id
stg_source_system
-
dv_source_system
stg_source_entity
-
dv_source_entity
-
PACKAGE_VERSION
dv_package_version
-
LOAD_NAME
dv_load_name
customerid
SENSITIVE_KEY_LOOKUP
dv_driving_key
-
DV_STATUS
status
In this example, the sensitive customer id is set as the driving key. See Status satellite for more details about choosing the correct driving key.
Like in the link load, the sensitive customerid should not be mapped to the status satellite dv_id. Instead, the randomized sensitive hub dv_id is joined and appended to the status satellite dv_id by the template & transformation.
Status satellite load logic has multiple steps and it varies by extract type delta vs. full. When using the FTL template sensitive_dv_status_satellite in a load step, you must specify the step_type parameter, which controls each phase of the load process. Valid parameter values are:
gatekeeper
drop_temp
temp_1
insert_0
insert_1
See the examples below for full and delta extract scenarios using this template.