Dimension
Dimensions provide context and categorisation to facts. Dimensions are defined for business concepts such as customers, products, facilities and time; they contain a unique dimension key which is referenced by facts. Sometimes dimension keys can also be referenced by other dimensions creating a snowflake model.
Context: Publish zone
This article discusses dimensional modeling from the publish zone perspective. The assumption is that source system data is stored in a data warehouse zone and then loaded to the publish zone for end use purposes (multi-layered approach). I.e. source system data is not loaded directly to a dimensional model.
See persistent staging and Data Vault for guides on data warehouse zone modeling.
Creating a dimension
Follow these steps to create a dimension:
Select or create a package and create a new entity, choose type DIM and physical type TABLE, VIEW or MATERIALIZED_VIEW (see considerations in the notes below).
Create an incoming load and an entity mapping for each source entity for the dimension. If you are using OVERRIDE load steps to define the load logic for the dimension, attribute-level entity mappings are not mandatory. However, it is a good practice to create attribute-level mappings for metadata purposes. Creating an attribute-level mapping is also a good way to create attributes with the correct data types for the dimension entity.
Define the load logic:
If your dimension has only one source entity, you might be able to use automatic load generation. In this case you need to define an attribute-level entity mapping including transformations. Use load options to modify the automatically generated logic and export SQL to preview the generated logic. Note that the DEPENDENCY entity mapping type can be used in some cases to allow automatic load generation with multiple dependencies.
OR
If your dimension has multiple source entities, define the load logic in SQL with OVERRIDE load steps. Note that you can use variables to make the logic more generic and save your own load templates. Also see examples below.
Use the Export SQL feature to preview entity SQL & load SQL.
Naming conventions
Note that you can modify the default entity and attribute naming used in the examples in the CONFIG_ENTITY_DEFAULTS configuration package to fit your naming conventions.
Notes
Physical type
As the source system data is stored in the data warehouse zone implemented e.g. with the persistent staging or Data Vault modeling methodologies, you have the choice to create dimensions with physical type TABLE, VIEW or MATERIALIZED_VIEW with the following considerations:
Physical type | Considerations |
---|---|
Table |
|
View |
|
Materialized view |
|
Keys
If you are using hash keys in the data warehouse zone, the easiest option is to use them in the publish zone as well. This way key lookups between facts and dimensions can be avoided and loads can be executed concurrently. Depending on how the data warehouse has been implemented, using non-hashed business keys in the publish zone can also be simple.
Note that in general hash keys will perform worse than non-hashed natural keys or integer keys. For example, BI tools often struggle with hash keys when the data volume is high.
Load patterns
Consider if change history is needed in the dimension. The most common use case is that end users only want to see the latest version of the data. Dimensions containing the change history are more complicated to load and to use. Change history should be stored in the data warehouse zone and only presented in the publish zone when required by the use case.
Consider if the dimension can be fully loaded every time or if an incremental load is needed. Commonly the amount of data in dimensions is limited and with modern database management systems it is more efficient to load the full table in each load execution than to design a more complicated incremental load. Dimensions often combine data from multiple sources and it can be difficult to define an incremental load logic.
Note that you can also define multiple loads per entity and schedule them freely. For example, you might define an incremental load for a large dimension which is used on a regular basis, and a separate full load which can be used when needed. Full loads are useful e.g. when:
Running an initial load, i.e. loading the entity for the first time with all historical data.
Recovering from an incident that requires data to be reloaded.
The incremental load logic compromises perfect correctness for better performance and, for example, does not catch rare changes to old data. In such cases there could be a full maintenance load scheduled less frequently than the incremental load that keeps the data quality on an acceptable level.
Erasing sensitive data from the data warehouse (read more about handling sensitive data in Agile Data Engine blog).
Full loads executed with the truncate-insert pattern can cause problems with simultaneous end use as the table being loaded is empty after the truncate statement has been run and before the insert has finished. Depending on the features of the target database, various load patterns have been designed to avoid these problems. See the Using a logic view and an automatically generated load and Transactional truncate-insert template for Snowflake examples below.
Note that if you are using current views on top of the data warehouse zone (e.g. Data Vault) to query data to the publish zone and the underlying tables have a lot of data, the current view logic can cause performance issues. Performance-wise it is often a better solution to design the query using the underlying data warehouse tables, apply the necessary filters and joins and then implement the current logic in the publish zone load. When you run into performance issues, investigate the query plan and see if such optimisations are possible.
Examples
Full load for a non-historizing dimension
A non-historizing dimension (SCD type 1) is the most common scenario for a publish dimension: only the latest version of the data is needed. Often also the row count is low so the dimension can be fully loaded (truncate-insert) every time.
The most simple option is to create the dimension entity and define a full load, entity mappings as instructed above, and OVERRIDE steps as follows:
Load step | Example | Notes |
---|---|---|
truncate_table |
SQL
| Generic truncate table statement. |
insert_into |
SQL
| Insert statement containing the load logic for the dimension. If your target database supports syntax like INSERT OVERWRITE INTO… you can skip the truncate step. This example is loading a taxi zone dimension from Data Vault and using current views to get the latest data from satellites. |
Note that using the simple truncate-insert approach causes the entity to be empty for the duration of the insert statement execution as discussed above.
Alternatively, you could use one of the built-in load patterns described in Using a logic view and an automatically generated load, or a load template as presented in Transactional truncate-insert template for Snowflake. This template loads the data first into a temporary entity and then swaps the original table with it which also allows you to run smoke tests on the data before the swap operation.
Note that you can also load a type 1 slowly changing dimension incrementally if needed. You could use the pattern described in Using a logic view and an automatically generated load with OPT_INSERT_UPDATE_BY_DATAHASH_AS_MERGE or write your own load logic with OVERRIDE load steps.
Using a logic view and an automatically generated load
Agile Data Engine has a variety of preconfigured load patterns which can be enabled with load options and used with dimensions (and other types of entities):
Load method | Dimension type | Load pattern | Notes |
---|---|---|---|
Full | Non-historizing (SCD type 1) | Uses a temporary table which is swapped or transactionally renamed to the original entity depending on the target database product. | |
Uses a temporary table which is swapped or transactionally renamed to the original entity depending on the target database product. | |||
OVERWRITE is only supported in Snowflake. | |||
OVERWRITE is only supported in Snowflake. | |||
Uses a MERGE statement. | |||
Uses a MERGE statement. | |||
Incremental | Historizing | Compare to a persistent staging entity or a Data Vault satellite. | |
Non-historizing (SCD type 1) | Uses a temporary table which is swapped or transactionally renamed to the original entity depending on the target database product. | ||
Uses a temporary table which is swapped or transactionally renamed to the original entity depending on the target database product. | |||
OVERWRITE is only supported in Snowflake. | |||
OVERWRITE is only supported in Snowflake. | |||
Uses a MERGE statement. | |||
Uses a MERGE statement. |
These load patterns can be used in loads that have one source entity mapping, yet typically dimensions join together data from multiple different sources. Still, you can use them by defining the dimension logic with SQL into a view and then materializing it into a table using one of the load patterns.
Use one of the load patterns with a logic view as follows:
Create a view for the dimension load logic, e.g. D_CUSTOMER_LOGIC.
Add entity mappings for all the source entities and define the load logic with SQL.
Create a table for the actual dimension, e.g. D_CUSTOMER.
Create a 1-to-1 mapping from the logic view to the table to add the attributes needed in the dimension. Also add/map the following attributes that collect metadata and are used by the load patterns (feel free to name them according to your conventions):
Source attribute | Transformation | Target attribute | Target attribute technical type | Notes |
---|---|---|---|---|
dim_key | NONE | dim_key | DV_HASHKEY | Primary key for the dimension |
- | CURRENT_TS | load_time | DV_LOAD_TIME | Indicates when a row was inserted |
- | CURRENT_TS | update_time | DV_UPDATE_TIME | Indicates when a row was updated |
All mapped source attributes | HASH | comparison_hash | DV_DATAHASH | Used for comparing changes between the target entity and the incoming data by some of the load patterns. |
Note that the load patterns have different requirements on what type of attributes are needed in the target entity for the load pattern to work (technical type). Please refer to the load pattern specific documentation and use SQL export to preview the logic.
Enable one of the load patterns by setting it as a load option, for example:
CODEOPT_INSERT_UPDATE_BY_DATAHASH_AS_MERGE = true
Finally, use the Export SQL feature to preview the load SQL, for example:
MERGE INTO publish.D_CUSTOMER trg USING (
SELECT DISTINCT
dim_key AS dim_key
...
, CURRENT_TIMESTAMP::timestamp_ntz AS load_time
, CURRENT_TIMESTAMP::timestamp_ntz AS update_time
, MD5(...) AS comparison_hash
FROM
publish.D_CUSTOMER_LOGIC src_entity
) src
ON (
trg.dim_key = src.dim_key
)
WHEN MATCHED AND trg.comparison_hash != src.comparison_hash
THEN UPDATE SET
dim_key = src.dim_key
...
, update_time = src.update_time
, comparison_hash = src.comparison_hash
WHEN NOT MATCHED
THEN INSERT (
dim_key
...
, load_time
, update_time
, comparison_hash
)
VALUES (
src.dim_key
...
, src.load_time
, src.update_time
, src.comparison_hash
)
;
Transactional truncate-insert template for Snowflake
As discussed above, the truncate-insert pattern causes to target table to be empty for the duration it takes to run the insert query. This example pattern is designed for Snowflake and it avoids the problem by first loading the data into a temporary table and then swapping the primary table with the temporary table in an instant metadata operation. Moreover, this pattern allows you to configure SMOKE_BLACK steps to run tests to the temporary table data before executing the swap operation, possibly intercepting the workflow if critical issues are detected.
Configure the load template in the CONFIG_LOAD_TEMPLATES configuration package and add to a load with Add from template:
{
...
"entityLoadLogicTemplates": [
...
{
"entityLoadLogicTemplateId": "b5a5b5ee-2b8e-4094-9687-09908db61401",
"templateName": "insert_overwrite_tmp_swap"
}
],
"entityLoadLogicTemplateLoadSteps": [
{
"entityLoadLogicTemplateLoadStepId": "f2a6ed3b-cab5-4a1d-a36a-b12173420c08",
"entityLoadLogicTemplateId": "b5a5b5ee-2b8e-4094-9687-09908db61401",
"loadStepName": "drop_tmp_if_exists",
"position": 1,
"stepType": "OVERRIDE",
"loadLanguageName": "SQL",
"textStep": "drop table if exists <target_schema>.z_tmp_<target_entity_name>",
"readAffectedRows": false
},
{
"entityLoadLogicTemplateLoadStepId": "1b7647df-56b9-4449-b6dc-deeffcbe07f2",
"entityLoadLogicTemplateId": "b5a5b5ee-2b8e-4094-9687-09908db61401",
"loadStepName": "create_tmp",
"position": 2,
"stepType": "OVERRIDE",
"loadLanguageName": "SQL",
"textStep": "create transient table <target_schema>.z_tmp_<target_entity_name> clone <target_schema>.<target_entity_name> copy grants",
"readAffectedRows": false
},
{
"entityLoadLogicTemplateLoadStepId": "7eb9ce9f-88f8-4600-907f-fb4dcaf00648",
"entityLoadLogicTemplateId": "b5a5b5ee-2b8e-4094-9687-09908db61401",
"loadStepName": "insert_to_tmp",
"position": 3,
"stepType": "OVERRIDE",
"loadLanguageName": "SQL",
"textStep": "insert overwrite into <target_schema>.z_tmp_<target_entity_name>\n/* Insert logic here */",
"readAffectedRows": false
},
{
"entityLoadLogicTemplateLoadStepId": "25e6fd7f-46c1-49fa-8845-f5b2f369ff30",
"entityLoadLogicTemplateId": "b5a5b5ee-2b8e-4094-9687-09908db61401",
"loadStepName": "swap_tmp_with_original",
"position": 4,
"stepType": "OVERRIDE",
"loadLanguageName": "SQL",
"textStep": "alter table <target_schema>.z_tmp_<target_entity_name> swap with <target_schema>.<target_entity_name>",
"readAffectedRows": false
},
{
"entityLoadLogicTemplateLoadStepId": "853b174a-08f3-427a-9d32-76d4594fd2c7",
"entityLoadLogicTemplateId": "b5a5b5ee-2b8e-4094-9687-09908db61401",
"loadStepName": "drop_tmp",
"position": 5,
"stepType": "OVERRIDE",
"loadLanguageName": "SQL",
"textStep": "drop table <target_schema>.z_tmp_<target_entity_name>",
"readAffectedRows": false
}
],
"entityTypeEntityLoadTypeEntityLoadLogicTemplates": [
{
"entityTypeEntityLoadTypeEntityLoadLogicTemplateId": "85360f26-3d78-4dd0-95d1-dc5727fff39f",
"entityType": "DIM",
"entityLoadType": "TRANSFORM_PERSIST",
"entityLoadLogicTemplateId": "b5a5b5ee-2b8e-4094-9687-09908db61401",
"templateOrdinalPosition": 100,
"isDefault": false
},
{
"entityTypeEntityLoadTypeEntityLoadLogicTemplateId": "2e67cc4f-38c4-4c47-ba10-c7337bf290cd",
"entityType": "FACT",
"entityLoadType": "TRANSFORM_PERSIST",
"entityLoadLogicTemplateId": "b5a5b5ee-2b8e-4094-9687-09908db61401",
"templateOrdinalPosition": 100,
"isDefault": false
}
]
}
Resulting SQL load steps when used in an entity:
--- 1. drop_tmp_if_exists (OVERRIDE - SQL) ---
drop table if exists <target_schema>.z_tmp_<target_entity_name>;
--- 2. create_tmp (OVERRIDE - SQL) ---
create transient table <target_schema>.z_tmp_<target_entity_name> clone <target_schema>.<target_entity_name> copy grants;
--- 3. insert_to_tmp (OVERRIDE - SQL) ---
insert overwrite into <target_schema>.z_tmp_<target_entity_name>
/* Insert logic here */;
--- 4. swap_tmp_with_original (OVERRIDE - SQL) ---
alter table <target_schema>.z_tmp_<target_entity_name> swap with <target_schema>.<target_entity_name>;
--- 5. drop_tmp (OVERRIDE - SQL) ---
drop table <target_schema>.z_tmp_<target_entity_name>;
Notes
The template creates a new transient table and swaps it with the original table effectively overriding the table type. Read more about working with transient tables in Snowflake documentation and modify the template according to your requirements.
Also see the usage notes related to cloning. For example, clustering keys are replicated but automatic clustering is suspended for the clone by default.