Dimension

Dimensions provide context and categorisation to facts. Dimensions are defined for business concepts such as customers, products, facilities and time; they contain a unique dimension key which is referenced by facts. Sometimes dimension keys can also be referenced by other dimensions creating a snowflake model.

Context: Publish zone

This article discusses dimensional modeling from the publish zone perspective. The assumption is that source system data is stored in a data warehouse zone and then loaded to the publish zone for end use purposes (multi-layered approach). I.e. source system data is not loaded directly to a dimensional model.

See persistent staging and Data Vault for guides on data warehouse zone modeling.

Creating a dimension

Follow these steps to create a dimension:

Select or create a package and create a new entity, choose type DIM and physical type TABLE, VIEW or MATERIALIZED_VIEW (see considerations in the notes below).
Create an incoming load and an entity mapping for each source entity for the dimension. If you are using OVERRIDE load steps to define the load logic for the dimension, attribute-level entity mappings are not mandatory. However, it is a good practice to create attribute-level mappings for metadata purposes. Creating an attribute-level mapping is also a good way to create attributes with the correct data types for the dimension entity.
Define the load logic:

If your dimension has only one source entity, you might be able to use automatic load generation. In this case you need to define an attribute-level entity mapping including transformations. Use load options to modify the automatically generated logic and export SQL to preview the generated logic. Note that the DEPENDENCY entity mapping type can be used in some cases to allow automatic load generation with multiple dependencies.

OR

If your dimension has multiple source entities, define the load logic in SQL with OVERRIDE load steps. Note that you can use variables to make the logic more generic and save your own load templates. Also see examples below.

Use the Export SQL feature to preview entity SQL & load SQL.

Naming conventions

Note that you can modify the default entity and attribute naming used in the examples in the CONFIG_ENTITY_DEFAULTS configuration package to fit your naming conventions.

Notes

Physical type

As the source system data is stored in the data warehouse zone implemented e.g. with the persistent staging or Data Vault modeling methodologies, you have the choice to create dimensions with physical type TABLE, VIEW or MATERIALIZED_VIEW with the following considerations:

Physical type	Considerations
Table	Data is materialised into a dimension table, this creates a copy of the data in the target database. With a table you need to define the logic how the data is refreshed and whether it is a full load or an incremental load. See examples below. A table requires more storage, but offers the best query performance. Depending on the usage of the dimension and the pricing model of the target database, a table might be the cheapest option if the dimension is queried a lot as storage is usually cheap compared to compute. Create a table especially when the load is heavy and the dimension is queried a lot.
View	Dimension load logic is created into the target database as a view, i.e. not materialized. With a view you only need to define the load logic for the dimension, not how the data is refreshed, also attribute definitions are optional (yet recommended). Therefore a view can be the easiest option especially during early development iterations. A view can be converted to a table later on in the development process. Views are also suitable for production use when the number of rows is small and query performance is good.
Materialized view	Materialized views store the results of the view definition query for better query performance. The use of materialized views and their data refresh policies differ per target database management system. Please refer to the documentation of your target database product for further details.

Keys

If you are using hash keys in the data warehouse zone, the easiest option is to use them in the publish zone as well. This way key lookups between facts and dimensions can be avoided and loads can be executed concurrently. Depending on how the data warehouse has been implemented, using non-hashed business keys in the publish zone can also be simple.

Note that in general hash keys will perform worse than non-hashed natural keys or integer keys. For example, BI tools often struggle with hash keys when the data volume is high.

Load patterns

Consider if change history is needed in the dimension. The most common use case is that end users only want to see the latest version of the data. Dimensions containing the change history are more complicated to load and to use. Change history should be stored in the data warehouse zone and only presented in the publish zone when required by the use case.

Consider if the dimension can be fully loaded every time or if an incremental load is needed. Commonly the amount of data in dimensions is limited and with modern database management systems it is more efficient to load the full table in each load execution than to design a more complicated incremental load. Dimensions often combine data from multiple sources and it can be difficult to define an incremental load logic.

Note that you can also define multiple loads per entity and schedule them freely. For example, you might define an incremental load for a large dimension which is used on a regular basis, and a separate full load which can be used when needed. Full loads are useful e.g. when:

Running an initial load, i.e. loading the entity for the first time with all historical data.
Recovering from an incident that requires data to be reloaded.
The incremental load logic compromises perfect correctness for better performance and, for example, does not catch rare changes to old data. In such cases there could be a full maintenance load scheduled less frequently than the incremental load that keeps the data quality on an acceptable level.
Erasing sensitive data from the data warehouse (read more about handling sensitive data in Agile Data Engine blog).

Full loads executed with the truncate-insert pattern can cause problems with simultaneous end use as the table being loaded is empty after the truncate statement has been run and before the insert has finished. Depending on the features of the target database, various load patterns have been designed to avoid these problems. See the Using a logic view and an automatically generated load and Transactional truncate-insert template for Snowflake examples below.

Note that if you are using current views on top of the data warehouse zone (e.g. Data Vault) to query data to the publish zone and the underlying tables have a lot of data, the current view logic can cause performance issues. Performance-wise it is often a better solution to design the query using the underlying data warehouse tables, apply the necessary filters and joins and then implement the current logic in the publish zone load. When you run into performance issues, investigate the query plan and see if such optimisations are possible.

Examples

Full load for a non-historizing dimension

A non-historizing dimension (SCD type 1) is the most common scenario for a publish dimension: only the latest version of the data is needed. Often also the row count is low so the dimension can be fully loaded (truncate-insert) every time.

The most simple option is to create the dimension entity and define a full load, entity mappings as instructed above, and OVERRIDE steps as follows:

Load step

Example

Notes

truncate_table

SQL

truncate table <target_schema>.<target_entity_name>

Generic truncate table statement.

insert_into

SQL

insert into <target_schema>.<target_entity_name>
select 
  h_tz.dv_id as taxi_zone_id,
  s_tz.borough,
  s_tz.zone,
  s_tz.service_zone,
  s_tzg.taxi_zone_geography,
  s_tz.dv_load_time as load_time,
  s_tzg.dv_source_system as source_system
from rdv.h_taxi_zone h_tz
join rdv.s_taxi_zone_ny_tlc_c s_tz 
  on h_tz.dv_id = s_tz.dv_id
join rdv.s_taxi_zone_geography_ny_tlc_c s_tzg
  on h_tz.dv_id = s_tzg.dv_id

Insert statement containing the load logic for the dimension.

If your target database supports syntax like INSERT OVERWRITE INTO… you can skip the truncate step.

This example is loading a taxi zone dimension from Data Vault and using current views to get the latest data from satellites.

Note that using the simple truncate-insert approach causes the entity to be empty for the duration of the insert statement execution as discussed above.

Alternatively, you could use one of the built-in load patterns described in Using a logic view and an automatically generated load, or a load template as presented in Transactional truncate-insert template for Snowflake. This template loads the data first into a temporary entity and then swaps the original table with it which also allows you to run smoke tests on the data before the swap operation.

Note that you can also load a type 1 slowly changing dimension incrementally if needed. You could use the pattern described in Using a logic view and an automatically generated load with OPT_INSERT_UPDATE_BY_DATAHASH_AS_MERGE or write your own load logic with OVERRIDE load steps.

Using a logic view and an automatically generated load

Agile Data Engine has a variety of preconfigured load patterns which can be enabled with load options and used with dimensions (and other types of entities):

Load method	Dimension type	Load pattern	Notes
Full	Non-historizing (SCD type 1)	OPT_INSERT_UPDATE_DELETE_BY_ID_AS_SWAP	Uses a temporary table which is swapped or transactionally renamed to the original entity depending on the target database product.
		OPT_INSERT_UPDATE_DELETE_BY_DATAHASH_AS_SWAP	Uses a temporary table which is swapped or transactionally renamed to the original entity depending on the target database product.
		OPT_INSERT_UPDATE_DELETE_BY_ID_AS_OVERWRITE	OVERWRITE is only supported in Snowflake.
		OPT_INSERT_UPDATE_DELETE_BY_DATAHASH_AS_OVERWRITE	OVERWRITE is only supported in Snowflake.
		OPT_INSERT_UPDATE_DELETE_BY_ID_AS_MERGE	Uses a MERGE statement.
		OPT_INSERT_UPDATE_DELETE_BY_DATAHASH_AS_MERGE	Uses a MERGE statement.
Incremental	Historizing (SCD type 2)	OPT_HISTORIZED_BY_DATAHASH_AS_INSERT	Compare to a persistent staging entity or a Data Vault satellite. See also SCD Type 2 with FTL Templates.
	Non-historizing (SCD type 1)	OPT_INSERT_UPDATE_BY_ID_AS_SWAP	Uses a temporary table which is swapped or transactionally renamed to the original entity depending on the target database product.
		OPT_INSERT_UPDATE_BY_DATAHASH_AS_SWAP	Uses a temporary table which is swapped or transactionally renamed to the original entity depending on the target database product.
		OPT_INSERT_UPDATE_BY_ID_AS_OVERWRITE	OVERWRITE is only supported in Snowflake.
		OPT_INSERT_UPDATE_BY_DATAHASH_AS_OVERWRITE	OVERWRITE is only supported in Snowflake.
		OPT_INSERT_UPDATE_BY_ID_AS_MERGE	Uses a MERGE statement.
		OPT_INSERT_UPDATE_BY_DATAHASH_AS_MERGE	Uses a MERGE statement.

These load patterns can be used in loads that have one source entity mapping, yet typically dimensions join together data from multiple different sources. Still, you can use them by defining the dimension logic with SQL into a view and then materializing it into a table using one of the load patterns.

Use one of the load patterns with a logic view as follows:

Create a view for the dimension load logic, e.g. D_CUSTOMER_LOGIC.
Add entity mappings for all the source entities and define the load logic with SQL.
Create a table for the actual dimension, e.g. D_CUSTOMER.
Create a 1-to-1 mapping from the logic view to the table to add the attributes needed in the dimension. Also add/map the following attributes that collect metadata and are used by the load patterns (feel free to name them according to your conventions):

Source attribute	Transformation	Target attribute	Target attribute technical type	Notes
dim_key	NONE	dim_key	DV_HASHKEY	Primary key for the dimension
-	CURRENT_TS	load_time	DV_LOAD_TIME	Indicates when a row was inserted
-	CURRENT_TS	update_time	DV_UPDATE_TIME	Indicates when a row was updated
All mapped source attributes	HASH	comparison_hash	DV_DATAHASH	Used for comparing changes between the target entity and the incoming data by some of the load patterns.

Note that the load patterns have different requirements on what type of attributes are needed in the target entity for the load pattern to work (technical type). Please refer to the load pattern specific documentation and use SQL export to preview the logic.

Enable one of the load patterns by setting it as a load option, for example:
CODE
```
OPT_INSERT_UPDATE_BY_DATAHASH_AS_MERGE = true
```

Finally, use the Export SQL feature to preview the load SQL, for example:

SQL

MERGE INTO publish.D_CUSTOMER trg USING (
    SELECT DISTINCT
        dim_key AS dim_key
        ...
        , CURRENT_TIMESTAMP::timestamp_ntz AS load_time
        , CURRENT_TIMESTAMP::timestamp_ntz AS update_time
        , MD5(...) AS comparison_hash
    FROM
        publish.D_CUSTOMER_LOGIC src_entity
) src
ON (
    trg.dim_key = src.dim_key
)
WHEN MATCHED AND trg.comparison_hash != src.comparison_hash
    THEN UPDATE SET
        dim_key = src.dim_key
        ...
        , update_time = src.update_time
        , comparison_hash = src.comparison_hash
WHEN NOT MATCHED
    THEN INSERT (
        dim_key
        ...
        , load_time
        , update_time
        , comparison_hash
    )
    VALUES (
        src.dim_key
        ...
        , src.load_time
        , src.update_time
        , src.comparison_hash
    )
;

Transactional truncate-insert template for Snowflake

As discussed above, the truncate-insert pattern causes to target table to be empty for the duration it takes to run the insert query. This example pattern is designed for Snowflake and it avoids the problem by first loading the data into a temporary table and then swapping the primary table with the temporary table in an instant metadata operation. Moreover, this pattern allows you to configure SMOKE_BLACK steps to run tests to the temporary table data before executing the swap operation, possibly intercepting the workflow if critical issues are detected.

Configure the load template in the CONFIG_LOAD_TEMPLATES configuration package and add to a load with Add from template:

JSON

{
  ...
  "entityLoadLogicTemplates": [
    ...
    {
      "entityLoadLogicTemplateId": "b5a5b5ee-2b8e-4094-9687-09908db61401",
      "templateName": "insert_overwrite_tmp_swap"
    }
  ],
  "entityLoadLogicTemplateLoadSteps": [
    {
      "entityLoadLogicTemplateLoadStepId": "f2a6ed3b-cab5-4a1d-a36a-b12173420c08",
      "entityLoadLogicTemplateId": "b5a5b5ee-2b8e-4094-9687-09908db61401",
      "loadStepName": "drop_tmp_if_exists",
      "position": 1,
      "stepType": "OVERRIDE",
      "loadLanguageName": "SQL",
      "textStep": "drop table if exists <target_schema>.z_tmp_<target_entity_name>",
      "readAffectedRows": false
    },
    {
      "entityLoadLogicTemplateLoadStepId": "1b7647df-56b9-4449-b6dc-deeffcbe07f2",
      "entityLoadLogicTemplateId": "b5a5b5ee-2b8e-4094-9687-09908db61401",
      "loadStepName": "create_tmp",
      "position": 2,
      "stepType": "OVERRIDE",
      "loadLanguageName": "SQL",
      "textStep": "create transient table <target_schema>.z_tmp_<target_entity_name> clone <target_schema>.<target_entity_name> copy grants",
      "readAffectedRows": false
    },
    {
      "entityLoadLogicTemplateLoadStepId": "7eb9ce9f-88f8-4600-907f-fb4dcaf00648",
      "entityLoadLogicTemplateId": "b5a5b5ee-2b8e-4094-9687-09908db61401",
      "loadStepName": "insert_to_tmp",
      "position": 3,
      "stepType": "OVERRIDE",
      "loadLanguageName": "SQL",
      "textStep": "insert overwrite into <target_schema>.z_tmp_<target_entity_name>\n/* Insert logic here */",
      "readAffectedRows": false
    },
    {
      "entityLoadLogicTemplateLoadStepId": "25e6fd7f-46c1-49fa-8845-f5b2f369ff30",
      "entityLoadLogicTemplateId": "b5a5b5ee-2b8e-4094-9687-09908db61401",
      "loadStepName": "swap_tmp_with_original",
      "position": 4,
      "stepType": "OVERRIDE",
      "loadLanguageName": "SQL",
      "textStep": "alter table <target_schema>.z_tmp_<target_entity_name> swap with <target_schema>.<target_entity_name>",
      "readAffectedRows": false
    },
    {
      "entityLoadLogicTemplateLoadStepId": "853b174a-08f3-427a-9d32-76d4594fd2c7",
      "entityLoadLogicTemplateId": "b5a5b5ee-2b8e-4094-9687-09908db61401",
      "loadStepName": "drop_tmp",
      "position": 5,
      "stepType": "OVERRIDE",
      "loadLanguageName": "SQL",
      "textStep": "drop table <target_schema>.z_tmp_<target_entity_name>",
      "readAffectedRows": false
    }
  ],
  "entityTypeEntityLoadTypeEntityLoadLogicTemplates": [
    {
      "entityTypeEntityLoadTypeEntityLoadLogicTemplateId": "85360f26-3d78-4dd0-95d1-dc5727fff39f",
      "entityType": "DIM",
      "entityLoadType": "TRANSFORM_PERSIST",
      "entityLoadLogicTemplateId": "b5a5b5ee-2b8e-4094-9687-09908db61401",
      "templateOrdinalPosition": 100,
      "isDefault": false
    },
    {
      "entityTypeEntityLoadTypeEntityLoadLogicTemplateId": "2e67cc4f-38c4-4c47-ba10-c7337bf290cd",
      "entityType": "FACT",
      "entityLoadType": "TRANSFORM_PERSIST",
      "entityLoadLogicTemplateId": "b5a5b5ee-2b8e-4094-9687-09908db61401",
      "templateOrdinalPosition": 100,
      "isDefault": false
    }
  ]
}

Resulting SQL load steps when used in an entity:

SQL

--- 1. drop_tmp_if_exists (OVERRIDE - SQL) ---
drop table if exists <target_schema>.z_tmp_<target_entity_name>;

--- 2. create_tmp (OVERRIDE - SQL) ---
create transient table <target_schema>.z_tmp_<target_entity_name> clone <target_schema>.<target_entity_name> copy grants;

--- 3. insert_to_tmp (OVERRIDE - SQL) ---
insert overwrite into <target_schema>.z_tmp_<target_entity_name>
/* Insert logic here */;

--- 4. swap_tmp_with_original (OVERRIDE - SQL) ---
alter table <target_schema>.z_tmp_<target_entity_name> swap with <target_schema>.<target_entity_name>;

--- 5. drop_tmp (OVERRIDE - SQL) ---
drop table <target_schema>.z_tmp_<target_entity_name>;

Notes

The template creates a new transient table and swaps it with the original table effectively overriding the table type. Read more about working with transient tables in Snowflake documentation and modify the template according to your requirements.

Also see the usage notes related to cloning. For example, clustering keys are replicated but automatic clustering is suspended for the clone by default.