Slowly Changing Dimension Type 2
A Slowly Changing Dimension Type 2 (SCD2) is a data warehousing technique used to track historical changes in dimension data. When an attribute changes, a new row is added with the updated data, while the old data is retained. This allows for a complete history of changes over time. Each record typically includes fields like valid_from
, valid_to
, validity_status
, and may also contain a unique identifier (scd_id
) to manage versioning and ensure data integrity.
Data Example: Tracking Historical Changes with SCD2
In this section, we’ll explore what happens when new data arrives for an existing id in a dimension table, and how historical changes are tracked using the SCD2 logic.
Initial Data Setup
Let’s start with a dimension table that holds data about taxi zones. The initial data looks like this:
sdc_id | id | datahash | valid_from | valid_to | status | location_id | zone |
---|---|---|---|---|---|---|---|
0b51a5 | 8c19f5 | af1348 | 2024-09-02 | 9999-12-31 | 1 | 263 | Yorkville West |
8f8eab | d6baf6 | 7f500c | 2024-09-02 | 9999-12-31 | 1 | 264 | NV |
fa0e6cf | e56954 | f911c6 | 2024-09-02 | 9999-12-31 | 1 | 265 | NA |
Here, the validity_status is set to 1, indicating that these are the valid records for each taxi zone.
New Data
Suppose we receive updated data for taxi zones 264 and 265, along with completely new data for taxi zone 266. In an SCD2 setup, we don’t remove the old rows. Instead, we update the validity information of the previous rows and add the new rows to reflect the changes:
sdc_id | id | datahash | valid_from | valid_to | status | location_id | zone |
---|---|---|---|---|---|---|---|
0b51a5 | 8c19f5 | af1348 | 2024-09-02 | 9999-12-31 | 1 | 263 | Yorkville West |
8f8eab | d6baf6 | 7f500c | 2024-09-02 | 2024-09-03 | 0 | 264 | NV |
e87333 | d6baf6 | b6cb8b | 2024-09-04 | 9999-12-31 | 1 | 264 | N/A |
fa0e6cf | e56954 | f911c6 | 2024-09-02 | 2024-09-03 | 0 | 265 | NA |
3e4061 | e56954 | 28a20d | 2024-09-04 | 9999-12-31 | 1 | 265 | N/A |
76fa2b | f76640 | 8582f3 | 2024-09-04 | 9999-12-31 | 1 | 266 | N/A |
Here’s what happened:
Rows 2 and 4: The validity of the old rows for taxi zones 264 and 265 has been updated (
valid_to
changed to 2024-09-03) and theirstatus
set to 0, indicating they are no longer the current rows.Rows 3 and 5: New rows were added for taxi zones 264 and 265 with updated data, valid from 2024-09-04.
Row 6: A new row was added for the new taxi zone 266.
Row 1: The first row remains unchanged because the data related to that taxi zone has not been updated.
Creating a Slowly Changing Dimension Type 2 Entity and Load
To create an SCD2 entity in ADE, it is recommended to create a new entity type with default attributes for SCD2. If you haven’t created it yet, go to the Create SCD2 Entity Type with Default Attributes section.
Once the entity type is created, follow these steps to create an SCD2 entity.
Create an Entity using DIM_SCD2
Create a new entity using the
DIM_SCD2
entity type and set the physical type to table.Create and incoming load and map the source entities. Note that only one entity can be mapped as a source to use the load template and the
target_entity_attribute_list
variables. In this example, we map a hub as a dependency and a satellite as a source.Set the attribute-level mappings for the satellite. For the
valid_from
attribute, you can use either a business timestamp from the data or a technical timestamp indicating when the data was added to the data warehouse, depending on your business requirements.
For thevalid_to
andvalidity_status
attributes, you need to create empty mappings; no source attribute or transformation type should be set. This ensures that thetarget_entity_attribute_list
variables work correctly.Use the following load logic to populate the SCD2 table with the new rows and update the
valid_to
andvalidity_status
attributes when necessary.Note! The template contains one section that must be edited manually. Additionally, this logic assumes that rows should only be added if the id appears in both the hub and satellite, and that the hub is named consistently. If the hub name is different, manual edits may be required.
CODEWITH source_data AS ( SELECT DISTINCT <target_entity_attribute_list_with_transform> FROM ( SELECT dv_id AS hub_dv_id FROM <source_entity_schema>.H_<target_entity_logical_name> ) hub JOIN <source_entity_schema>.<source_entity_name> src_entity ON hub.hub_dv_id = src_entity.dv_id WHERE -- select new, uninserted rows based on the insert timestamp -- this is used to replace Run ID logic; however, it is recommended to use the Run ID logic whenever possible src_entity.source_insert_time > nvl((SELECT MAX(source_insert_time) FROM <target_schema>.<target_entity_name>), to_timestamp('2000-01-01', 'yyyy-MM-dd')) ) MERGE INTO <target_schema>.<target_entity_name> trg USING ( SELECT -- -- NOTE! Copy paste the target_entity_attribute_list here and REPLACE VALID_TO AND VALIDITY_STATUS WITH THESE: -- , CASE WHEN LEAD(valid_from) OVER (PARTITION BY <target_entity_logical_name>_id ORDER BY valid_from) IS NULL THEN TO_TIMESTAMP('9999-12-31', 'yyyy-MM-dd') ELSE LEAD(valid_from) OVER (PARTITION BY <target_entity_logical_name>_id ORDER BY valid_from) END AS valid_to , CASE WHEN LEAD(valid_from) OVER (PARTITION BY <target_entity_logical_name>_id ORDER BY valid_from) IS NULL THEN 1 ELSE 0 END AS validity_status FROM ( -- select only those target table rows whose validity information needs to be updated SELECT <target_entity_attribute_list> FROM <target_schema>.<target_entity_name> trg WHERE trg.validity_status = 1 AND EXISTS ( SELECT 1 FROM source_data src WHERE trg.<target_entity_logical_name>_id = src.<target_entity_logical_name>_id ) UNION ALL SELECT DISTINCT <target_entity_attribute_list> FROM source_data src_entity ) ) src ON ( trg.<target_entity_logical_name>_id = src.<target_entity_logical_name>_id AND trg.datahash = src.datahash AND COALESCE(trg.valid_from,'3000-01-01') = COALESCE(src.valid_from,'3000-01-01') ) WHEN MATCHED THEN UPDATE SET trg.valid_to = src.valid_to , trg.validity_status = src.validity_status WHEN NOT MATCHED THEN INSERT ( <target_entity_attribute_list> ) VALUES ( <target_entity_attribute_list> ) ;
Create SCD2 Entity Type with Default Attributes
This section describes how to create a new entity type with default attributes for SCD2.
Create a New Entity Type
You can create the entity type with default attributes using either the new configuration UI or the old JSON file editor. Below, we use the new configuration UI.
Go to
CONFIG_ENTITY_DEFAULTS
package and createDIM_SCD2
entity type using the values below.Add default attributes to the
DIM_SCD2
entity type (ensure that the attributes already exist in Default Attributes; refer to the Create Default Attributes section if needed).In the Default Load Types tab, add
TRANSFORM_PERSIST
as a default load type for the table.Finally, go to the Package Details tab and commit your changes.
Create Default Attributes
The DIM_SCD2
entity type uses default attributes that do not exist in ADE by default, so they need to be created.
Go to CONFIG_ENTITY_DEFAULTS
package and open the Default Attributes tab. Create the following default attributes if they don’t already exist:
<entity_logical_name>_scd_id
<entity_logical_name>_id
datahash
valid_from
valid_to
validity_status
source_insert_time (used to replace the Run ID logic, though using Run ID logic is recommended whenever possible).