Skip to main content
Skip table of contents

Slowly Changing Dimension Type 2

A Slowly Changing Dimension Type 2 (SCD2) is a data warehousing technique used to track historical changes in dimension data. When an attribute changes, a new row is added with the updated data, while the old data is retained. This allows for a complete history of changes over time. Each record typically includes fields like valid_from, valid_to, validity_status, and may also contain a unique identifier (scd_id) to manage versioning and ensure data integrity.

Data Example: Tracking Historical Changes with SCD2

In this section, we’ll explore what happens when new data arrives for an existing id in a dimension table, and how historical changes are tracked using the SCD2 logic.

Initial Data Setup

Let’s start with a dimension table that holds data about taxi zones. The initial data looks like this:

sdc_id

id

datahash

valid_from

valid_to

status

location_id

zone

0b51a5

8c19f5

af1348

2024-09-02

9999-12-31

1

263

Yorkville West

8f8eab

d6baf6

7f500c

2024-09-02

9999-12-31

1

264

NV

fa0e6cf

e56954

f911c6

2024-09-02

9999-12-31

1

265

NA

Here, the validity_status is set to 1, indicating that these are the valid records for each taxi zone.

New Data

Suppose we receive updated data for taxi zones 264 and 265, along with completely new data for taxi zone 266. In an SCD2 setup, we don’t remove the old rows. Instead, we update the validity information of the previous rows and add the new rows to reflect the changes:

sdc_id

id

datahash

valid_from

valid_to

status

location_id

zone

0b51a5

8c19f5

af1348

2024-09-02

9999-12-31

1

263

Yorkville West

8f8eab

d6baf6

7f500c

2024-09-02

2024-09-03

0

264

NV

e87333

d6baf6

b6cb8b

2024-09-04

9999-12-31

1

264

N/A

fa0e6cf

e56954

f911c6

2024-09-02

2024-09-03

0

265

NA

3e4061

e56954

28a20d

2024-09-04

9999-12-31

1

265

N/A

76fa2b

f76640

8582f3

2024-09-04

9999-12-31

1

266

N/A

Here’s what happened:

  • Rows 2 and 4: The validity of the old rows for taxi zones 264 and 265 has been updated (valid_to changed to 2024-09-03) and their status set to 0, indicating they are no longer the current rows.

  • Rows 3 and 5: New rows were added for taxi zones 264 and 265 with updated data, valid from 2024-09-04.

  • Row 6: A new row was added for the new taxi zone 266.

  • Row 1: The first row remains unchanged because the data related to that taxi zone has not been updated.

Creating a Slowly Changing Dimension Type 2 Entity and Load

To create an SCD2 entity in ADE, it is recommended to create a new entity type with default attributes for SCD2. If you haven’t created it yet, go to the Create SCD2 Entity Type with Default Attributes section.

Once the entity type is created, follow these steps to create an SCD2 entity.

Create an Entity using DIM_SCD2

  1. Create a new entity using the DIM_SCD2 entity type and set the physical type to table.

    image-20240927-085221.png

  2. Create and incoming load and map the source entities. Note that only one entity can be mapped as a source to use the load template and the target_entity_attribute_list variables. In this example, we map a hub as a dependency and a satellite as a source.

    image-20240927-085413.png

  3. Set the attribute-level mappings for the satellite. For the valid_from attribute, you can use either a business timestamp from the data or a technical timestamp indicating when the data was added to the data warehouse, depending on your business requirements.
    For the valid_to and validity_status attributes, you need to create empty mappings; no source attribute or transformation type should be set. This ensures that the target_entity_attribute_list variables work correctly.

    image-20240927-090650.png

  4. Use the following load logic to populate the SCD2 table with the new rows and update the valid_to and validity_status attributes when necessary.

    Note! The template contains one section that must be edited manually. Additionally, this logic assumes that rows should only be added if the id appears in both the hub and satellite, and that the hub is named consistently. If the hub name is different, manual edits may be required.

    CODE
    WITH source_data AS (
        SELECT DISTINCT
            <target_entity_attribute_list_with_transform>
        FROM
            (
                SELECT dv_id AS hub_dv_id
                FROM <source_entity_schema>.H_<target_entity_logical_name>
            ) hub
        JOIN
            <source_entity_schema>.<source_entity_name> src_entity
        ON hub.hub_dv_id = src_entity.dv_id
    	WHERE
            -- select new, uninserted rows based on the insert timestamp
            -- this is used to replace Run ID logic; however, it is recommended to use the Run ID logic whenever possible
    		src_entity.source_insert_time > nvl((SELECT MAX(source_insert_time) FROM <target_schema>.<target_entity_name>), to_timestamp('2000-01-01', 'yyyy-MM-dd'))
    )
    MERGE INTO <target_schema>.<target_entity_name> trg USING (
      SELECT 
    	--
    	-- NOTE! Copy paste the target_entity_attribute_list here and REPLACE VALID_TO AND VALIDITY_STATUS WITH THESE:
    	--
        , CASE WHEN LEAD(valid_from) OVER (PARTITION BY <target_entity_logical_name>_id ORDER BY valid_from) IS NULL THEN TO_TIMESTAMP('9999-12-31', 'yyyy-MM-dd') ELSE LEAD(valid_from) OVER (PARTITION BY <target_entity_logical_name>_id ORDER BY valid_from) END AS valid_to
        , CASE WHEN LEAD(valid_from) OVER (PARTITION BY <target_entity_logical_name>_id ORDER BY valid_from) IS NULL THEN 1 ELSE 0 END AS validity_status
    	
      FROM (
        -- select only those target table rows whose validity information needs to be updated
        SELECT
            <target_entity_attribute_list>
        FROM
            <target_schema>.<target_entity_name> trg
            WHERE
                trg.validity_status = 1
                AND
                EXISTS (
                    SELECT
                        1
                    FROM
                        source_data src
                    WHERE
                        trg.<target_entity_logical_name>_id = src.<target_entity_logical_name>_id		
                )
        UNION ALL
        SELECT DISTINCT
            <target_entity_attribute_list>
        FROM
            source_data src_entity	
      )			
    ) src
    ON (
        trg.<target_entity_logical_name>_id = src.<target_entity_logical_name>_id
        AND
        trg.datahash = src.datahash
        AND
        COALESCE(trg.valid_from,'3000-01-01') = COALESCE(src.valid_from,'3000-01-01') 
    )
    WHEN MATCHED
        THEN UPDATE SET
            trg.valid_to = src.valid_to
            , trg.validity_status = src.validity_status
    WHEN NOT MATCHED
        THEN INSERT (
            <target_entity_attribute_list>
        )
        VALUES (
            <target_entity_attribute_list>
        )
    ;

Create SCD2 Entity Type with Default Attributes

This section describes how to create a new entity type with default attributes for SCD2.

Create a New Entity Type

You can create the entity type with default attributes using either the new configuration UI or the old JSON file editor. Below, we use the new configuration UI.

  1. Go to CONFIG_ENTITY_DEFAULTS package and create DIM_SCD2 entity type using the values below.

    image-20240927-063600.png
  2. Add default attributes to the DIM_SCD2 entity type (ensure that the attributes already exist in Default Attributes; refer to the Create Default Attributes section if needed).

    image-20240927-082224.png
  3. In the Default Load Types tab, add TRANSFORM_PERSIST as a default load type for the table.

    image-20240927-084145.png
  4. Finally, go to the Package Details tab and commit your changes.

Create Default Attributes

The DIM_SCD2 entity type uses default attributes that do not exist in ADE by default, so they need to be created.

Go to CONFIG_ENTITY_DEFAULTS package and open the Default Attributes tab. Create the following default attributes if they don’t already exist:

  • <entity_logical_name>_scd_id

    image-20240927-064614.png
  • <entity_logical_name>_id

    image-20240927-064703.png
  • datahash

    image-20240927-064832.png
  • valid_from

    image-20240927-064910.png
  • valid_to

    image-20240927-065024.png
  • validity_status

    image-20240927-065051.png
  • source_insert_time (used to replace the Run ID logic, though using Run ID logic is recommended whenever possible).

    image-20240927-065923.png
JavaScript errors detected

Please note, these errors can depend on your browser setup.

If this problem persists, please contact our support.