Fact

Facts consist of measurements, metrics and details of business processes such as sales, manufacturing and accounting. Facts are associated with dimensions which provide context and categorisation to facts.

Context: Publish zone

This article discusses dimensional modeling from the publish zone perspective. The assumption is that source system data is stored in a data warehouse zone and then loaded to the publish zone for end use purposes (multi-layered approach). I.e. source system data is not loaded directly to a dimensional model.

See persistent staging and Data Vault for guides on data warehouse zone modeling.

Creating a fact

Follow these steps to create a fact:

Select or create a package and create a new entity, choose type FACT and physical type TABLE, VIEW or MATERIALIZED_VIEW (see considerations in the notes below).
Create an incoming load and an entity mapping for each source entity for the fact. If you are using OVERRIDE load steps to define the load logic for the fact, attribute-level entity mappings are not mandatory. However, it is a good practice to create attribute-level mappings for metadata purposes. Creating an attribute-level mapping is also a good way to create attributes with the correct data types for the fact entity.
Define the load logic:

If your fact load has only one source entity, you might be able to use automatic load generation. In this case you need to define an attribute-level entity mapping including transformations. Use load options to modify the automatically generated logic and export SQL to preview the generated logic. Note that the DEPENDENCY entity mapping type can be used in some cases to allow automatic load generation with multiple dependencies.

OR

If your fact load has multiple source entities, define the load logic in SQL with OVERRIDE load steps. Note that you can use variables to make the logic more generic and save your own load templates. Also see examples below.

Use the Export SQL feature to preview entity SQL & load SQL.

Naming conventions

Note that you can modify the default entity and attribute naming used in the examples in the CONFIG_ENTITY_DEFAULTS configuration package to fit your naming conventions.

Notes

Physical type

As the source system data is stored in the data warehouse zone implemented e.g. with the persistent staging or Data Vault modeling methodologies, you have the choice to create facts with physical type TABLE, VIEW or MATERIALIZED_VIEW with the following considerations:

Physical type	Considerations
Table	Data is materialised into a fact table, this creates a copy of the data in the target database. With a table you need to define the logic how the data is refreshed and whether it is a full load or an incremental load. See examples below. A table requires more storage, but offers the best query performance. Depending on the usage of the fact and the pricing model of the target database, a table might be the cheapest option if the fact is queried a lot as storage is usually cheap compared to compute. Create a table especially when the load is heavy and the fact is queried a lot. As facts are usually much larger than dimensions, consider table as the default physical type for facts.
View	Fact load logic is created into the target database as a view, i.e. not materialized. With a view you only need to define the load logic for the fact, not how the data is refreshed, also attribute definitions are optional (yet recommended). Therefore a view can be the easiest option especially during early development iterations. A view can be converted to a table later on in the development process. Views are also suitable for production use when the number of rows is small and query performance is good.
Materialized view	Materialized views store the results of the view definition query for better query performance. The use of materialized views and their data refresh policies differ per target database management system. Please refer to the documentation of your target database product for further details.

Keys

If you are using hash keys in the data warehouse zone, the easiest option is to use them in the publish zone as well. This way key lookups between facts and dimensions can be avoided and loads can be executed concurrently. Depending on how the data warehouse has been implemented, using non-hashed business keys in the publish zone can also be simple.

Note that in general hash keys will perform worse than non-hashed natural keys or integer keys. For example, BI tools often struggle with hash keys when the data volume is high.

Load patterns

Consider if the fact can be fully loaded every time or if an incremental load is needed. Commonly facts are large and require an incremental load but if the row count is limited, a full load could be efficient enough and easier to implement.

An incremental load aims to process only new or changed records which is preferable especially with large fact tables. The critical parts of an incremental load logic are:

Selecting changed rows from the source entities:
Agile Data Engine has a built in feature for tracking changes between a source-target entity pair called Run ID Logic. You can use Run ID Logic even with multiple source entity mappings and OVERRIDE load steps, see example here. Similarly, some database products have their own change tracking features which in many cases can be utilised in loads.
Alternatively if not using Run ID Logic, a simple method for filtering incoming data is to filter a change timestamp column, for example:
SQL
```
...
WHERE meta_update_time >= datediff('day', -1, current_date())
...
```
Existence check against the target entity:
New or changed rows from the source are inserted or merged (upsert) into the target fact table. Within the insert/merge statement there is usually an existence check which ensures that duplicates are not produced. The existence check can be a heavy operation when the target table is large and in some cases it can be narrowed down e.g. with Timeslot Logic, see example here.

Note that you can also define multiple loads per entity and schedule them freely. For example, you might define an incremental load for a large fact table which is used on a regular basis, and a separate full load which can be used when needed. Full loads are useful e.g. when:

Running an initial load, i.e. loading the entity for the first time with all historical data.
Recovering from an incident that requires data to be reloaded.
The incremental load logic compromises perfect correctness for better performance and, for example, does not catch rare changes to old data. In such cases there could be a full maintenance load scheduled less frequently than the incremental load that keeps the data quality on an acceptable level.

Also note that if your fact table is loaded incrementally from multiple sources and those loads are independent of each other, you can create separate loads for each source and schedule them freely.

Full loads executed with the truncate-insert pattern can cause problems with simultaneous end use as the table being loaded is empty after the truncate statement has been run and before the insert has finished. Depending on the features of the target database, various load patterns have been designed to avoid these problems. See the Using a logic view and an automatically generated load and Transactional truncate-insert template for Snowflake examples presented in the Dimension article.

Note that if you are using current views on top of the data warehouse zone (e.g. Data Vault) to query data to the publish zone and the underlying tables have a lot of data, the current view logic can cause performance issues. Performance-wise it is often a better solution to design the query using the underlying data warehouse tables, apply the necessary filters and joins and then implement the current logic in the publish zone load. When you run into performance issues, investigate the query plan and see if such optimisations are possible.

Examples

Full load (truncate-insert)

As discussed above, a full load is useful to have even if you are planning to load a fact incrementally on a regular basis. Also, if the fact is small enough, you can avoid more complicated incremental logic by fully loading it every time.

The most simple way is to create the fact entity and define a full load, entity mappings as instructed above, and OVERRIDE steps as follows:

Load step

Example

Notes

truncate_table

SQL

truncate table <target_schema>.<target_entity_name>

Generic truncate table statement.

insert_into

SQL

insert into <target_schema>.<target_entity_name>
select
    id as trip_key
    ,id_ratecode as ratecode_key
    ,id_taxi_zone_pickup as taxi_zone_pickup_key
    ,id_taxi_zone_dropoff as taxi_zone_dropoff_key
	,meta_insert_time
    ,meta_update_time
    ...
    ,'yellow' as taxi_company
from
    psa.psa_taxi_trip_yellow
union all
select
    id as trip_key
    ,id_ratecode as ratecode_key
    ,id_taxi_zone_pickup as taxi_zone_pickup_key
    ,id_taxi_zone_dropoff as taxi_zone_dropoff_key
	,meta_insert_time
    ,meta_update_time
    ...
    ,'green' as taxi_company
from
    psa.psa_taxi_trip_green

Insert statement containing the load logic for the fact.

If your target database supports syntax like INSERT OVERWRITE INTO… you can skip the truncate step.

This example is loading a taxi trip fact from two persistent staging tables.

Note that using the simple truncate-insert approach causes the table to be empty for the duration of the insert statement execution as discussed above.

Alternatively, you could apply one of the built-in load patterns as described in Using a logic view and an automatically generated load, or a load template as presented in Transactional truncate-insert template for Snowflake in the Dimension article. This template loads the data first into a temporary entity and then swaps the original table with it which also allows you to run smoke tests on the data before the swap operation.

Incremental load

Let’s define incremental loads for the same use case as in the full load example above. In the use case we have two source entities containing trip data from different taxi companies:

psa_taxi_trip_yellow
psa_taxi_trip_green

As these sources are not dependent on each other, we can create separate loads for the sources.

Create a fact table as instructed above, e.g. F_TAXI_TRIP.
Create a separate load from both source tables to the fact table.
Create entity mappings in both loads to add the attributes needed in the fact. Also add/map the following attributes that collect metadata and are used by the merge load pattern we are going to use (feel free to name them according to your conventions):

Source attribute	Transformation	Target attribute	Target attribute technical type	Notes
fact_key	NONE	fact_key	DV_HASHKEY	Primary key for the fact
-	CURRENT_TS	load_time	DV_LOAD_TIME	Indicates when a row was inserted
-	CURRENT_TS	update_time	DV_UPDATE_TIME	Indicates when a row was updated
All mapped source attributes	HASH	comparison_hash	DV_DATAHASH	Used for comparing changes between the target entity and the incoming data.

Enable the OPT_INSERT_UPDATE_BY_DATAHASH_AS_MERGE load pattern:
NONE
```
OPT_INSERT_UPDATE_BY_DATAHASH_AS_MERGE = true
```
Enable Run ID Logic with OPT_USE_RUN_IDS:
CODE
```
OPT_USE_RUN_IDS = true
```
OR

Define a custom filter for incoming data using OPT_WHERE, for example:
CODE
```
OPT_WHERE = meta_update_time >= datediff('day', -1, current_date())
```

Finally, use the Export SQL feature to preview the load SQL, for example:

SQL

MERGE INTO publish.F_TAXI_TRIP trg USING (
    SELECT DISTINCT
        id AS trip_key
        ...
        , CURRENT_TIMESTAMP::timestamp_ntz AS load_time
        , CURRENT_TIMESTAMP::timestamp_ntz AS update_time
        , MD5(...) AS comparison_hash
    FROM
        psa.PSA_TAXI_TRIP_YELLOW src_entity
    WHERE
        -- Run ID Logic and OPT_WHERE applied here:
        stg_batch_id IN (<loadablerunids>)
) src
ON (
    trg.trip_key = src.trip_key
)
WHEN MATCHED AND trg.comparison_hash != src.comparison_hash
    THEN UPDATE SET
        fact_key = src.fact_key
        ...
        , update_time = src.update_time
        , comparison_hash = src.comparison_hash
WHEN NOT MATCHED
    THEN INSERT (
        fact_key
        ...
        , load_time
        , update_time
        , comparison_hash
    )
    VALUES (
        src.fact_key
        ...
        , src.load_time
        , src.update_time
        , src.comparison_hash
    )
;

Note that if you need to join multiple source entities together to load the fact table, you can copy the automatically generated merge logic and use it in an OVERRIDE load step. Also see this example on how to use Run ID Logic in a load with multiple entity mappings.