Facts consist of measurements, metrics and details of business processes such as sales, manufacturing and accounting. Facts are associated with dimensions which provide context and categorisation to facts.
Context: Publish zone
This article discusses dimensional modeling from the publish zone perspective. The assumption is that source system data is stored in a data warehouse zone and then loaded to the publish zone for end use purposes (multi-layered approach). I.e. source system data is not loaded directly to a dimensional model.
Creating a fact
Follow these steps to create a fact:
Create an incoming load and an entity mapping for each source entity for the fact. If you are using OVERRIDE load steps to define the load logic for the fact, attribute-level entity mappings are not mandatory. However, it is a good practice to create attribute-level mappings for metadata purposes. Creating an attribute-level mapping is also a good way to create attributes with the correct data types for the fact entity.
Define the load logic:
If your fact load has only one source entity, you might be able to use automatic load generation. In this case you need to define an attribute-level entity mapping including transformations. Use load options to modify the automatically generated logic and export SQL to preview the generated logic. Note that the DEPENDENCY entity mapping type can be used in some cases to allow automatic load generation with multiple dependencies.
If your fact load has multiple source entities, define the load logic in SQL with OVERRIDE load steps. Note that you can use variables to make the logic more generic and save your own load templates. Also see examples below.
Use the Export SQL feature to preview entity SQL & load SQL.
Note that you can modify the default entity and attribute naming used in the examples in the CONFIG_ENTITY_DEFAULTS configuration package to fit your naming conventions.
As the source system data is stored in the data warehouse zone implemented e.g. with the persistent staging or Data Vault modeling methodologies, you have the choice to create facts with physical type TABLE, VIEW or MATERIALIZED_VIEW with the following considerations:
If you are using hash keys in the data warehouse zone, the easiest option is to use them in the publish zone as well. This way key lookups between facts and dimensions can be avoided and loads can be executed concurrently. Depending on how the data warehouse has been implemented, using non-hashed business keys in the publish zone can also be simple.
Note that in general hash keys will perform worse than non-hashed natural keys or integer keys. For example, BI tools often struggle with hash keys when the data volume is high.
Consider if the fact can be fully loaded every time or if an incremental load is needed. Commonly facts are large and require an incremental load but if the row count is limited, a full load could be efficient enough and easier to implement.
An incremental load aims to process only new or changed records which is preferable especially with large fact tables. The critical parts of an incremental load logic are:
Selecting changed rows from the source entities:
Agile Data Engine has a built in feature for tracking changes between a source-target entity pair called Run ID Logic. You can use Run ID Logic even with multiple source entity mappings and OVERRIDE load steps, see example here. Similarly, some database products have their own change tracking features which in many cases can be utilised in loads.
Alternatively if not using Run ID Logic, a simple method for filtering incoming data is to filter a change timestamp column, for example:SQL
... WHERE meta_update_time >= datediff('day', -1, current_date()) ...
Existence check against the target entity:
New or changed rows from the source are inserted or merged (upsert) into the target fact table. Within the insert/merge statement there is usually an existence check which ensures that duplicates are not produced. The existence check can be a heavy operation when the target table is large and in some cases it can be narrowed down e.g. with Timeslot Logic, see example here.
Note that you can also define multiple loads per entity and schedule them freely. For example, you might define an incremental load for a large fact table which is used on a regular basis, and a separate full load which can be used when needed. Full loads are useful e.g. when:
Running an initial load, i.e. loading the entity for the first time with all historical data.
Recovering from an incident that requires data to be reloaded.
The incremental load logic compromises perfect correctness for better performance and, for example, does not catch rare changes to old data. In such cases there could be a full maintenance load scheduled less frequently than the incremental load that keeps the data quality on an acceptable level.
Also note that if your fact table is loaded incrementally from multiple sources and those loads are independent of each other, you can create separate loads for each source and schedule them freely.
Full loads executed with the truncate-insert pattern can cause problems with simultaneous end use as the table being loaded is empty after the truncate statement has been run and before the insert has finished. Depending on the features of the target database, various load patterns have been designed to avoid these problems. See the Using a logic view and an automatically generated load and Transactional truncate-insert template for Snowflake examples presented in the Dimension article.
Note that if you are using current views on top of the data warehouse zone (e.g. Data Vault) to query data to the publish zone and the underlying tables have a lot of data, the current view logic can cause performance issues. Performance-wise it is often a better solution to design the query using the underlying data warehouse tables, apply the necessary filters and joins and then implement the current logic in the publish zone load. When you run into performance issues, investigate the query plan and see if such optimisations are possible.
Full load (truncate-insert)
As discussed above, a full load is useful to have even if you are planning to load a fact incrementally on a regular basis. Also, if the fact is small enough, you can avoid more complicated incremental logic by fully loading it every time.
The most simple way is to create the fact entity and define a full load, entity mappings as instructed above, and OVERRIDE steps as follows:
Generic truncate table statement.
Insert statement containing the load logic for the fact.
If your target database supports syntax like INSERT OVERWRITE INTO… you can skip the truncate step.
This example is loading a taxi trip fact from two persistent staging tables.
Note that using the simple truncate-insert approach causes the table to be empty for the duration of the insert statement execution as discussed above.
Alternatively, you could apply one of the built-in load patterns as described in Using a logic view and an automatically generated load, or a load template as presented in Transactional truncate-insert template for Snowflake in the Dimension article. This template loads the data first into a temporary entity and then swaps the original table with it which also allows you to run smoke tests on the data before the swap operation.
Let’s define incremental loads for the same use case as in the full load example above. In the use case we have two source entities containing trip data from different taxi companies:
As these sources are not dependent on each other, we can create separate loads for the sources.
Create a fact table as instructed above, e.g. F_TAXI_TRIP.
Create a separate load from both source tables to the fact table.
Create entity mappings in both loads to add the attributes needed in the fact. Also add/map the following attributes that collect metadata and are used by the merge load pattern we are going to use (feel free to name them according to your conventions):
Target attribute technical type
Primary key for the fact
Indicates when a row was inserted
Indicates when a row was updated
All mapped source attributes
Used for comparing changes between the target entity and the incoming data.
Enable the OPT_INSERT_UPDATE_BY_DATAHASH_AS_MERGE load pattern:NONE
OPT_INSERT_UPDATE_BY_DATAHASH_AS_MERGE = true
Enable Run ID Logic with OPT_USE_RUN_IDS:CODE
OPT_USE_RUN_IDS = true
Define a custom filter for incoming data using OPT_WHERE, for example:CODE
OPT_WHERE = meta_update_time >= datediff('day', -1, current_date())
Finally, use the Export SQL feature to preview the load SQL, for example:
MERGE INTO publish.F_TAXI_TRIP trg USING ( SELECT DISTINCT id AS trip_key ... , CURRENT_TIMESTAMP::timestamp_ntz AS load_time , CURRENT_TIMESTAMP::timestamp_ntz AS update_time , MD5(...) AS comparison_hash FROM psa.PSA_TAXI_TRIP_YELLOW src_entity WHERE -- Run ID Logic and OPT_WHERE applied here: stg_batch_id IN (<loadablerunids>) ) src ON ( trg.trip_key = src.trip_key ) WHEN MATCHED AND trg.comparison_hash != src.comparison_hash THEN UPDATE SET fact_key = src.fact_key ... , update_time = src.update_time , comparison_hash = src.comparison_hash WHEN NOT MATCHED THEN INSERT ( fact_key ... , load_time , update_time , comparison_hash ) VALUES ( src.fact_key ... , src.load_time , src.update_time , src.comparison_hash ) ;
Note that if you need to join multiple source entities together to load the fact table, you can copy the automatically generated merge logic and use it in an OVERRIDE load step. Also see this example on how to use Run ID Logic in a load with multiple entity mappings.