Designing Packages
Guide objective
This guide helps you to design Agile Data Engine packages with good practices.
Packages are the unit of commit in Agile Data Engine which means that all changes done in the Designer are taken into Runtime environments at the package level. Package design is at the core of the agile development and deployment concept of Agile Data Engine. Therefore, following the good practices presented in this article is crucial for a successful implementation.
See also:
Package design principles
Data warehouse layer specific packages
Packages should be kept data warehouse layer specific so that entities from multiple different layers (e.g. staging, Data Vault / EDW, publish etc.) would not exist in the same package. This way it is clear what type of entities each package contains and which layer they belong to in the data warehouse architecture. This also helps preventing circular dependencies between packages as entities in data warehouse layer specific packages should mainly depend on entities in the previous layer, and sometimes, on entities in the same layer.
Note that if you are using Data Vault modelling methodology, Raw Data Vault and Business Data Vault entities can coexist in the same packages.
Small and nimble package size
Agile Data Engine enables iterative development and deployment of a data warehouse. Continuous integration and deployment is based on packages as a package is the unit of committing changes from the Designer.
Having a large number of entities in a single package will make parallel development more difficult and slow down continuous deployment. Especially when there are multiple developers or development teams working at the same time, large packages can lead to a situation where developers have to wait for other development in a single package to finish before they can proceed with deploying their changes.
Large packages with hundreds of entities can also cause technical issues with longer wait times when packages are being deployed. Generally, smaller packages are faster to deploy.
Source system specific source and staging packages
Corresponding source and stage entities should be kept in the same source system specific package.
If a source system has lots of source datasets, its packages should be split by domain or in some other logical way. Otherwise, this package would grow too much. Domain in this case could mean e.g. an internal module name or a similar category of the source system.
Domain specific Data Vault packages
Data Vault entities should be divided into logical domain specific packages. There are various approaches to this that might work better depending on the situation and the number of active developers or development teams (see examples).
However, some good practices are common in all approaches: At least, a hub entity and all of its satellites should be in the same package. If the hub is driving a link entity, the link entity should exist in the same package. If there is some complex business logic or calculations in some Business Data Vault entities, then those entities could be placed in a separate domain specific package to support the development and maintenance of them separately from other entities.
Use case specific or single-entity publish packages
There are various valid approaches to publish package design as well. A common approach is to split publish packages by use case. There can often be multiple use cases built into separate publish entities on top of the same Data Vault entities and domains. It makes sense to keep these use cases in separate packages as this supports the development and maintenance of them individually.
In some cases, even single-entity publish packages are used. This is the highest level of separation possible and it especially serves environments with dozens of developers from multiple separate teams developing and operating the same data warehouse.
Examples
Staging packages
Staging packages contain entities that are for ingesting source system data into the staging area in a data warehouse.
SOURCE entities as METADATA_ONLY which describe datasets in the source system
STAGE tables loaded from source files, other STAGE entities (tables, views) when transformations are needed before loading to the Data Vault/EDW layer.
In this example staging packages are split by source system and additionally source systems with a large number of entities by domain.
Package naming standard | Package name | Entity type | Entity name |
---|---|---|---|
STG_<source_system>_<domain> | STG_ERP_FINA | SOURCE | GL_ENTRY |
STG_<source_system>_<domain> | STG_ERP_FINA | STAGE | STG_GL_ENTRY_ERP |
STG_<source_system>_<domain> | STG_ERP_FINA | ... | ... |
STG_<source_system>_<domain> | STG_ERP_SALES | SOURCE | SO_HEADER |
STG_<source_system>_<domain> | STG_ERP_SALES | STAGE | STG_SO_HEADER_ERP |
STG_<source_system>_<domain> | STG_ERP_SALES | SOURCE | SO_LINE_ITEM |
STG_<source_system>_<domain> | STG_ERP_SALES | STAGE | STG_SO_LINE_ITEM_ERP |
STG_<source_system>_<domain> | STG_ERP_SALES | ... | ... |
STG_<source_system> | STG_CRM | SOURCE | LEAD |
STG_<source_system> | STG_CRM | STAGE | STG_LEAD_CRM |
STG_<source_system> | STG_CRM | SOURCE | OPPORTUNITY |
STG_<source_system> | STG_CRM | STAGE | STG_OPPORTUNITY_CRM |
STG_<source_system> | STG_CRM | ... | ... |
Data Vault packages
Data Vault packages contain Data Vault entities, mainly hubs, links and satellites. The below table presents two approaches to designing Data Vault packages:
Package naming standard | Package name | Entity type | Entity name |
---|---|---|---|
DV_<domain>_<subset> | DV_CUSTOMER_H | HUB | H_CUSTOMER |
DV_<domain>_<subset> | DV_CUSTOMER_H | SAT | S_CUSTOMER_ERP |
DV_<domain>_<subset> | DV_CUSTOMER_H | SAT | S_CUSTOMER_CRM |
DV_<domain>_<subset> | DV_CUSTOMER_H | LINK | L_CUSTOMER_SALES_REP |
DV_<domain>_<subset> | DV_CUSTOMER_H | S_SAT | SS_CUSTOMER_SALES_REP |
DV_<domain>_<subset> | DV_CUSTOMER_H | ... | ... |
DV_<domain> | DV_HUMAN_RESOURCES | HUB | H_EMPLOYEE |
DV_<domain> | DV_HUMAN_RESOURCES | LINK | L_EMPLOYEE_SUPERVISOR |
DV_<domain> | DV_HUMAN_RESOURCES | SAT | S_EMPLOYEE_CRM |
DV_<domain> | DV_HUMAN_RESOURCES | SAT | S_EMPLOYEE_HRM |
DV_<domain> | DV_HUMAN_RESOURCES | HUB | H_TIME_ENTRY |
DV_<domain> | DV_HUMAN_RESOURCES | ... | ... |
DV_CUSTOMER_H contains the customer hub, all related satellites, all links that the customer business key drives and all status satellites related to those links.
DV_HUMAN_RESOURCES contains all hubs and related entities belonging to the human resources domain. This is feasible while the implementation is small and the amount of human resources related entities is limited. This package could be later split into hub specific packages.
Publish packages
Publish packages contain tables and views for end use of the data from the data warehousing perspective including facts, dimensions and flat entities. The below table shows an example of a single-entity package and a use case specific package:
Package naming standard | Package name | Entity type | Entity name |
---|---|---|---|
P_<schema>_<entity> | P_PUBLISH_D_CUSTOMER | DIM | D_CUSTOMER |
P_<schema>_<use_case> | P_EXT_PUBLISH_SHARED | DIM | D_SALES_REP_CONTACT_DETAILS |
P_<schema>_<use_case> | P_EXT_PUBLISH_SHARED | FACT | F_DELIVERIES |
P_<schema>_<use_case> | P_EXT_PUBLISH_SHARED | ... | ... |
Note that there can be multiple publish schemas and the schema name is included in the package name in this example.