Designing Packages

Guide objective

This guide helps you to design Agile Data Engine packages with good practices.

Packages are the unit of commit in Agile Data Engine which means that all changes done in the Designer are taken into Runtime environments at the package level. Package design is at the core of the agile development and deployment concept of Agile Data Engine. Therefore, following the good practices presented in this article is crucial for a successful implementation.

See also:

Package design principles

Data warehouse layer specific packages

Packages should be kept data warehouse layer specific so that entities from multiple different layers (e.g. staging, Data Vault / EDW, publish etc.) would not exist in the same package. This way it is clear what type of entities each package contains and which layer they belong to in the data warehouse architecture. This also helps preventing circular dependencies between packages as entities in data warehouse layer specific packages should mainly depend on entities in the previous layer, and sometimes, on entities in the same layer.

Note that if you are using Data Vault modelling methodology, Raw Data Vault and Business Data Vault entities can coexist in the same packages.

Small and nimble package size

Agile Data Engine enables iterative development and deployment of a data warehouse. Continuous integration and deployment is based on packages as a package is the unit of committing changes from the Designer.

Having a large number of entities in a single package will make parallel development more difficult and slow down continuous deployment. Especially when there are multiple developers or development teams working at the same time, large packages can lead to a situation where developers have to wait for other development in a single package to finish before they can proceed with deploying their changes.

Large packages with hundreds of entities can also cause technical issues with longer wait times when packages are being deployed. Generally, smaller packages are faster to deploy.

Source system specific source and staging packages

Corresponding source and stage entities should be kept in the same source system specific package.

If a source system has lots of source datasets, its packages should be split by domain or in some other logical way. Otherwise, this package would grow too much. Domain in this case could mean e.g. an internal module name or a similar category of the source system.

Domain specific Data Vault packages

Data Vault entities should be divided into logical domain specific packages. There are various approaches to this that might work better depending on the situation and the number of active developers or development teams (see examples).

However, some good practices are common in all approaches: At least, a hub entity and all of its satellites should be in the same package. If the hub is driving a link entity, the link entity should exist in the same package. If there is some complex business logic or calculations in some Business Data Vault entities, then those entities could be placed in a separate domain specific package to support the development and maintenance of them separately from other entities.

Use case specific or single-entity publish packages

There are various valid approaches to publish package design as well. A common approach is to split publish packages by use case. There can often be multiple use cases built into separate publish entities on top of the same Data Vault entities and domains. It makes sense to keep these use cases in separate packages as this supports the development and maintenance of them individually.

In some cases, even single-entity publish packages are used. This is the highest level of separation possible and it especially serves environments with dozens of developers from multiple separate teams developing and operating the same data warehouse.

Examples

Staging packages

Staging packages contain entities that are for ingesting source system data into the staging area in a data warehouse.

SOURCE entities as METADATA_ONLY which describe datasets in the source system
STAGE tables loaded from source files, other STAGE entities (tables, views) when transformations are needed before loading to the Data Vault/EDW layer.

In this example staging packages are split by source system and additionally source systems with a large number of entities by domain.

Package naming standard	Package name	Entity type	Entity name
STG_<source_system>_<domain>	STG_ERP_FINA	SOURCE	GL_ENTRY
STG_<source_system>_<domain>	STG_ERP_FINA	STAGE	STG_GL_ENTRY_ERP
STG_<source_system>_<domain>	STG_ERP_FINA	...	...
STG_<source_system>_<domain>	STG_ERP_SALES	SOURCE	SO_HEADER
STG_<source_system>_<domain>	STG_ERP_SALES	STAGE	STG_SO_HEADER_ERP
STG_<source_system>_<domain>	STG_ERP_SALES	SOURCE	SO_LINE_ITEM
STG_<source_system>_<domain>	STG_ERP_SALES	STAGE	STG_SO_LINE_ITEM_ERP
STG_<source_system>_<domain>	STG_ERP_SALES	...	...
STG_<source_system>	STG_CRM	SOURCE	LEAD
STG_<source_system>	STG_CRM	STAGE	STG_LEAD_CRM
STG_<source_system>	STG_CRM	SOURCE	OPPORTUNITY
STG_<source_system>	STG_CRM	STAGE	STG_OPPORTUNITY_CRM
STG_<source_system>	STG_CRM	...	...

Data Vault packages

Data Vault packages contain Data Vault entities, mainly hubs, links and satellites. The below table presents two approaches to designing Data Vault packages:

Package naming standard	Package name	Entity type	Entity name
DV_<domain>_<subset>	DV_CUSTOMER_H	HUB	H_CUSTOMER
DV_<domain>_<subset>	DV_CUSTOMER_H	SAT	S_CUSTOMER_ERP
DV_<domain>_<subset>	DV_CUSTOMER_H	SAT	S_CUSTOMER_CRM
DV_<domain>_<subset>	DV_CUSTOMER_H	LINK	L_CUSTOMER_SALES_REP
DV_<domain>_<subset>	DV_CUSTOMER_H	S_SAT	SS_CUSTOMER_SALES_REP
DV_<domain>_<subset>	DV_CUSTOMER_H	...	...
DV_<domain>	DV_HUMAN_RESOURCES	HUB	H_EMPLOYEE
DV_<domain>	DV_HUMAN_RESOURCES	LINK	L_EMPLOYEE_SUPERVISOR
DV_<domain>	DV_HUMAN_RESOURCES	SAT	S_EMPLOYEE_CRM
DV_<domain>	DV_HUMAN_RESOURCES	SAT	S_EMPLOYEE_HRM
DV_<domain>	DV_HUMAN_RESOURCES	HUB	H_TIME_ENTRY
DV_<domain>	DV_HUMAN_RESOURCES	...	...

DV_CUSTOMER_H contains the customer hub, all related satellites, all links that the customer business key drives and all status satellites related to those links.
DV_HUMAN_RESOURCES contains all hubs and related entities belonging to the human resources domain. This is feasible while the implementation is small and the amount of human resources related entities is limited. This package could be later split into hub specific packages.

Publish packages

Publish packages contain tables and views for end use of the data from the data warehousing perspective including facts, dimensions and flat entities. The below table shows an example of a single-entity package and a use case specific package:

Package naming standard	Package name	Entity type	Entity name
P_<schema>_<entity>	P_PUBLISH_D_CUSTOMER	DIM	D_CUSTOMER
P_<schema>_<use_case>	P_EXT_PUBLISH_SHARED	DIM	D_SALES_REP_CONTACT_DETAILS
P_<schema>_<use_case>	P_EXT_PUBLISH_SHARED	FACT	F_DELIVERIES
P_<schema>_<use_case>	P_EXT_PUBLISH_SHARED	...	...

Note that there can be multiple publish schemas and the schema name is included in the package name in this example.