Well, I had to stop my studies on Kimball’s The Data Warehouse ETL Toolkit for now, and it is because I have to dig as deep as possible in the Data Vault subject, which, I must say, is a very very interesting topic.
I love data modeling and Data Vault seems to be a strong choice for many big companies, also, it’s philosophy attracts me, so I’m eager to learn more about this. Therefore, if you are also interested in this matter, stick around!
This is a quick recap about Data Vault. All that is written here has been taken directly from the wikipedia: https://en.wikipedia.org/wiki/Data_vault_modeling
What is Data Vault?
- Data vault modeling was originally conceived by Dan Linstedt in the 1990s and was released in 2000 as a public domain modeling method.
- Data vault modeling is a modeling method to provide long-term historical storage of data.
- Every row in a data vault must be accompanied by record source and load date attributes.
- Developed in 2000
- Model resilient to change, by separating structural information from descriptive attributes.
- Designed to enable parallel loading as much as possible
Data vault philosophy
Data vault’s philosophy is that all data is relevant data, even if it is not in line with established definitions and business rules. If data are not conforming to these definitions and rules then that is a problem for the business, not the data warehouse. The determination of data being “wrong” is an interpretation of the data that stems from a particular point of view that may not be valid for everyone, or at every point in time. Therefore the data vault must capture all data and only when reporting or extracting data from the data vault is the data being interpreted.
Simple data vault model with two hubs (blue), one link (green) and four satellites (yellow):
NOTE: PK in the image above are SK.
As you can see on the HUBs, the blue ones, they hold a PK, BK and Load_Date, so take hub_employee for instance. Notice that the Satellites Contact and Details both have the same first 3 columns: pk, load_date, record_source.
- PK is the key of the employee
- Load_date is when that particular row was inserted
- Record_source states where that data from that satellite is coming from
That way, every time you have new data for employees you can simply create a new satellite table and relate it to the hub using the PK between those tables.
- Hubs contain a list of unique business keys with low propensity to change.
- Hubs also contain a surrogate key for each Hub item and metadata describing the origin of the business key.
- Descriptive attributes for the information on the Hubs are stored in Satellite tables.
- Hubs cannot have duplicate data (BKs must be unique in a hub).
- Hubs should have at least one satellite.
Hubs contains at least:
- A surrogate key, used to connect the other structures to this table
- A business key, this key can consist of multiple fields, like an unique identifier.
- The record source, to see where that date came from.
- Optionally, you can also have metadata fields.
- Links are basically many-to-many join tables.
- Links can link to other links, however it is considered bad practice, because it introduces dependencies between links that make parallel loading more difficult.
Links contains at least:
- The surrogate keys for the hubs they connect.
- They own surrogate key
Hubs and Links form the structure of the model, but have no temporal attributes and hold no descriptive attributes. These are stored in satellites.
- Satellites are somewhat like facts on a star schema, but they hold description and facts.
- They have their hub or link parent PK/BK/SK, whatever is being used to relate them.
- Start and end date of each row, much like a type 2 SCD.
- Usually the attributes are grouped in satellites by source system. However, descriptive attributes such as size, cost, speed, amount or color can change at different rates, so you can also split these attributes up in different satellites based on their rate of change.
The satellite table below is linked on the DRIVERS_LINK, and that link is between CARS_HUB and PERSONS_HUB, so this satellite goes by the name of DRIVER_INSURANCE_SATELLITE or S_DRIVER_INSURANCE. So, we have:
Here’s the satellite table attributes:
This satellite contains:
- attributes that are specific to the relationship between the car and the person driving it.
- an indicator whether this is the primary driver.
- the name of the insurance company for this car and person (could also be a separate hub).
- a summary of the number of accidents involving this combination of vehicle and driver.
- Also included is a reference to a reference table called R_RISK_CATEGORY containing the codes for the risk category
- Reference tables are a normal part of a healthy data vault model.
- They are there to prevent redundant storage of simple reference data that is referenced a lot.
- Reference tables are referenced from Satellites, but never bound with physical foreign keys.
- They can be historical or have no history, if no history then stick to natural keys and do not create SKs here.
- Expect to use lots of reference tables.
- First load HUBS, creating SKs for any new BK
- Resolve LINKS between HUBS and create SKs for any new associations.
- Add (relate) all satellites to HUBS.
- Add (relate) all satellites to all LINKS.
- Since HUBS are not related to each other you can load all the hubs in parallel.
- Since LINKs are not related to each other you can load all the links in parallel.
- Since SATELLITES can related only to hubs and links, you can load all satellites in parallel.
- Data is never deleted from the data vault.
Problems might occur when links relate to links. The same for link for multiples hubs. Always try to avoid this.
Data vault and dimensional modeling
- Data vault modelled layer is normally focused on store date.
- Data vault modelled layer is not optimized for query performance, not it’s easy to query.
- Hubs and satellites related to those hubs can be treated as dimensions, for the purpose of enabling quick prototype of a dimensional model.
- Links and satellites related to those links can be treated as fact tables, for the purpose of enabling quick prototype of a dimensional model.
Note that while it is relatively straightforward to move data from a data vault model to a (cleansed) dimensional model, the reverse is not as easy, given the denormalized nature of the dimensional model’s fact tables, fundamentally different to the third normal form of the data vault.