Data Vault Basics, by Dan Linstedt. The writing on this article reflects Data Valt’s creator, Dan Linstedt, view over its own creation.
Data Vault Basics
- In order to properly create and apply data vault modeling one must know and understand your business well.
- It is very easy to convert from 3rd normal form and from star schema to data vault, in this article we will focus on converting from 3rd normal form to data vault.
Data Vault Benefits
- Manage and enforce compliance to SOX, hippa, basill ii.
- Make it easier to find problems in your data that were not possible before.
- Reduce time for implementing changes.
- Merge new business unit into the organizations rapidly.
- Consolidate disparate data stores, ie: master data management (mdm)
- Scalable to hundreds of petabytes.
- Trace all data back to the source systems.
Data Vault Definition
Data Vault is a detail oriented, historical tracking set of normalized tables that support your business areas. It’s a breed between 3rd normal form and star schema. The design is flexible, scalable, consistent and adaptable to the needs of the enterprise.
In business terms, Data Vault is to adapt quickly, model the business accurately, and scale with the business needs. Data Vault is:
- a data integration architecture;
- A series of standards;
- And definitional elements or methods by way information is connected within an rdbms data store
- Extensive possibilities for data attribution.
- All data relationships are key driven.
- Relationships can be dropped and created on the fly.
- Data mining can discover new relationships between elements.
- Artificial intelligence can be utilized to rank the relevancy of the relationships to the user configured outcome.
Data Vault business case
Data Vault is based heavily on business process, so it is very important to see how the business model represents the data vault.
Below we show how to transit from one business case to a data vault physic model.
The business case: for this example, let’s examine a company with a marketing department that wants to build a sales campaign to sell slow moving products. The hope is that the customer will see the campaign, like the new low price of the product, and make a purchase from the company.
When the analyst re-iterates what they heard in the user interviews, the analyst says to the business users, that he heard them say: “marketing generates a sales invoice”. The business users quickly correct him (and the case model) and state that finance generates the invoices. Of course, for this example hopefully the customer then provides feedback to the company about the marketing campaign, and company friendliness.
In the corrected business case, we change the model from the representation above, to the representation below. Not a large change, just a difference in departments. Hopefully there is communication between finance and marketing (which is not shown in this case example).
Note: the model was correct because marketing department do not generates invoices, it is the finance department that does it!
- Make sure you have your business case adjusted to your reality.
- Work with your client to figure out which part of the business model to implement first.
- Focus for now is: campaigns, invoices, products, and customers. See the image below.
Upon further investigation, we discover that the business also wants:
- To track campaign effectiveness ratings and dates (length of time for the campaign)
- Dates and amounts on invoices and line-items
- Products and their availability dates, descriptions, stock quantities and defect reasons
- Customer address, contacts, and other demographic details
The next model that we build is based on the concepts of the data vault and ties directly back to the business descriptions of granularity of data, and keyed information. For the purposes of scope and this example, we will limit the logical data model to the area outlined in red.
The image below shows our first cut logical data vault data model.
- We take the business keys such as invoice number and product number and create HUB table for them
- Interactions between invoices and products are put in a LINK table named LINK_INVOICE_LINE_ITEM
- Line Items is represented as a SATELLITE but Dan states that it this is just because CUSTOMER is out of scope of this article
Here’s the result:
Here are the 5 “easy” steps of how to implement/develop your data vault model:
- Establish the business keys, hubs
- Establish the relationships between the business keys, links
- Establish description around the business keys, satellites
- Add standalone components like calendars and code/descriptions/lookups for decoding in data marts
- Tune for query optimization, add performance tables such as bridge tables and point-in-time structures
Northwind data model
We start with a simplified 3nf data model, often times the source data model represents the business as it stands today – at least it represents the business processes which collect the information.
1st step: Establish the business keys, hubs.
This can be challenging if the business keys are “smart-keys”, or composite keys made up of several identifiable relationships. Although, normalization to the nth degree is discouraged, it is good to identify and document the metadata for these composite keys.
2nd step: Establish the relationships between the business keys, links.
This process can be a little tricky sometimes – particularly if the data set says the business keys are “weak”, and can’t stand on their own, or are non-identifying (non-unique). In these cases, we can end up with one-legged link tables, which is something we work through to correct during the modeling process. Again, the link tables represent relationships, transactions, hierarchies, and define the grain of the data on the intersection.
Please note the importance of grain defined between order, employee, shipper and customer. The grain of this relationship (link table) is defined by the source system, however it is subtle and easy to miss.
The final image below shows a complete data vault, with all the hubs, links and satellites available. While the white-papers have defined point-in-time tables, and bridge tables, they are not required by the architecture
Data Vault and Data Compliance
The data vault provides a series of standard fields which track data changes by date, and where they came from – and the data vault is always modeled to hold the lowest possible grain of data from the source systems. That is to say: data is integrated, but not modified when it is stored in the data vault. The satellites are split by type of data and rate of change which allow reduced storage requirements, and increased traceability.
When following the standards in loading information into the data vault, the it staff automatically inherits “data compliance and traceability”. The business can point at the data mart and claim that it’s “wrong” today, and “right” tomorrow, but it can always show where the data came from, when it came in, and what it looked like before alteration/transformation, aggregation and cleansing – thus meeting the compliance initiatives becomes easier.