Data Vault Hands On – First approach

Well, so after much reading, it is time to get my hands on the job and start actually modeling some data vault models. Be aware that this is my very first attempt of turning a 3NF model in DV model.

For this I’m using SQL Server Northwind database, you can get it here: https://github.com/microsoft/sql-server-samples/blob/master/samples/databases/northwind-pubs/instnwnd.sql

The data model

Here’s the full Northwind 3NF model. Highlighted in green those tables that we are going to cover.

Note: for this current project I’m not considering the hierarchical relationship of Employees table.

So, the natural order would be creating hubs, links and sats, but before doing that I want first to create a simple ER diagram to reflect my understanding of the model. I am using diagrams.net for this.

As you can see only strong entities survived our ERD mapping. I’m mainly talking about the “order details” table, since it’s really a bridge table between orders and products. I believe that, in reality, the ERD precedes the 3NF model, but for this exercise I’m doing it anyway. Again, this is just for the sake of understanding how the business looks at the process.

Okay, now that we have our ERD and 3NF, and an understanding of the business, we can start data vault our model.

Creating the Data Vault

So, first step is to find all HUBS in our model, but what are HUBS?

HUB is a place where we can store business keys, they are like the representation of our business entities.

What about business keys? What is a business key?

Business key is an object that identify your business object, like IDs that identify something, like social security number identify a person, or a customer id identify a customer, a driver license number, a product id, etc.

Employees

By analyzing the employees table we find that EmployeeID is our business key. So, our HUB can look like this:

As you can see, our business key EmployeeID is not the PK of our HUB, the hash key of it that it is, meaning, we hash md5 the BK to use as our PK in our HUBS.

Besides the BK and it’s HK (hash key), we also have LOAD_DTS and REC_SRC as metadata in our HUB table.

LOAD_DTS: the load date indicates when the business key initially arrived in the data warehouse. The load date should be the same for all data that arrived in the same batch.

REC_SRC: the record source is hard-coded and applied to maintain traceability of the arriving data set. It is best to refrain from using a generalized record source, such as “SAP” for all SAP data. Instead, use the lowest level of granularity, for example “SAP.FINANCE.GL” to indicate the general ledger module in the financial application of SAP.

Now that we have created our first HUB and understand better how to model it, let’s create the remaining ones.

Other HUBS

So, the remaining HUBS required are: PRODUCTS, SHIPPERS, ORDERS, CUSTOMERS.

LINKS

We have identified our business keys and HUBs, now it’s time to create LINKs between them. LINKS represents relationships.

Here’s my approach on modeling LINKS for this model:

  1. We have a transaction between ORDERS and PRODUCTS, it means that many products can be on many orders, and that many orders can contain many products.

As you can see in the image, our LINK table take HK from all tables related, and that is important because it represent the grain on this relationship. We also add LOAD_DTS and REC_SRC as metadata fields.

Important to understand is that the PK HK_Products_Orders is a HK from the BK from Products and Orders. It is not a HK from the HK, it is a HK from the BK… hope it’s clear.

Moving on, the other LINK needs to represent the grain expected in the transaction of making an order, that’s why we need to include all other entities, aka HUBS, in that next LINK.

Once HUBS and LINKS are set, it’s time to get the SATELLITLES (SATS).

SATELLITES

Now that we have HUBS and LINKS put together, let’s work on our SATS. Here’s how SATS are composed:

  • They have composed PK: hash key from HUB + LOAD_DTS
  • SATS store all descriptive attributes/information

Best practice for designing SAT is to keep all descriptive attributes on the same grain on the same SAT, so if your attribute is in another grain, take it to a different SAT. Also, if your data changes in different pace, separate them in different SATS. You can also split SATS by data types and source, but that will depend on how you are approaching your model.

So, for Employees we have in orange attributes that change more slowly, whereas in green we have attributes that might be updated more often. ReportsTo is out of the scope of this article.

This is how our model should look like with our new SATS:

This is our final model:

Okay, but how do I create this model using SQL Server database?

First we need to create our HUB tables, then populate it.

Here’s how we can create our first HUB_EMPLOYEES table. Be advised that what I’m doing here is not best practice, in the real world you should have a stage area to load your data, and here we are not creating a staging area, I will address this topic somewhere in the future.

LOADING HUBS

The SQL code I used to create HUB_EMPLOYEES:

CREATE TABLE [raw].[HUB_EMPLOYEES] (

               [HK_EmployeeID] [varbinary](35) NOT NULL,

               [LOAD_DTS] [date] NOT NULL,

               [REC_SRC] [varchar](100) NOT NULL,

               [EmployeeID] [int] NOT NULL,

  CONSTRAINT [PK_HUB_EMPLOYEE] PRIMARY KEY CLUSTERED

 (

               [HK_EmployeeID] ASC

 ) ON [PRIMARY],

  CONSTRAINT [UK_HUB_EMPLOYEE] UNIQUE NONCLUSTERED

 (

               [EmployeeID] ASC

 ) ON [PRIMARY],

 ) ON [PRIMARY]

 ;

And here’s how populate our new HUB_EMPLOYEES:

INSERT INTO [raw].[HUB_EMPLOYEES]

  (

        [HK_EmployeeID]

       ,[LOAD_DTS]

       ,[REC_SRC]

       ,[EmployeeID]

  )

SELECT          HASHBYTES( ‘md5’, convert( varchar(35), STG.EmployeeID ) )

, GETDATE()

, ‘NORTHWIND’

, STG.EmployeeID

 FROM [dbo].[Employees] STG

 WHERE

 NOT EXISTS ( SELECT 1 FROM [RAW].[HUB_EMPLOYEES] HUB WHERE HUB.[EmployeeID] = STG.[EmployeeID] )

;

Note: be advised that in your INSERT INTO statement you will much likely have a where clause that filters the date stamp you want, that is to make sure you are only performing incremental load, and not full loads all the time.

You can repeat this same logic for all our hubs.

LOADING LINKS

Links represent our relationships, the connection between various entities. Therefore, link loading should occur after the hub loading.

For links you can use the following SQL code:

/*  LINK_EMP_CUST_SHIP_ORDERS  */

CREATE TABLE [raw].[LINK_EMP_CUST_SHIP_ORDERS](

            [HK_Emp_Cust_Ship_Orders] [varbinary](35) NOT NULL,

            [LOAD_DTS] [date] NOT NULL,

            [REC_SRC] [varchar](100) NOT NULL,

            [HK_EmployeeID] [varbinary](35) NOT NULL,

            [HK_OrderID] [varbinary](35) NOT NULL,

[HK_CustomerID] [varbinary](35) NOT NULL,

[HK_ShipperID] [varbinary](35) NOT NULL

CONSTRAINT [PK_LINK_EMP_CUST_SHIP_ORDERS] PRIMARY KEY CLUSTERED

(

            [HK_Emp_Cust_Ship_Orders] ASC

),

CONSTRAINT [UK_LINK_EMP_CUST_SHIP_ORDERS] UNIQUE NONCLUSTERED

(

            [HK_EmployeeID] ASC,

            [HK_OrderID] ASC,

[HK_CustomerID] ASC,

[HK_ShipperID] ASC

)

) ON [PRIMARY]

GO

/*  ESTABILISHING RELATIONSHIPS BETWEEN LINK AND HUBS  */

ALTER TABLE [raw].[LINK_EMP_CUST_SHIP_ORDERS]  WITH CHECK ADD  CONSTRAINT [FK_LINK_EMP_CUST_SHIP_ORDERS_HUB_EMPLOYEES] FOREIGN KEY([HK_EmployeeID])

REFERENCES [raw].[HUB_EMPLOYEES] ([HK_EmployeeID])

GO

ALTER TABLE [raw].[LINK_EMP_CUST_SHIP_ORDERS]  WITH CHECK ADD  CONSTRAINT [FK_LINK_EMP_CUST_SHIP_ORDERS_HUB_ORDERS] FOREIGN KEY([HK_OrderID])

REFERENCES [raw].[HUB_ORDERS] ([HK_OrderID])

GO

ALTER TABLE [raw].[LINK_EMP_CUST_SHIP_ORDERS]  WITH CHECK ADD  CONSTRAINT [FK_LINK_EMP_CUST_SHIP_ORDERS_HUB_CUSTOMER] FOREIGN KEY([HK_CustomerID])

REFERENCES [raw].[HUB_CUSTOMER] ([HK_CustomerID])

GO

ALTER TABLE [raw].[LINK_EMP_CUST_SHIP_ORDERS]  WITH CHECK ADD  CONSTRAINT [FK_LINK_EMP_CUST_SHIP_ORDERS_HUB_SHIPPERS] FOREIGN KEY([HK_ShipperID])

REFERENCES [raw].[HUB_SHIPPERS] ([HK_ShipperID])

GO

Now, I will load my LINK, but be advised that this is not the recommended practice, I’m doing this way just so we can have the tables filled with data and you can see the end result. Here’s how I did it:

INSERT INTO [raw].[LINK_EMP_CUST_SHIP_ORDERS]

(

    [HK_Emp_Cust_Ship_Orders]

    ,[LOAD_DTS]

    ,[REC_SRC]

    ,[HK_EmployeeID]

    ,[HK_OrderID]

,[HK_CustomerID]

,[HK_ShipperID]

)

SELECT distinct

 HASHBYTES( ‘md5’, convert( varchar(35), e.[EmployeeID] ) + convert( varchar(35), o.[OrderID] ) + convert( varchar(35), c.[CustomerID] ) + convert( varchar(35), s.[ShipperID] ) )

,GETDATE()

,’NORTHWIND’

,HASHBYTES( ‘md5’, convert( varchar(35), e.[EmployeeID] ) )        — e.EmployeeID

,HASHBYTES( ‘md5’, convert( varchar(35), o.[OrderID] ) )                — o.OrderID

,HASHBYTES( ‘md5’, convert( varchar(35), c.[CustomerID] ) )        — c.CustomerID

,HASHBYTES( ‘md5’, convert( varchar(35), s.[ShipperID] ) )        — s.ShipperID

FROM Orders O

INNER JOIN Customers C ON C.CustomerID = O.CustomerID

INNER JOIN Employees E ON E.EmployeeID = O.EmployeeID

INNER JOIN Shippers  S ON S.ShipperID = O.ShipVia

order by 1,2,3,4

LOADING SATS

Now that we have HUBS and LINKS loaded, it’s time to create and load SATS.

Script to create and load SATS:
 

/* CREATING SATS */

CREATE TABLE [raw].[SAT_EMP_OTHERS](

               [HK_EmployeeID] [varbinary](35) NOT NULL,

               [LOAD_DTS] [date] NOT NULL,

               [REC_SRC] [varchar](100) NOT NULL,

               [Address]                [nvarchar](60) NULL,

               [City]                [nvarchar](60) NULL,

               [Region]                [nvarchar](60) NULL,

               [PostalCode]        [nvarchar](60) NULL,

               [Country]                [nvarchar](60) NULL,

               [HomePhone]        [nvarchar](60) NULL,

               [Extension]        [nvarchar](60) NULL,

               [Photo]                [image] NULL,

               [Notes]                [ntext] NULL,

               [PhotoPath]        [nvarchar](255) NULL,

  CONSTRAINT [PK_SAT_EMP_OTHERS] PRIMARY KEY CLUSTERED

 (

               [HK_EmployeeID] ASC,

               [LOAD_DTS] ASC

 ) ON [PRIMARY]

 ) ON [PRIMARY]

 GO

 ALTER TABLE [raw].[SAT_EMP_OTHERS]  WITH CHECK ADD  CONSTRAINT [FK_SatCustomers_HubCustomers] FOREIGN KEY([HK_EmployeeID])

 REFERENCES [raw].[HUB_EMPLOYEES] ([HK_EmployeeID])

 GO

/* LOADING SATS */

INSERT INTO [raw].[SAT_EMP_OTHERS]

 (

[HK_EmployeeID]

,[LOAD_DTS]

,[REC_SRC]

,[Address]

,[City]

,[Region]

,[PostalCode]

,[Country]

,[HomePhone]

,[Extension]

,[Photo]                

,[Notes]                

,[PhotoPath]

 )

SELECT

HE.HK_EmployeeID

,’2022-04-12′ –‘GETDATE() REPLACEING GETDATE() HERE IN ORDER TO MIMIC THAT DATA WERE LOADED THE SAME DAY HUB DATA LOADED

,’NORTHWIND’

,E.Address

,E.City

,E.Region

,E.PostalCode

,E.Country

,E.HomePhone

,E.Extension

,E.Photo

,E.Notes

,E.PhotoPath

FROM Employees E

INNER JOIN RAW.HUB_EMPLOYEES HE ON HE.EmployeeID = E.EmployeeID

Here’s the load result:

This process repeats again for all SATS, in the end this is how our data vault model will look like in our SSMS diagram:

You can find all query script in this file:

https://github.com/p3dro-carvalho/p3dro-carvalho/blob/main/SQLQuery_loadingDV.sql

Well, it’s not a fully functional data-vault, I mean, considering all the automation required an whatnot, but it’s just a start. More to come in next posts! See you there.

References

https://www.udemy.com/course/modeling-data-warehouse-with-data-vault-for-beginners/

https://simplesqlbi.home.blog/2019/06/29/part-3-data-vault-for-beginners/

Publicado por Pedro Carvalho

Apaixonado por análise de dados e Power BI

Deixe uma resposta

Preencha os seus dados abaixo ou clique em um ícone para log in:

Logotipo do WordPress.com

Você está comentando utilizando sua conta WordPress.com. Sair /  Alterar )

Imagem do Twitter

Você está comentando utilizando sua conta Twitter. Sair /  Alterar )

Foto do Facebook

Você está comentando utilizando sua conta Facebook. Sair /  Alterar )

Conectando a %s

%d blogueiros gostam disto: