Informatica Powercenter 10.4.1 Basics – First Steps

The Informatica Powercenter is a 4 post series, see all posts clicking here.

Before starting creating workflows and integrations let’s create our repository, here’s how.

Open PowerCenter Repository Manager.

Click to add new repository.

Right click your Repository and add domain.

Connect to your repository, Administrator / Administrator should do it.

Select your REP and then go Folder > Create.

After that, open your designer tool.

This is your Designer tool.

Selecet SOURCE ANALYZER:

Then create a new source. Go Source > import from database.

The following screen will appear, now click on the 3 dots …

Then you are going to add a new USER DSN source:

Now that you created your source, you can then use it to add a table to your “Source Analyzer”.

“Nome do proprietário”, in english “owner name”, which is the schema name, therefore, in our example, it’s dbo. Select the table you want and click OK.

Now your source analyzer is created, will look like this:

Go to Target Designer and repeat the same process. In the end your target designer will look like this.

Note: here the table for the target is T_ORDERS, which is a truncated version of the ORDERS table, same table, same structure, same fields, same data type, but no data in it.

Now that we have source and target we can go to the Mapping Designer and map our source and target.

Now, go to the Worflow Manager. On the workflow go to the Workflow Designer.

Create your connections.

In our case it’s a relational connection using SQL Server as source and target.

Click new or add:

Here’s my source:

Here’s my target:

Now on your workflow designer, create your workflow.

Click on the SESSION icon and add to your workflow.

Then click on the link task and link it.

Double click your mapping “s_m_test” and adjust your source and target information.

Another option you can have for this test is to enable truncate and load option.

Now, right click you mapping somewhere blank and select start workflow.

The Workflow Monitor will be shown, and you can see we had success. Noticed the task view tab. Now t_orders is populated.

Reference

Install Informatica 10.4 on Windows | How to install Informatica PowerCenter 10.4 on windows

Informatica Powercenter 10.4.1 Basics – Install Client

The Informatica Powercenter is a 4 post series, see all posts clicking here.

Now that we have Informatica Powercenter Server already installed, let’s proceed with Informatica Powercenter client. Here I show how.

Go to the files you have downloaded and find this path:

\part1\informatica_1041_client_winem-64t\Client

That’s it, next next finish, and your client will be installed.

Informatica Powercenter 10.4.1 Basics –Install Server

The Informatica Powercenter is a 4 post series, see all posts clicking here.

Before installing Informatica powercenter client (such as the informatica powercenter designer) you need to install the Informatica Powercenter server. Here I show how.

Go to the files you have downloaded and find this path:

\part2\informatica_1041_server_winem-64t\Server\

Double click install.exe.

On step 6, if you are using SQL SERVER like me, make sure that your TCP IP connection is enabled.

The user can be SA or any user with CONNECT, CREATE TABLE, and CREATE VIEW privileges. Read more about it here:

https://docs.informatica.com/data-integration/powercenter/10-4-1/installation-for-powercenter-and-data-quality/part-2–before-you-install-the-services/prepare-for-application-services-and-databases/model-repository-service/microsoft-sql-server-database-requirements.html

A key point here is to create a login named infa with password infa, it will be used here in the domain and later in the Repository Services.

Remember to re-start your SQL services if you had to enable your TCP/IP protocol.

I’m using password Administrator@1 here.

Here I’m using password Administrator.

Also uncheck second checkbox (will appear checked by default).

Uncheck the default option.

Once installation is completed you will see this:

Click on the DESKTOP link to open your Informatica Powercenter page.

Use Administrator / Administrator to log in.

For some reason my REP wasn’t working properly. So I deleted it an created a new one, like this:

More on how to create a new rep can be found on informatica documentation here:

https://docs.informatica.com/data-security-group/test-data-management/10-4-0/installation-guide/create-the-application-services/create-and-configure-the-powercenter-repository-service/create-the-powercenter-repository-service.html

And after that check this:

According to the documentation the REP should be in normal mode. More on this here:

https://docs.informatica.com/data-security-group/test-data-management/10-4-0/installation-guide/create-the-application-services/create-and-configure-the-powercenter-repository-service/after-you-create-the-powercenter-repository-service/run-the-powercenter-repository-service-in-normal-mode.html

Same process can be done to create an IS.

Here, again, my IS wasn’t working, so I had to recreate. Here’s how I did it.

Here’s the catch. You need to provide Administrator / Administrator as your username / password, not your DB credentials, at least that’s how it worked for me.

More on IS here:

https://docs.informatica.com/data-security-group/test-data-management/10-4-0/installation-guide/create-the-application-services/create-and-configure-the-powercenter-integration-service/create-the-powercenter-integration-service.html

https://network.informatica.com/thread/48885

https://knowledge.informatica.com/s/article/115138?language=en_US

https://www.toolbox.com/tech/data-management/question/informatica-sf-34004-serivce-initialization-failed-error-071811/

Informatica Powercenter 10.4.1 Basics – Where to download?

The Informatica Powercenter is a 4 post series, see all posts clicking here.

You can download INFORMATICA POWERCENTER at:

https://edelivery.oracle.com/osdc/faces/SoftwareDelivery

Search for: “Oracle Informatica PowerCenter and PowerConnect Adapters”

Select your software > click on top right “continue”.

On the new page, select your Operational System > continue.

On the new page >  accept the terms.

On the new page select only the version you need, in our case, 10.4.1 > Click download.

After that just follow the normal next next finish procedures using the downloaded software.

Data Vault Hands On – First approach

Well, so after much reading, it is time to get my hands on the job and start actually modeling some data vault models. Be aware that this is my very first attempt of turning a 3NF model in DV model.

For this I’m using SQL Server Northwind database, you can get it here: https://github.com/microsoft/sql-server-samples/blob/master/samples/databases/northwind-pubs/instnwnd.sql

The data model

Here’s the full Northwind 3NF model. Highlighted in green those tables that we are going to cover.

Note: for this current project I’m not considering the hierarchical relationship of Employees table.

So, the natural order would be creating hubs, links and sats, but before doing that I want first to create a simple ER diagram to reflect my understanding of the model. I am using diagrams.net for this.

As you can see only strong entities survived our ERD mapping. I’m mainly talking about the “order details” table, since it’s really a bridge table between orders and products. I believe that, in reality, the ERD precedes the 3NF model, but for this exercise I’m doing it anyway. Again, this is just for the sake of understanding how the business looks at the process.

Okay, now that we have our ERD and 3NF, and an understanding of the business, we can start data vault our model.

Creating the Data Vault

So, first step is to find all HUBS in our model, but what are HUBS?

HUB is a place where we can store business keys, they are like the representation of our business entities.

What about business keys? What is a business key?

Business key is an object that identify your business object, like IDs that identify something, like social security number identify a person, or a customer id identify a customer, a driver license number, a product id, etc.

Employees

By analyzing the employees table we find that EmployeeID is our business key. So, our HUB can look like this:

As you can see, our business key EmployeeID is not the PK of our HUB, the hash key of it that it is, meaning, we hash md5 the BK to use as our PK in our HUBS.

Besides the BK and it’s HK (hash key), we also have LOAD_DTS and REC_SRC as metadata in our HUB table.

LOAD_DTS: the load date indicates when the business key initially arrived in the data warehouse. The load date should be the same for all data that arrived in the same batch.

REC_SRC: the record source is hard-coded and applied to maintain traceability of the arriving data set. It is best to refrain from using a generalized record source, such as “SAP” for all SAP data. Instead, use the lowest level of granularity, for example “SAP.FINANCE.GL” to indicate the general ledger module in the financial application of SAP.

Now that we have created our first HUB and understand better how to model it, let’s create the remaining ones.

Other HUBS

So, the remaining HUBS required are: PRODUCTS, SHIPPERS, ORDERS, CUSTOMERS.

LINKS

We have identified our business keys and HUBs, now it’s time to create LINKs between them. LINKS represents relationships.

Here’s my approach on modeling LINKS for this model:

  1. We have a transaction between ORDERS and PRODUCTS, it means that many products can be on many orders, and that many orders can contain many products.

As you can see in the image, our LINK table take HK from all tables related, and that is important because it represent the grain on this relationship. We also add LOAD_DTS and REC_SRC as metadata fields.

Important to understand is that the PK HK_Products_Orders is a HK from the BK from Products and Orders. It is not a HK from the HK, it is a HK from the BK… hope it’s clear.

Moving on, the other LINK needs to represent the grain expected in the transaction of making an order, that’s why we need to include all other entities, aka HUBS, in that next LINK.

Once HUBS and LINKS are set, it’s time to get the SATELLITLES (SATS).

SATELLITES

Now that we have HUBS and LINKS put together, let’s work on our SATS. Here’s how SATS are composed:

  • They have composed PK: hash key from HUB + LOAD_DTS
  • SATS store all descriptive attributes/information

Best practice for designing SAT is to keep all descriptive attributes on the same grain on the same SAT, so if your attribute is in another grain, take it to a different SAT. Also, if your data changes in different pace, separate them in different SATS. You can also split SATS by data types and source, but that will depend on how you are approaching your model.

So, for Employees we have in orange attributes that change more slowly, whereas in green we have attributes that might be updated more often. ReportsTo is out of the scope of this article.

This is how our model should look like with our new SATS:

This is our final model:

Okay, but how do I create this model using SQL Server database?

First we need to create our HUB tables, then populate it.

Here’s how we can create our first HUB_EMPLOYEES table. Be advised that what I’m doing here is not best practice, in the real world you should have a stage area to load your data, and here we are not creating a staging area, I will address this topic somewhere in the future.

LOADING HUBS

The SQL code I used to create HUB_EMPLOYEES:

CREATE TABLE [raw].[HUB_EMPLOYEES] (

               [HK_EmployeeID] [varbinary](35) NOT NULL,

               [LOAD_DTS] [date] NOT NULL,

               [REC_SRC] [varchar](100) NOT NULL,

               [EmployeeID] [int] NOT NULL,

  CONSTRAINT [PK_HUB_EMPLOYEE] PRIMARY KEY CLUSTERED

 (

               [HK_EmployeeID] ASC

 ) ON [PRIMARY],

  CONSTRAINT [UK_HUB_EMPLOYEE] UNIQUE NONCLUSTERED

 (

               [EmployeeID] ASC

 ) ON [PRIMARY],

 ) ON [PRIMARY]

 ;

And here’s how populate our new HUB_EMPLOYEES:

INSERT INTO [raw].[HUB_EMPLOYEES]

  (

        [HK_EmployeeID]

       ,[LOAD_DTS]

       ,[REC_SRC]

       ,[EmployeeID]

  )

SELECT          HASHBYTES( ‘md5’, convert( varchar(35), STG.EmployeeID ) )

, GETDATE()

, ‘NORTHWIND’

, STG.EmployeeID

 FROM [dbo].[Employees] STG

 WHERE

 NOT EXISTS ( SELECT 1 FROM [RAW].[HUB_EMPLOYEES] HUB WHERE HUB.[EmployeeID] = STG.[EmployeeID] )

;

Note: be advised that in your INSERT INTO statement you will much likely have a where clause that filters the date stamp you want, that is to make sure you are only performing incremental load, and not full loads all the time.

You can repeat this same logic for all our hubs.

LOADING LINKS

Links represent our relationships, the connection between various entities. Therefore, link loading should occur after the hub loading.

For links you can use the following SQL code:

/*  LINK_EMP_CUST_SHIP_ORDERS  */

CREATE TABLE [raw].[LINK_EMP_CUST_SHIP_ORDERS](

            [HK_Emp_Cust_Ship_Orders] [varbinary](35) NOT NULL,

            [LOAD_DTS] [date] NOT NULL,

            [REC_SRC] [varchar](100) NOT NULL,

            [HK_EmployeeID] [varbinary](35) NOT NULL,

            [HK_OrderID] [varbinary](35) NOT NULL,

[HK_CustomerID] [varbinary](35) NOT NULL,

[HK_ShipperID] [varbinary](35) NOT NULL

CONSTRAINT [PK_LINK_EMP_CUST_SHIP_ORDERS] PRIMARY KEY CLUSTERED

(

            [HK_Emp_Cust_Ship_Orders] ASC

),

CONSTRAINT [UK_LINK_EMP_CUST_SHIP_ORDERS] UNIQUE NONCLUSTERED

(

            [HK_EmployeeID] ASC,

            [HK_OrderID] ASC,

[HK_CustomerID] ASC,

[HK_ShipperID] ASC

)

) ON [PRIMARY]

GO

/*  ESTABILISHING RELATIONSHIPS BETWEEN LINK AND HUBS  */

ALTER TABLE [raw].[LINK_EMP_CUST_SHIP_ORDERS]  WITH CHECK ADD  CONSTRAINT [FK_LINK_EMP_CUST_SHIP_ORDERS_HUB_EMPLOYEES] FOREIGN KEY([HK_EmployeeID])

REFERENCES [raw].[HUB_EMPLOYEES] ([HK_EmployeeID])

GO

ALTER TABLE [raw].[LINK_EMP_CUST_SHIP_ORDERS]  WITH CHECK ADD  CONSTRAINT [FK_LINK_EMP_CUST_SHIP_ORDERS_HUB_ORDERS] FOREIGN KEY([HK_OrderID])

REFERENCES [raw].[HUB_ORDERS] ([HK_OrderID])

GO

ALTER TABLE [raw].[LINK_EMP_CUST_SHIP_ORDERS]  WITH CHECK ADD  CONSTRAINT [FK_LINK_EMP_CUST_SHIP_ORDERS_HUB_CUSTOMER] FOREIGN KEY([HK_CustomerID])

REFERENCES [raw].[HUB_CUSTOMER] ([HK_CustomerID])

GO

ALTER TABLE [raw].[LINK_EMP_CUST_SHIP_ORDERS]  WITH CHECK ADD  CONSTRAINT [FK_LINK_EMP_CUST_SHIP_ORDERS_HUB_SHIPPERS] FOREIGN KEY([HK_ShipperID])

REFERENCES [raw].[HUB_SHIPPERS] ([HK_ShipperID])

GO

Now, I will load my LINK, but be advised that this is not the recommended practice, I’m doing this way just so we can have the tables filled with data and you can see the end result. Here’s how I did it:

INSERT INTO [raw].[LINK_EMP_CUST_SHIP_ORDERS]

(

    [HK_Emp_Cust_Ship_Orders]

    ,[LOAD_DTS]

    ,[REC_SRC]

    ,[HK_EmployeeID]

    ,[HK_OrderID]

,[HK_CustomerID]

,[HK_ShipperID]

)

SELECT distinct

 HASHBYTES( ‘md5’, convert( varchar(35), e.[EmployeeID] ) + convert( varchar(35), o.[OrderID] ) + convert( varchar(35), c.[CustomerID] ) + convert( varchar(35), s.[ShipperID] ) )

,GETDATE()

,’NORTHWIND’

,HASHBYTES( ‘md5’, convert( varchar(35), e.[EmployeeID] ) )        — e.EmployeeID

,HASHBYTES( ‘md5’, convert( varchar(35), o.[OrderID] ) )                — o.OrderID

,HASHBYTES( ‘md5’, convert( varchar(35), c.[CustomerID] ) )        — c.CustomerID

,HASHBYTES( ‘md5’, convert( varchar(35), s.[ShipperID] ) )        — s.ShipperID

FROM Orders O

INNER JOIN Customers C ON C.CustomerID = O.CustomerID

INNER JOIN Employees E ON E.EmployeeID = O.EmployeeID

INNER JOIN Shippers  S ON S.ShipperID = O.ShipVia

order by 1,2,3,4

LOADING SATS

Now that we have HUBS and LINKS loaded, it’s time to create and load SATS.

Script to create and load SATS:
 

/* CREATING SATS */

CREATE TABLE [raw].[SAT_EMP_OTHERS](

               [HK_EmployeeID] [varbinary](35) NOT NULL,

               [LOAD_DTS] [date] NOT NULL,

               [REC_SRC] [varchar](100) NOT NULL,

               [Address]                [nvarchar](60) NULL,

               [City]                [nvarchar](60) NULL,

               [Region]                [nvarchar](60) NULL,

               [PostalCode]        [nvarchar](60) NULL,

               [Country]                [nvarchar](60) NULL,

               [HomePhone]        [nvarchar](60) NULL,

               [Extension]        [nvarchar](60) NULL,

               [Photo]                [image] NULL,

               [Notes]                [ntext] NULL,

               [PhotoPath]        [nvarchar](255) NULL,

  CONSTRAINT [PK_SAT_EMP_OTHERS] PRIMARY KEY CLUSTERED

 (

               [HK_EmployeeID] ASC,

               [LOAD_DTS] ASC

 ) ON [PRIMARY]

 ) ON [PRIMARY]

 GO

 ALTER TABLE [raw].[SAT_EMP_OTHERS]  WITH CHECK ADD  CONSTRAINT [FK_SatCustomers_HubCustomers] FOREIGN KEY([HK_EmployeeID])

 REFERENCES [raw].[HUB_EMPLOYEES] ([HK_EmployeeID])

 GO

/* LOADING SATS */

INSERT INTO [raw].[SAT_EMP_OTHERS]

 (

[HK_EmployeeID]

,[LOAD_DTS]

,[REC_SRC]

,[Address]

,[City]

,[Region]

,[PostalCode]

,[Country]

,[HomePhone]

,[Extension]

,[Photo]                

,[Notes]                

,[PhotoPath]

 )

SELECT

HE.HK_EmployeeID

,’2022-04-12′ –‘GETDATE() REPLACEING GETDATE() HERE IN ORDER TO MIMIC THAT DATA WERE LOADED THE SAME DAY HUB DATA LOADED

,’NORTHWIND’

,E.Address

,E.City

,E.Region

,E.PostalCode

,E.Country

,E.HomePhone

,E.Extension

,E.Photo

,E.Notes

,E.PhotoPath

FROM Employees E

INNER JOIN RAW.HUB_EMPLOYEES HE ON HE.EmployeeID = E.EmployeeID

Here’s the load result:

This process repeats again for all SATS, in the end this is how our data vault model will look like in our SSMS diagram:

You can find all query script in this file:

https://github.com/p3dro-carvalho/p3dro-carvalho/blob/main/SQLQuery_loadingDV.sql

Well, it’s not a fully functional data-vault, I mean, considering all the automation required an whatnot, but it’s just a start. More to come in next posts! See you there.

References

https://www.udemy.com/course/modeling-data-warehouse-with-data-vault-for-beginners/

https://simplesqlbi.home.blog/2019/06/29/part-3-data-vault-for-beginners/

Data Vault Summary of Official Standards Document Version 1.0

This text is just a summary of all information that I found relevant while reading Data Vault official documentation. The documentation is a 41 page book, and you can find it over here: https://www.amazon.com.br/gp/product/B009I8FW9Y/ref=ppx_yo_dt_b_d_asin_title_o00?ie=UTF8&psc=1

Introduction

This text focus on the standards for Loading and Modeling the Data Vault. However, Data Vault is now in its 2.0 version, so, why am I reading this book? Well, according to my internet research this book is still the best way to get information about modeling a data vault, so it is cut to the chase and focused on this matter only, whilst 2.0 books cover several other aspects that I’m not interested right now, so I will save 2.0 books for the near future.

Data Vault Modeling Standards

Entities types

Hubs: list of business keys

Links: list of relationships between business keys

Satellites (sats, for short): Descriptive information

Stand-alone: tables like calendar/time. Do no store history and it’s keys are proliferated throughout the model. Also known as REFERENCE tables. Hubs, Links and Sats should never be stand-alone entities.

Common field elements

Sequence ID (required) – if surrogate keys are used, it is the primary key of all tables.

Load Date Time Stamp (required) – an attribute present on Hubs, Links, Sats, but on Sats it should be part of the primary key.

Record Source (required) – the name of the source system that generated the information. Use to trace back to the source.

Load End-dates (required) – this attribute represents the life of the record in a satellite.

HUBs rules

Definition of HUB: a list of uniquely identified business keys that have a very low propensity to change.

  • A hub must have at least 1 business key (BKs).
  • A hub key should not be composite.
    • Exception when two source use the same key for different meanings.
  • A hub should support at least one Sat. Hubs with no Sats are probably bad design.
  • A hub BK must be either a system created key or a true BKs that is the single basis for finding information in the source system. BK is often referred as Natural Keys (NK).
  • A hub can contain Surrogate Keys (SKs) if the database doesn’t work well with NKs.
  • A hub load-date-time attribute should not be part of the hub primary key structure.
  • A hub primary key cannot contain a record source.

LINKs rules

Definition of LINK: a list of uniquely identified composite relationships between two hub keys, must have two or more hubs combined, or it can use only one hub but using two keys from the same hub for hierarchical purposes. Hierarchical links should have at least on Sat to indicate the effectivity of the relationship.

  • A link must contain at least two hubs, or two linked primary keys.
  • A link  can contain two keys imported from the same hub for hierarchical or rolled up relationship.
  • A link load-date-time attribute should not be part of the link primary key structure.
  • A link composite key must be unique.
  • A link may contain surrogate sequence key.
  • A link may contain 2 or more hub keys.
  • A link grain is determined by the number of imported hub or link parent keys.
  • A link is a transaction, or a hierarchy, or a relationship.
  • A link may have zero or more Sats attached. An hierarchical link must have at least one Sat.
  • A link must be at the lowest level of the grain.
  • A link must represent at most, 1 instance of a relationship at any given time.
  • A link cannot be associated to a reference structure (hub-sat, code-description).

SATELLITEs rules

Definition of SATELLITE: any data with a propensity to change, any descriptive data about a BK. Data in a Sat must be separated by type and rate of change.

  • A sat MUST have at least one hub or link PK imported.
  • A sat cannot be attached to more than one hub.
  • A sat MUST have load-date-time attribute as part of its PK.
  • A sat may contain a sequence identifier or an ordering identifier as part of its PK for uniqueness.
  • A sat must contain record of source system for trackability.
  • A sat must contain at least one descriptive attribute about the hub or link its related.
  • A sat may contain aggregated attributes.
  • A sat purpose is to store data over time.
  • A sat may contain FK to reference tables.
  • A sat may contain FK to a reference structure (reference structure is a single hub with a sat, I assume it’s not related to the data vault as a whole, like a reference table)
  • A sat may contain FK to calendar table, geography tables.
  • A sat may contain load-end-date, its not mandatory, but it’s a best practice.

Naming conventions

Entities

  • Hubs, HUB_, H_
  • Links, LINK_, L_
    • Hierarchical Links, HLINK_, HIER_
  • Satellites, SAT_, S_

Fields

  • Record source, REC_SRC_, RSRC_
  • Sequence IDs, SEQ_ID_, SQN_
  • Date time stamps, DTS_
  • Date stamps, DT_
  • Time stamps, TM_
  • Load Date time stamps, LDDTS_
  • User, USR_
  • Occurrence number, OCNUM_
  • End date time stamps, LEDTS_

Avoiding Outer Joins

Insert an empty SAT record for every new hub key, meaning, even if your attributes are null, you at least have the hub key filled in. This allows the queries to equi-joins (inner) and avoid outer joins.

Data Vault Loading Standards

The data warehouse should truly represent the source system in their original format, and the only thing that should be done to the data is basic defaults, and error reporting.

Batch Loading

For batch loading use a set of staging tables, load data to refresh, do NOT keep history in these staging tables, so every time you load data on your stage make sure its truncated first. All process must be consistent, repeatable, highly parallel and scalabe.

Batch Loading – Staging Definitions

The staging definitions should populate required fields up load. The definitions should follow the rules below.

Required fields

This fields should be populated on the way in to the staging tables. Keep transformation to a minimum during load of the stages in order to make loads as parallel and independent as possible. Keep the structure as CLOSE to the source system as possible.

  • Staging sequence number
  • Load date
  • Record source
  • Extract date (optional)
  • Delta flag / new row flag (optional)
  • Target table sequence number (optional)
Goals for stage table loads
  • Parallel – all stage loads should be independent of any other stage load
  • Run when data is ready on the source
  • Big network bandwidth, lots of memory, lots of disk space
Data type changes going into staging tables
  • Char/string to datetime
  • Char/string to numeric
  • Split over-loaded fields
  • Char to varchar, trim
  • Upper case all strings, or lower case

Best practice: if disk space is available, store both the original value in char mode (prior the conversion), and store the new value in the “domain aligned field”.

Default values

Just suggestions here, you can use whatever values you want, so long as they are in agreement with your user.

  • Defaults for datetime: 1/1/1970 00:00:00.0000
  • Defaults for numeric: NULL
  • Defaults for char: (1)

Batch – HUB loading

Watch for duplicates across staging tables, also make sure the case chosen is consistent (upper/lower/etc).

  • Gather a unique union of ALL DISTINCT keys in the staging tables that are targeted to that hub.
  • Check to see if your records already exists in your target table.
  • If your records already exists in the target table, filter them out from your load.
  • If your records don’t already exists, assign the next sequence number and insert.

Batch – LINK loading

The goal of loading links is to have NO DUPLICATES across business key structures (sequence IDs). Each representation of an intersection or relationship must be unique.

  • Gather the business keys.
  • Default the NULL business keys to an unknown value or “zero record” in the hub.
  • Go get the hub sequence ID for that individual business key. Repeat this for all hub keys.
  • Check to see if this collection of sequences exist in the link already, if so – filter them out.
  • If it doesn’t exist, generate a new sequence for the link and insert it.

Batch – SATELLITE loading

Satellite loading objectives are two: 1) to only load delta changes. 2) split the work, by type of data and rate of change.

  • Gather the link or hub surrogate key for the dependent satellite.
  • Join those SKs to a current load-date table, and compare the most recent data in the staging to the satellite.
  • Select only those rows that have changed.
  • Place the new load-date into the PK of the satellite.
  • Track the rows to be inserted, to make it easy to end-date old rows in the next pass, as well as easy to update the PIT and bridge tables.
  • Insert the new satellite records.

Tracking and Tracability

  • Loading load dates
    • Every record in a batch cycle must contain exactly the same load-date and time UNLESS each record is tied by a load cycle ID. This way, the loaded group can be backed out and re-applied if necessary.
  • Load-end dates
    • Load end dates should be 1 second or 1 millisecond BEFORE the next active record. Depending on the grain of the RDBMS engine. (I need to understand this better). This provides a temporal view of the data, allowing queries to execute spatial-temporal analysis.
  • Last seen dates
    • Last seen dates are generated by the load process. They can be defined two ways, either as the last time we saw the data for the satellite, or the last time we saw the hub key on the feed.
  • Tracking satellite
    • The tracking satellite is a system generated, system managed satellite. It contains fields that are generated during the load cycles. It tracks the history of arrival and updates to the other tables like hub, links and satellites.
  • Record sources
    • Record sources are the metadata about where the data in that record came from. They can contain the source system, source application, and sometimes even the source function. The more granular, the better.

Reference

The Official Data Vault Standards Document (Version 1.0) (Data Warehouse Architecture) (English Edition)

Database Concepts, Third Normal Form (3NF)

As I was studying Data Vault concepts I found myself in a dire desire of understanding more of the third normal form, 3NF. Here in this article I will present you everything I’ve gathered so far about the 3NF.

Why is the 3NF important to Data Vault? Well, you certainly don’t need to be a jedi master of the 3NF to perform Data Vault modeling, but understanding its characteristics and how the many different keys behave in a RDBMS can definitely help creating your model.

Before we can fully understand what 3NF is, we need to understand 1NF and 2NF.

Note: Please, see the reference at the bottom of this article for all material used to produce this piece.

What is 1NF?

In order to meet 1NF a table must meet the following criteria:

  1. The data are stored in a two-dimensional table.
  2. There are no repeating groups.
  3. It must have a primary key

What are repeating groups?

A repeating group is an attribute that has more than one value in each row of a table. A repeating group is directly analogous to a multivalued attribute in an ER diagram.

To achieve 1NF you need to handle your table, much likely creating new tables, so data can be normalized.

Take the table players, for this example. It as many multivalued attributes, which inflicts 1NF.

player_iditems
jane1 amulets, 2 rings, 1 shields
john12 arrows, 1 bow, 2 rings
mary2 rings, 1 staff, 2 hood
lisa1 rings, 2 dagger, 2 hood

In order to achieve 1NF we have to decompose this table in two tables, like these:

player_id
jane
john
mary
lisa

Here played being the primary key.

player_idItem_typeQuantity
janeamulets1
janerings2
janeshields1
johnarrows12
johnbow1
johnrings2
maryrings2
marystaff1
maryhood2
lisarings1
lisadagger2
lisahood2

Here the combination of player + items being the primary key.

What is 2NF?

In theoretical terms, second formal form relations are defined as follows:

  1. The relation is in first normal form.
  2. All non-key attributes are functionally dependent on the entire primary key. Meaning: no part-key dependency allowed.

Let’s enhance our last table, to add player rating:

player_iditem_typeitem_quantityplayer_rating
janeamulets1Intermediate
janerings2Intermediate
janeshields1Intermediate
johnarrows12Beginner
johnbow1Beginner
johnrings2Beginner
maryrings2Advanced
marystaff1Advanced
maryhood2Advanced
lisarings1Advanced
lisadagger2Advanced
lisahood2Advanced

On the table above item_quantity and player_rating are non-key attributes, meaning, they do not compose or participate on the primary key for this table. We can therefore say that: player_id + item_type determines item_quantity, the terminology being:

  • { player_id , item_type } -> item_quantity

Meaning that, for every unique combination of player_id + item_type (aka, for every primary key) there is only one item_quantity. The next question is, are there other functional dependencies on this table? Yes, there are. Here are they:

  • player_id -> player_rating

But why is the first functional dependency correct and not the second one?

Because functional dependency is a one-way relationship between two attributes such that at any given time, for each unique value of attribute A, only one value of attribute B is associated with it.

And, as we can see here, on the functional dependency “player_id -> player_rating” there are no uniqueness on this relationship.

Now that we have identified all the functional dependencies (FD) we can use the determinant of each of this FDs, and create a separate table, a separate entity, using those determinants as primary key. Like these:

Note on determinant: To clarify what are our determinants here we first need to know which are the Functional Dependencies we got, so here are the FDs:

  1. { player_id , item_type } -> item_quantity
  2. player_id -> player_rating

So we have determinant { player_id , item_type } for our FD #1 and player_id as our determinante for FD #2.

What is the definition of Determinant?

An attribute on which other attributes are functionally dependent on.

player_idplayer_rating
janeIntermediate
johnBeginner
maryAdvanced
lisaAdvanced
player_iditem_typeitem_quantity
janeamulets1
janerings2
janeshields1
johnarrows12
johnbow1
johnrings2
maryrings2
marystaff1
maryhood2
lisarings1
lisadagger2
lisahood2

By doing that we assure it’s now on 2NF, and that there are no part-key dependencies, in other words, every attribute depends on the whole primary key.

What is the 3NF?

The theoretical definition of third normal form says:

  1. The relation is in second normal form.
  2. There are no transitive dependencies.

Another way to describe 3NF:

  • Each non-prime attribute in a table should depend on every candidate key.
  • It should never depend on part of a candidate key.
  • It should never depend on other non-prime attributes.

Know this already, the 3NF* mantra:

Every attribute in a table should depend on the key, the whole, and nothing but the key.

So, let’s take this enhanced player table:

player_idplayer_ratingplayer_skill_level
janeIntermediate6
johnBeginner3
maryAdvanced8
lisaAdvanced9

We added player_skill_level, a scale so we know how high is the player skill level.

skill_level        
123456789
 beginner  intermediate  advanced 

Let’s assume that JOHN skill level went up from 3 to 4, but, for some reason, the player_rating wasn’t properly updated, resulting on this table:

player_idplayer_ratingplayer_skill_level
janeIntermediate6
johnBeginner4
maryAdvanced8
lisaAdvanced9

As we can see, player_rating should be Intermediate, but it is showing Beginner due to an update anomaly.

It is important to say that it is easy for us, as humans, to understand that there is an error on this table, but this table is perfectly fine regarding 2NF, it is in compliance with 2NF as all non-keys attributes are functionally dependent on the entire primary key.

So the real question is, how are those non-key attributes dependent on the key?

Player_rating is dependent on player_id. Like this:

  • player_id -> player_rating

Player_skill_level is also dependent on player_id but indirectly, like this:

  • player_id -> player_skill_level -> player_rating
    • this is a transitive dependency, where player_rating depends on player_skill_level

A dependency like the one of player_rating is called a transitive dependency. A dependency of a non-key attribute on another non-key attribute.

To achieve 3NF we have to adjust our tables so they look like this:

player_idplayer_skill_level
jane6
john4
mary8
lisa9
player_skill_levelplayer_rating
1Beginner
2Beginner
3Beginner
4Intermediate
5Intermediate
6Intermediate
7Advanced
8Advanced
9Advanced

*3NF can be described as:

Every non-key attribute in a table should depend on the key, the whole, and nothing but the key.

*3NF enhanced version, Boyce-Codd Normal Form (BCNF), can be described as:

Every attribute in a table should depend on the key, the whole, and nothing but the key.

Note about Transitive Dependencies:

Transitive dependencies comes from a functional dependency pattern, such as: A -> B and B -> C; therefore A -> C. A transitive dependency therefore exists only when the determinant that is not the primary key is not a candidate key for the relation (remember: relation here is the relation between attributes).

When you have a transitive dependency in a 2NF relation, you should break the relation into two smaller relations, each of which has one of the determinants in the transitive dependency as its primary key. The attributes determined by the determinant become non-key attributes in each relation. This removes the transitive dependency and places the relation in third normal form.

What is Boyce-Codd Normal Form?

There is not much of a difference between 3NF and BCNF, in fact, in the real world most 3FN models are in fact BCNF models.

Note: If your relations are in third normal form and do not exhibit the special characteristics that BCNF, 4NF, and 5NF were designed to handle, then they are automatically in 5NF.

Before discussing about BCNF let’s review 3NF definition:

  • Each non-prime attribute in a table should depend on every candidate key.
  • It should never depend on part of a candidate key.
  • It should never depend on other non-prime attributes.

However, if you pay close attention to this 3NF definition, you will notice that if every attribute on a table is a prime attribute, then transient dependencies are going to happen, and that, regardless of being accepted as 3NF, is a design failure. That’s why we have BCNF.

See this table:

release_yearpopularity_rankingmovie_namerelease_year_and_month
20081The Dark Knight2008-07
20082Indiana Jones2008-05
20083Kung Fu Panda2008-06
20091Avatar2009-12
20092Harry Potter2009-07
20093Ice Age2009-07

Here are the candidate keys:

  1. Movie_name
  2. { release_year , popularity_ranking }
  3. { release_year_and_month , popularity_ranking }

All keys are prime attributes, there are no non-prime attributes.

The problem we have appears on the following functional dependency:

  • release_year_and_month -> release_year

This problem occurs because if you happen to update the release_year column you can end with an update anomaly if your column release_year do not update accordingly. Like this:

release_yearpopularity_rankingmovie_namerelease_year_and_month
20081The Dark Knight2008-07
20082Indiana Jones2008-05
20083Kung Fu Panda2018-06
20091Avatar2009-12
20092Harry Potter2009-07
20093Ice Age2009-07

However, even knowing that there is an anomaly in this table, according to 2NF and 3NF the table complies, and that’s why BCNF enhance the 3NF definition.

To be in Boyce-Codd normal form, a relation must meet the following rules:

  1. The relation must be in third normal form.
  2. All determinants must be candidate keys.

What is a Determinant?

An attribute on which other attributes are functionally dependent on.

What is a Candidate Key?

A candidate key is an attribute, or set of attributes, that uniquely identify a row in the table. More detailed way of putting it: a column or combination of columns (columns also means attributes) that can be used as the primary key of a relation (relation here is about attribute to attribute relation).

What is a non-prime attribute?

A prime attribute is an attribute that belongs to at least one candidate key, therefore, non-prime attributes are attributes that doesn’t belong to any candidate key.

As we know, release_year depends on release_year_and_month, that’s our problem, because release_year_and_month is a determinant, since it determines release_year, but elease_year_and_month is not a candidate key.

To fix this we can simple replace the release_year_and_month attribute for release_month attribute.

release_yearpopularity_rankingmovie_namerelease_month
20081The Dark KnightJuly
20082Indiana JonesMay
20083Kung Fu PandaJune
20091AvatarDecember
20092Harry PotterJuly
20093Ice AgeJuly

The table is now in BCNF, and here we can informally state that BCNF definition as:

Every attribute in a table should depend on the key, the whole, and nothing but the key**.

** Keep in mind that “KEY” here refers to “every candidate key”.

References

Learn Database Normalization – 1NF, 2NF, 3NF, 4NF, 5NF

Learn Boyce-Codd Normal Form (BCNF)

https://www.sciencedirect.com/topics/computer-science/transitive-dependency

https://www.sciencedirect.com/topics/computer-science/third-normal-form

Database Keys Made Easy – Primary, Foreign, Candidate, Surrogate, & Many More

Jan L. Harrington, in Relational Database Design (Third Edition), 2009

Data Vault Basics from Dan Linstedt

Data Vault Basics, by Dan Linstedt. The writing on this article reflects Data Valt’s creator, Dan Linstedt, view over its own creation.

Data Vault Basics

  • In order to properly create and apply data vault modeling one must know and understand your business well.
  • It is very easy to convert from 3rd normal form and from star schema to data vault, in this article we will focus on converting from 3rd normal form to data vault.

Data Vault Benefits

  • Manage and enforce compliance to SOX, hippa, basill ii.
  • Make it easier to find problems in your data that were not possible before.
  • Reduce time for implementing changes.
  • Merge new business unit into the organizations rapidly.
  • Consolidate disparate data stores, ie: master data management (mdm)
  • Scalable to hundreds of petabytes.
  • Trace all data back to the source systems.

Data Vault Definition

Data Vault is a detail oriented, historical tracking set of normalized tables that support your business areas. It’s a breed between 3rd normal form and star schema. The design is flexible, scalable, consistent and adaptable to the needs of the enterprise.

In business terms, Data Vault is to adapt quickly, model the business accurately, and scale with the business needs. Data Vault is:

  • a data integration architecture;
  • A series of standards;
  • And definitional elements or methods by way information is connected within an rdbms data store
  • Extensive possibilities for data attribution.
  • All data relationships are key driven.
  • Relationships can be dropped and created on the fly.
  • Data mining can discover new relationships between elements.
  • Artificial intelligence can be utilized to rank the relevancy of the relationships to the user configured outcome.

Data Vault business case

Data Vault is based heavily on business process, so it is very important to see how the business model represents the data vault.

Below we show how to transit from one business case to a data vault physic model.

The business case: for this example, let’s examine a company with a marketing department that wants to build a sales campaign to sell slow moving products. The hope is that the customer will see the campaign, like the new low price of the product, and make a purchase from the company.

When the analyst re-iterates what they heard in the user interviews, the analyst says to the business users, that he heard them say: “marketing generates a sales invoice”. The business users quickly correct him (and the case model) and state that finance generates the invoices. Of course, for this example hopefully the customer then provides feedback to the company about the marketing campaign, and company friendliness.

In the corrected business case, we change the model from the representation above, to the representation below. Not a large change, just a difference in departments. Hopefully there is communication between finance and marketing (which is not shown in this case example).

Note: the model was correct because marketing department do not generates invoices, it is the finance department that does it!

  • Make sure you have your business case adjusted to your reality.
  • Work with your client to figure out which part of the business model to implement first.
    • Focus for now is: campaigns, invoices, products, and customers. See the image below.

Upon further investigation, we discover that the business also wants:

  • To track campaign effectiveness ratings and dates (length of time for the campaign)
  • Dates and amounts on invoices and line-items
  • Products and their availability dates, descriptions, stock quantities and defect reasons
  • Customer address, contacts, and other demographic details

The next model that we build is based on the concepts of the data vault and ties directly back to the business descriptions of granularity of data, and keyed information. For the purposes of scope and this example, we will limit the logical data model to the area outlined in red.

The image below shows our first cut logical data vault data model.

  • We take the business keys such as invoice number and product number and create HUB table for them
  • Interactions between invoices and products are put in a LINK table named LINK_INVOICE_LINE_ITEM
  • Line Items is represented as a SATELLITE but Dan states that it this is just because CUSTOMER is out of scope of this article

Here’s the result:

Here are the 5 “easy” steps of how to implement/develop your data vault model:

  1. Establish the business keys, hubs
  2. Establish the relationships between the business keys, links
  3. Establish description around the business keys, satellites
  4. Add standalone components like calendars and code/descriptions/lookups for decoding in data marts
  5. Tune for query optimization, add performance tables such as bridge tables and point-in-time structures

Northwind data model

We start with a simplified 3nf data model, often times the source data model represents the business as it stands today – at least it represents the business processes which collect the information.

1st step: Establish the business keys, hubs.

This can be challenging if the business keys are “smart-keys”, or composite keys made up of several identifiable relationships. Although, normalization to the nth degree is discouraged, it is good to identify and document the metadata for these composite keys.

2nd step: Establish the relationships between the business keys, links.

This process can be a little tricky sometimes – particularly if the data set says the business keys are “weak”, and can’t stand on their own, or are non-identifying (non-unique). In these cases, we can end up with one-legged link tables, which is something we work through to correct during the modeling process. Again, the link tables represent relationships, transactions, hierarchies, and define the grain of the data on the intersection.

Please note the importance of grain defined between order, employee, shipper and customer. The grain of this relationship (link table) is defined by the source system, however it is subtle and easy to miss.

The final image below shows a complete data vault, with all the hubs, links and satellites available. While the white-papers have defined point-in-time tables, and bridge tables, they are not required by the architecture

Data Vault and Data Compliance

The data vault provides a series of standard fields which track data changes by date, and where they came from – and the data vault is always modeled to hold the lowest possible grain of data from the source systems. That is to say: data is integrated, but not modified when it is stored in the data vault. The satellites are split by type of data and rate of change which allow reduced storage requirements, and increased traceability.

When following the standards in loading information into the data vault, the it staff automatically inherits “data compliance and traceability”. The business can point at the data mart and claim that it’s “wrong” today, and “right” tomorrow, but it can always show where the data came from, when it came in, and what it looked like before alteration/transformation, aggregation and cleansing – thus meeting the compliance initiatives becomes easier.

References

http://danlinstedt.com/solutions-2/data-vault-basics/

Data Vault a glimpse

Well, I had to stop my studies on Kimball’s The Data Warehouse ETL Toolkit for now, and it is because I have to dig as deep as possible in the Data Vault subject, which, I must say, is a very very interesting topic.

I love data modeling and Data Vault seems to be a strong choice for many big companies, also, it’s philosophy attracts me, so I’m eager to learn more about this. Therefore, if you are also interested in this matter, stick around!

This is a quick recap about Data Vault. All that is written here has been taken directly from the wikipedia: https://en.wikipedia.org/wiki/Data_vault_modeling

What is Data Vault?

  • Data vault modeling was originally conceived by Dan Linstedt in the 1990s and was released in 2000 as a public domain modeling method.
  • Data vault modeling is a modeling method to provide long-term historical storage of data.
  • Every row in a data vault must be accompanied by record source and load date attributes.
  • Developed in 2000
  • Model resilient to change, by separating structural information from descriptive attributes.
  • Designed to enable parallel loading as much as possible

Data vault philosophy

Data vault’s philosophy is that all data is relevant data, even if it is not in line with established definitions and business rules. If data are not conforming to these definitions and rules then that is a problem for the business, not the data warehouse. The determination of data being “wrong” is an interpretation of the data that stems from a particular point of view that may not be valid for everyone, or at every point in time. Therefore the data vault must capture all data and only when reporting or extracting data from the data vault is the data being interpreted.

From <https://en.wikipedia.org/wiki/Data_vault_modeling>

Ontology

Simple data vault model with two hubs (blue), one link (green) and four satellites (yellow):

NOTE: PK in the image above are SK.

As you can see on the HUBs, the blue ones, they hold a PK, BK and Load_Date, so take hub_employee for instance. Notice that the Satellites Contact and Details both have the same first 3 columns: pk, load_date, record_source.

  1. PK is the key of the employee
  2. Load_date is when that particular row was inserted
  3. Record_source states where that data from that satellite is coming from

That way, every time you have new data for employees you can simply create a new satellite table and relate it to the hub using the PK between those tables.

HUBS

  • Hubs contain a list of unique business keys with low propensity to change.
  • Hubs also contain a surrogate key for each Hub item and metadata describing the origin of the business key.
  • Descriptive attributes for the information on the Hubs are stored in Satellite tables.
  • Hubs cannot have duplicate data (BKs must be unique in a hub).
  • Hubs should have at least one satellite.

Hubs contains at least:

  1. A surrogate key, used to connect the other structures to this table
  2. A business key, this key can consist of multiple fields, like an unique identifier.
  3. The record source, to see where that date came from.
  4. Optionally, you can also have metadata fields.

LINKS

  • Links are basically many-to-many join tables.
  • Links can link to other links, however it is considered bad practice, because it introduces dependencies between links that make parallel loading more difficult.

Links contains at least:

  • The surrogate keys for the hubs they connect.
  • They own surrogate key

Satellites

Hubs and Links form the structure of the model, but have no temporal attributes and hold no descriptive attributes. These are stored in satellites.

  • Satellites are somewhat like facts on a star schema, but they hold description and facts.
  • They have their hub or link parent PK/BK/SK, whatever is being used to relate them.
  • Start and end date of each row, much like a type 2 SCD.
  • Usually the attributes are grouped in satellites by source system. However, descriptive attributes such as size, cost, speed, amount or color can change at different rates, so you can also split these attributes up in different satellites based on their rate of change.

Satellite example

The satellite table below is linked on the DRIVERS_LINK, and that link is between CARS_HUB and PERSONS_HUB, so this satellite goes by the name of DRIVER_INSURANCE_SATELLITE or S_DRIVER_INSURANCE. So, we have:

  1. S_driver_insurance
  2. L_drivers
  3. H_cars
  4. H_persons

Here’s the satellite table attributes:

This satellite contains:

  • attributes that are specific to the relationship between the car and the  person driving it.
  • an indicator whether this is the primary driver.
  • the name of the insurance company for this car and person (could also be a separate hub).
  • a summary of the number of accidents involving this combination of vehicle and driver.
  • Also included is a reference to a reference table called R_RISK_CATEGORY containing the codes for the risk category

Reference tables

  • Reference tables are a normal part of a healthy data vault model.
  • They are there to prevent redundant storage of simple reference data that is referenced a lot.
  • Reference tables are referenced from Satellites, but never bound with physical foreign keys.
  • They can be historical or have no history, if no history then stick to natural keys and do not create SKs here.
  • Expect to use lots of reference tables.

Loading practices

  1. First load HUBS, creating SKs for any new BK
  2. Resolve LINKS between HUBS and create SKs for any new associations.
  1. Add (relate) all satellites to HUBS.
  2. Add (relate) all satellites to all LINKS.
  • Since HUBS are not related to each other you can load all the hubs in parallel.
  • Since LINKs are not related to each other you can load all the links in parallel.
  • Since SATELLITES can related only to hubs and links, you can load all satellites in parallel.
  • Data is never deleted from the data vault.

Problems might occur when links relate to links. The same for link for multiples hubs. Always try to avoid this.

Data vault and dimensional modeling

  • Data vault modelled layer is normally focused on store date.
  • Data vault modelled layer is not optimized for query performance, not it’s easy to query.
  • Hubs and satellites related to those hubs can be treated as dimensions, for the purpose of enabling quick prototype of a dimensional model.
  • Links and satellites related to those links can be treated as fact tables, for the purpose of enabling quick prototype of a dimensional model.

Note that while it is relatively straightforward to move data from a data vault model to a (cleansed) dimensional model, the reverse is not as easy, given the denormalized nature of the dimensional model’s fact tables, fundamentally different to the third normal form of the data vault.

References

https://en.wikipedia.org/wiki/Data_vault_modeling

Data Warehouse ETL Toolkit – Chapter 03 – Part 1

Chapter 03 – Extracting

Integrating all the different data sources are usually one of the biggest challenges the ETL team faces, and without data there is not data warehouse. In this chapter Kimball teach us how to interface with those various source systems, how to create your logical data map and how to deal with some common source systems.

In this chapter Kimball also talks about how to deal with subject of change in data capture and deleted record capture, as we now know that our data warehouses are dynamically changing. This change subject is also revised in many other chapters, such as the data cleaning chapter, the delivery chapters, and the operations chapter.

Part 1: The Logical Data Map

You need to plan your ETL from begin to end even before starting your first ETL process. It is important to have a logical map, a clear understanding of which data will be needed to answer business needs and where that data come from. Once you have that clear and documented, you may start building your physical implementation. So plan first, build later.

The logical data map describes the relationship between the extreme starting points and the extreme ending points of your ETL system.

Designing Logical Before Physical

You already know that you need a plan, but how to create that plan? Here’s how:

  1. Have a plan. You are the ETL team, so you need to understand what user needs, what kind of data source you have and document it all. This plan is of utmost importance, especially in the future as users and quality assurance (QA) team will want to validate what is happening with their data, so your plan needs to document everything in order to describe exactly what is done between the ultimate source system and the data warehouse.
  2. Identify data source candidates. Identify the likely candidate data sources you believe will help answer your user’s needs. Within these sources identify what elements you think will be central to your users. These data elements will be your subject in the data profiling step.
  3. Analyze source systems with a data-profiling tool. This data profiling step means that you will check your data source candidates and see if the data they have is adequate to fulfill your business needs. So, it is very important that whoever performs this step have technical knowledge of data, but also very good understand of business, and how this data will be used. Failure in this step can simple compromise the entire data warehouse project. If the data cannot support the business objectives, this is the time to find that out.
  4. Receive walk-though of data lineage and business rules. Here you should have a data lineage mapped out, like, what data comes from which data source, which extractions, transformations and loadings are performed and what is going to be delivered.
  5. Receive walk-through of data warehouse data model. Just like the ETL team needs to understand all data lineage it also needs to understand how all that data will create the physical data model of the data warehouse. This includes understanding dimensional modeling concepts, table-by-table mapping, how dimensions, facts and any other table in the dimensional model work together to deliver the ETL solution. Also, never forget, the ETL system goal is to deliver data in the most effective way to end user tools.
  6. Validate calculations and formulas. It is helpful to make sure the calculations are correct before you spend time coding the wrong algorithms in your ETL process, so before creating any ETL process, double check everything, especially with your end users.

Inside the Logical Data Map

So, now you know you need a Logical Data Map, but how to create one? This topic will describe the logical data maps structure.

The Logical Data Map contains: the data definition for all the data warehouse sources, the target data warehouse data model, and all details required to manipulate and transform that data from its original format to that of its final destination.

This is the Logical Data Map document:

I (Pedro) consider this spreadsheet, this Logical data map, pretty self-explanatory, however, some points are worth noticing:

  1. SCD (slowly changing dimension) type. For dimensions, this component indicates a Type-1, -2, or -3 slowly changing dimension approach. For example, within the customer dimension, the last name may require Type 2 behavior (retain history), while the first name may require Type 1 (overwrite). These SCD types are developed in detail in Chapter 5.
  2. Transformation. The exact manipulation required of the source data so it corresponds to the expected format of the target. This component is usually notated in SQL or pseudo-code.
  3. Table type. First dimensions, then facts.

Pay very close attention to this document, the logical data map. If you watch it closely you will notice that it reveals many hidden requirements for the ETL team that might otherwise have been overlooked.

Take a good look at Figure 3.1.

See that data types between source and target for STATE get converted from 255 to 75 characters and there is nothing described in TRANSFORMATION column about that, meaning that this conversion from 255 to 75 characters is made implicitly (implied conversions) by the ETL tool. So, in future, if values with more than 75 chars are created you will probably lose the data. Besides that, your ETL can abort of fail the entire process with this kind of data overflow error. Implied conversions are common and notorious for sneaking up and destroying your processes. Make sure all your conversions are made explicitly.

The transformation column is a very important one. But don’t be completely focused on this column only, make sure you review the entire mapping before creating any ETL process. Most times transformations will be expressed using SQL, but this field can also be blank, meaning straight load from source-to-target. Transformations can also be plain text, describing what to do, like instructions to preload from a flat file or to base the load transformation on criteria outside of the database or to reject known data anomalies into a reject file.

Pay close attention to SCD types on your dimensions columns. Be aware of exactly what kind of SCD you really need for each column of your dimensions. Also, make sure your users understand those SCDs and what to expect as result in your final data warehouse by using them. Nonetheless, users quite often asks for SCDs to change once they start using the data warehouse, make sure to run those changes requests through your data warehouse project manager and the change management process.

Building the Logical Data Map

A key point of a data warehouse success is the cleanliness and cohesiveness of the data within it, and to achieve that we need to have a solid understand of all source systems that integrates the data warehouse. The complete logical data mapping cannot exist until all source systems have been identified and analyzed. The result of such analysis is usually broken into two major phases:

  1. The data discovery phase
  2. The anomaly detection phase

Data Discovery Phase

By this time you already know how your target should look like , meaning, how your end dimensional model should be delivered, so now what you have to do is find the all the correct source systems. The ETL team is responsible for finding the correct, the true source. Thorough analysis can alleviate weeks of delays caused by developing the ETL process using the wrong source.

Collecting and Documenting Source Systems

Source systems are scattered around your company. It’s the ETL team responsibility to find them. Sources systems can be in pieces of documentation, like interview notes, reports, and in the data modeler’s logical data mapping. Work with the system and business analysts to track down all source systems. It is the ETL team’s responsibility to keep track of the systems discovered and investigate their usefulness as a data warehouse source.

Keeping Track of the Source Systems

Now that you went through all this effort of assessing and finding your source system a good thing to do is to document it.

The chart or spreadsheet above is named the source system tracking report. The maintenance of this list should be a collaborative effort between the ETL team and the data modeling team. If for some reason an entry become not useable anymore, keep it, you might need for future reference.

The source system tracking report can also indicate how long your project will take, or how many source systems you will tackle at once. Fields of notice:

  1. Subject area. Typically the name of the data mart that this system feeds.
  2. Business name. how the source system is commonly referred to by the users.
  3. Priority. A ranking used to determine future phases. The priority is usually set after the data warehouse bus matrix has been completed.

Determining the System-of-Record

First know this, system of record definition is: it is the originating source of data.

This is very important to know as part of the ETL team, because all your source systems you ingest in your system should come from the system of record. This is particularly important since most enterprises store data redundantly across many different systems. It is very common that the same price of data is copied, moved, manipulated, transformed, altered, cleansed, or made corrupt throughout the enterprise, resulting in many versions of the same data. In pretty much all cases, data at the end of the lineage will not be the same as the data in the system of record.

Having the system of record data is especially important to build conformed dimensions in your data warehouse, so without it conforming dimensions will be nearly impossible. The further downstream you go from the originating data source, the more you increase the risk of extracting corrupt data.

Note: Derived Data. Should the ETL process accept calculated columns in the source system as the system-of-record, or are the base elements, the foundation of the derived data, desired?

The answer is: it depends. If you have additive calculated columns, go for it, but, if your calculated columns are nonadditive measures that cannot be combined in queries by users, then you will need to leave those calculated columns to be performed by users in their front room or to create and handle these calculated columns yourself. But, of course, you get to create those nonadditive calculated columns yourself, it will fall under your responsibility to keep it up to date with users business rules.

Analyzing the Source System: Using Findings from Data Profiling

Now that you know all your system of record source system it is time to analyze and understand which kind of data and model they have.

A good way to achieve that understanding is to check those source system’s Entity Relation (ER) diagrams. If you happen to not find an ER diagram, you might try to reverse engineer the database and see if you can get one. So, at this point use every tool you have at your disposal to understand your source data, another good thing to do is talk to a system guru, or data analyst, or business analyst that knows the system well.

Here are some characteristics that you need to keep an eye out when investigating your ER diagrams:

  1. Unique identifiers and natural keys. Identifying unique identifiers are not that hard, they are usually the Primary Key, however some other very important information is required, especially if you are dealing with slowly changing dimensions: natural key. The natural key is what the business uses to uniquely describe the row.
  2. Data types. Do not let column names fool you. The order_number column is actually a varchar, the same usually happens when dealing with date or datetime columns, they very often come as strings instead of date type, so make sure to double check all your data types. Also, never forget your leading zeroes either on strings or integers, they might be important. Another important point is truncating data, even numbers, make sure you do not truncate, or if you are truncating you have the users approval.
  3. Relationship between tables. Understanding how tables are related is vital to ensuring accuracy in joins while retrieving data. If you are lucky enough your ER diagram will present the relationship lines to you, if not, it’s a good practice to get all source system heterogeneous sources and map out the relationships.
  4. Discrete relationships. You may find some lookup tables in your model. Carefully document the name of each group of rows and the associated tables and columns. This information will be needed while mapping many of the dimensions.
  5. Cardinality of relationships and columns. Knowing the cardinality of relationships is necessary to predict the result of your queries. In relational databases, all associated tables will be joined in one of the following ways:
    1. One-to-one. Used on super-type / sub-type scenarios and the practice of vertical table partitioning. One-to-one relationships can be identified by observing that the relationship is on the primary key of each table.
    2. One-to-many. This is the most common type of relationship found. It is easily identified by noting that a non-key attribute in a table refers to the primary key of another table.
    3. Many-to-many. This relationship usually involves three tables with two one-to-many relationships between them. More specifically, there are two tables with an associative table between them. The center or associative table has a compound primary key and two foreign keys, one to the primary key of one table and another to the primary key of the other table.

Data Content Analysis

You already know your ER diagrams and have all the correct sources to work with. Not its time to dig deeper and analyze the data content. What is inside each table? How is the data quality of each column? Pay close attention to anomalies in data that can compromise your data warehouse, such as:

  1. NULL values. First know this, in a relational database NULL is not equal to NULL. Joining two or more tables based on a column that contains NULL values will cause data loss! So make sure you check and deal with every NULL values you may found in your source, try, as much as possible, to get rid of every NULL value. The ETL normally transform those NULL values into default values, replacing those NULLs
  2. Dates in nondate fields. Dates can be found in many different formats. Make sure you check all your data columns, understand their pattern and conform each and every one of them, before loading into the data warehouse.

Collecting Business Rules in the ETL Process

ETL business rules are more technical, more complex, more detailed. So make sure you document them properly. See the example bellow the difference between data modeler’s business rules and ETL’s one:

Status Code— Data Model team: The status is a four-digit code that uniquely identifies the status of the product. The code has a short description, usually one word, and a long description, usually one sentence.


Status Code— ETL team: The status is a four-digit code. However, there are legacy codes that were only three digits that are still being used in some cases. All three-digit codes must be converted to their four-digit equivalent code. The name of the code may have the word OBSOLETE embedded in the name. OBSOLETE needs to be removed from the name and these obsolete codes must have an obsolete flag set to ‘Y’. The description should always be in sentence case, regardless of the case used when entered into the source system.

Even though those rules being of technical nature, the rules makers are still the business. The ETL team cannot makeup rules, it is up to the ETL architect to translate users requirements into useable ETL definitions. Also, as you discover undocumented data anomalies make sure to share that with your users so new rules of how to deal with each of those anomalies can be set, properly documented, approved and signed off.

Kimball do not state where to store those business rules. Some ideas I have: store in your ETL tool as metadata, or store in your logical data map, as a new column “business rules”.

Integrating Heterogeneous Data Sources

Until now everything we’ve been talking to refers to source data, meaning, we were talking about going to the data source and get it. However, besides that, we also need to integrate data. By integrate data we mean conform, so we are talking about conforming dimensional attributes and facts. You cannot have the same attribute with two different meanings, or two different labels that will induce your users to errors or misinterpretation of the data. The same happens for facts, you cannot have the same kpi being calculated differently.

Failing in conforming will be the failure of your data warehouse project. In the data warehouse, conformed dimensions are the cohesive design that unifies disparate data systems scattered throughout the enterprise. Nonetheless, the process of conforming dimensions and facts involves technical and political issues, as you will need to align with your end user which kind of data, or which meaning should prevail over others.

Important note: when you have dimensions being populated by several distinct sources, it is important to include the unique identifier from each of those sources in the target dimension in the data warehouse. Those identifiers should be viewable by end users to ensure peace of mind that the dimension reflects their data that they can tie back to in their transaction system.

Chapter 5 discusses loading dimensions in more detail, but here are some key techniques for loading conformed dimensions in a disparate source system environment:

  1. Identify the source systems. We have talked about this already, so make sure you have all your good source (system of record) mapped out.
  2. Understand the source systems (data profiling). We have talked about this already as well, so make sure you run your data profiling processes in each and every source you identified as required for your data warehouse project.
  3. Create record matching logic. Now that you know your source and you know what kind of data and metadata you have in your sources it is time to match what goes with what. Here you have to design which table, which data, talks to which data, meaning, you have to be able to say that “I can join this with that, I can do this other join here and there”, so on so forth. But beware, joining sometimes can be simple, but once you are dealing with heterogeneous data sources you will much likely need to infer how data should be joined (fuzzy relationship logic).
  4. Establish survivorship rules. This means that if you have a customer table in your accounts receivable, production control, and sales systems, the business must decide which system has overriding power when attributes overlap.
  5. Establish nonkey attribute buisiness rules. You will probably find attributes that you need to use because your users demands, but those attributes do not come from the system of record, however those attributes will be used to enhance your final conformed dimension. When this happens make sure you assign business rules for those attributes. Documentation and publication of the data lineage metadata is crucial to prevent doubt in the integrity of the data warehouse when users don’t see what they expect to see.
  6. Load conformed dimension. The final task of the data-integration process is to physically load the conformed dimension. This step is where you consider the slowly changing dimension (SCD) type and update late-arriving data as necessary. Consult Chapter 5 for details on loading your conformed dimensions.

SOURCE:

Ralph Kimball, Joe Caserta- The Data Warehouse ETL Toolkit: Practical Techniques for Extracting, Cleaning, Conforming, and Delivering Data, 2004.

Data Warehouse ETL Toolkit – Chapter 02

Chapter 02 – ETL Data Structures

In this chapter Kimball start once again talking about staging data, which he says that (sic) in this context staging means writing to disk. He also reminds us that they recommend 4 staging points, between each major steps of the ETLS (ECCD). This chapter focus on describing general design principles for the staging area, planning and design standards needed to setup your staging areas. Kimball also says that metadata plays a very important part in supporting all the staging process.

To Stage or Not to Stage

Ultimately it’s the project choice to decide if either data will be staged or not. Data can be staged to files, or to tables or it can be handled in memory only and not staged at all. It is always important to consider the balance between I/O and in-memory processing, as performance in retrieving and accessing data is usually of high importance to any ETL project.

There are two major points to consider whether you are going to stage or not:

  1. Getting data from source as fast as possible
  2. Recover from failure without restarting the ETL from the starting point

Consider the following reason for staging data before it is loaded into the data warehouse:

  1. Recoverability; it’s good practice to stage data as soon as it has been extracted from the source and then again after each major transformations steps. By implementing those recover points your ETL process won’t have to intrude the source again if the ETL fails. When staging data purely for recovery purposes the data should be stored as sequential files on the file system rather than in a database. Staging data for recoverability purposes is especially important when the source overwrite their own data. 
  2. Backup; sometimes it is easier and cheaper to backup your staging points rather than your entire data warehouse. Save those files from recover points in compressed zip folders and if you ever need to reload the data warehouse you can simply uncompress the load files and reload them.
  3. Auditing; many times the lineage between the source and target is lost in ETL code, and auditing an ETL process is very important if you want to have proof that your data warehouse have the correct data after all data manipulation and data transformation. A great way to achieve this audit is by having multiple staging points saved, so auditors can simple compare the original data input with the logical transformation rules against the output file.

Make sure you plan your staging architecture well, not uncommon staging areas accumulate mora data than the data warehouse itself.

Designing the Staging Area

In order to design your staging area you first need to understand that you can have data preserved in your staging area, also known as persistent staging, or you can also have data being deleted, also known as transient staging area, and, of course, you can have a mix of both. So keep that in mind when designing your staging area.

Also, in the grand scheme of things, it is important to have a police in your company that grants the ETL sole ownership over all the staging areas, therefore, be aware that:

  1. Data in the staging area should be owned by the ETL team. Do not grant access to data in the staging area for people outside the ETL team. The data in the staging area is not ready to be used by a user, nor it is indexed, nor it falls under any SLA.
  2. Users are not allowed in the staging area for any reason. Users will most likely misuse the data in the staging area and reduce the perceived integrity of the data warehouse.
  3. Reports cannot access data from the staging area. In the staging area tables and files can be created, dropped, reprocessed, changes can happened without the need to notify users, that way breaking reports pointing to whatever data they might be using in the stating area.
  4. Only the ETL process can write to and read from the staging area. This is a big one. If you ever need information that don’t have a source, like an excel domain file, for instance, you need to make that excel file come from a source. You should never ever manually input that table in your staging area.

Since staging area is ETL team ownership it is also the ETL responsibility to supply the DBA team and OS admins with an overall data storage measure of the staging area, so they can estimate the space allocations and parameter settings for the staging database, file systems, and directory structures, as show in the next image.

As seen in the figure above, the details shown are for tables within a database, however, you can also use this volumetric worksheet for file-system planning.

Data Structures in the ETL System

This section describe important types of data structures we are likely to need in our ETL system.

Flat Files

In many cases handling data in your staging area can be done much faster by using flat files. A DBMS requires overhead to maintain metadata of the data being processed, which in many cases are not necessary in a staging environment.

Kimball says that (sic) It has been the conventional wisdom that sorting, merging, deleting, replacing, and many other data-migration functions are much faster when they are performed outside the DBMS.

The use of flat files can benefit the most in the following scenarios:

  1. Staging source data for safekeeping and recovery. We talked about this already, but the idea is, you get your data from the source, save in a flat file, and if for some reason you need to recover from failure or for future reference, your data extraction is saved in a flat file.
  2. Sorting data. Presorting optimizes performance, so carefully simulate the biggest sorting tasks both in your DBMS and with a dedicated sort package for your flat files. We think that most likely flat files will trump over DBMS on this task, and since much of your ETL evolves around integrating disparate data, merging data efficiently is a top priority that requires intensive sorting.
  3. Filtering. Filtering flat files is way more efficient than inserting all data into a database table, indexing the table, and then applying a WHERE clause either to delete or select the data into another table. It might be faster to have the data in a flat file and GREP (unix for filtering) only the rows you need.
  4. Replace/substituting text strings. Doing substring searching and replacing data in a database can require nested scalar functions and update statements. This type of sequential file processing is much faster at the file-system level than it is with a database.
  5. Referencing source data. Instead of querying the same table in the source system over and over, it’s more efficient to extract the reference table and stage it in the staging area once. From there, you can look up data with your ETL tool. Most ETL tools can read a look-up table into memory and store it there for the life of the process. Moreover, utilizing staged reference data keeps the queries that hit the source system simpler and more efficient because many of the table joins can be omitted.

Relation Tables

You can also store your staging data in Relational Tables in a DBMS. Here are some of the advantages of using relational databases as staging area:

  1. Apparent metadata. One of the drawbacks of using flat files is the lack of apparent metadata. By storing data in a relational database, the DBMS maintains the metadata automatically. Information such as column names, data types and lengths, and cardinality is inherent to the database system.
  2. Relational abilities. If you are receiving data from nonrelational systems, it might make sense to stage data in a normalized model before transforming into a dimensional model.
  3. Open repository. Data is more easily accessed and assessed in a SQL-compliant environment, making QA and auditing easier.
  4. DBA support. When data is staged in a database the DBA team can alleviate the ETL team of responsibilities such as space allocation, backup and recovery, archiving, and security are tasks that the ETL team must coordinate when the staging data is not in a DBMS.
  5. SQL interface. It is easier to handle data with SQL, that’s pretty much what Kimball means with this SQL interface. (sic) Besides enforcing referential integrity, having the ability to use native SQL is the primary reason to store staging data in a database environment.

Independent DBMS Working Tables

Independent DBMS working tables are, at least for me (Pedro), the most common scenario. Independent tables are orphans on the DBMS, meaning that they do not have any relationship whatsoever. Most of the time, the reason you create a staging table is to set the data down so you can again manipulate it using SQL. Just because independent tables are not necessarily normalized, they must not be treated like dump files. Each field of an independent file or table must have a purpose and a logical definition. For database tables, a proper index plan must be established and implemented on all independent tables.

Third Normal Form Entity/Relation Models

Don’t assume that the data-staging area must be normalized. Remember two of the goals for designing your ETL processes we describe at the beginning of this chapter: Make them fast and make them recoverable. Consider modeling decisions on a case-by-case basis, normalizing entities only as needed.

Nonrelational Data Sources

First, what is nonrelational data sources? Any source of data that is not related to each other is a nonrelational data source, meaning that your ERP data source and your Excel file data source are not related to each other, meaning, they do not enforce referential integrity. That is why it is of utmost importance to run all data integrity checks in your ETL process.

Sometimes people create special staging points to integrate all nonrelational data, after all, your data-integration tasks are easier to be performed once all data is under one single roof (like in a DBMS, for example).

(sic) Bringing all of the disparate data sources into a single DBMS is a common practice, but is it really necessary? The power of ETL tools in handling heterogeneous data minimizes the need to store all of the necessary data in a single database. Figure 2.2 illustrates how a ETL tool can integrate many heterogeneous data sources right from their native data stores, integrating the data on the fly to migrate it to the data warehouse.

What happen is that integrity errors needs to be handled by the ETL process in an automatic way. You need to establish business rules for different data-quality failure scenarios and  implement them in the ETL process. When erroneous data is passed through the process, sometimes you want to transform the data on the fly; load the data as is; load the data with an associated code or description describing the impact or condition of the data; or if the data is unacceptable, reject the data completely and put it into a reject file for investigation. Hand-coded logic in the ETL process is almost always required to successfully integrate nonrelational data sources.

Note: use rejection files carefully. Any rows in the rejection file needs to be dealt with BEFORE the next ETL cycle, the failure in doing that will redeem your data warehouse out of sync with your production system.

Dimensional Data Models: The Handoff from the Back Room to the Front Room

Dimensional data models sit between the back room and the front room. If you are reading this, you probably already know the power of dimensional modeling, but if you are new to this know that dimensional models are by far the most popular data structures for end user querying and analysis. They are simple to create, they are extremely stable in the presence of changing data environments, they are intuitively understandable by end users, and they are the fastest data structure for general relational database querying. Dimensional models are also the foundation for constructing all forms of OLAP cubes, since an OLAP cube is really just a high-performance dimensional model implemented on special-purpose software.

For more detail on data models, please see Kimball’s Data Warehouse Toolkit book.

Fact Tables

Dimension models are built around measurement processes. Measures are mostly numeric, and most measures can repeat over time, creating a time series.

A single measure creates a single fact table record, each record in a fact table can represent only one fact. Besides storing facts in the fact table we also store everything that gives that fact context, we call that dimensions, and while we would like to store those dimensions verbosely, the best practice is to normalize those dimensions attributes out of the fact table, by doing that creating dimension tables.

The grain of a fact table is the definition of what constitutes a unique fact table record. The grain is declared as a business term, not as a database term. the grain can be formally stated in terms of the key of the fact table. This key declaration will include some but usually not all of the foreign key references to the dimensions attached to the fact table.

The figure above shows several dimension tables related to a fact table.

Dimension Tables

Dimension tables holds verbose description that describe entities, that are used to contextualize a fact. A single dimension table can hold many dimensional attributes, like Customer dimension can store customer name, customer age, customer gender, customer email, etc.

Dimensional attributes are mostly textual or are numbers that take on discrete values. Dimension tables should always be built with a single primary key field that is a simple meaningless integer assigned by the ETL process. These keys are called surrogate keys. The many advantages of surrogate keys are described in the other Toolkit books. In this book Kimball describe how surrogate keys are created and administered in a number of important situations.

The primary surrogate keys in each dimension are paired with corresponding foreign keys in the fact table. When this primary-to-foreign key relationship is adhered to, we say that the tables obey referential integrity. Referential integrity is a constant requirement in all dimensional models. Failure to maintain referential integrity means that some fact table records are orphans that cannot be retrieved through constraints on the affected dimensions.

Atomic and Aggregate Fact Tables

Often, business users don’t want to analyze transaction-level facts because the cardinality of each dimension is so extensive that any atomic-level report would be pages long—making it humanly impossible to examine. However, you need to store the atomic-level facts to produce the periodic snapshot facts required by the users.

It’s good practice to partition fact tables stored in the staging area because its resulting aggregates will most likely be based on a specific period— perhaps monthly or quarterly. Creating partitions alleviates the database from scanning the entire table and enables it to go directly to the subsection that contains the data it needs for that period to create the aggregate.

You may implement a hybrid structure where you have the large atomic data layer in a dimensional RDBMS schema, with progressively more aggregated structures above the atomic layer in the form of OLAP cubes. Some of the OLAP systems can drill down through the OLAP cubes and access the lowest-level atomic data in a single application.

Surrogate Key Mapping Tables

Surrogate key mapping tables are designed to map natural keys from the disparate source systems to their master data warehouse surrogate key. Mapping tables contain only the most current value of a surrogate key— used to populate a dimension—and the natural key from the source system. Since the same dimension can have many sources, a mapping table contains a natural key column for each of its sources.

Planning and Design Standards

The data-staging area must be managed and maintained as much, if not more, than any other database in your environment. The data-staging area must be a controlled environment.

Only the data architect should be able to design or modify a table in the staging area. All physical changes must be owned by the DBA responsible for the database. Also, if any developer needs a table, a strong chance exists that it can be used by another developer. As a rule of thumb: If you build a table, expect that it will be used by someone else for reasons other than you originally intended.

Impact Analysis

We often want, or we are usually requested by users, to change objects (tables or columns) in our staging area, so be specially aware that any changes in the data staging objects can break processes that are crucial to the data warehouse itself. Allowing ad-hoc changes to data-staging objects can simply destroy your entire data warehouse project.

So, if there is no other way, and a change is required to be performed in your data stage objects Impact Analysis must be performed. Impact Analysis examines the metadata associated to an object and see what that change will impact in those objects structure. You must perform impact analysis before any changes are made to a data stage object.

Communication among the ETL project manager, source system DBA, and data warehouse modeling team is crucial to ensure that appropriate impact analysis is performed.

Metadata Capture

Several metadata elements associated with the data staging area are valuable to the data warehouse and must be presented to its end users. Types of metadata derived by the staging area include the following:

  1. Data lineage. Data lineage, also known as the logical data mapping, illustrates transformations applied to a data element between its original data source and its ultimate target in the data warehouse.
  2. Business definitions. Every table created in the data-staging area stems from a business definition.
  3. Technical definitions. Without proper documentation of the technical definitions of your staging tables, the tables might be recreated over and over, causing duplication of efforts and data explosion in the data-staging area. Technical definitions describe the physical attributes of data elements, including the structure, format, and location.
  4. Process metadata. Processes that load data-staging tables must record their statistics along with the statistics of the data warehouse table loads. Although information regarding staging table loads need not be presented to end users, the ETL team must know exactly how many records were loaded into each staging table, with success and failure statistics for each process. A measure of data freshness is useful both for ETL administrators and end users.

Make sure you capture all metadata needed to understand what is going on with your ETL system and data warehouse, so information such as number of rows inserted, updated, deleted, and rejected should be available for each process. Also, the process start time, end time, and duration.

The big categories of metadata include table structure metadata (25 percent), data-cleaning results (25 percent), and process results (50 percent). Metadata is developed in depth in Chapter 4 and Chapter 9.

Auditing Data Transformation Steps

In volatile environments, the source system data is constantly changing; it’s mandatory that the data warehouse have the ability to prove the data within it is accurate.

The ETL process must maintain a snapshot of the before picture of the data before goes through the data-cleansing routines.

Snapshots of extract data are stored in the data-staging area and made available for audit purposes. A before and after glimpse of the data, along with the metadata describing the data-cleansing logic, depicts exactly how the data within the data warehouse is derived and promotes confidence in the quality of its data.

SOURCE:

Ralph Kimball, Joe Caserta- The Data Warehouse ETL Toolkit: Practical Techniques for Extracting, Cleaning, Conforming, and Delivering Data, 2004.

Data Warehouse ETL Toolkit – Chapter 01

Hello everyone, I’m trying to improve my english skills, so from now on I will try to in english write as much as possible. I’m also reading Kimballs’ and Casertas’ “The Data Warehouse ETL Toolkit”, and here I will post my chapters summaries. Hope it can help you like its helping me 🙂

Chapter 01 talks about three main structures that are: requirements, architecture and data warehouse mission.

Requirements are sub-divided into business needs, compliance requirements, data profiling, security requirements, data integration, data latency, archiving and lineage, end user delivery interface, available skills, legacy licenses.

Architecture is sub-divided into ETL Tools vs Hand Coding, The Back Room, The Front Room.

Data Warehouse Mission is sub-divided into what the data warehouse is, what the data warehouse is not.

Beside those three main topics chapter 01 also discuss about industry terms not used consistently, resolving architectural conflict the hybrid bus approach, how the data warehouse is changing and the mission of the ETL team.

I’m not going over all topics and sub-topics, but I’d rather highlight here in this summary what I find most relevant, based on my current ETL understanding and perspective.

About Requirements:

On the Requirements topics Kimball lists several possible requirements that your company might need in order to implement DW-ETL, however, he states that Business Needs is the most fundamental and important one.

Business Needs

Kimball says that Business Needs are (sic) the information content that end users need to make informed business decisions. It means that every time the ETL team is assessing data sources to ingest, they need to take in consideration how that data source can be used to support users’ “informed business decisions”.

Data Integration

Data Integration is an important topic further discussed in Chapter 05, but it is important to know that its routine to have multiple data sources, and therefore, implement some type of data integration, and by it Kimball means conformed dimensions and conformed facts. In other words, (sic) conforming facts means agreeing on common business metrics across separate databases so that these numbers can be compared mathematically by calculation differences and ratios. In the ETL system, data integration is a separate step identified in our data flow as the conform step. Physically, this step involves enforcing common names of conformed dimension attributes and facts, as well as enforcing common domain contents and common units of measurement.

Archiving and Lineage

I particularly find this topic very interesting. It says that every data warehouse will much likely need various copies of old data, either for comparisons with new data to generate change capture or for reprocessing. Kimballs’ recommendation is that we should archive data every time we have a major staging point. So, if you consider Kimballs ETL approach of ECCD (extract, clean, conform, deliver) you will have 4 points of staging and archiving. So, if you are interested in performing data quality of your ETL process, or just need to reprocess some old data, the archiving will help you, as Kimball says (sic): it is almost always less of a headache to read data back in from permanent media than it is to reprocess data through the ETL system at a later time. And, of course, it may be impossible to reprocess data according to the old processing algorithms if enough time has passed.

About Architecture:

On this topic Kimballs says that architecture decisions should be done early and as a whole. The main points on this are to decide between proprietary ETL tool or to hand code your ETL. As of today I rarely see people doing ETL hand coding, but I can be wrong. Either way, whichever you choose make sure you stick to it, otherwise a change in such architecture decision can lead to an entire overhaul of your ETL system. This topic also covers the back and front room architecture, both that are important key aspects of Kimballs approach on ETL.

Exception and Quality Handling

Kimball advocates that errors, exceptions and issues that occurs during, or as result of, the ETL process should be treated as a system-wide, uniform mechanism for reporting all instances of exceptions that happened with the ETL, and not just as a series of random little ad-hoc alerts and comments placed in files. The same way when dealing with quality issues. More on quality on chapter 04.

Recovery and Restart

Kimball do not go greats lengths on this matter, but he says that it is very important that all ETL jobs should be planned in a way that if a job needs to run twice for some reason, might be a bug, an error, etc., it needs to make sure that no data is lost, corrupted nor tampered. It’s of very high importance otherwise your data warehouse will end up receiving incorrect data from your ETL as it is certain that sooner or later some of your ETL jobs will either terminate abnormally or be mistakenly run more than once.

The Back Room – Preparing the Data

Figure 1. Back Room

Back room is where the ETL developer team works, it’s where the data management, or it is where the ETL team prepare the data. No query services should be provided to users on the back room, and that’s because by doing that we can relieve the ETL team from:

  1. Providing detailed security at a row, column or application level
  2. Building query performance-enhancing indexes and aggregations
  3. Guaranteeing that all datasets are consistent with each other

Kimball writes about a restaurant comparison, and it’s a very good, but lengthy one, but we can summarize as: the kitchen is where things are prepared, the dining room (front room) is where dishes are served, well presented to the final users.

The word staging is very important to Kimball as he uses it to define steps, moments, in which data is saved/archived as files or in tables for later use or recovery/restart purposes. He states that (sic) staging almost always implies a temporary or permanent physical snapshot of data. We assume that every ETL system supporting the data warehouse is structured with these four steps and that data is staged (written to disk) in parallel with the data being transferred to the next stage.

Figure 2. ECCD, the four staging steps of a data warehouse.

In the back room there are four steps, which are detailed in future chapters, they are: extracting, cleaning, conforming, delivering.

note: besides most transformation being performed on the cleaning and conforming steps, the best place to resolve certain legacy data format issues is the extract step. These includes resolving repeating groups, REDEFINEs, and overloaded columns and performing low-level data conversions, including converting bit encoding to character, EBCDIC to ASCII, and packed decimal to integer, more on this on chapter 3.

note: figure 2 looks like you must do all the ECCD (extract, clean, conform, deliver) serially, with well-defined boundaries between each pair of steps, but in practice there will be multiple simultaneous flows of data in the ETL happening at the same time, and frequently some of the cleaning steps are embedded in the logic that performs extraction.

Extracting: in this step the raw data coming from various sources are written directly to disk, with minimal to no restructuring. This allow the initial load to be simple and fast as possible, to not burden the source it is extracting data from. This is important in case the source needs to be re-ingested if something happens (like errors or interruptions). This initial data can be used multiple times and can be archived for future reference, or can be stored to be used as reference for change checks for the next ETL cycle and then discarded.

Cleaning: in this step data quality is applied, so checks such as validating values, like if zip codes are in valid format and in valid range, ensure consistency across values, like if the zip codes and the city are consistent, removing duplicates, applying and checking business rules, like if the premium customer have the correct premium status associated to them.

Data-cleaning transformations may involve human intervention and the exercise of judgment. It is sometimes important, depending on the level of transformations applied, to save semi permanently the results of data-cleaning, since they are important to track, irreversible and, many times, complex.

The data exceptions found in the data-cleaning step should be reported back to the source in order to improve the overall data quality of the source.

Conforming: every time you have more than 1 data source you will much likely need data conformation. This includes all text labels of your dimensions, numeric measures that can be in different orders, like some numbers are in kilometers, others in miles, etc. Chapter 5 and 6 talks more about this.

Delivering: the back room prepare data to be ready for querying. The final and crucial back-room step is physically structuring the data into a set of schemas known as dimensional models (star schemas). Chapter 5 talks more about this.

ODS – Operational Data Store

According to Kimball ODS has been absorbed by the data warehouse, but we know that some companies still uses it.

ODS is a hot extract from the operation system database, and it was particularly important when data warehouses were not able to provide the correct grain of data, nor at the expected speed, the users needed.

Since data warehouse can now work much faster, in a daily, or even at online pace, the ODS is no longer needed.

The Front Room – Data Access

The front room is where data is displayed and accessed by users, it is the data that the user gets to see.

The whole purpose of the ETL system is to feed the presentation layer of dimensionally modeled tables that are directly accessed by query tools, report writers, dashboards, and OPLAP cubes. Chapter 05 and 06 talks more about modeling.

Data marts

Kimball takes special interest talking about data marts. He says that people usually mistake the concept of a data mart, so for Kimball a data mart is (sic) a set of dimensional tables supporting a business process. In Kimballs view, data marts are always measurement-intensive subject areas, and they are surrounded by descriptive entities. Other important topics about data mart:

  1. Data marts should not duplicate data. A company can only have one ORDERS data mart, so all user tools and queries in the company various departments should access this data mart to have a single, consistently labeled version of ORDERS.
  2. Data marts should be as atomic as possible, containing all information needed to go to the deepest level of drill-down. So here Kimball states the aggregated data marts are an incorrect assumption.

About Data Warehouse Mission

Since the data warehouse is a decision support system, the main criteria of success is whether the data warehouse effectively contributes to the most important decision-making process in the organization.

The actual scenario is that we have been ingesting years and years of data in our transactional databases, but It’s has been really difficult to pull that data out and make use of it, so the ETL system must play a major role in handling the data to the final end user applications in a useable form. And that is what Data Warehouse is about.

What Data Warehouse Is
Data warehouse is the process of taking data, from one or many sources, and transform it into organized information in a user-friendly format to encourage data analysis and support fact-based business decision making.

In other words, (sic) a data warehouse is a system that extracts, cleans, conforms, and delivers source data into a dimensional data store and then supports and implements querying and analysis for the purpose of decision making.

What Data Warehouse Is Not

As Kimball states, the concept of a data warehouse is often misunderstand. So, he says that, a data warehouse is not a product, nor a language, nor a project, nor a data model, nor a copy of your transaction systems.

Data warehouse includes several components, such as design suits, techniques, tools, and products. The ETL is a major component, but it alone does not make up for a data warehouse.

The Mission of the ETL Team

At the highest level, the ETL team mission is to build the back room of the data warehouse, and, the back room must support the following steps:

  1. Extracting data from the original sources
  2. Quality assuring and cleaning data
  3. Conforming the labels and measures in the data to achieve consistency across the original sources
  4. Delivering data in a physical format that can be used by query tools, report writers, and dashboards.

SOURCE:

Ralph Kimball, Joe Caserta- The Data Warehouse ETL Toolkit: Practical Techniques for Extracting, Cleaning, Conforming, and Delivering Data, 2004.