The Drunkard’s Walk, a book worth reading twice, or more. A Data Analyst perspective

As you know, if you follow my LinkedIn profile (which, if you doesn’t, you should) I was reading Leonard Mlodinow’s “The Drunkard’s Walk”, a really great book if you want to get a grasp on how chance, or how randomness, affects our lives, and, what is better, how we fail to perceive what is chance, and what is not.

This book builds, chapter by chapter, on how probability and statistics come to become a thing, and how our understand of this field still bother us, since we have a really hard time ceasing from using our intuition.

Here are some key points that I think it is worth knowing as a Data Analyst. Please, note that those are only some that I remember by heart after just having read the book, either way, I highly recommend this book to anyone that consider finding patterns in data an interesting thing.

  • The probability that two events will both occur can never be greater than the probability that each will occur individually. This is known as The First Law of probability.
  • If two possible events, A and B, are independent, then the probability that both will occur is equal to the product of their individual probabilities. This is known as The Second Law of probability.
  • Girolamo Cardano and the Law of Sample Space
  • Law of Large Numbers
  • Pascal’s triangle
  • Gambler’s Fallacy
  • Bayes’ theorem
  • The bell curve, normal distribution, sample variance
  • Significance testing
  • Confirmation bias, Ambiguous evidence

Those are all really complex topics to explore, so if you feel like knowing more about each of them I highly encourage you to read the book and/or reading on the internet.

4 Insights you can gather from Michael Lewis’ Moneyball as a Data Analyst

Moneyball book cover

If you follow me on LinkedIn you already know that I’m, or actually was, reading Moneyball. Just finished the book today, and so I thought it would be fun to share what I got from this book using my perspective, in other words, a data analyst’s perspective.

It’s no surprise that this book is, or was, a best seller. It’s easy, and compelling, to read, the story is engaging for those who likes baseball, or for those who likes “think out of the box” approaches, or for those who likes to see underdogs beating favorites, or if you like a story of someone that became relevant for something. With that being said, Moneyball do not provides any hard data or numbers, it’s not a math or statistics book, it’s just a story book, but you one can gather reliable insights by reading it.

So here are four insights gathered from Moneyball:

  1. It’s hard to challenge common wisdom

Billy Beane had to went from a becoming next baseball super start to a baseball failure to understand that the way things were going were wrong.

That being said, it’s easy to find people saying all over that we should think out of the box, but in truth they expect that you think inside their sandbox, and if you challenge their common sense they call you wrong or even worse. Besides, just thinking is something, acting is something else. If you are thinking differently and you start acting differently it may that a while until you get people to see that your way of doing it is better. Actually, you can find even today people that still do not take numbers and stats into account, because “they know better”.

  1. It takes time to build your data stats properly

The guy who really put things together were Billy James, who first article was published in 1977, and back then it wasn’t easy to gather all data, and not all data were properly gathered, in fact, gather, prepare and figure out which data you actually need is a on-going process that never ends. In fact, 1977 studies from James just started to being used by Billy Beane much later in 1997. So it was a 20 years apart thing, and furthermore, after all those years with baseball stats being used since 97 by Billy Beane’s team, they still found new and improved way to use and gather new data in 2001 by using Voros McCracken new way to measure defense data.

So, knowing that, it is important, specially if you are a leader, a manager or a director, to understand that data literacy takes time, a lot of time, maybe not 30 years, but it’s not a 6 month project that will get all the sweet sweet insights you need – it might give you some, but understand that sometimes what you really need was not even thought about.

  1. It’s hard to trust data when they tell you something you disagree

Moneyball introduces us some players that are only playing baseball major league because Billy Beane and Paul DePodesta stats sheet. All other teams thinks that Billy’s team is going to loose, but they end up winning many games, because stats. It doesn’t matter if the player looks good, or if they run fast, or if they thrown a ball 95mph, all that matters is if, by using baseball metrics they can deliver wins, and that is what happens. Many times players are doing something that everyone else think they should do differently, but common wisdom is usually wrong.

Have you ever seen something like this happen? When a manager check it’s report that says that he should be selling more in this area and not that, or that some product should be selling more than other, but he insists in doing something different from what the data is telling him to? It’s common and very similar to our point 1, because people often think they know better.

  1. The luck factor

In a baseball you can have many games throughout the season, but only a few in the playoffs. What happens is, the way your data applies changes, because your sample changes. That is why during a season Billy’s team would end up high in the ranking, but usually lost in the playoffs.

That shows us that not every kpi, metric, should be applied to the same sample, or considered in the same context. Be aware of what you are measuring before taking a metric in account, you may end up with a correct number for a wrong situation.

Conclusion

There are many more scenarios one can gather from reading this book, those listed here are just some that come from the top of my head as I’m writing this blog post. Something else worth noticing is how Billy dealt with this team members, it’s a long story, but follows pretty much the idea you can gather by reading another book, First, break all the rules – which I also highly recommend, specially if you are in a lead position.

You can buy Moneyball here, I don’t make a single dollar by recommending it, but I’m doing it anyway 🙂

SQL Soccer Questions

Recently I came across a SQL interview question, the “SQL Soccer Matches” question, have you seen this already?

This is the kind of question you can find in an interview exam. Anyway, in order to answer that I developed many approaches to solve it: using union, using many ctes and union, using case sum, and using unpivot crossjoin.

Wanna see all those solving ideas? Check it on my DataCamp workbook: https://app.datacamp.com/workspace/w/21c61079-0787-42ae-b21d-ae7ff127d556

This is the question:

The question:
Compute the total number of points each team has scored after all the matches. The rules are as follows:

a. if a team wins a match (scores more goals than the other team) it gets 3 points.
b. if a team draws a match (scores exactly same number of goals as other team) it gets one point.
c. if a team loses a match (scores fewer goals than other team) it gets no points.

Write an SQL query that selects the team_id, team_name and num_points of each team in the tournament.
Return the result table ordered by num_points in decreasing order.
In case of a tie, order the records by team_id.

Solutions:

https://app.datacamp.com/workspace/w/21c61079-0787-42ae-b21d-ae7ff127d556

SQL Interview Questions

I’m going big on SQL these days. I always loved SQL, but this couple last days I’m going hard on learning and mastering it. There’s still lot to do, many courses to take, many books to read, and many exercises to do.

Talking about exercises, this guys Nick Singh just lunched a site with questions, it’s not super different from hackerrank and stuff, but the thing I liked the most is that it’s well explained, well put, good design and many interesting questions.

So, check it out clicking here https://datalemur.com?referralCode=F6dizRnl. (this is a referral link, just in case you use it and sig-in I can end up some bonus or something).

[Solved, Spoilers] SQL Murder Mystery – Can you find out whodunnit?

SQL Murder Mystery

Here: https://mystery.knightlab.com/

So, couple of friends of mine send me this challenge, and it’s really fun. Go there and try it!

The plot is: someone killed someone else, and you, as the detective, need to find out who did it. Your weapons are SQL, table and relationships. Off you go!

the database

My solution to this

On the code below you will find all code, with comments, that I’ve used to came up with the solution to this mystery. I hope you enjoy like I did.

What about you, have you done anything differently?

Note: this is SQLite code. Also, all code is multi-line commented using /**/. If you plan on running this code make sure to remove comments when needed.

-- INTRO
/*
Experienced SQL sleuths start here
A crime has taken place and the detective needs your help. 
The detective gave you the crime scene report, but you somehow lost it. 
You vaguely remember that the crime was a ​murder​ 
that occurred sometime on ​Jan.15, 2018​ and that it took place in ​SQL City​. 
Start by retrieving the corresponding crime scene report 
from the police department’s database.
*/

-- FIRST QUERY, CHECKING CRIME SCENE REPORT

/*
select * from crime_scene_report
where 1=1
and lower( type ) like '%murder%'
and lower( city ) like '%sql%city%'
and date = 20180115;
*/

-- CRIME SCENE REPORT RESULT
/*
Security footage shows that there were 2 witnesses. 
The first witness lives at the last house on "Northwestern Dr". 
The second witness, named Annabel, lives somewhere on "Franklin Ave".
*/

-- FIRST WITNESS 
/*
select * from person
where 1=1
and lower( address_street_name ) like '%northwestern%'
order by address_number desc
limit 1;
*/
-- FIRST WITNESS RESULT
/*
id		name			license_id	address_number	address_street_name	ssn
14887	Morty Schapiro	118009		4919			Northwestern Dr		111564949
*/

-- SECOND WITNESS 
/*
select * from person
where 1=1
and lower( address_street_name ) like '%franklin%'
and lower( name ) like '%annabel%'
*/

-- SECOND WITNESS RESULT
/*
id		name			license_id	address_number	address_street_name	ssn
16371	Annabel Miller	490173		103				Franklin Ave		318771143
*/

-- INTERVIEW FROM 1ST WITNESS
/*
select * from interview
where person_id = 14887
*/

-- 1ST WITNESS INTERVIEW REPORT
/*
I heard a gunshot and then saw a man run out. 
He had a "Get Fit Now Gym" bag. The membership number on the bag started with "48Z". 
Only gold members have those bags. The man got into a car with a plate that included "H42W".
*/

-- INTERVIEW FROM 2ND WITNESS
/*
select * from interview
where person_id = 16371
*/

-- 2ND WITNESS INTERVIEW REPORT
/*
I saw the murder happen, and I recognized the killer from my gym 
when I was working out last week on January the 9th.
*/

-- SEARCHING FOR A PERSON MEMBER OF GYM THAT MATCHES INTERVIEW REPORT - 1ST WITNESS
/*
select 
g.*, p.*, d.*
from get_fit_now_member g
left join person p on p.id = g.person_id
left join drivers_license d on d.id = p.license_id
where 1=1
and lower( g.membership_status ) like '%gold%'
and g.id like '48Z%'
and d.plate_number like '%H42W%'
*/

-- RESULT FROM 1ST WITNESS INTERVIEW
/*
person_id 67318
name Jeremy Bowers
*/

-- SEARCHING FOR A PERSON MEMBER OF GYM THAT MATCHES INTERVIEW REPORT - 2ND WITNESS
/*
select c1.*, c2.*, g1.*
from get_fit_now_check_in c1
inner join 
(
  select c.check_in_date, c.check_in_time, c.check_out_time
  from get_fit_now_check_in c
  left join get_fit_now_member g on g.id = c.membership_id
  where 1=1
	  -- date as text to search using like for anything on january 09th
  and cast(check_in_date as text) like '%0109' 
	  -- checking that our witness Annabelle person_id 16371 checked in 16h and left 17h
  and g.person_id = 16371
) c2 
 on c1.check_in_date  = c2.check_in_date 
and c1.check_in_date  >= c2.check_in_time 
and c1.check_out_time <= c2.check_out_time
left join get_fit_now_member g1 on g1.id = c1.membership_id 
-- here we also have Jeremy Bowers showing up. He might be the killer.
*/

/*
FINAL RESULT
Congrats, you found the murderer! 
But wait, there's more... If you think you're up for a challenge, try querying the 
interview transcript of the murderer to find the real villain behind this crime. 
If you feel especially confident in your SQL skills, 
try to complete this final step with no more than 2 queries. 
*/

-- QUERYING Jeremy Bowers, 67318, INTERVIEW
/*
select * from interview where person_id = 67318
*/

-- INTERVIEW OS JEREMY RESULT
/*
I was hired by a woman with a lot of money. 
I don't know her name but I know she's around 5'5" (65") or 5'7" (67"). 
She has red hair and she drives a Tesla Model S. 
I know that she attended the SQL Symphony Concert 3 times in December 2017.
*/

/*
QUERY TO FIND THE TRUE MASTER MIND BEHIND THE MURDER

EXPLANATION:
I START FROM FACEBOOK_EVENT_CHECKIN TO MAKE SURE THAT ALL 3 ROWS WILL ALWAYS BE 
VISIBLE (AS RIGHT JOIN DOESN'T WORK HERE).
THEN I MOVE TO PERSON, THEN INCOME AND DRIVERS LICENSE, 
SO I HAVE ALL TABLES AND ATTRIBUTES AT MY DISPOSAL.
THEN I FILTER EVERYTHING ACCORDING TO JEREMY INTERVIEW, FILTERING CAR, MODEL, ETC.
BY THE END OF ALL FILTERS ONLY ONE PERSON REMAINS, Miranda Priestly
*/

select
f.event_id, f.event_name, f.date, 
p.name, d.height, d.hair_color, d.gender, d.car_make, d.car_model, i.annual_income
from facebook_event_checkin f
left join person p on f.person_id = p.id
left join income i on i.ssn = p.ssn
left join drivers_license d on p.license_id = d.id
where 1=1
and lower( car_make ) like '%tesla%'
and lower( car_model ) like '%model%s%'
and hair_color = 'red'
and gender = 'female'
and (height >= 65 and height < 68)
and lower( f.event_name ) like '%sql%symphony%concert%'
and cast( f.date as text ) like '201712%'

/* 
RESULT:
Congrats, you found the brains behind the murder! 
Everyone in SQL City hails you as the greatest SQL detective of all time. 
Time to break out the champagne!
*/

Reference

https://mystery.knightlab.com/

Informatica Powercenter – XML Transformation (XML Source Qualifier, XML Parser, XML Generator)

I highly recommend reading the first three links on References, at the end of this post.

About XML

What says the official documentation: “Each group in an XML source definition is analogous to a relational table, and the Designer treats each group within the XML Source Qualifier transformation as a separate source of data.”

The best way to understand how you can consume an XML, like, getting data from XML to your database, is to think about relational data. Look this:

xml file

This is an XML file. You can see that we have Customers and Orders. That’s like 2 tables, so, when consuming this XML we will probably end up with those two tables.

xml customer

Inside CUSTOMERS we have 4 customers, those are rows in our table. If we do our ETL from XML to Database, this is what we got:

customer from xml to database

Same thing happens for Orders:

xml order
order from xml to database

Did you notice that we have an extra “table” inside Customers that’s FULLADDRESS, and in Orders it’s SHIPINFO.

Therefore, if you want to go as full normalized as you can, you may end up with 4 tables, can you picture this?

how our xml elements relate

Customers can have many orders, we can find this information by looking again to our XML file:

customers can have zero or many orders

XML Schema file – XSD

It is not everyday you will have XML schema file for your XML files, but if you do, that helps a lot. What’s a XSD file? Something like this:

xsd file

XML Hands on Powercenter

When using Powercenter you can import your XML source from an XSD or XML file, however, regardless of how you import your sorce, it’s always a good idea to double check your XML Definition.

Here’s how you can import your XML source file in Powercenter. Go sources > import XML definition > browse your file

sources import xml definition

Then you get your file, but before hitting OK or OPEN, you can check the ADVANCED Options. Honestly, I don’t know what each and every button on this advanced option does, but the documentation can tell you.

advanced option

Advanced option explained, from the official docs:

The next step is called XML import Wizard.

powercenter xml wizard

Next screen on the wizard should be hard to tell, but you already know how XML works like a relational database. What you need to tell the wizard here is how your XML file is laid out. In our case we know we have hierarchy, because FULLADDRESS and SHIPINFO have a parent-son relationship to CUSTOMERS and ORDERS.

Then you finish the wizard and this is what you get:

your xml source

To adjust your XML views go > Edit XML Definition.

edit XML screen

This is our XML views. As far as I know, XML views is synonym for table. Here we have two views; X_Order and X_Customer. Here you will adjust your views accordingly to your expectations. What I mean is, if a field is in the wrong place, you can correct it, or if a new view needs to be created, you can create here, also, if a view needs to be recreated from the ground up, you can also do it here.

By the way, if you noticed on XML Wizard first screen, we have a “do not generate XML views” option. If we take that option then all views will require manual creation.

In the “getting started” link in the References you can better understand how to work with views.

Using XML Source Qualifier Transformation in Mapping

here’s our full mapping

In this mapping we have our XML source flowing data from it’s source to two different tables, and also sending the same data to two sort transformation, then joining them in a sorted join, and then storing it in a single table. I will not go in details of sort and join since I have already wrote about them.

Using XML Parser Transformation in Mapping

xml parser transformation

I won’t go into much details about the parser transformation, but what it does is, it receives and DATAINPUT that is a string with XML data and do the same thing the XML Source Qualifier do, like we already did in previous examples.

The XML Parser transformation is an active transformation.

Using XML Generator Transformation in Mapping

xml generator

The XML Generator is the opposite of the parsers generator. Here the XML Generator will receive relational data and generate and XML out of it. Say you have a database and want to extract a table in XML format to a flat file, for that you will use this XML generator transformation.

The XML Generator transformation is an active transformation.

References

https://docs.informatica.com/data-integration/powercenter/10-4-0/xml-guide/preface.html

https://docs.informatica.com/data-integration/powercenter/10-5-1/getting-started/tutorial-lesson-6/using-xml-files.html

https://docs.informatica.com/data-integration/powercenter/10-5-1/transformation-guide/xml-transformations.html

https://exploreinformatica.com/download-sample-xsd-and-xml-file/

https://youtu.be/lyAaiginA4E – XML Transformation in informatica

https://youtu.be/iNfYRK5Qpy0 – Informatica PowerCenter- Working with XML Data-Informatica Training

https://youtu.be/Ux3Nxb0geus – XML Generator transformation scenario in INFORMATICA by manish

https://youtu.be/9408HdOYhAA – O que é XML – Quebrando mitos e lendas (pt-br)

https://youtu.be/KeLiQXqVgMI – XML Tutorial for Beginners | What is XML | Learn XML

Informatica Powercenter – Stored Procedure Transformation

Let’s start checking out what the official documentation has to say:

“A Stored Procedure transformation is an important tool for populating and maintaining databases. Database administrators create stored procedures to automate tasks that are too complicated for standard SQL statements. The Stored Procedure transformation is a passive transformation. You can configure a connected or unconnected Stored Procedure transformation.

A stored procedure is a precompiled collection of Transact-SQL, PL-SQL or other database procedural statements and optional flow control statements, similar to an executable script. Stored procedures are stored and run within the database. You can run a stored procedure with the EXECUTE SQL statement in a database client tool, just as you can run SQL statements. Unlike standard SQL, however, stored procedures allow user-defined variables, conditional statements, and other powerful programming features.”

  • Stored procedures allow user-defined variables, conditional statements, and other powerful programming features
  • If a stored procedure returns a result set rather than a single return value, the Stored Procedure transformation takes only the first value returned from the procedure.

Connected and Unconnected

Stored procedures can be either connected or unconnected.

Connected: use this if you want your Stored Procedure transformation to receive data from an input port sent as an input parameter to the stored procedure, or the results of a stored procedure sent as an output parameter to another transformation.

Unconnected: here the Stored Procedure transformation either runs before or after the session, or is called by an expression in another transformation in the mapping.

The following table compares connected and unconnected transformations:

connected vs unconnected Store Procedure Transformation

Stored Procedure runs

  • Unconnected: If no transformation references the Stored Procedure transformation, you have the option to run the stored procedure once before or after the session.
  • Unconnected: If a transformation do references the Stored Procedure transformation, then it will be runed every time a row passes trough that transformation.
Stored Procedure Type

The following list describes the options for running a Stored Procedure transformation:

  • Normal. The stored procedure runs where the transformation exists in the mapping on a row-by-row basis. This is useful for calling the stored procedure for each row of data that passes through the mapping, such as running a calculation against an input port. Connected stored procedures run only in normal mode.
  • Pre-load of the Source. Before the session retrieves data from the source, the stored procedure runs. This is useful for verifying the existence of tables or performing joins of data in a temporary table.
  • Post-load of the Source. After the session retrieves data from the source, the stored procedure runs. This is useful for removing temporary tables.
  • Pre-load of the Target. Before the session sends data to the target, the stored procedure runs. This is useful for verifying target tables or disk space on the target system.
  • Post-load of the Target. After the session sends data to the target, the stored procedure runs. This is useful for re-creating indexes on the database.

Some more points:

  • You can run more than one Stored Procedure transformation in different modes (connected or unconnected) in the same mapping.
  • You cannot run the same instance of a Stored Procedure transformation in both connected and unconnected mode in a mapping.
  • The Integration Service opens the database connection when it encounters the first stored procedure. The database connection remains open until the Integration Service finishes processing all stored procedures for that connection. The Integration Service closes the database connections and opens a new one when it encounters a stored procedure using a different database connection.
  • To run multiple stored procedures that use the same database connection, set these stored procedures to run consecutively.
Stored Procedure Ports

Stored Procedured Transformation Hands On

Enough talking, let’s do some practice. First the Stored Procedure itself, here’s the one we created:

our Stored Procedure
USE [PWC]
GO

/****** Object:  StoredProcedure [dbo].[SP_GET_ITEM_COUNT]    Script Date: 7/28/2022 7:47:31 PM ******/
SET ANSI_NULLS ON
GO

SET QUOTED_IDENTIFIER ON
GO


CREATE PROCEDURE [dbo].[SP_GET_ITEM_COUNT] @ITEM_ID INT, @outRes as INT OUTPUT
AS SELECT @outRes = COUNT(*) FROM ORDER_ITEMS WHERE ITEM_ID = @ITEM_ID
return @outRes
GO

This Stored Procedure is doing what? It is selecting count(*) from ORDER_ITEMS, filtering this select by ITEM_ID and then storing it’s result inside @outRes.

Now, to the mapping.

our mapping

We have a pretty straightforward mapping. By now you should be already used to all transformations and objects in the image above.

Key point here is our Stored Procedure transformation.

SP ports tab

Port-wise no surprises. We just unchecked Input port for outRes as we won’t make any use of that port.

SP properties tab

For properties we are using type Normal, as this is the only that we can use for a connected Stored Procedure.

Conclusion

Our hands on example is very timid, and I believe that the real power of Stored Procedure transformations are in the unconnected types for Pre and Post target/post loads. However, for our certification point of view it’s vital that we understand how SP transformation behaves, and I hope this post can help us with that!

References

https://docs.informatica.com/data-integration/powercenter/10-5/getting-started/tutorial-lesson-5/creating-a-mapping-with-fact-and-dimension-tables/creating-a-stored-procedure-transformation.html

https://docs.informatica.com/data-integration/powercenter/10-4-0/transformation-guide/stored-procedure-transformation/stored-procedure-transformation-overview.html

Informatica Powercenter – SQL Overrides

I did not really find any super good resource on this topic, but the overall idea of SQL override is to override a transformation default behavior by writing a SQL query instead.

Of course, SQL overrides works only on relational data. There is no SQL Overrides for flat file.

For instance, take a source qualifier transformation. If we edit our Source Qualifier transformation we will find there a tab for Transformations, Ports, Properties, Sources and Metadata Extensions.

source qualifier transformation’s ports tab
source qualifier transformation’s properties tab

In SQ’s properties tab we have “SQL QUERY” option, its selected in the image above. See that it says “user-defined SQL Statement”. So, if you write a query there you need to make sure that:

  • The SELECT statement must list the port names in the order in which they appear in the transformation.
  • The number of columns in the SELECT statement in the query must match the number of ports in the Source Qualifier transformation.
  • If you enter an SQL query, the Integration Service uses only the defined SQL statement. The SQL Query overrides the User-Defined Join, Source Filter, Number of Sorted Ports, and Select Distinct settings in the Source Qualifier transformation.

All those yellow highlighted properties in the imabe above will be overridden by you SQL QUERY.

The Default Query

We know that using SQL Query we are overriding our default query, but, what is our default query?

To check your default query simply open your SQL Query editor within your transformation and hit generate query.

example of default query

Why should we use SQL Override?

We can make use of SQL Override every time that we need a more complex scenario that would be easier and simpler to construct via SQL instead of via Informatica Powercenter transformations. For instance, if you have multiple joins, CTEs, window functions, subqueries, etc. For those cases, SQL Query override sounds like a better option.

We can also overwrite our SQL Queries if Powercenter default query is not using the best indexes, so we can tweak our query to make sure we are using best indexes.

Also, if you already have a report or a dataset ready in your database and you need to migrate it to Powercenter, you can just grab the query that is producing your dataset and set in Powercenter via SQL Query override, that way you will not need to recreate the whole thing using Powercenter transformations; that can speed up your migration process.

What is the difference between SQL override and lookup override in Informatica?

The role of SQL Override is to limit the number of incoming rows entering the mapping pipeline, whereas Lookup Override is used to limit the number of lookup rows to avoid the whole table scan by saving the lookup time and the cache it uses.

References

https://wingsoftechnology.com/sql-override-informatica/

https://exploreinformatica.com/sql-override%E2%80%8B-informatica%E2%80%8B/

https://diogonvidal.wixsite.com/powercenter/post/sql-override-no-powercenter

https://www.heyiamindians.com/why-we-use-sql-override-in-informatica/

Informatica Powercenter – User Defined Functions

User Defined Function (UDF’s) is a way Powercenter gives us to create our own personal, reusable, functions. We can use our UDF’s in Expression Transformations in order to reuse expression logic.

For instance, if you want to RTRIM() LTRIM() and UPPER() several ports, so, instead of writing all these 3 functions all the time you can simply create a UDF and call that UDF.

To create go to User-Defined Functions, right click it > new.

This is User-Defined Function screen.

Some key points about UDF’s:

Type is either PUBLIC or PRIVATE:

  • Public. Callable from any user-defined function, transformation expression, link condition expression, or task expression.
  • Private. Callable from another user-defined function. Create a private function when you want the function to be part of a more complex function. The simple function may not be usable independently of the complex function.

Although you can place a user-defined function inside another user-defined function, a function cannot refer to itself. For example, the UDF RemoveSpaces includes a UDF TrimLeadTrailSpaces. TrimLeadTrailSpaces cannot include RemoveSpaces; otherwise, RemoveSpaces is invalid.

User-Defined Function practice

Here we created a User Defined Function to do pretty much the same job we did with our mapplet, adjust some basic text string.

UDF code

Notice that our UDF is a public one.

Here’s our code. It check if it’s null, if it is then set ‘THIS IS A NULL ROW’, otherwise UPPER and TRIM the text.

IIF
(
ISNULL( TEXT_INPUT )
, 'THIS IS A NULL ROW'
, UPPER( LTRIM( RTRIM( TEXT_INPUT ) ) )
)

Here’s our UDF in use within our mapping:

UDF inside epxression transformation

Notice that for calling an UDF function it is used “:UDF.”, check image to see the details.

After running this mapping this is our result:

our UDF worked!

References

https://docs.informatica.com/data-integration/powercenter/10-4-0/designer-guide/working-with-user-defined-functions/working-with-user-defined-functions-overview.html

https://youtu.be/ej9DnVrSbro – How to Create User Defined Functions in PowerCenter

Informatica Powercenter – Mapplets

Here what official documentation says about Mapplets:

A mapplet is a reusable object that you create in the Mapplet Designer. It contains a set of transformations and lets you reuse the transformation logic in multiple mappings.

When you use a mapplet in a mapping, you use an instance of the mapplet. Any change made to the mapplet is inherited by all instances of the mapplet.

My take on this is, if you have a mapping that you wish to re-use, instead of copying and paste every time you need that flow of transformations, you instead create a mapplet, so the next time you don’t need to copy-paste, you just use the mapplet you created.

Another important take on mapplets is that it’s a set, a group, of transformations. What I mean is, if you can achieve your logic just by using only one transformation, then you do not need mapplet, you just create a re-usable transformation. However, if your logic requires more than one transformation, then yes, mapplets is what you are looking for.

Active or Passive

A mapplet can be active or passive depending on the transformations in the mapplet. Active mapplets contain one or more active transformations. Passive mapplets contain only passive transformations.

All transformation rules apply to the mapplet. For example, as with an active transformation, you cannot concatenate data from an active mapplet with a different pipeline.

Rules for creating a Mapplet

Use the following rules and guidelines when you add transformations to a mapplet:

  • If you use a Sequence Generator transformation, you must use a reusable Sequence Generator transformation.
  • If you use a Stored Procedure transformation, you must configure the Stored Procedure Type to be Normal.
  • You cannot include PowerMart 3.5-style LOOKUP functions in a mapplet.

You cannot include the following objects in a mapplet:

  • Normalizer transformations
  • COBOL sources
  • XML Source Qualifier transformations
  • XML sources
  • Target definitions
  • Other mapplets

Best-practice: Although reusable transformations and shortcuts in a mapplet can be used, to protect the validity of the mapplet, use a copy of a transformation instead. Reusable transformations and shortcuts inherit changes to their original transformations. This might invalidate the mapplet and the mappings that use the mapplet.

Creating and Using Mapplet

We create mapplets pretty much the same way we create transformations, with some differences, of course.

where you can access your mappings

One very important difference is that mapplets require input and output. You can create input and output by going transformation -> create -> mapplet input / mapplet output, or you can click on menu bar icons.

An important fact about mapplets is that their input or output ports do not have a datatype definition.

You can create mapplets ports manually, like we did in the image above, or your can drag a source to your mapplet and drag all fields to your input/output and then remove your source from your mapplet.

just drag ports from your source qualifier to your input/output ports
then delete it, and your ports will be ready to go

So, let’s say I want to do some text cleansing, like apply lowercase, remove leading or trailing spaces, if it’s null change to ‘not informed’.

simple mapplet

Testing our mapplet

To test our mapplet we are creating a .csv flat file, and input it’s result to our MS SQL database.

Here’s our source csv file:

And here is our mapping, using our mapplet:

Notice what we did here;

  • we have passed our ID and NAME from source directly to ID and NAME on target,
  • and our cleansed data, the one from our mapplet, receives the same NAME from source but store its result in RESULT_CHECK on target. This way, when we query our database we will be able to see the difference between those two fields.

In your workflow session the mapplets transformation will be shown under transformation group, but named with the mapplet name before the transformation name (yeah, I should have named better my mapplet transformations, ops).

Here’s our result, it worked =)

our mapplet worked!

References

https://youtu.be/m7pGHQd8tLI – Reusable Transformation & Mapplets in Informatica – In depth Knowledge with Real Time Scenarios

https://youtu.be/B7uj_XvGG6M#Informatica #Mapplet Video Tutorial – Part 25 || Informatica #Free #Training Sessions

Informatica Powercenter – Update Strategy Transformation SCD1 SCD2

Something worth keeping in mind: in order to Update Strategy to work properly you need to make sure that your target table has a primary key.

Here’s what the official documentation says about Update Strategy Information:

“The Update Strategy transformation is an active transformation. When you design a data warehouse, you need to decide what type of information to store in targets. As part of the target table design, you need to determine whether to maintain all the historic data or just the most recent changes.”

Note: Update Strategy is pretty straight forward, but it comes with a lot around it.

Setting up Update Strategy Transformation

First understand this: We are using two different tables with the same structure to play around here. CustomerSCD_base and CustomerSCD.

CustomerSCD_Base do not have primary key and allow nulls everywhere, so we can do anything we want here.

CustomerSCD has CustomerID as primary key.

Our mapping in Powercenter is like this:

in our mapping we have CustumerSCD_Base as source and CustomerSCD as target, and, of course, Update Strategy as transformation.

By editing our transformation we can go to properties tab and check the “Update Strategy Expression”.

Here are all options we can have for that field:

To better understand how it works let’s see some use cases.

Simple Insert

Data-wise, here’s our current situation:

1 row for source and same for target

That way, if I run this transformation nothin will happen, as our CustomerSCD primary key will not allow the same key to be inserted in the table again.

We can even see here in the Monitor that a row got rejected:

Let add a new row to our CustomerSCD_Base and see what happens.

1 jnew row added to source

Now we have a new row in our base table. Let’s run our Powercenter workflow once more and see the result.

1 new row added to target

Now Powercenter was able to insert our new row from base to SCD table. Monitor shows that we had 1 rejected and 1 affected.

Simple Update

Now I changed our transformation from Insert to UPDATE, let’s see what happens.

update strategy transformation properties tab

Our result still shows the same data.

But Monitor shows 2 affected rows.

That’s because those 2 rows are eligible for update, however no field were changed.

Let’s try something out of ordinary, and change the CustomerID, the primary key on our target table, and see what happens.

See that I changed from ANATR to XPTO. Let’s run our workflow again and see what happens.

This time we only had 1 row affected, because Powercenter was not able to find in the target table another row with ‘XPTO’ as primary key.

A more practical example would be updating the user Phone. Let’s change ALFKI number to something else.

Let’s run workflow again and see…

phone number updated from source to target

You see, we now have updated our target table with our new Phone number. This kind of update in a dimension table is called Slowly Changing Dimension type 1, in short, SCD1.

Here what Kimball’s says about SCD1:

With slowly changing dimension type 1, the old attribute value in the dimension row is overwritten with the new value; type 1 attributes always reflects the most recent assignment, and therefore this technique destroys history. Although this approach is easy to implement and does not create additional dimension rows, you must be careful that aggregate fact tables and OLAP cubes affected by this change are recomputed.

Kimball group

Simple Delete

You now already understand what happens, so I will not be repetitive. For deletion I will just do this one time.

So, changing transformation:

Running workflow and checking results:

Only ALFKI were deleted, since it’s the only matching from from source to target.

Going a bit more complex

So far we have seen everything there is about Update Strategy Transformation itself, however Update Strategy is commonly used together with some other transformations, like router, filter, expression transformation, etc. We do that to create more complex scenarios.

For instance, in our topic “simple update”, we have updated one of our CustomerID from source table. Remember:

We noticed that after this change from ANATR to XPTO this row remained useless. But, normally, when you have an ETL process like these you usually add new rows and update existing ones. Therefore, on this scenario we should have an INSERT+UPDATE behavior, following this set of rules:

  • if CustomerID do not exists, then INSERT
  • if CustomerID already exists AND some attribute changed, then UPDATE
  • if CustomerID already exists AND nothing changed, then do NOTHING

By doing this we should have XPTO inserted in our target table.

Update or Insert scenario – brute force

Let’s try to implement this logic:

  • if CustomerID do not exists, then INSERT
  • if CustomerID already exists AND some attribute changed, then UPDATE
  • if CustomerID already exists AND nothing changed, then do NOTHING

Here’s our current data set:

Here we have two rows in source (CustomerSCD_base) and two rows in our target table (CustomerSCD).

If we run a simple UPDATE, nothing will happen. So now I am creating an UPDATE/INSERT mapping.

top transformation update, bottom transformation inserts (or vice-versa)

This works. What Powercenter does here is, he run all rows from source in each of my transformations. So all my 2 rows are checked regarding INSERT and all regarding UPDATE, this works, but not if your source and target has many rows. I do not recommend you do this in your real world environment. Here’s the result:

Now XPTO is added to CustomerSCD. If I run the same workflow again, nothing will be inserted, since there is no new CustomerID in our source table. Only update will happen.

So, how can we improve this scenario? How can we tell Powercenter to check if its a new row and add only if its new, and update only those that already exists? We need to, somehow, check if CustomerID from source table exists in target table…

Update Insert Scenario – Slowly Changing Dimension type 1 – SCD1 – Manually

One more time, let’s try to implement this logic:

  • if CustomerID do not exists, then INSERT
  • if CustomerID already exists AND some attribute changed, then UPDATE
  • if CustomerID already exists AND nothing changed, then do NOTHING

Here is our mapping goal.

informatica powercenter slowly changing dimension type 1 mapping
final complete mapping, scd1

How does that mapping works? How we build this?

Lookup transformation: we are using a lookup here so we can check if our source table customerID exists in our target table customerID.

creating lookup transformation

When we create our lookup we have to choose TARGET as the lookup table, this means that we want to check our customerID from source against customerID from target.

Once lookup is created drag your customerID from source to your lookup transformation, and create a condition comparing CustomerID from source against CustomerID from target.

editing lookup

Expression transformation: we create a expression transformation so we can have all fields to compare and create our rule to set flag for insert or update.

  • Drag all fields from source qualifier to your expression transformation
  • Also drag all fields from lookup transformation to your expression transformation
    • note1: no need to drag customerID1 from your lookup transformation, since this information is already coming from your source qualifier.
    • note2: only drag from lookup those fields you want to check for change in your expression transformation. In this example we are checking all fields from target, so had to drag them all.
  • Rename all fields you brought from your lookup transformation and add the LKP_ prefix (this is optional, it’s just to help you find whichever are the lookup fields)
  • Add two new fields, UPDATE_FLAG and INSERT_FLAG
editing expression transformation

Now, inside your flag fields we write expressions to validate our set of rules.

insert_flag
IIF( ISNULL( LKP_CustomerID1 ), TRUE, FALSE )
update_flag
IIF( CustomerID = LKP_CustomerID1 AND
( 
CompanyName != LKP_CompanyName1
OR ContactName != LKP_ContactName1
OR ContactTitle != LKP_ContactTitle1
OR Address != LKP_Address1
OR City != LKP_City1
OR Region != LKP_Region1
OR PostalCode != LKP_PostalCode1
OR Country != LKP_Country1
OR Phone != LKP_Phone1
OR Fax != LKP_Fax1 
) , TRUE, FALSE )

The INSERT_FLAG code is checking if LKP_CUSTOMER is null or not. It will return TRUE if it is null. It will be null every time that a customerID from source do not exists on target.

The UPDATE_FLAG code is checking if customerID from source is equal customerID from target, this means that, if the customerID already exists it will return TRUE. This code also check for any changes on our other fields.

The ROUTER Transformation: we create the router transformation to segregate results, to create output ports for each result we want. We will have output group for INSERT and another one for UPDATE. Here’s how we handle the router transformation:

  • With exception of the LKP_ fields from your Expression Transformation drag all fields to ROUTER (see first image below)
  • Make sure you also drag INSERT_FLAG and UPDATE_FLAG
  • Edit your router transformation (double click it)
  • In GROUPS create two new groups, one for INSERT, another for UPDATE, and a DEFAULT one will be automatically created (see second image below)
  • For group INSERT set the Group Filter Condition to INSERT_FLAG (see second image below)
  • For group UPDATE set the Group Filter Condition to UPDATE_FLAG (see second image below)
express transformation to router
router groups

For router’s INSERT group, just drag it results to your target table. You do not need to drag INSERT_FLAG nor UPDATE_FLAG port to your target.

router’s insert output ports to target

For router’s UPDATE group, drag the it to a Update Strategy transformation, and make sure your Update Strategy transformation is set to DD_UPDATE. To finish, drag the output port from your Update Strategy to your target table.

router’s update output ports to target + update strategy properties

Save your mapping > create a workflow and run it. Just something to be aware of:

  1. Make sure your workflow session is set to DATA DRIVEN.

2. On your workflow session mapping tab, make sure your Lookup is set to your target database connection.

SCD1 – Running workflow and results

Now that our mapping and workflow is ready, lets run it, but fist, lets check out dataset. Here it is:

initial dataset pre-run. Target table empty

Here our dataset after the workflow of that mapping is executed for the first time:

4 rows inserted

Here is our Monitor results. We successfully inserted 4 new rows on our target table.

4 rows inserted

Here’s our result if we do nothing and just re-run the workflow. Nothing changed.

nothing changed
no rows affected

Let’s update a row from our source table.

changed source table customer ALFKI country name to Portugal

Now let’s run our workflow again. This time we expect 1 affected row on our monitor, and also that target table gets updated.

update worked as expected!
monitor results like expected

So far, so good. We are now able to keep our rows with most updated/recent data.

Slowly Changing Dimension type 1 – SCD1 – Using Powercenter Wizard

Pretty much everything we have manually done so far can be implemented via Powercenter Wizard. Yes! Powercenter has wizards to create SCD1, SCD2, SCD3. Let’s see how Powercenter Wizard do creating a SCD1, and then we will compare our manual mapping against wizard’s mapping.

Go to Mappings > Wizard > Slowly Changing Dimensions…

mapping wizard
name your SCD
choose your source and name a new target
setting fields

Setting the fields is important. Here we are saying that CustomerID is our logical key field, which means that this is the primary key, and all the other fields are going to be used to check for changes.

Once we hit finish our mapping is created for us. Let’s do a detailed review, step by step, on what is happening here.

the whole picture

Source Qualifier and Lookup transformation

Source qualifier to lookup transformation: here the wizard created a lookup from target to check if what we are inserting exists already in target, or not.

We can validate that by checking the table name lookup (img below) and the condition (img above).

Something curious is that our key used in comparison is not checked for output. Maybe we should have done that in our manual attempt.

the wizard unchecked output port for source key

Expression Transformation

After lookup we have Expression Transformation. Here the wizard bought all ports from lookup ( minus CustomerID ), and also all ports from source qualifier.

ports coming from lookup and source qualifier

Wizard also created two new fields; ChangedFlag and NewFlag.

changedFlag code
IIF(NOT ISNULL(PM_PRIMARYKEY)
AND
(
DECODE(Address,PM_PREV_Address,1,0) = 0
OR
DECODE(City,PM_PREV_City,1,0) = 0
OR
DECODE(CompanyName,PM_PREV_CompanyName,1,0) = 0
OR
DECODE(ContactName,PM_PREV_ContactName,1,0) = 0
OR
DECODE(ContactTitle,PM_PREV_ContactTitle,1,0) = 0
OR
DECODE(Country,PM_PREV_Country,1,0) = 0
OR
DECODE(Fax,PM_PREV_Fax,1,0) = 0
OR
DECODE(Phone,PM_PREV_Phone,1,0) = 0
OR
DECODE(PostalCode,PM_PREV_PostalCode,1,0) = 0
OR
DECODE(Region,PM_PREV_Region,1,0) = 0
),TRUE,FALSE)
newFlag code
IIF(ISNULL(PM_PRIMARYKEY),TRUE,FALSE)

To sum up:

  • NewFlag returns TRUE if PM_PRIMARYKEY is null. PM_PRIMARYKEY is the key that comes from our target table. This PM_PRIMARYKEY key will be null if our lookup transformation condition returns null. In other words, this will be null if our CustomerID from source do not exists on target.
  • ChangedFlag returns TRUE if PM_PRIMARYKEY is not null (meaning that this key already exists in our target database), and DECODE() all fields we marked to check. What DECODE() do is check if source values is equal to target value, if it is equal return 1, otherwise returns 0. By using this DECODE() the wizard is checking if any field changed.

Filter Transformation | UPDATE Changed record

The filters transformations. After Expression transformation the mapping moves on to filters transformations.
We have two filters: FIL_InsertNewRecord and FIL_UpdateChangedRecord.

Let’s first talk about FIL_UpdateChangedRecord.

FIL_UpdateChangedRecord

To compose the UpdateChangeRecord filter transformation we have ports coming from source qualifier, lookup and expression.

  • ChangedFlag is the only port that comes from expression transformation.
  • PM_PRIMARYKEY is the only port that comes from lookup transformation.
  • All other ports comes from source qualifier.
    • note that CustomerID from source is not brought to this filter transformation.
FIL_UpdateChangedRecord – filter transformation edit

The only properties set on FIL_UpdateChangedRecord transformation is the Filter Condition as ChangedFlag. This mean that this filter will pass through all rows when ChangedFlag returns TRUE (remember: changedFlag will return TRUE every time we have a record from source that have an ID that already exists in target and some of those other fields, like address, city, changed).

Update Strategy for Update Records

After the filter transformation we move to the Update Strategy transformation. No secrets here; we just ensure that we are using the DD_UPDATE function. The filter is already making sure that only the rows that matter is being passed through, so, here in the Update Stratetgy transformation we just make sure that those rows will be updated in our target. Notice that ChangedFlag is not being passed along. Besides that all ports, with exception for CustomerID, will be updated in our target table.

Note that CustomerID is not changed. It means that our update will update a row based on PM_PRIMARYKEY and keep CustomerID intact.

update strategy using DD_UPDATE function

Filter Transformation | INSERT New record

The other branch in our pipeline, after leaving expression transformation, is the FIL_InsertNewRecord filter transformation.

FIL_InsertNewRecord

To compose the InsertNewRecord filter transformation we have ports coming from source qualifier, and expression only. For this filter no port comes from lookup transformation (it’s an insert).

  • NewFlag is the only port that comes from expression transformation.
  • All other ports comes from source qualifier.
    • note that CustomerID from source is brought to this filter transformation.

To be super clear: Unlike the Update version, this filter is using CustomerID from source, and this is because it’s a insert and it would not be possible to add a CustomerID from target, since it yet do not exists on target, that’s why we are using CustomerID from source.

For this Insert filter the only properties is the NewFlag. This means that every time a row returns TRUE for NewFlag it will be passed along by this filter, and any row that returns FALSE for NewFlag will be dropped.

Remember: our logic behind NewFlag says that a row is TRUE everytime this source ID do not exists in our target table.

Update Strategy for Insert Records

update strategy for insert records

Here on this Update Strategy transformation we just make sure that we are using the DD_INSERT function. All rows used comes from the filter transformation, with exception of NewFlag port, which is only there for filter purposes, so no reason to pass along to our Update Strategy transformation.

Sequence Generator transformation

This is a new one. We have not created this transformation in our first SCD1 manual example. The Sequence Generator transformation is being used here to create the PM_PRIMARYKEY. As we can see here the sequence starts with 0, increase 1 by 1 and the current value is 1. Meaning that, when inserting, the first PM_PRIMARYKEY registered will receive 1, the second 2, and so on.

sequence generator

Creating our target

So far the Wizard created our mapping, but our target isn’t created inside our database. So I will create it inside our MS SQL database.

Select your transformation. Go to target > Generate/execute SQL …

target > generate execute SQL
create your table with Primary Key

Workflow

Remember that your workflow session has to be Data Driven and that your lookup should be pointing to your target database.

Workflow results Monitor

This is our starter dataset. We have 5 rows on source and 0 on target.

start

Result after running workflow for the first time. All rows inserted from source to target.

first run result

Result after doing nothing and just re-run the workflow. Nothing happened.

same thing

Result after updating some field on source and the running the workflow again. The update were transmitted from source to target.

updated, but before workflow execution
result after workflow execution

SCD2 – SCD type 2 – Slowly Changing Dimension type 2

Everything we did so far is great, but, we are loosing historical data. I mean, our customer moved from Germany to Portugal, but we simply lost track of where he lived before. We are updating data, but loosing history. This leads us to a new logic paradigm:

  • if CustomerID do not exists, then INSERT
    • create two datetime columns to track the row lifespance, DT_START, DT_END
    • DT_START receives the current system date when the row is inserted into our database
    • DT_END receives a maximum value like 1/1/9999
    • create a ACTIVE_FLAG column that receives 1, indicating its the current value
  • if CustomerID already exists AND some attribute changed, then
    • INSERT new row with new data
      • this new row DT_START receives the current system date
      • this new row ACTIVE_FLAG column that receives 1
    • the OLD row will:
      • update ACTIVE_FLAG to 0 (zero), indicating it’s inactive
      • update DT_END with current system date
  • if CustomerID already exists AND nothing changed, then do NOTHING

( Also, something else… So far we have used a 11 columns table. In real world examples you can have much larger tables, so understand that, sometimes you will only want to track SCD2 for specific columns. We can end up with something like, country being tracked as SCD2 and all other columns as SCD1; this means that a new row will only be created if country changes, otherwise it’s just a matter of updating data on other columns. )

What Kimball says:

Slowly changing dimension type 2 changes add a new row in the dimension with the updated attribute values. This requires generalizing the primary key of the dimension beyond the natural or durable key because there will potentially be multiple rows describing each member. When a new row is created for a dimension member, a new primary surrogate key is assigned and used as a foreign key in all fact tables from the moment of the update until a subsequent change creates a new dimension key and updated dimension row.

A minimum of three additional columns should be added to the dimension row with type 2 changes: 1) row effective date or date/time stamp; 2) row expiration date or date/time stamp; and 3) current row indicator.

KIMBALL GROUP

For this example we will be using a more simplified version of our customer table. This time we will have CustomerSCD2_base for source and CustomerSCD2 for target. They are empty for now.

Before moving forward with the mapping, lets understand better the behavior we expect. Let’s use Microsoft Excel to help us with envisioning this.

Let’s say we have our dataset like this. One customer there. Source is top, Target is bottom.

starting point

Then we changed FRANS from ITALY to BRAZIL. Here’s what we expect to happen, according to our logic paradigm. Source is top, Target is bottom:

a new update/insert happened

So, once we had an update on FRANS country we added this as a new row in our target table. We also added ACTIVE_FLAG column to track which row is active, by active we mean current, and what is old.

To make it more real-world like we can incorporate couple of more new columns. This is how SCD2 looks more like it in real data warehouses:

the same update/insert but with more control columns

So, besides what we already have on SCD1 we will also need:

  • a way to generate SK_Customer
    • SK_Customer is a surrogate key, which we will be using as our true primary key for this table, since customerID will be duplicating on this table
  • a way to generate and insert/update our START_DT and END_DT
  • a way to generate our ACTIVE_FLAG

You can already tell that we will need to re-work and re-create our target table… But we will get to that soon.

Implementing SCD type 2 – SCD2 – Manually

informatica powercenter slowly changing dimension type 2 mapping
final complete mapping, scd2

To implement SCD type 2 we will create a mapping and do it manually. Let’s do it step-by-step and think at each stage what to do and why we are doing it.

Source and Target

First we need to understand our source, and we already have our source ready here it is:

source

Then we need to have our target ready. For that I created a target in Powercenter and used Powercenter to create my target inside my database.

creating the target within Powercenter
the table that Powercenter created within our db
source and target queried results

Source Qualifier and Target

Create a new mapping and drag your source to it, also drag your target.

source and target

Lookup

Create a lookup from your target.

creating the lookup

Drag your comparison key between source and target and create that condition within your lookup transformation. Here we are comparing if CustomerID from target equals CustomerID1 from source.

lookup condition

Expression Transformation

Create a expression transformation and drag there all fields from source, and also some fields from lookup.

From source:

  • we will always bring all fields as they will be needed if we need to insert a new row.

From lookup:

  • we will always bring all fields we want to update, because we need those fields to compare, to check for changes.
all expression transformation fields

Besides fields from source and lookup we will also need:

  • INSERT_FLAG, this field will return true every time a row matches our logic to be inserted
  • UPDATE_FLAG, this field will return true every time a row matches our logic to be updated
  • GET_DATE, this field will get the current date. This is necessary to fulfill START_DT for new records and to update END_DT for updating records
  • END_DT, this field is static and will always return 01/01/9999 as this will be used everytime we need to insert a new record
  • ACTIVE_FLAG_1, this field always returns 1. We will use when inserting new rows to identify those rows as current
  • ACTIVE_FLAG_0, this field always returns 0. We will use when updating rows that are not longer current.

INSERT_FLAG code

IIF( ISNULL( LKP_CustomerID1 ), TRUE, FALSE )

UPDATE_FLAG code

IIF( CustomerID = LKP_CustomerID1
AND
(
LKP_ContactName1 != ContactName
OR
LKP_Country1 != Country
OR
LKP_Phone1 != Phone
), TRUE, FALSE )

GET_DATE code is simply SYSDATE.

END_DT code is simply TO_DATE( ’01/01/9999′, ‘MM/DD/YYYY’ ).

Router Transformation

router transformation

Let’s understand our inputs from Expression to Router:

CustoimerID, ContactName, Country and Phone comes from Source Qualifier. We pass those to our router because we will need them when we INSERT new rows.

We bring LPK_PM_PRIMARY because we need this info, as this is the SK (surrogate key) that we need when updating rows. We need to know which row will become the old row, that’s why LPK_PM_PRIMARY is important, it will be able to tell us which SK to set END_DT and ACTIVE_FLAG_0.

  • INSERT_FLAG, UPDATE_FLAG will be used as filters on router, just like we did on SCD1.
  • GET_DATE will return current date (SYSDATE), we will use when inserting new rows. This GET_DATE port will be also used to fill END_DT for updating rows.
  • END_DT is a fixed string of a 1/1/9999 date, that we will also use for insert rows.
  • ACTIVE_FLAG_1 is a static 1 that we will pass when inserting new rows.
  • ACTIVE_FLAG_0 is a static 0 that we will pass when updating rows.

Router’s groups, just like in SCD1:

This router’s groups will act as filters. Every time a row matches INSERT_FLAG it will be redirected to router’s INSERT ports, and same logic for UPDATE.

Update Strategy | Insert

If a row matches the INSERT_FLAG it will be redirected to INSERT group in our router. Those ports will be then redirected to our UPDATE STRATEGY transformation for insert.

We bring all fields that we need:

  • customerID, contactName, country, phone: those fields will bring data from source
  • get_date: will return current datetime (sysdate)
  • end_dt: will return 01/01/9999
  • active_flag_1: will return 1

So, by following this logic a new row will be registered in our target. However, since we are not using CustomerID as PrimaryKey, we need a Sequence Generator for our Surrogate Key.

The only change I did was to change Start Value from our Sequence Generator to 1 (default is 0).

sequence generator

Other important aspects of sequence generator is to keep un-checked cycle and reset, since it is generating primary key numbers.

Also, make sure your Update Strategy is set to DD_INSERT.

update strategy

Update Strategy | Update Insert

The same logic applies to Update Insert scenario. Difference here is that we get results from our Router’s UPDATE ports, as we will be inserting new rows for those rows that are changing. This is where the “keep history” magic happens.

Update Strategy | Update Update

In this final step we are just updating the old row that is now becoming inactive. That’s why we are using LKP_PM_PRIMARYKEY here. This means we have a change and we know which key to change.

Be aware that, here, GET_DATE goes to END_DT, as END_DT is now over, thus receiving current date.

ACTIVE_FLAG here receives 0 (zero).

Lookup, again, I know

Before going to testing, edit your lookup transformation again, and make sure that you do an SQL Override, with a code like this:

lookup edit > properties > sql override
SELECT CustomersSCD2.PM_PRIMARYKEY AS PM_PRIMARYKEY
,CustomersSCD2.CustomerID AS CustomerID
,CustomersSCD2.ContactName AS ContactName
,CustomersSCD2.Country AS Country
,CustomersSCD2.Phone AS Phone
,CustomersSCD2.START_DT AS START_DT
,CustomersSCD2.END_DT AS END_DT
,CustomersSCD2.ACTIVE_FLAG AS ACTIVE_FLAG
FROM CustomersSCD2
WHERE CustomersSCD2.ACTIVE_FLAG = '1'

This will make sure that only the rows that are active are being the scope of your mapping. Otherwise your mapping won’t work.

So, this SQL override, is to ensure that you only get the current rows, those with ACTIVE_FLAG equals 1, as it would make no much sense to treat old registers.

Running Workflow and testing SCD type 2 results

Let’s run our SCD 2 workflow and see if it works. Remember that you workflow session needs to be data driven and that all your mapping should be pointing to the right databases.

First scenario, source 4 rows, target 0 rows.

Running the workflow for the first time. Notice that 4 rows were inserted.

(primary key starting at 6 because I’ve truncated the table before, hehe)

If we run the workflow again, nothing happens. See:

no changes

If we change a customer country and run our workflow, a history needs to be created.

source updated, pre workflow execution

Our workflow now inserted and updated 1 row.

Here’s our history created. Now we are tracking every change in our customer attributes.

Here’s another update on the same customer, changing country one more time, this time to Sweden.

Conclusion

Hope you guys enjoyed this post, took couple of days to put it together. Now we can move on knowing that SCD1 and SCD2 are clear to us.

Reference

https://www.kimballgroup.com/data-warehouse-business-intelligence-resources/kimball-techniques/dimensional-modeling-techniques/type-1/

https://www.kimballgroup.com/data-warehouse-business-intelligence-resources/kimball-techniques/dimensional-modeling-techniques/type-2/

https://youtu.be/-oubI0WP_gY – How to Implement SCD TYPE 2 using Informatica PowerCenter

https://youtu.be/8KGCa8ih4AI – Informatica Tutorial | Update Strategy transformation | Slowly Changing Dimension (SCD) Type 1 |

https://youtu.be/RJkCUtm5KTM –  – Slowly Changing Dimension Informatica interview questions scenario based Informatica tutorial

Informatica Powercenter – Union Transformation

If you are familiar to SQL you already know what UNION does. In fact, you might be wondering, is Powercenter Union the same as SQL union or SQL union all? It is UNION ALL.

Here’s the official documentation on Powercenter Union transformation:

The Union transformation is a multiple input group transformation that you use to merge data from multiple pipelines or pipeline branches into one pipeline branch. It merges data from multiple sources similar to the UNION ALL SQL statement to combine the results from two or more SQL statements. Similar to the UNION ALL statement, the Union transformation does not remove duplicate rows. The Union transformation is an active transformation.

The Data Integration Service processes all input groups in parallel. It concurrently reads sources connected to the Union transformation and pushes blocks of data into the input groups of the transformation. The Union transformation processes the blocks of data based on the order it receives the blocks from the Integration Service.

The Union transformation is developed using the Custom transformation.

Rules and Guidelines for Union Transformation

Here’s what the official documentation tell us to use as rules and guidelines when working with a Union transformation:

  • You can create multiple input groups, but only one output group.
  • All input groups and the output group must have matching ports. The precision, datatype, and scale must be identical across all groups.
  • The Union transformation does not remove duplicate rows. To remove duplicate rows, you must add another transformation such as a Router or Filter transformation.
  • The Union transformation does not generate transactions.

My take on all those points is that we need to pay special atention to the second point. If you want to union sources you must ensure precision, scale, datatype across all ports (by ports you should already know that we are talking attributes), but the attribute name should matter too, right?! We will see that in our tests.

Creating a Union Transformation

Let’s first understand our data. We have sales from 3 different stores. Here their data:

Three different files, same structure:

Our objective here is to union all data and input it in our SQL database. We are going to use UNION transformation for this.

Create a mapping, add your sources, add your union transformation, add your target.

This is what we got, but as you can see Union Transformation has a NEWGROUP thing. You can’t just simply drag and drop all sources into your transformation, it won’t work.

First drag one of your groups.

By doing that you will create a group output, a pattern.

Now edit your transformation and create two new groups.

After doing that we will three groups of input and the same output in our UNION transformation. Now just drag and drop each source to its group, and after that select your output ports and connect to your target.

After creating a workflow and execute it, here’s the result: 21 rows, all inputted into the database.

Union removing duplicates

Everything we have done so far replicated data according to UNION ALL statement on SQL, but what if I need to remove duplicates before loading union data? You will need to add a SORT transformation after your UNION and before your target.

Make sure you select all ports you with to sort by. Remember that the PORT order is taken in consideration by Powercenter when sorting your data.

Also, select DISTINCT in properties tab.

Extra: Handling decimals

Did you notice that we lost decimal data on SELL_PRICE? That’s because when I created our sources I forgot to set precision to 2 decimal places. So, what I did, I went to Source Qualifiers and edited all my sources, as well edited all transformations and target, and recreated the target table as well.

edited all source qualifier from all sources
also edited the union transformation
edited the sort transformation as well
on target designer I edited our target and used it to recreate our target table on SQL database
re-ran the workflow, now we have decimals

References

https://docs.informatica.com/data-integration/powercenter/10-4-0/transformation-guide/union-transformation/union-transformation-overview.html

https://youtu.be/O3s8c9hppNI – Aula 04 – Criando transformações Union e Sorter no PowerCenter (Português BR)