Chapter 03 – Extracting
Integrating all the different data sources are usually one of the biggest challenges the ETL team faces, and without data there is not data warehouse. In this chapter Kimball teach us how to interface with those various source systems, how to create your logical data map and how to deal with some common source systems.
In this chapter Kimball also talks about how to deal with subject of change in data capture and deleted record capture, as we now know that our data warehouses are dynamically changing. This change subject is also revised in many other chapters, such as the data cleaning chapter, the delivery chapters, and the operations chapter.
Part 1: The Logical Data Map
You need to plan your ETL from begin to end even before starting your first ETL process. It is important to have a logical map, a clear understanding of which data will be needed to answer business needs and where that data come from. Once you have that clear and documented, you may start building your physical implementation. So plan first, build later.
The logical data map describes the relationship between the extreme starting points and the extreme ending points of your ETL system.
Designing Logical Before Physical
You already know that you need a plan, but how to create that plan? Here’s how:
- Have a plan. You are the ETL team, so you need to understand what user needs, what kind of data source you have and document it all. This plan is of utmost importance, especially in the future as users and quality assurance (QA) team will want to validate what is happening with their data, so your plan needs to document everything in order to describe exactly what is done between the ultimate source system and the data warehouse.
- Identify data source candidates. Identify the likely candidate data sources you believe will help answer your user’s needs. Within these sources identify what elements you think will be central to your users. These data elements will be your subject in the data profiling step.
- Analyze source systems with a data-profiling tool. This data profiling step means that you will check your data source candidates and see if the data they have is adequate to fulfill your business needs. So, it is very important that whoever performs this step have technical knowledge of data, but also very good understand of business, and how this data will be used. Failure in this step can simple compromise the entire data warehouse project. If the data cannot support the business objectives, this is the time to find that out.
- Receive walk-though of data lineage and business rules. Here you should have a data lineage mapped out, like, what data comes from which data source, which extractions, transformations and loadings are performed and what is going to be delivered.
- Receive walk-through of data warehouse data model. Just like the ETL team needs to understand all data lineage it also needs to understand how all that data will create the physical data model of the data warehouse. This includes understanding dimensional modeling concepts, table-by-table mapping, how dimensions, facts and any other table in the dimensional model work together to deliver the ETL solution. Also, never forget, the ETL system goal is to deliver data in the most effective way to end user tools.
- Validate calculations and formulas. It is helpful to make sure the calculations are correct before you spend time coding the wrong algorithms in your ETL process, so before creating any ETL process, double check everything, especially with your end users.
Inside the Logical Data Map
So, now you know you need a Logical Data Map, but how to create one? This topic will describe the logical data maps structure.
The Logical Data Map contains: the data definition for all the data warehouse sources, the target data warehouse data model, and all details required to manipulate and transform that data from its original format to that of its final destination.
This is the Logical Data Map document:
I (Pedro) consider this spreadsheet, this Logical data map, pretty self-explanatory, however, some points are worth noticing:
- SCD (slowly changing dimension) type. For dimensions, this component indicates a Type-1, -2, or -3 slowly changing dimension approach. For example, within the customer dimension, the last name may require Type 2 behavior (retain history), while the first name may require Type 1 (overwrite). These SCD types are developed in detail in Chapter 5.
- Transformation. The exact manipulation required of the source data so it corresponds to the expected format of the target. This component is usually notated in SQL or pseudo-code.
- Table type. First dimensions, then facts.
Pay very close attention to this document, the logical data map. If you watch it closely you will notice that it reveals many hidden requirements for the ETL team that might otherwise have been overlooked.
Take a good look at Figure 3.1.
See that data types between source and target for STATE get converted from 255 to 75 characters and there is nothing described in TRANSFORMATION column about that, meaning that this conversion from 255 to 75 characters is made implicitly (implied conversions) by the ETL tool. So, in future, if values with more than 75 chars are created you will probably lose the data. Besides that, your ETL can abort of fail the entire process with this kind of data overflow error. Implied conversions are common and notorious for sneaking up and destroying your processes. Make sure all your conversions are made explicitly.
The transformation column is a very important one. But don’t be completely focused on this column only, make sure you review the entire mapping before creating any ETL process. Most times transformations will be expressed using SQL, but this field can also be blank, meaning straight load from source-to-target. Transformations can also be plain text, describing what to do, like instructions to preload from a flat file or to base the load transformation on criteria outside of the database or to reject known data anomalies into a reject file.
Pay close attention to SCD types on your dimensions columns. Be aware of exactly what kind of SCD you really need for each column of your dimensions. Also, make sure your users understand those SCDs and what to expect as result in your final data warehouse by using them. Nonetheless, users quite often asks for SCDs to change once they start using the data warehouse, make sure to run those changes requests through your data warehouse project manager and the change management process.
Building the Logical Data Map
A key point of a data warehouse success is the cleanliness and cohesiveness of the data within it, and to achieve that we need to have a solid understand of all source systems that integrates the data warehouse. The complete logical data mapping cannot exist until all source systems have been identified and analyzed. The result of such analysis is usually broken into two major phases:
- The data discovery phase
- The anomaly detection phase
Data Discovery Phase
By this time you already know how your target should look like , meaning, how your end dimensional model should be delivered, so now what you have to do is find the all the correct source systems. The ETL team is responsible for finding the correct, the true source. Thorough analysis can alleviate weeks of delays caused by developing the ETL process using the wrong source.
Collecting and Documenting Source Systems
Source systems are scattered around your company. It’s the ETL team responsibility to find them. Sources systems can be in pieces of documentation, like interview notes, reports, and in the data modeler’s logical data mapping. Work with the system and business analysts to track down all source systems. It is the ETL team’s responsibility to keep track of the systems discovered and investigate their usefulness as a data warehouse source.
Keeping Track of the Source Systems
Now that you went through all this effort of assessing and finding your source system a good thing to do is to document it.
The chart or spreadsheet above is named the source system tracking report. The maintenance of this list should be a collaborative effort between the ETL team and the data modeling team. If for some reason an entry become not useable anymore, keep it, you might need for future reference.
The source system tracking report can also indicate how long your project will take, or how many source systems you will tackle at once. Fields of notice:
- Subject area. Typically the name of the data mart that this system feeds.
- Business name. how the source system is commonly referred to by the users.
- Priority. A ranking used to determine future phases. The priority is usually set after the data warehouse bus matrix has been completed.
Determining the System-of-Record
First know this, system of record definition is: it is the originating source of data.
This is very important to know as part of the ETL team, because all your source systems you ingest in your system should come from the system of record. This is particularly important since most enterprises store data redundantly across many different systems. It is very common that the same price of data is copied, moved, manipulated, transformed, altered, cleansed, or made corrupt throughout the enterprise, resulting in many versions of the same data. In pretty much all cases, data at the end of the lineage will not be the same as the data in the system of record.
Having the system of record data is especially important to build conformed dimensions in your data warehouse, so without it conforming dimensions will be nearly impossible. The further downstream you go from the originating data source, the more you increase the risk of extracting corrupt data.
Note: Derived Data. Should the ETL process accept calculated columns in the source system as the system-of-record, or are the base elements, the foundation of the derived data, desired?
The answer is: it depends. If you have additive calculated columns, go for it, but, if your calculated columns are nonadditive measures that cannot be combined in queries by users, then you will need to leave those calculated columns to be performed by users in their front room or to create and handle these calculated columns yourself. But, of course, you get to create those nonadditive calculated columns yourself, it will fall under your responsibility to keep it up to date with users business rules.
Analyzing the Source System: Using Findings from Data Profiling
Now that you know all your system of record source system it is time to analyze and understand which kind of data and model they have.
A good way to achieve that understanding is to check those source system’s Entity Relation (ER) diagrams. If you happen to not find an ER diagram, you might try to reverse engineer the database and see if you can get one. So, at this point use every tool you have at your disposal to understand your source data, another good thing to do is talk to a system guru, or data analyst, or business analyst that knows the system well.
Here are some characteristics that you need to keep an eye out when investigating your ER diagrams:
- Unique identifiers and natural keys. Identifying unique identifiers are not that hard, they are usually the Primary Key, however some other very important information is required, especially if you are dealing with slowly changing dimensions: natural key. The natural key is what the business uses to uniquely describe the row.
- Data types. Do not let column names fool you. The order_number column is actually a varchar, the same usually happens when dealing with date or datetime columns, they very often come as strings instead of date type, so make sure to double check all your data types. Also, never forget your leading zeroes either on strings or integers, they might be important. Another important point is truncating data, even numbers, make sure you do not truncate, or if you are truncating you have the users approval.
- Relationship between tables. Understanding how tables are related is vital to ensuring accuracy in joins while retrieving data. If you are lucky enough your ER diagram will present the relationship lines to you, if not, it’s a good practice to get all source system heterogeneous sources and map out the relationships.
- Discrete relationships. You may find some lookup tables in your model. Carefully document the name of each group of rows and the associated tables and columns. This information will be needed while mapping many of the dimensions.
- Cardinality of relationships and columns. Knowing the cardinality of relationships is necessary to predict the result of your queries. In relational databases, all associated tables will be joined in one of the following ways:
- One-to-one. Used on super-type / sub-type scenarios and the practice of vertical table partitioning. One-to-one relationships can be identified by observing that the relationship is on the primary key of each table.
- One-to-many. This is the most common type of relationship found. It is easily identified by noting that a non-key attribute in a table refers to the primary key of another table.
- Many-to-many. This relationship usually involves three tables with two one-to-many relationships between them. More specifically, there are two tables with an associative table between them. The center or associative table has a compound primary key and two foreign keys, one to the primary key of one table and another to the primary key of the other table.
Data Content Analysis
You already know your ER diagrams and have all the correct sources to work with. Not its time to dig deeper and analyze the data content. What is inside each table? How is the data quality of each column? Pay close attention to anomalies in data that can compromise your data warehouse, such as:
- NULL values. First know this, in a relational database NULL is not equal to NULL. Joining two or more tables based on a column that contains NULL values will cause data loss! So make sure you check and deal with every NULL values you may found in your source, try, as much as possible, to get rid of every NULL value. The ETL normally transform those NULL values into default values, replacing those NULLs
- Dates in nondate fields. Dates can be found in many different formats. Make sure you check all your data columns, understand their pattern and conform each and every one of them, before loading into the data warehouse.
Collecting Business Rules in the ETL Process
ETL business rules are more technical, more complex, more detailed. So make sure you document them properly. See the example bellow the difference between data modeler’s business rules and ETL’s one:
Status Code— Data Model team: The status is a four-digit code that uniquely identifies the status of the product. The code has a short description, usually one word, and a long description, usually one sentence.
Status Code— ETL team: The status is a four-digit code. However, there are legacy codes that were only three digits that are still being used in some cases. All three-digit codes must be converted to their four-digit equivalent code. The name of the code may have the word OBSOLETE embedded in the name. OBSOLETE needs to be removed from the name and these obsolete codes must have an obsolete flag set to ‘Y’. The description should always be in sentence case, regardless of the case used when entered into the source system.
Even though those rules being of technical nature, the rules makers are still the business. The ETL team cannot makeup rules, it is up to the ETL architect to translate users requirements into useable ETL definitions. Also, as you discover undocumented data anomalies make sure to share that with your users so new rules of how to deal with each of those anomalies can be set, properly documented, approved and signed off.
Kimball do not state where to store those business rules. Some ideas I have: store in your ETL tool as metadata, or store in your logical data map, as a new column “business rules”.
Integrating Heterogeneous Data Sources
Until now everything we’ve been talking to refers to source data, meaning, we were talking about going to the data source and get it. However, besides that, we also need to integrate data. By integrate data we mean conform, so we are talking about conforming dimensional attributes and facts. You cannot have the same attribute with two different meanings, or two different labels that will induce your users to errors or misinterpretation of the data. The same happens for facts, you cannot have the same kpi being calculated differently.
Failing in conforming will be the failure of your data warehouse project. In the data warehouse, conformed dimensions are the cohesive design that unifies disparate data systems scattered throughout the enterprise. Nonetheless, the process of conforming dimensions and facts involves technical and political issues, as you will need to align with your end user which kind of data, or which meaning should prevail over others.
Important note: when you have dimensions being populated by several distinct sources, it is important to include the unique identifier from each of those sources in the target dimension in the data warehouse. Those identifiers should be viewable by end users to ensure peace of mind that the dimension reflects their data that they can tie back to in their transaction system.
Chapter 5 discusses loading dimensions in more detail, but here are some key techniques for loading conformed dimensions in a disparate source system environment:
- Identify the source systems. We have talked about this already, so make sure you have all your good source (system of record) mapped out.
- Understand the source systems (data profiling). We have talked about this already as well, so make sure you run your data profiling processes in each and every source you identified as required for your data warehouse project.
- Create record matching logic. Now that you know your source and you know what kind of data and metadata you have in your sources it is time to match what goes with what. Here you have to design which table, which data, talks to which data, meaning, you have to be able to say that “I can join this with that, I can do this other join here and there”, so on so forth. But beware, joining sometimes can be simple, but once you are dealing with heterogeneous data sources you will much likely need to infer how data should be joined (fuzzy relationship logic).
- Establish survivorship rules. This means that if you have a customer table in your accounts receivable, production control, and sales systems, the business must decide which system has overriding power when attributes overlap.
- Establish nonkey attribute buisiness rules. You will probably find attributes that you need to use because your users demands, but those attributes do not come from the system of record, however those attributes will be used to enhance your final conformed dimension. When this happens make sure you assign business rules for those attributes. Documentation and publication of the data lineage metadata is crucial to prevent doubt in the integrity of the data warehouse when users don’t see what they expect to see.
- Load conformed dimension. The final task of the data-integration process is to physically load the conformed dimension. This step is where you consider the slowly changing dimension (SCD) type and update late-arriving data as necessary. Consult Chapter 5 for details on loading your conformed dimensions.
Ralph Kimball, Joe Caserta- The Data Warehouse ETL Toolkit: Practical Techniques for Extracting, Cleaning, Conforming, and Delivering Data, 2004.