Chapter 02 – ETL Data Structures
In this chapter Kimball start once again talking about staging data, which he says that (sic) in this context staging means writing to disk. He also reminds us that they recommend 4 staging points, between each major steps of the ETLS (ECCD). This chapter focus on describing general design principles for the staging area, planning and design standards needed to setup your staging areas. Kimball also says that metadata plays a very important part in supporting all the staging process.
To Stage or Not to Stage
Ultimately it’s the project choice to decide if either data will be staged or not. Data can be staged to files, or to tables or it can be handled in memory only and not staged at all. It is always important to consider the balance between I/O and in-memory processing, as performance in retrieving and accessing data is usually of high importance to any ETL project.
There are two major points to consider whether you are going to stage or not:
- Getting data from source as fast as possible
- Recover from failure without restarting the ETL from the starting point
Consider the following reason for staging data before it is loaded into the data warehouse:
- Recoverability; it’s good practice to stage data as soon as it has been extracted from the source and then again after each major transformations steps. By implementing those recover points your ETL process won’t have to intrude the source again if the ETL fails. When staging data purely for recovery purposes the data should be stored as sequential files on the file system rather than in a database. Staging data for recoverability purposes is especially important when the source overwrite their own data.
- Backup; sometimes it is easier and cheaper to backup your staging points rather than your entire data warehouse. Save those files from recover points in compressed zip folders and if you ever need to reload the data warehouse you can simply uncompress the load files and reload them.
- Auditing; many times the lineage between the source and target is lost in ETL code, and auditing an ETL process is very important if you want to have proof that your data warehouse have the correct data after all data manipulation and data transformation. A great way to achieve this audit is by having multiple staging points saved, so auditors can simple compare the original data input with the logical transformation rules against the output file.
Make sure you plan your staging architecture well, not uncommon staging areas accumulate mora data than the data warehouse itself.
Designing the Staging Area
In order to design your staging area you first need to understand that you can have data preserved in your staging area, also known as persistent staging, or you can also have data being deleted, also known as transient staging area, and, of course, you can have a mix of both. So keep that in mind when designing your staging area.
Also, in the grand scheme of things, it is important to have a police in your company that grants the ETL sole ownership over all the staging areas, therefore, be aware that:
- Data in the staging area should be owned by the ETL team. Do not grant access to data in the staging area for people outside the ETL team. The data in the staging area is not ready to be used by a user, nor it is indexed, nor it falls under any SLA.
- Users are not allowed in the staging area for any reason. Users will most likely misuse the data in the staging area and reduce the perceived integrity of the data warehouse.
- Reports cannot access data from the staging area. In the staging area tables and files can be created, dropped, reprocessed, changes can happened without the need to notify users, that way breaking reports pointing to whatever data they might be using in the stating area.
- Only the ETL process can write to and read from the staging area. This is a big one. If you ever need information that don’t have a source, like an excel domain file, for instance, you need to make that excel file come from a source. You should never ever manually input that table in your staging area.
Since staging area is ETL team ownership it is also the ETL responsibility to supply the DBA team and OS admins with an overall data storage measure of the staging area, so they can estimate the space allocations and parameter settings for the staging database, file systems, and directory structures, as show in the next image.
As seen in the figure above, the details shown are for tables within a database, however, you can also use this volumetric worksheet for file-system planning.
Data Structures in the ETL System
This section describe important types of data structures we are likely to need in our ETL system.
In many cases handling data in your staging area can be done much faster by using flat files. A DBMS requires overhead to maintain metadata of the data being processed, which in many cases are not necessary in a staging environment.
Kimball says that (sic) It has been the conventional wisdom that sorting, merging, deleting, replacing, and many other data-migration functions are much faster when they are performed outside the DBMS.
The use of flat files can benefit the most in the following scenarios:
- Staging source data for safekeeping and recovery. We talked about this already, but the idea is, you get your data from the source, save in a flat file, and if for some reason you need to recover from failure or for future reference, your data extraction is saved in a flat file.
- Sorting data. Presorting optimizes performance, so carefully simulate the biggest sorting tasks both in your DBMS and with a dedicated sort package for your flat files. We think that most likely flat files will trump over DBMS on this task, and since much of your ETL evolves around integrating disparate data, merging data efficiently is a top priority that requires intensive sorting.
- Filtering. Filtering flat files is way more efficient than inserting all data into a database table, indexing the table, and then applying a WHERE clause either to delete or select the data into another table. It might be faster to have the data in a flat file and GREP (unix for filtering) only the rows you need.
- Replace/substituting text strings. Doing substring searching and replacing data in a database can require nested scalar functions and update statements. This type of sequential file processing is much faster at the file-system level than it is with a database.
- Referencing source data. Instead of querying the same table in the source system over and over, it’s more efficient to extract the reference table and stage it in the staging area once. From there, you can look up data with your ETL tool. Most ETL tools can read a look-up table into memory and store it there for the life of the process. Moreover, utilizing staged reference data keeps the queries that hit the source system simpler and more efficient because many of the table joins can be omitted.
You can also store your staging data in Relational Tables in a DBMS. Here are some of the advantages of using relational databases as staging area:
- Apparent metadata. One of the drawbacks of using flat files is the lack of apparent metadata. By storing data in a relational database, the DBMS maintains the metadata automatically. Information such as column names, data types and lengths, and cardinality is inherent to the database system.
- Relational abilities. If you are receiving data from nonrelational systems, it might make sense to stage data in a normalized model before transforming into a dimensional model.
- Open repository. Data is more easily accessed and assessed in a SQL-compliant environment, making QA and auditing easier.
- DBA support. When data is staged in a database the DBA team can alleviate the ETL team of responsibilities such as space allocation, backup and recovery, archiving, and security are tasks that the ETL team must coordinate when the staging data is not in a DBMS.
- SQL interface. It is easier to handle data with SQL, that’s pretty much what Kimball means with this SQL interface. (sic) Besides enforcing referential integrity, having the ability to use native SQL is the primary reason to store staging data in a database environment.
Independent DBMS Working Tables
Independent DBMS working tables are, at least for me (Pedro), the most common scenario. Independent tables are orphans on the DBMS, meaning that they do not have any relationship whatsoever. Most of the time, the reason you create a staging table is to set the data down so you can again manipulate it using SQL. Just because independent tables are not necessarily normalized, they must not be treated like dump files. Each field of an independent file or table must have a purpose and a logical definition. For database tables, a proper index plan must be established and implemented on all independent tables.
Third Normal Form Entity/Relation Models
Don’t assume that the data-staging area must be normalized. Remember two of the goals for designing your ETL processes we describe at the beginning of this chapter: Make them fast and make them recoverable. Consider modeling decisions on a case-by-case basis, normalizing entities only as needed.
Nonrelational Data Sources
First, what is nonrelational data sources? Any source of data that is not related to each other is a nonrelational data source, meaning that your ERP data source and your Excel file data source are not related to each other, meaning, they do not enforce referential integrity. That is why it is of utmost importance to run all data integrity checks in your ETL process.
Sometimes people create special staging points to integrate all nonrelational data, after all, your data-integration tasks are easier to be performed once all data is under one single roof (like in a DBMS, for example).
(sic) Bringing all of the disparate data sources into a single DBMS is a common practice, but is it really necessary? The power of ETL tools in handling heterogeneous data minimizes the need to store all of the necessary data in a single database. Figure 2.2 illustrates how a ETL tool can integrate many heterogeneous data sources right from their native data stores, integrating the data on the fly to migrate it to the data warehouse.
What happen is that integrity errors needs to be handled by the ETL process in an automatic way. You need to establish business rules for different data-quality failure scenarios and implement them in the ETL process. When erroneous data is passed through the process, sometimes you want to transform the data on the fly; load the data as is; load the data with an associated code or description describing the impact or condition of the data; or if the data is unacceptable, reject the data completely and put it into a reject file for investigation. Hand-coded logic in the ETL process is almost always required to successfully integrate nonrelational data sources.
Note: use rejection files carefully. Any rows in the rejection file needs to be dealt with BEFORE the next ETL cycle, the failure in doing that will redeem your data warehouse out of sync with your production system.
Dimensional Data Models: The Handoff from the Back Room to the Front Room
Dimensional data models sit between the back room and the front room. If you are reading this, you probably already know the power of dimensional modeling, but if you are new to this know that dimensional models are by far the most popular data structures for end user querying and analysis. They are simple to create, they are extremely stable in the presence of changing data environments, they are intuitively understandable by end users, and they are the fastest data structure for general relational database querying. Dimensional models are also the foundation for constructing all forms of OLAP cubes, since an OLAP cube is really just a high-performance dimensional model implemented on special-purpose software.
For more detail on data models, please see Kimball’s Data Warehouse Toolkit book.
Dimension models are built around measurement processes. Measures are mostly numeric, and most measures can repeat over time, creating a time series.
A single measure creates a single fact table record, each record in a fact table can represent only one fact. Besides storing facts in the fact table we also store everything that gives that fact context, we call that dimensions, and while we would like to store those dimensions verbosely, the best practice is to normalize those dimensions attributes out of the fact table, by doing that creating dimension tables.
The grain of a fact table is the definition of what constitutes a unique fact table record. The grain is declared as a business term, not as a database term. the grain can be formally stated in terms of the key of the fact table. This key declaration will include some but usually not all of the foreign key references to the dimensions attached to the fact table.
The figure above shows several dimension tables related to a fact table.
Dimension tables holds verbose description that describe entities, that are used to contextualize a fact. A single dimension table can hold many dimensional attributes, like Customer dimension can store customer name, customer age, customer gender, customer email, etc.
Dimensional attributes are mostly textual or are numbers that take on discrete values. Dimension tables should always be built with a single primary key field that is a simple meaningless integer assigned by the ETL process. These keys are called surrogate keys. The many advantages of surrogate keys are described in the other Toolkit books. In this book Kimball describe how surrogate keys are created and administered in a number of important situations.
The primary surrogate keys in each dimension are paired with corresponding foreign keys in the fact table. When this primary-to-foreign key relationship is adhered to, we say that the tables obey referential integrity. Referential integrity is a constant requirement in all dimensional models. Failure to maintain referential integrity means that some fact table records are orphans that cannot be retrieved through constraints on the affected dimensions.
Atomic and Aggregate Fact Tables
Often, business users don’t want to analyze transaction-level facts because the cardinality of each dimension is so extensive that any atomic-level report would be pages long—making it humanly impossible to examine. However, you need to store the atomic-level facts to produce the periodic snapshot facts required by the users.
It’s good practice to partition fact tables stored in the staging area because its resulting aggregates will most likely be based on a specific period— perhaps monthly or quarterly. Creating partitions alleviates the database from scanning the entire table and enables it to go directly to the subsection that contains the data it needs for that period to create the aggregate.
You may implement a hybrid structure where you have the large atomic data layer in a dimensional RDBMS schema, with progressively more aggregated structures above the atomic layer in the form of OLAP cubes. Some of the OLAP systems can drill down through the OLAP cubes and access the lowest-level atomic data in a single application.
Surrogate Key Mapping Tables
Surrogate key mapping tables are designed to map natural keys from the disparate source systems to their master data warehouse surrogate key. Mapping tables contain only the most current value of a surrogate key— used to populate a dimension—and the natural key from the source system. Since the same dimension can have many sources, a mapping table contains a natural key column for each of its sources.
Planning and Design Standards
The data-staging area must be managed and maintained as much, if not more, than any other database in your environment. The data-staging area must be a controlled environment.
Only the data architect should be able to design or modify a table in the staging area. All physical changes must be owned by the DBA responsible for the database. Also, if any developer needs a table, a strong chance exists that it can be used by another developer. As a rule of thumb: If you build a table, expect that it will be used by someone else for reasons other than you originally intended.
We often want, or we are usually requested by users, to change objects (tables or columns) in our staging area, so be specially aware that any changes in the data staging objects can break processes that are crucial to the data warehouse itself. Allowing ad-hoc changes to data-staging objects can simply destroy your entire data warehouse project.
So, if there is no other way, and a change is required to be performed in your data stage objects Impact Analysis must be performed. Impact Analysis examines the metadata associated to an object and see what that change will impact in those objects structure. You must perform impact analysis before any changes are made to a data stage object.
Communication among the ETL project manager, source system DBA, and data warehouse modeling team is crucial to ensure that appropriate impact analysis is performed.
Several metadata elements associated with the data staging area are valuable to the data warehouse and must be presented to its end users. Types of metadata derived by the staging area include the following:
- Data lineage. Data lineage, also known as the logical data mapping, illustrates transformations applied to a data element between its original data source and its ultimate target in the data warehouse.
- Business definitions. Every table created in the data-staging area stems from a business definition.
- Technical definitions. Without proper documentation of the technical definitions of your staging tables, the tables might be recreated over and over, causing duplication of efforts and data explosion in the data-staging area. Technical definitions describe the physical attributes of data elements, including the structure, format, and location.
- Process metadata. Processes that load data-staging tables must record their statistics along with the statistics of the data warehouse table loads. Although information regarding staging table loads need not be presented to end users, the ETL team must know exactly how many records were loaded into each staging table, with success and failure statistics for each process. A measure of data freshness is useful both for ETL administrators and end users.
Make sure you capture all metadata needed to understand what is going on with your ETL system and data warehouse, so information such as number of rows inserted, updated, deleted, and rejected should be available for each process. Also, the process start time, end time, and duration.
The big categories of metadata include table structure metadata (25 percent), data-cleaning results (25 percent), and process results (50 percent). Metadata is developed in depth in Chapter 4 and Chapter 9.
Auditing Data Transformation Steps
In volatile environments, the source system data is constantly changing; it’s mandatory that the data warehouse have the ability to prove the data within it is accurate.
The ETL process must maintain a snapshot of the before picture of the data before goes through the data-cleansing routines.
Snapshots of extract data are stored in the data-staging area and made available for audit purposes. A before and after glimpse of the data, along with the metadata describing the data-cleansing logic, depicts exactly how the data within the data warehouse is derived and promotes confidence in the quality of its data.
Ralph Kimball, Joe Caserta- The Data Warehouse ETL Toolkit: Practical Techniques for Extracting, Cleaning, Conforming, and Delivering Data, 2004.