Database Concepts, Third Normal Form (3NF)

As I was studying Data Vault concepts I found myself in a dire desire of understanding more of the third normal form, 3NF. Here in this article I will present you everything I’ve gathered so far about the 3NF.

Why is the 3NF important to Data Vault? Well, you certainly don’t need to be a jedi master of the 3NF to perform Data Vault modeling, but understanding its characteristics and how the many different keys behave in a RDBMS can definitely help creating your model.

Before we can fully understand what 3NF is, we need to understand 1NF and 2NF.

Note: Please, see the reference at the bottom of this article for all material used to produce this piece.

What is 1NF?

In order to meet 1NF a table must meet the following criteria:

  1. The data are stored in a two-dimensional table.
  2. There are no repeating groups.
  3. It must have a primary key

What are repeating groups?

A repeating group is an attribute that has more than one value in each row of a table. A repeating group is directly analogous to a multivalued attribute in an ER diagram.

To achieve 1NF you need to handle your table, much likely creating new tables, so data can be normalized.

Take the table players, for this example. It as many multivalued attributes, which inflicts 1NF.

player_iditems
jane1 amulets, 2 rings, 1 shields
john12 arrows, 1 bow, 2 rings
mary2 rings, 1 staff, 2 hood
lisa1 rings, 2 dagger, 2 hood

In order to achieve 1NF we have to decompose this table in two tables, like these:

player_id
jane
john
mary
lisa

Here played being the primary key.

player_idItem_typeQuantity
janeamulets1
janerings2
janeshields1
johnarrows12
johnbow1
johnrings2
maryrings2
marystaff1
maryhood2
lisarings1
lisadagger2
lisahood2

Here the combination of player + items being the primary key.

What is 2NF?

In theoretical terms, second formal form relations are defined as follows:

  1. The relation is in first normal form.
  2. All non-key attributes are functionally dependent on the entire primary key. Meaning: no part-key dependency allowed.

Let’s enhance our last table, to add player rating:

player_iditem_typeitem_quantityplayer_rating
janeamulets1Intermediate
janerings2Intermediate
janeshields1Intermediate
johnarrows12Beginner
johnbow1Beginner
johnrings2Beginner
maryrings2Advanced
marystaff1Advanced
maryhood2Advanced
lisarings1Advanced
lisadagger2Advanced
lisahood2Advanced

On the table above item_quantity and player_rating are non-key attributes, meaning, they do not compose or participate on the primary key for this table. We can therefore say that: player_id + item_type determines item_quantity, the terminology being:

  • { player_id , item_type } -> item_quantity

Meaning that, for every unique combination of player_id + item_type (aka, for every primary key) there is only one item_quantity. The next question is, are there other functional dependencies on this table? Yes, there are. Here are they:

  • player_id -> player_rating

But why is the first functional dependency correct and not the second one?

Because functional dependency is a one-way relationship between two attributes such that at any given time, for each unique value of attribute A, only one value of attribute B is associated with it.

And, as we can see here, on the functional dependency “player_id -> player_rating” there are no uniqueness on this relationship.

Now that we have identified all the functional dependencies (FD) we can use the determinant of each of this FDs, and create a separate table, a separate entity, using those determinants as primary key. Like these:

Note on determinant: To clarify what are our determinants here we first need to know which are the Functional Dependencies we got, so here are the FDs:

  1. { player_id , item_type } -> item_quantity
  2. player_id -> player_rating

So we have determinant { player_id , item_type } for our FD #1 and player_id as our determinante for FD #2.

What is the definition of Determinant?

An attribute on which other attributes are functionally dependent on.

player_idplayer_rating
janeIntermediate
johnBeginner
maryAdvanced
lisaAdvanced
player_iditem_typeitem_quantity
janeamulets1
janerings2
janeshields1
johnarrows12
johnbow1
johnrings2
maryrings2
marystaff1
maryhood2
lisarings1
lisadagger2
lisahood2

By doing that we assure it’s now on 2NF, and that there are no part-key dependencies, in other words, every attribute depends on the whole primary key.

What is the 3NF?

The theoretical definition of third normal form says:

  1. The relation is in second normal form.
  2. There are no transitive dependencies.

Another way to describe 3NF:

  • Each non-prime attribute in a table should depend on every candidate key.
  • It should never depend on part of a candidate key.
  • It should never depend on other non-prime attributes.

Know this already, the 3NF* mantra:

Every attribute in a table should depend on the key, the whole, and nothing but the key.

So, let’s take this enhanced player table:

player_idplayer_ratingplayer_skill_level
janeIntermediate6
johnBeginner3
maryAdvanced8
lisaAdvanced9

We added player_skill_level, a scale so we know how high is the player skill level.

skill_level        
123456789
 beginner  intermediate  advanced 

Let’s assume that JOHN skill level went up from 3 to 4, but, for some reason, the player_rating wasn’t properly updated, resulting on this table:

player_idplayer_ratingplayer_skill_level
janeIntermediate6
johnBeginner4
maryAdvanced8
lisaAdvanced9

As we can see, player_rating should be Intermediate, but it is showing Beginner due to an update anomaly.

It is important to say that it is easy for us, as humans, to understand that there is an error on this table, but this table is perfectly fine regarding 2NF, it is in compliance with 2NF as all non-keys attributes are functionally dependent on the entire primary key.

So the real question is, how are those non-key attributes dependent on the key?

Player_rating is dependent on player_id. Like this:

  • player_id -> player_rating

Player_skill_level is also dependent on player_id but indirectly, like this:

  • player_id -> player_skill_level -> player_rating
    • this is a transitive dependency, where player_rating depends on player_skill_level

A dependency like the one of player_rating is called a transitive dependency. A dependency of a non-key attribute on another non-key attribute.

To achieve 3NF we have to adjust our tables so they look like this:

player_idplayer_skill_level
jane6
john4
mary8
lisa9
player_skill_levelplayer_rating
1Beginner
2Beginner
3Beginner
4Intermediate
5Intermediate
6Intermediate
7Advanced
8Advanced
9Advanced

*3NF can be described as:

Every non-key attribute in a table should depend on the key, the whole, and nothing but the key.

*3NF enhanced version, Boyce-Codd Normal Form (BCNF), can be described as:

Every attribute in a table should depend on the key, the whole, and nothing but the key.

Note about Transitive Dependencies:

Transitive dependencies comes from a functional dependency pattern, such as: A -> B and B -> C; therefore A -> C. A transitive dependency therefore exists only when the determinant that is not the primary key is not a candidate key for the relation (remember: relation here is the relation between attributes).

When you have a transitive dependency in a 2NF relation, you should break the relation into two smaller relations, each of which has one of the determinants in the transitive dependency as its primary key. The attributes determined by the determinant become non-key attributes in each relation. This removes the transitive dependency and places the relation in third normal form.

What is Boyce-Codd Normal Form?

There is not much of a difference between 3NF and BCNF, in fact, in the real world most 3FN models are in fact BCNF models.

Note: If your relations are in third normal form and do not exhibit the special characteristics that BCNF, 4NF, and 5NF were designed to handle, then they are automatically in 5NF.

Before discussing about BCNF let’s review 3NF definition:

  • Each non-prime attribute in a table should depend on every candidate key.
  • It should never depend on part of a candidate key.
  • It should never depend on other non-prime attributes.

However, if you pay close attention to this 3NF definition, you will notice that if every attribute on a table is a prime attribute, then transient dependencies are going to happen, and that, regardless of being accepted as 3NF, is a design failure. That’s why we have BCNF.

See this table:

release_yearpopularity_rankingmovie_namerelease_year_and_month
20081The Dark Knight2008-07
20082Indiana Jones2008-05
20083Kung Fu Panda2008-06
20091Avatar2009-12
20092Harry Potter2009-07
20093Ice Age2009-07

Here are the candidate keys:

  1. Movie_name
  2. { release_year , popularity_ranking }
  3. { release_year_and_month , popularity_ranking }

All keys are prime attributes, there are no non-prime attributes.

The problem we have appears on the following functional dependency:

  • release_year_and_month -> release_year

This problem occurs because if you happen to update the release_year column you can end with an update anomaly if your column release_year do not update accordingly. Like this:

release_yearpopularity_rankingmovie_namerelease_year_and_month
20081The Dark Knight2008-07
20082Indiana Jones2008-05
20083Kung Fu Panda2018-06
20091Avatar2009-12
20092Harry Potter2009-07
20093Ice Age2009-07

However, even knowing that there is an anomaly in this table, according to 2NF and 3NF the table complies, and that’s why BCNF enhance the 3NF definition.

To be in Boyce-Codd normal form, a relation must meet the following rules:

  1. The relation must be in third normal form.
  2. All determinants must be candidate keys.

What is a Determinant?

An attribute on which other attributes are functionally dependent on.

What is a Candidate Key?

A candidate key is an attribute, or set of attributes, that uniquely identify a row in the table. More detailed way of putting it: a column or combination of columns (columns also means attributes) that can be used as the primary key of a relation (relation here is about attribute to attribute relation).

What is a non-prime attribute?

A prime attribute is an attribute that belongs to at least one candidate key, therefore, non-prime attributes are attributes that doesn’t belong to any candidate key.

As we know, release_year depends on release_year_and_month, that’s our problem, because release_year_and_month is a determinant, since it determines release_year, but elease_year_and_month is not a candidate key.

To fix this we can simple replace the release_year_and_month attribute for release_month attribute.

release_yearpopularity_rankingmovie_namerelease_month
20081The Dark KnightJuly
20082Indiana JonesMay
20083Kung Fu PandaJune
20091AvatarDecember
20092Harry PotterJuly
20093Ice AgeJuly

The table is now in BCNF, and here we can informally state that BCNF definition as:

Every attribute in a table should depend on the key, the whole, and nothing but the key**.

** Keep in mind that “KEY” here refers to “every candidate key”.

References

Learn Database Normalization – 1NF, 2NF, 3NF, 4NF, 5NF

Learn Boyce-Codd Normal Form (BCNF)

https://www.sciencedirect.com/topics/computer-science/transitive-dependency

https://www.sciencedirect.com/topics/computer-science/third-normal-form

Database Keys Made Easy – Primary, Foreign, Candidate, Surrogate, & Many More

Jan L. Harrington, in Relational Database Design (Third Edition), 2009

Publicado por Pedro Carvalho

Apaixonado por análise de dados e Power BI

Deixe uma resposta

Preencha os seus dados abaixo ou clique em um ícone para log in:

Logo do WordPress.com

Você está comentando utilizando sua conta WordPress.com. Sair /  Alterar )

Imagem do Twitter

Você está comentando utilizando sua conta Twitter. Sair /  Alterar )

Foto do Facebook

Você está comentando utilizando sua conta Facebook. Sair /  Alterar )

Conectando a %s

%d blogueiros gostam disto: