Normalization

Introduction

When you’re involved in data modeling, you should know the term  normalization. Normalization refers to the process where you apply a set of rules – called normal forms –  to avoid data redundancy as well as update, insert or delete anomalies.  If the data is not normalized, many problems can occur while loading an integrated conceptual model into the DBMS. In this article, I’ll be showing how to structure a data set by applying the first three normal forms.

 

Data Normalization

Let’s suppose you started working as a data engineer at a car dealer company. One of your tasks is to improve the structure of a database – which is an excel spreadsheet – used by the company to hold their data. You just viewed carefully the spreadsheet – figure 1 – and you noticed a few things:

       Figure 1. Dealers table. The column DealerId is the primary key of the table

 

First at all, a table should have a single purpose. The above table has more than one purpose:

  • Identify all company’s dealers.
  • Stores the customer’s dealers.
  • Show the office’s dealers information, such as telephones and fax number.

This kind of structure generates anomalies while inserting, deleting or updating the data. This is called data anomalies. By applying normalization, you reduce the anomalies that appears in the data. There are three kinds of anomalies that you must know:

  • Insert anomaly – it is the inability to add data to the database due to absence of other data. Let’s suppose you want to insert a new office to the above table. As the primary key of this table is the DealerId column, then you need to insert an entire row to add a new desired office.
  • Delete anomaly – it is the unintended loss of data due to deletion of other data. For example, what could happen if Chris Hernandez quits the company? In this case,  the California office will be deleted, along with Chris’s customers.
  • Update anomaly – it is a data inconsistency that results from data redundancy and a partial update. You can notice there are two dealers working in the New York office. In case the New York office decided to change its fax number, you need to update the fax number in each record  related  to the New York office. This produces data redundancy, and remember that one of the purposes of normalization is to reduce the data redundancy.

 

In order to avoid these anomalies and structure data correctly we need to use normalization, which at least should have 3 normal forms. These thee normal forms will be explained below.

 

First Normal Form (1NF)

The first normal form says:

  • Each table needs to contain a primary key which uniquely identifies each row.
  • Each column needs to contain atomic values, i.e. values that cannot be divided into more values; and there can’t also be a group of columns belonging to the same domain.

Looking again at our current model, the values of the Customers column aren’t atomic, because there’s more than one value in at least one cell – Sam Hudson has three customers and all are just separated by comma. Also, DealerOfficeTelephone, DealerOfficeTelephone2 and DealerOfficeTelephone3 columns remain in the same column domain: dealer’s telephone. This design, for example, limits the number of office telephones just to three, because there’s only three columns related to office telephone.

The way to achieve the first normal form is by moving the grouping columns to a new table and create a foreign key referencing to the table where the grouping columns were moved. In this case, there are two grouping columns: DealerOfficeTelephones and Customers. Both domains are going to be in their respective table: DealerTelephones and Customers.

By the way, asides showing the data in a table, I’ll be modeling the state of the database – after apply a normal form – by using a database diagram tool. I personally suggest to use the SqlDBM tool because of the simplicity and intuitive design, so it’s perfectly useful for every beginner and expert – I’ll share the SqlDBM site to start using.

 

        Figure 2. This is the result of applying the first normal form to our model

 

There are two new tables: DealerTelephones and Customers. Each table contains a foreign key – DealerId – referencing to the primary key of Dealers table. Also, the DealerId column at Customers table, asides to be a foreign key, it’s part of the table’s primary key, but why? By applying this kind of constraint – two columns together being table’s primary key – you enforce that a customer can be related to more than one dealer, and a dealer can be related to more than one customer.

 

Second Normal Form (2NF)

The second normal form says:

  • The table must be in the first normal form.
  • All the non – key columns must depend only on the table’s primary key.

You already know that all tables are in first normal form, but you don’t know if all non-key columns are dependent to the table’s primary key. First, you must know that a non-key column – as its name indicates – is a column that’s not the table’s primary key, or it’s not part of the primary key – in case the table has a compound primary key –. In addition, to understand the dependence between columns, let’s say there are two columns: column A and column B. The column B is dependent on column A when once the value of column A is known, the value of the column B is also known.

By knowing the prior concepts, you see there’s an issue with the Customers table. The Customers’ primary key is compound: DealerId column along CustomerId column. All non- key columns must depend on the entire primary key, it cannot depend partially on the primary key. Check carefully that CustomerName, CustomerCity and ZipCode columns only depend to CustomerId, but not to DealerId. Once you know the CustomerId, you’ll know the value of CustomerName, CustomerCity and ZipCode, but if you know the DealerId, you’ll not know exactly the CustomerName, CustomerCity and ZipCode columns, because remember that a dealer can be related to more than one customer. This situation alerts you that the non-key columns just depend to a part of the primary key, and not to the entire primary key.

Also, remember that a table should have only a single purpose. However, now the Customers table has two purposes: it holds customer information, such as the name, the city where the customer lives and so on, and identify the dealer’s customers.

The way to achieve the second normal form is by moving the DealerId column from the Customers table, but where? Since a customer can be related to one or many dealer, and a dealer can be related to one or many customer, the relationship between both entities is many to many, and this says to us that a new table needs to be created. This table will contain two columns: DealerId and CustomerId. Both columns are foreign keys and compound the primary key. To gain more details, let’s see the following diagram:

           Figure 3. The current state of the database after applied the second normal form

 

Third Normal Form (3NF)

The third normal form says:

  • The table must be in the second normal form.
  • All non – key columns must not depend transitively on the primary key.

“All non – key columns must not depend transitively on the primary key”,  such confusing sentence, but let’s see what it means. You already know what  a non- key column and the dependence between columns are, but what about a transitive dependence? To explain this new concept, suppose you have three numbers: A, B and C. If A is equal to B, and B is equal to C, then clearly A is equal to C; this is what transitive is called for. In database terms, instead of numbers, we have columns. Once you know the value of column A, you know the value of the column B, and, once you know the value of the column B, you know the value of the column C. There’s a transitive dependence between columns A and C through column B.

Let’s look to the Customers table, specifically the columns CustomerCity and ZipCode. Both columns are related by dependence: once you know the city’s zip code, you’ll know the customer city  and where the customer is from; this means the CustomerCity column is dependent on ZipCode column. Remember that all columns depend on the table’s primary key – we issued this by applying the second normal form –, so once you know customer ID, you’ll know also the customer’s city zip code. Notice there’s a transitive dependence between columns CustomerCity and CustomerId through ZipCode column.

To avoid the transitive dependence, the column which serves as intermediary between the two transitive columns needs to be moved into a new table, along with the columns that are dependent to it. The ZipCode column is considered as the intermediary column, and the CustomerCity relies on the ZipCode column, so both columns will be moved into a new table – see the image below.

                                                                                Figure 4. ZipCode table

As a zip code can be related to one or more customers – this is, a lot of customers can be from the same city –, there’ll be a foreign key in the Customers table referencing to the ZipCode’s table primary key. By doing this, our data set is normalized – see the image below.

 

 

                                                   Figure 5. Car dealer Company database after applied normalization

Conclusion

Although we haven’t reached all the normal forms, for an OLTP database, the third normal form is enough for a good database performance. Remember that data normalization helps to reduce data duplication, enforce data integrity, increase database performance and make the queries  less complex.

I suggest you practice data normalization to enhance your database design skills. Just use a database modeling tool, for example, SqlDBM like I used in this article in order to apply each normal form.

 

Author: Alfonso Hernandez