Database Normalization – 1NF HOW-TO

In this post we are going to tackle the concept of database normalization and how it can be applied. Before we get into that however, let’s look at the intent of data normalization. In short, normalization, as it relates to the field of relational database design, is an effort to reduce or eliminate redundant data. This is done by breaking the data down into well-defined constructs and then associating them to each other through the use of defined relationships.

So what is redundant data? By definition, it is the unnecessary duplication of data. Why should we strive to eliminate it? First, it will reduce the overall storage requirement of the data. Secondly, it will make querying the data much easier. Thirdly, it makes it easy to change data. We’ll see an example of these benefits later in this series on normalization.

Notice that I did not say that normalization will guarantee data integrity. This is also important, however, strictly speaking, normalizaiton promotes data integrity but does not necessarily govern it. This is, in my experience, left to the DBA to define and implement.

So what is an example of redundant data? Take a look at the table below. It’s depicts a fictional table used to store student registration data. Can you see the redundant data?

Student Class1 Instructor1 Class2 Instructor2
Smith, John Biology Darwin, Charles Math Pascal, Blaise
Ericson, Robert Biology Darwin, Charles Chemistry Bohr, Niels
Edwards, Abby Physics Newton, Isaac Biology Darwin, Charles
Richmon, Stanley Computer Science Boyer, Robert Archaeology Howard, Carter

This table design violates first-normal-form, also referred to as 1NF. 1NF dictates that a table must not have any repeating groups. A repeating group occurs when a given instance, in this case STUDENT, has an attribute, in this case CLASS and INSTRUCTOR that contains multiple values. In this example, the columns CLASS1 and CLASS2 are representitive of a repeating group. They both convey an attribute type of CLASS with multiple values for each STUDENT. The values are stored as CLASS1 and CLASS2. The same is true of the INSTRUCTOR attribute. More than likely, when you see any column with a numeral such as this example, it’s a safe bet that it’s a repeating group. A novice DBA might redesign the table to look like this in an effort to acheive 1NF:

Student Class Instructor
Smith, John Biology | Math Darwin, Charles | Pascal, Blaise
Ericson, Robert Biology | Chemistry Darwin, Charles | Bohr, Niels
Edwards, Abby Physics | Biology Newton, Isaac | Darwin, Charles
Richmon, Stanley Computer Science | Archaeology Boyer, Robert | Howard, Carter

Close but no cigar. 1NF also dictates that a column (attribute) must be atomic. That is, it can contain only one value. This second example violates that because the STUDENT, CLASS and INSTRUCTOR columns contain multiple values. Below is the same data in 1NF:

Student Table

StudentId LastName FirstName
1 Smith John
2 Ericson Robert
3 Edwards Abby
4 Richmon Stanley

Class Table

ClassId ClassName
1 Archaeology
2 Biology
3 Chemistry
4 Computer Science
5 Math
6 Physics

Instructor Table

InstructorId LastName FirstName
1 Bohr Niels
2 Boyer Robert
3 Carter Howard
4 Darwin Charles
5 Newton Isaac
6 Pascal Blaise

Notice how we’ve broken down our original table into three, well-defined tables? Each table contains the same data as our original table in terms of the entity or instance they represent. The repeating group is gone and each column in the three tables are atomic. What’s missing though is some construct (table) that relates the three to each other. For that, we use what is known as a relation table. A relation table is one which relates an entity to another entity. Below is the relation table that would be used to associate the three entities to each other:

Class Relation Table

StudentId ClassId InstructorId
1 2 4
1 5 6
2 2 4
2 3 1
3 6 5
3 2 4
4 4 2
4 1 3

As you can see, the table above relates each of the three entities to each other. We can tell that Smith, John (StudentId = 1) is enrolled in Biology (ClassId = 2) and Math (ClassId = 5). We know that Biology (ClassId = 2) is taught by Charles Darwin (InstructorId = 4) and that Math (ClassId = 5) is taught by Blaise Pascal (InstructorId = 6).

You will find that normalized data have many tables but few columns. This is a by-product of the normalization process. For those of you who are astute readers, you may be pondering the fact that the relation table (Class Relation Table) seems to have some duplicative data. You are correct. The StudentId, ClassId and InstructorId do indeed occur multiple times in this table. That’s fine though. Remember that redundant data is defined as the unnecessary duplication of data. In the case of the relation table, the duplication is necessary (and perfectly legal) to relate the three entities to each other.

Come back in a few days. We’ll continue our discussion of the normalization process by exploring the second level of normalization, namely 2NF. Feel free to comment on this post with your thoughts, suggestions or critiques.

Minh

Leave a Reply