While researching data warehousing, I noticed that it is quite hard to find a good definition for the relational normal forms commonly used in data modelling for operational databases. The 3rd normal form is widely used, and one of those things “you know when you see it”, but what is the actual definition? And what about all the other normal forms?
As usual, a very good source is wikipedia (including the linked pages defining each form). However I personally find the “list of objectives” as quoted from Codd unhelpful:
- objective 1 which describes the “anomalies” that normal forms can resove is interesting, but not really helpful
- the other objectives (2, 3, 4) are debatable; eg are normal forms really more resistent to model changes, or easier to understand?
The Wikipedia pages which define each form are also very mathematical and academic.
There is an easier-to-read article from William Kent.
The remainder of this article is just my summary of Hillyer and Kent; I recommend that you read the original articles which are far better written than this one.
Goals of the Normal Form
I prefer this definition of the goals of normal forms:
- eliminating redundant data (avoiding having the same data in more than one table)
- eliminating unrelated data within the same table (splitting tables where columns are not directly coupled)
The only difference in the various normal forms is how extreme they interpret these issues; the higher the number the more strictly the two points above are implemented.
Key and Column Dependencies (First Three Normal Forms)
Most people using or even designing database tables only need to know the first three normal forms. In my opinion, the easiest way to understand them is the definitions from Kent:
- first form: “every column depends on the key”
- second form: “every column depends on the whole key”
- third form: “every column depends on nothing but the key”
Or as Hillyer puts it:
The goal of database normalization is to ensure that every non-key column in every table is directly dependent on the key, the whole key and nothing but the key
Databases are surprisingly often not even in first normal form. Inexperienced designers often “multiplex” different types into the same table, often resulting in lots of nullable columns in a table. However this excess of nullable columns makes writing queries dangerous, and the rules about which data can be expected to exist for specific types cannot be represented in the table schema.
First Normal Form
Requirement: “every column depends on the key”.
This means in practice:
- Define a unique primary key for each table. Optionally, use a surrogate key - but still identify the natural key and add a uniqueness constraint on those fields
- Dont mix completely unrelated data in the same table - one table per logical purpose.
Alternatively, it can be stated that “within a table, every record has the same columns with the same required/optional properties”. That means that two quite different datatypes cannot be mixed in the same table, as each datatype has at least one mandatory field and thus every row requires all types to be present which is obviously not sensible.
There are some NoSQL databases which support variable numbers of fields (eg bigtable-like and document-like) - but relational systems do not. However the “normal forms” don’t really apply to these.
Applying the first normal form to unstructured data results in more tables. The consequence is a need to:
- Define references between tables with a foreign key
- Define link tables to represent many-to-many relations where needed.
That lifetimes of independent objects are now separated; when two records are unrelated then one can be deleted without affecting the other; this is not the case when a row mixes unrelated data together. In fact, this fact can be used in reverse to determine whether data belongs in the same table or not: if it makes sense to delete a set of columns without deleting the whole record (ie if the key still makes sense without the columns), then that set of columns are in fact independent and should be in a separate table.
Using first normal form also reduces the need for lots of nullable columns on a record. A table where many columns are nullable is a hint that perhaps some columns are not related. As a practical measure, writing code or queries that work correctly with nullable columns is tricky - it is nice to minimise them.
One Value Per Column
A somewhat unrelated requirement is also thrown into the first normal form:
- Don’t combine different data items in same column
Sometimes the last point is clear: a single column should generally not have a value like “Cheese, Milk, Celery” - those are three items. However in practice, it depends what users of that column are going to do with it; when the column contents is really a set of values which can be queried or updated independently, then combining them in one column is not accepable in 1st normal form. However if the column contents is only going to be treated as an atomic unit, being updated only via a complete replacement, then it can be considered a single value. Examples:
- Is “22 smith st, myburb, mycity” one item or not? It depends on what you do with the data..
- Is “john smith” one item or not? It depends on whether you search for lastname or not..
There are also extended-sql databases which support array-typed columns (eg postgres, bigquery) - but in general, relational systems do not. The “normal form guidelines” still apply to such databases, but the “atomic value” rule should probably be viewed in a more relaxed manner, as the query language is also extended to support querying of these.
Note that defining cols “member1”, “member2”, “member3” .. “memberN” to hold sets of size up to N is NOT acceptable - at least not in the spirit of normal form, as it requires very complex SQL to correctly query. This is not the case for native array columns, where query syntax has been extended to elegantly handle this.
Performance Implications of First Normal Form
There is no performance benefit to “multiplexing types”, ie combining logically separated columns together into a single table with lots of columns.
There is a performance benefit to “multiplexing values”, ie storing a collection of values within a single column, as long as you don’t actually perform operations on members of that set. However when you do, the performance hit is much worse than having a separate table.
Second Normal Form
Requirement: “every column depends on the whole key”.
This requires moving columns that only depend on part of the primary key to a separate table. This reduces duplicated data when there are two rows with the same partial-primary-key.
Or in other words, it forces the creation of tables with a hierarchical general/detail style relationship, rather than duplicating the general data in each detail record.
Hillyer’s example is a book database. Mixing author and book info in the same record is awkward - an author with multiple books results in duplicated data - thus not second normal form. An author with no books cannot be represented.
Flattening of such relations (breaking 2nd normal) can provide performance benefits, due to avoiding a join. However it wastes space (duplicating facts on each detail row) and is ugly if a generic fact needs to be updated.
Third Normal Form
Requirement: “every column depends on nothing but the key”:
This requires moving columns that depend on the values of other columns. If two columns can be “inconsistent with each other”, ie updating col X forces an update for col Y, then a new table is needed. The new table will include X in the key, and have Y as a value to properly represent the dependency of Y on X.
Breaking 3rd normal form can provide a performance benefit - avoiding a join. However when the number of valid (X,Y) pairs is low enough that the whole table can be easily cached in memory (or some values of X are far more common than others) then the cost of the join may be very small.
Normal Form Alternative Definitions
The article from Kent talks about forms in terms of “facts”, which is an interesting alternative approach to understanding normal forms.
The first normal form definition adds nothing to the discussion above.
Second Normal Form as Facts
Requirement: A non-key field must provide a fact about the whole key.
When some column depends only on part of the key, then two records which share the same partial key will end up duplicating that fact. Obviously, this can only happen when the key is composite (there can neve rbe a partial value for a single-column key!).
This is of course the same as the previous statement “every column depends on the whole key”.
To make the table second-normal-form-compliant, it is necessary to create another table with structure
This effectively means there is an implicit tree of generic/specific entities that has been squashed into a single layer. Create a table for the generic data, and put the relevant facts there.
Third Normal Form as Facts
Each non-key field must provide a fact about only the whole key (ie “depend on nothing but the key”) also elegantly expressed as having a non-key field Y which provides a fact about another non-key field X.
The fix is to create a table TX for expressing
X->Y, then from TA reference a row in TX (ie store a foreign-key to TX).
This can also be considered as a variant of 2nd form problem: consider extending the kex of TA to include column X. That solves the 3rd-form problem for column Y, but creates a second-form problem: there are then other columns that do NOT depend on X.
Breaking third form leads to duplicated data - there may be multiple identical (X1,Y1) pairings in the database. Externalizing these means storing (S1, X1, Y1) once then referencing it with just S1 (one value instead of two). It can also lead to inconsistencies when X is updated in a row, but Y is not.
Fourth Normal Form
This form deals with the representation of 1:N and N:N relations.
When an entity A has a 1:N relation (ie has a set) of property X and a separate set of property Y, then it is technically possible to represent this multi-valued relation in a relational DB without native sets as a single table that contains both X and Y in separate columns of the same row. However this should not be done, because
- the columns for A and B must be nullable, as the parent object might only have one of them
- logic updating A and B is more complex; “deleting” an instance of A might require deleting a row (when B is null) or might require “nulling out” A (when B is not null).
The solution is simple: create separate tables for A and B.
I think this is actually pretty natural for most DB designers. Once a set has been externalized as a 1:N or N:N table, then it is fairly obvious when columns do not belong together.
In fact, unless I have misunderstood something, it seems that a table which combines unrelated properties X and Y is in fact breaking first normal form. Or is it that the description of first-normal-form is in fact pulling in a constraint from the fourth normal form? Regardless, it is pretty sensible and obvious not to combine uncorrelated data in the same table. Kent’s examples are good on this topic.
Fifth Normal Form
Ensure that the foreign-keys used to represent these relations are minimal (don’t include unneeded components).
As far as I can tell, this applies to cases where a table T has a 1:N or N:M relation to two or more other tables X and Y. In general, fourth normal form applies here. However in some cases there might be interactions between the different relations that rule out some combinations, ie not every (
T->X) and (
T->Y) combination is allowed. This situation requires the fifth normal form to properly express these constraints (ie prevent the invalid combinations).
Normal Forms as Anomalies
Codd defines the objectives of the normal forms as a way to remove a series of “anomalies” (inconsistencies) in data representation in relational systems. I don’t find this particularly helpful myself, but here they are:
Insertion anomaly: a row represents multiple logical objects, but has only one key - and thus one of those logical objects is a “primary” object, with the rest being secondary objects. This makes it impossible to represent secondary objects on their own - which may be valid if there really is a “part-of” relation between the primary and secondary objects, ie the secondary objects have no lifecycle independent of their “parent”. However it is an invalid schema if the secondary objects can exist alone - the model does not support that.
Deletion anomaly: very similar to insertion anomaly. When a row represents multiple logical objects, and secondary object lifecycles are linked to the parent then all is fine. But if the secondaries might outlive the primary, the model is invalid - deleting the “parent” cannot be done as it would leave a row without a valid primary key.
Update anomaly: occurs when data is duplicted; updating a single logical object then requires updating multiple rows. Note that the same constant occurring in multiple rows is not a problem: multiple “messages” with the same loglevel string, eg “error”, are acceptable as the loglevel is a per-message attribute rather than a reference to a shared logical entity. Updating the log-level for one message is separate from updating the level on another object. However multiple records referencing the same person object is wrong - updating the personal details should be reflected on all related data at once.