Моделирование бизнес-процессов / Моделирование бизнес-процессов / ER-диаграмы / 0849315484 Entity-Relationship Diagrams
.pdfHow about our second FD, School→ Location? There are only three
schools in the example and you may note that for every school, there is only one location, so no FD violation.
Now, we want to point out something interesting. If we define a functional dependency X → Y and we define a functional dependency Y → Z, then we
know by inference that X → Z. Here, we defined SSN → School. We also defined School → Location, so we can infer that SSN → Location
although that FD was not originally mentioned. The inference we have illustrated is called the transitivity rule of FD inference. Here is the transitivity rule restated:
Given X → Y
Given Y → Z
Then X → Z
To see that the FD SSN→ Location is true in our data, you can note that
given any value of SSN, you always find a unique location for that person. Another way to demonstrate that the transitivity rule is true is to try to invent a row where it is not true and then see if you violate any of the defined FDs.
We defined these FD's:
Given: SSN → Name
SSN → School
School → Location
We are claiming by inference using the transitivity rule that SSN→
Location. Suppose that we add another row with the same SSN and try a different location:
SSN |
Name |
School |
Location |
|
|
|
|
101 |
David |
Alabama |
Tuscaloosa |
102 |
Chrissy |
MSU |
Starkville |
103 |
Kaitlyn |
LSU |
Baton Rouge |
104 |
Stephanie |
MSU |
Starkville |
105 |
Lindsay |
Alabama |
Tuscaloosa |
106 |
Chloe |
Alabama |
Tuscaloosa |
106 |
Chloe |
MSU |
Starkville |
|
|
|
|
Now, we have satisfied SSN→ Name but violated SSN→ Location. Can we do this? We have no value for School, but we know that if School = "Alabama" as defined by SSN → School, then we would have the following rows:
SSN Name School Location
106 Chloe Alabama Tuscaloosa
106 Chloe Alabama Starkville
However, this is a problem. We cannot have Alabama and Starkville in the same row because we also defined School → Location. So in creating
our counterexample, we came upon a contradiction to our defined FDs. Hence, the row with Alabama and Starkville is bogus. If you had tried to create a new location like this:
SSN Name School Location
106 Chloe Alabama Tuscaloosa
106 Chloe FSU Tallahassee
You violate the FD, SSN→ School — again, a bogus row was created. By
being unable to provide a counterexample, you have demonstrated that the transitivity rule holds. You may prove the transitivity rule more formally (see Elmasri and Navathe, 2000, p. 479).
There are other inference rules for functional dependencies. We will state them and give an example, leaving formal proofs to the interested reader (see Elmasri and Navathe, 2000).
The Reflexive Rule
If X is a composite, composed of A and B, then X→ A and X→ B. Example: X = Name, City. Then we are saying that X → Name and X → City.
Example: |
|
|
|
Name |
City |
|
|
|
|
David |
Mobile |
|
Kaitlyn |
New Orleans |
|
Chrissy |
Baton Rouge |
The rule, which seems quite obvious, says if I give you the combination <Kaitlyn, New Orleans>, what is this person's Name? What is this person's City? While this rule seems obvious enough, it is necessary to derive other functional dependencies.
The Augmentation Rule
If X→ Y, then XZ→ Y. You might call this rule, "more information is not really
needed, but it doesn't hurt." Suppose we use the same data as before with Names and Cities, and define the FD Name → City. Now, suppose we add
a column, Shoe Size:
Name |
City |
Shoe Size |
|
|
|
David |
Mobile |
10 |
Kaitlyn |
New Orleans |
6 |
Chrissy |
Baton Rouge |
3 |
|
|
|
Now, I claim that because Name→ City, that Name+Shoe Size → City
(i.e., we augmented Name with Shoe Size). Will there be a contradiction here, ever? No, because we defined Name → City, Name plus more
information will always identify the unique City for that individual. We can always add information to the LHS of an FD and still have the FD be true.
The Decomposition Rule
The decomposition rule says that if it is given that X → YZ (that is, X defines both Y and Z), then X → Y and X → Z. Again, an example:
Name |
City |
Shoe Size |
|
|
|
David |
Mobile |
10 |
Kaitlyn |
New Orleans |
6 |
Chrissy |
Baton Rouge |
3 |
|
|
|
Suppose I define Name → City, Shoe Size. This means for every occurrence of Name, I have a unique value of City and a unique value of
Shoe Size. The rule says that given Name → City and Shoe Size together, then Name → City and Name → Shoe Size. A partial proof using the reflexive rule would be:
Name |
→ City, |
Shoe Size |
(given) |
City, Shoe Size → City |
(by the reflexive rule) |
||
Name |
→ City |
|
(using steps 1 and 2 and the transitivity rule) |
The Union Rule
The union rule is the reverse of the decomposition rule in that if X → Y and X
→Z, then X → YZ. The same example of Name, City, and Shoe Size illustrates the rule. If we found independently or were given that Name → City and given that Name → Show Size, we can immediately write Name
→City, Shoe Size. (Again, for further proofs, see Elmasri and Navathe, 2000, p. 480.)
You might be a little troubled with this example in that you may say that
Name is not a reliable way of identifying City; Names might not be unique.
You are correct in that Names may not ordinarily be unique, but note the
language we are using. In this database, we define that Name → City and, hence, in this database are restricting Name to be unique by definition.
Keys and FDs
The main reason we identify the FDs and inference rules is to be able to find keys and develop normal forms for relational databases. In any relational table, we want to find out which, if any attribute(s), will identify the rest of the attributes. An attribute that will identify all the other attributes in row is called a "candidate key." A "key" means a "unique identifier" for a row of information. Hence, if an attribute or some combination of attributes will always identify all the other attributes in a row, it is a "candidate" to be "named" a key. To give an example, consider the following:
SSN |
Name |
School |
Location |
|
|
|
|
101 |
David |
Alabama |
Tuscaloosa |
102 |
Chrissy |
MSU |
Starkville |
103 |
Kaitlyn |
LSU |
Baton Rouge |
104 |
Stephanie |
MSU |
Starkville |
105 |
Lindsay |
Alabama |
Tuscaloosa |
106 |
Chloe |
Alabama |
Tuscaloosa |
|
|
|
|
Now suppose I define the following FDs:
SSN → Name
SSN → School
School → Location
What I want is the fewest number of attributes I can find to identify all the rest — hopefully only one attribute. I know that SSN looks like a candidate, but can I rely on SSN to identify all the attributes? Put another way, can I show that SSN "defines" all attributes in the relation? I know that SSN defines Name and School because that is given. I know that I have the following transitive set of FDs:
SSN → School
School → Location
Therefore, by the transitive rule, I can say that SSN → Location. I have
derived the three FDs I need. Adding the reflexive rule, I can then use the union rule:
SSN → Name (given) SSN → School (given)
SSN → Location (derived by the transitive rule)
SSN → SSN (reflexive rule (obvious))
SSN → SSN, Name, School, Location (union rule)
This says that given any SSN, I can find a unique value for each of the other fields for that SSN. SSN therefore is a candidate key for this relation. In FD theory, once we find all the FDs that an attribute defines, we have found the closure of the attribute(s). In our example, the closure of SSN is all the attributes in the relation. Finding a candidate key is the finding of a closure of an attribute or a set of attributes that defines all the other attributes.
Are there any other candidate keys? Of course! Remember the augmentation rule that tells us that because we have established the SSN as the key, we can augment SSN and form new candidate keys: SSN, Name is a candidate key. SSN, Location is a candidate key, etc. Because every row in a relation is unique, we always have at least one candidate key — the set of all the attributes.
Is School a candidate key? No. You do have the one FD that School → Location and you could work on this a bit, but you have no way to infer that School → SSN (and in fact with the data, you have a counterexample that shows that School does not define SSN).
Keys should be a minimal set of attributes whose closure is all the attributes in the relation — "minimal" in the sense that you want the fewest attributes on the LHS of the FD that you choose as a key. In our example, SSN will be minimal (one attribute), whose closure includes all the other attributes.
Once we have found a set of candidate keys (or perhaps only one as in this case), we designate one of the candidate keys as the primary key and move on to normal forms.
These FD rules are useful in developing Normal forms. Normal forms can be expressed in more than one way, but using FDs is arguably the easiest way to see this most fundamental relational database concept. E. Codd (1972) originally defined three normal forms: 1NF, 2NF, and 3NF.
Checkpoint 1.3
1.What are functional dependencies? Give examples.
2.What does the augmentative rule state? Give examples.
3.What does the decomposition rule state? Give examples.
A Brief Look at Normal Forms
In this section we briefly describe the first, second, and third normal forms.
First Normal Form (1NF)
The first normal form (1NF) requires that data in tables be two-dimensional
— that there be no repeating groups in the rows. An example of a table not in 1NF is where there is an employee "record" such as:
Employee(name, address, {dependent name})
where {dependent name} infers that the attribute is repeated. Sample data for this record might be:
Smith, |
123 4th |
St., |
{John, Mary, Paul, |
Sally} |
Jones, |
4 Moose |
Lane., |
{Edgar, Frank, |
Bob} |
Adams, |
88 Tiger Circle., {Kaitlyn, Alicia, Allison} |
The problem with putting data in tables with repeating groups is that the table cannot be easily indexed or arranged so that the information in the repeating group can be found without searching each record individually. Relational people usually call a repeating group "nonatomic" (it has more than one value and can be broken apart).
Second Normal Form (2NF)
The second normal form (2NF) requires that data in tables depends on the whole key of the table. Partial dependencies are not allowed. An example:
Employee (name, job, salary, address)
where it takes a name + job combination (a concatenated key) to identify a salary, but address depends only on name. Some sample data:
Name |
Job |
Salary |
Address |
|
|
|
|
Smith |
Welder |
14.75 |
123 4th St |
Smith |
Programmer |
24.50 |
123 4th St |
Smith |
Waiter |
7.50 |
123 4th St |
Jones |
Programmer |
26.50 |
4 Moose Lane |
Jones |
Bricklayer |
34.50 |
4 Moose Lane |
Adams |
Analyst |
28.50 |
88 Tiger Circle |
|
|
|
|
Can you see the problem developing here? The address would be repeated for each occurrence of a name. This repeating is called redundancy and leads to anomalies. An anomaly means that there is a restriction on doing something due to the arrangement of the data. There are insertion anomalies, deletion anomalies, and update anomalies. The key of this table is Name + Job — this is clear because neither one is unique and it really takes both name and job to identify a salary. However, address depends only on the name, not the job; this is an example of a partial dependency. Address depends on only part of the key. An example of an insertion anomaly would be where one would want to insert a person into the table above, but the person to be inserted is not yet assigned a job. This cannot be done because a value would have to be known for the job attribute. Null
values cannot be valid values for keys in relational databases (this is known as the entity-integrity constraint). An update anomaly would be where one of the employees changed his or her address. Three rows would have to be changed to accommodate this one change of address. An example of a delete anomaly would be that Adams quits, so Adams is lost, but then the information that the analyst is being paid $28.50 is also lost. Therefore, more related information than was previously anticipated is lost.
Third Normal Form (3NF)
The third normal form (3NF) requires that the data in tables depends on the primary key of the table. A classic example of non-3NF is:
Employee (name, address, project#, project-location)
Suppose that project-location means the location from which a project is controlled, and is defined by the project#. Some sample data will show the problem with this table:
Name |
Address |
Project# |
Project-location |
|
|
|
|
Smith |
123 4th St |
101 |
Memphis |
Smith |
123 4th St |
102 |
Mobile |
Jones |
4 Moose Lane |
101 |
Memphis |
|
|
|
|
Note the redundancy in this table. Project 101 is located in Memphis; but every time a person is recorded as working on project 101, the fact that they work on a project that is controlled from Memphis is recorded again. The same anomalies — insert anomaly, update anomaly, and delete anomaly — are also present in this table.
To clear the database of anomalies and redundancies, databases must be normalized. The normalization process involves splitting the table into two or more tables (a decomposition). After tables are split apart (a process called decomposition), they can be reunited with an operation called a "join." There are three decompositions that would alleviate the normalization problems in our examples, as discussed below.
Examples of 1NF, 2NF, and 3NF
Example of Non-1NF to 1NF
Here, the repeating group is moved to a new table with the key of the table from which it came.
Non-1NF: |
|
|
|
Smith, |
123 4th |
St., |
{John, Mary, Paul, Sally} |
Jones, |
4 Moose |
Lane., |
{Edgar, Frank, Bob} |
Adams, |
88 Tiger Circle., {Kaitlyn, Alicia, Allison} |
is decomposed into 1NF tables with no repeating groups:
1NF Tables:
EMPLOYEE table
Name Address
Smith |
123 4th St |
Jones |
4 Moose Lane |
Adams |
88 Tiger Circle |
|
|
DEPENDENT table
DependentName EmployeeName
John |
Smith |
Mary |
Smith |
Paul |
Smith |
Sally |
Smith |
Edgar |
Jones |
Frank |
Jones |
Kaitlyn |
Adams |
Alicia |
Adams |
Allison |
Adams |
|
|
In the EMPLOYEE table, Name is defined as a key — it uniquely identifies the rows. In the DEPENDENT table, the key is a combination (concatenation) of DependentName and EmployeeName. Neither the DependentName nor the EmployeeName is unique in the DEPENDENT table, and therefore both attributes are required to uniquely identify a row in the table. The EmployeeName in the DEPENDENT table is called a foreign key because it references a primary key, Name in another table, the EMPLOYEE table. Note that the original table could be reconstructed by combining these two tables by recording all the rows in the EMPLOYEE table and combining them with the corresponding rows in the EMPLOYEE table where the names were equal (an equi-join operation). Note that in the derived tables, there are no anomalies or unnecessary redundancies.
Example of Non-2NF to 2NF
Here, partial dependency is removed to a new table.
Non-2NF:
Name |
Job |
Salary |
Address |
|
|
|
|
Smith |
Welder |
14.75 |
123 4th St |
Smith |
Programmer |
24.50 |
123 4th St |
Smith |
Waiter |
7.50 |
123 4th St |
Jones |
Programmer |
26.50 |
4 Moose Lane |
Jones |
Bricklayer |
34.50 |
4 Moose Lane |
Adams |
Analyst |
28.50 |
88 Tiger Circle |
|
|
|
|
is decomposed into 2NF:
Name + Job table
|
NAME AND JOB |
|
Name |
Job |
Salary |
|
|
|
Smith |
Welder |
14.75 |
Smith |
Programmer |
24.50 |
Smith |
Waiter |
7.50 |
Jones |
Programmer |
26.50 |
Jones |
Bricklayer |
34.50 |
Adams |
Analyst |
28.50 |
|
|
|
Name and Address (Employee info) table:
NAME AND ADDRESS
Name |
Address |
|
|
Smith |
123 4th St |
Jones |
4 Moose Lane |
Adams |
88 Tiger Circle |
|
|
Again, note the removal of unnecessary redundancy and the amelioration removal of possible anomalies.
Example of Non-3NF to 3NF
Here, transitive dependency is removed to a new table.
Non-3NF:
Name |
Address |
Project# |
Project-location |
|
|
|
|
Smith |
123 4th St |
101 |
Memphis |
Smith |
123 4th St |
102 |
Mobile |
Jones |
4 Moose Lane |
101 |
Memphis |
|
|
|
|
is decomposed into 3NF:
EMPLOYEE table:
EMPLOYEE
Name Address Project#
Smith |
123 4th St |
101 |
Smith |
123 4th St |
102 |
Jones |
4 Moose Lane |
101 |
|
|
|
PROJECT table:
PROJECT
Project# Project-location
101Memphis
102Mobile
101 Memphis
Again observe the removal of the transitive dependency and the anomaly problem.
There are more esoteric normal forms, but most databases will be well constructed if they are normalized to the 3NF. The intent here is to show the general process and merits of normalization.
Checkpoint 1.4
1.Define 1NF, 2NF, and 3NF.
2.Why do databases have to be normalized?
3.Why should we avoid having attributes with multiple values or repeating groups?