SQL Techniques - Third Normal Form (3NF)
Relational databases are built on the concept of entities and their relationships to each other. From time to time the relationship between tables is a Table1 < N – N> Table2 or table 1 data relates to table2 data in a non unique form. In this case a bridging table is a good solution to create the appropriate dependency relationship. This is known as Third Normal Form (3NF). The following is an example situation along with the structures and queries to find the correct corresponding data.
Problem
- You have a
Documents
table that contains some text - You have a
Keywords
table that contains the indexed and relevant words that may be within the text of a document
Say you have a Documents
that contains the sentence. Monday was not my best day.
Now you also have Keywords
that are for the days of the week, therefore Monday
is a record within that table.
How to efficiently process the documents for keyword lookup within the Documents
table. In this case we are introducing a new table DocumentKeywords
this is the bridging table between the two entities and records are added or deleted from this table when either the Documents, Keywords
tables are updated.
Here is a diagram of the current database layout
Diagram
Setup (Sqlite)
drop table if exists DocumentKeywords;
drop table if exists Keywords;
drop table if exists Documents;
create table Documents (
id INTEGER primary key autoincrement,
doc_date timestamp
);
create table Keywords (
id INTEGER primary key autoincrement,
keyword text
);
create table DocumentKeywords (
doc_id INTEGER,
keyword_id INTEGER,
FOREIGN KEY("doc_id") REFERENCES "Documents"("id") ON DELETE CASCADE,
FOREIGN KEY("keyword_id") REFERENCES "Keywords"("id") ON DELETE CASCADE
);
insert into Documents (doc_date)
values ('2020-01-01'),
('2020-01-02'),
('2020-01-03');
insert into Keywords (keyword)
values ('Blue'),
('Yellow'),
('Red'),
('Orange');
Filling in some relationships only for illustrative purposes
insert into DocumentKeywords(doc_id, keyword_id)
values (2, 1),
(2, 2),
(4, 4),
(3, 1);
From the above you see we have 2 documents that share unique keywords. Document 2, 3
both have Yellow
and we have 1 document that contains multiple keywords Document 2
contains Yellow, Blue
Query
How do we identify all documents that contain both keywords Yellow, Blue
?
First get all the keyword ids that are within documents for the above keywords
SELECT id
FROM Keywords
where keyword IN ('Blue', 'Yellow')
Now we need to make sure these are joined to match all document keywords
select dk.doc_id
from DocumentKeywords dk
inner join
(
SELECT id
FROM Keywords
where keyword IN ('Blue', 'Yellow')
) k
on dk.keyword_id = k.id
Ok that works, but the problem is that it returns all matches not the ones in both
doc_id |
---|
2 |
2 |
3 |
In the above since there are two keywords then we only want the matching document ids that are matched at least 2 times. Therefore do a group by with a having count(*) > 1.
select dk.doc_id
from DocumentKeywords dk
inner join
(
SELECT id
FROM Keywords
where keyword IN ('Blue', 'Yellow')
) k
on dk.keyword_id = k.id
group by dk.doc_id
having count(*) > 1;
doc_id |
---|
2 |
Ok we have the correct result now make it slick with a Common Table Expression (CTE)
WITH mult as (select dk.doc_id
from DocumentKeywords dk
inner join
(SELECT id
FROM Keywords
where keyword IN ('Blue', 'Yellow')) k
on dk.keyword_id = k.id
group by dk.doc_id
having count(*) > 1)
SELECT d.*
FROM Documents d
inner join mult
on d.id = mult.doc_id;
id | doc_date |
---|---|
2 | 2020-01-02 |