Building a filesystem like a google drive, a threaded comments system, or an employee management system always sounds like a daunting task. The data for these use-cases has recursive relationships. It can be organized in a treelike or hierarchical structure. A node may have many children and one parent. The top node which has no parent is called root. Your query patterns may involve querying individual nodes, all children of a node, siblings of a node, etc.
At first glance, one can assume that the best database to store such data would be a graph database. But this decision brings with it the overhead of learning, managing, and operating a new database. Not many teams have the bandwidth to do this.
This post looks at representing and managing hierarchical data structures in a relational database system like PostgreSQL, MySQL using an approach called Closure tables. Before we dive into closure tables, let us take a look at adjacency lists which are probably used by 90% of the applications out there with such needs.
Adjacency Lists
The data model looks as follows:
Hierarchical tree:
id (pk),
parent_id (fk) NULL,
name,
...
...
Here, the parent_id
references another row in the table itself and a foreign key constraint can be added to enforce it at the database level. The root of the tree has parent_id
set as null
.
This naive approach takes us quite far. You can do the following operations quite easily:
- Adding a node to the tree.
- Finding all immediate children of a node.
- Finding all siblings of a node.
- Moving a node anywhere in the tree.
If your application needs are all of the above, you can implement this approach and forget about it.
However, it is not a good approach when your application might need to do the following:
- Querying all children of a node.
This is a recursive query and depending upon the depth of the tree, it might be very time-consuming.
- Deleting a node is expensive.
Depending on your application needs, you might need to either delete the whole sub-tree underneath the node or promote one of the children as a parent.
Closure Table
The Closure Table solution is an elegant way of storing hierarchies that trade-off space complexity for query time efficiency. It involves storing all paths that exist in the tree in a separate table, unlike adjacency lists which only store the parent-child path.
Let us take a look at the tables:
Given a node in a tree, we would have one table storing information related to the node
Node:
id (pk),
name,
....
and another table called the closure table which stores information about the tree structure.
Closure Table:
id (pk),
parent (fk to node),
child (fk to node),
depth
To represent the following tree:
A -> B -> C
Our data would look like below
Node:
id | name |
---|---|
1 | A |
2 | B |
3 | C |
Closure Table (replacing the foreign key values with node names):
parent | child | depth |
---|---|---|
A | A | 0 |
B | B | 0 |
A | B | 1 |
C | C | 0 |
B | C | 1 |
A | C | 2 |
If you notice the data being represented here, it means that each node is a parent of itself and has an entry for each of its parents that live above it in the hierarchy.
Taking an example of node C, the above statement means that
- C is a child of itself at depth 0
- C is a child of B at depth 1
- C is a child of A at depth 2
Let’s take a look at how some common operations are implemented in a closure table:
- Inserting a node:
We need to insert one row in the closure table for each pair of nodes in the tree that share a parent-child relationship, even if they are separated by multiple levels in the tree and we need to insert a row for each node to reference itself.
At first look, it might seem like this would be too much work and there would be a lot of inserts whenever we are trying to insert another node. However, the databases have us covered there.
The inserts are a cartesian join query that any database should be able to execute easily. The cost is directly proportional to the size of the sub-tree under the new node. This means that whenever adding a new node at the leaf of the tree, this would insert rows for all of its parents. However, if you’re adding a new parent to a sub-tree, it will insert a row for all nodes in that sub-tree.
INSERT into closuretable(parent, child, depth)
SELECT p.parent, c.child, p.depth + c.depth+1
FROM closuretable p, closuretable c
WHERE p.child = PARENT_ITEM and c.parent = CHILD_ITEM
The cost of insertion is offset by what we gain when querying the tree.
- Retrieve all children of a node
SELECT node.*
FROM node
JOIN closuretable as t ON node.id = t.child
WHERE t.parent = NODE_ID;
- Retrieve all parents of a node
SELECT node.*
FROM node
JOIN closuretable as t ON node.id = t.parent
WHERE t.child = NODE_ID;
- Delete a node
There are 2 cases when deleting a node:
- Node is a leaf node
In this case, we can simply delete the node’s links to its parents.
DELETE FROM closuretable WHERE child = NODE_ID
- Node is the parent of a sub-tree. The sub-tree has to be deleted as well.
DELETE FROM closuretable
WHERE child IN (SELECT child
FROM closuretable
WHERE parent = NODE_ID);
- Move a node in the tree
This would involve changing the parent of a node. If the node has a sub-tree underneath it, then all of its children would need to be moved as well.
We first delete all the existing parent-child links in our closure table.
DELETE FROM closuretable
WHERE child IN
(
SELECT child
FROM closuretable
WHERE parent = NODE_ID
)
AND parent IN
(
SELECT parent
FROM closuretable
WHERE child = NODE_ID
AND parent != child
)
And then we insert new links for all the nodes in our sub-tree. This is very similar to when we were inserting a node in the tree.
INSERT INTO closuretable (parent, child, depth,)
SELECT supertree.parent, subtree.child,
supertree.depth + subtree.depth + 1
FROM closuretable AS supertree
CROSS JOIN closuretable AS subtree
WHERE supertree.parent = PARENT_IDS
AND subtree.child = CHILD_IDS
And that’s it, folks. The code examples in this post have been deliberately kept at SQL level for wide applicability. You can set up triggers for all of the table maintenance operations - insert, delete and call it a day.
If you want to implement closure tables in your favorite language/framework, please check out the following libraries which already do a great job of maintaining this for you.
Library | Language | Framework/ORM/DAO |
---|---|---|
ClosureTree | ruby | ruby on rails |
ClosureTable | php | laravel |
closure-table | php | laravel |
closure_table | elixir | ecto |
django-closuretree | python | django (does not look like its maintained) |
References
- SQL Antipatterns - Bill Karwin
- Bill Karwins presentation on hierarchical data
- Stack Overflow Thread
- Bill Karwin’s Blog post on closure tables
- dirtsimple.org
- Joe Celko’s Trees and Hierarchies in SQL for Smarties