Closure Table

Building a filesystem like a google drive, a threaded comments system, or an employee management system always sounds like a daunting task. The data for these use-cases has recursive relationships. It can be organized in a treelike or hierarchical structure. A node may have many children and one parent. The top node which has no parent is called root. Your query patterns may involve querying individual nodes, all children of a node, siblings of a node, etc.

At first glance, one can assume that the best database to store such data would be a graph database. But this decision brings with it the overhead of learning, managing, and operating a new database. Not many teams have the bandwidth to do this.

This post looks at representing and managing hierarchical data structures in a relational database system like PostgreSQL, MySQL using an approach called Closure tables. Before we dive into closure tables, let us take a look at adjacency lists which are probably used by 90% of the applications out there with such needs.

Adjacency Lists

The data model looks as follows:

Hierarchical tree:
    id (pk),
    parent_id (fk) NULL,
    name,
    ...
    ...

Here, the parent_id references another row in the table itself and a foreign key constraint can be added to enforce it at the database level. The root of the tree has parent_id set as null.

This naive approach takes us quite far. You can do the following operations quite easily:

Adding a node to the tree.
Finding all immediate children of a node.
Finding all siblings of a node.
Moving a node anywhere in the tree.

If your application needs are all of the above, you can implement this approach and forget about it.

However, it is not a good approach when your application might need to do the following:

Querying all children of a node.

This is a recursive query and depending upon the depth of the tree, it might be very time-consuming.

Deleting a node is expensive.

Depending on your application needs, you might need to either delete the whole sub-tree underneath the node or promote one of the children as a parent.

Closure Table

The Closure Table solution is an elegant way of storing hierarchies that trade-off space complexity for query time efficiency. It involves storing all paths that exist in the tree in a separate table, unlike adjacency lists which only store the parent-child path.

Let us take a look at the tables:

Given a node in a tree, we would have one table storing information related to the node

Node:
    id (pk),
    name,
    ....

and another table called the closure table which stores information about the tree structure.

Closure Table:
    id (pk),
    parent (fk to node),
    child (fk to node),
    depth

To represent the following tree:

A -> B -> C

Our data would look like below

Node:

id	name
1	A
2	B
3	C

Closure Table (replacing the foreign key values with node names):

parent	child	depth
A	A	0
B	B	0
A	B	1
C	C	0
B	C	1
A	C	2

If you notice the data being represented here, it means that each node is a parent of itself and has an entry for each of its parents that live above it in the hierarchy.

Taking an example of node C, the above statement means that

C is a child of itself at depth 0
C is a child of B at depth 1
C is a child of A at depth 2

Let’s take a look at how some common operations are implemented in a closure table:

Inserting a node:

We need to insert one row in the closure table for each pair of nodes in the tree that share a parent-child relationship, even if they are separated by multiple levels in the tree and we need to insert a row for each node to reference itself.

At first look, it might seem like this would be too much work and there would be a lot of inserts whenever we are trying to insert another node. However, the databases have us covered there.

The inserts are a cartesian join query that any database should be able to execute easily. The cost is directly proportional to the size of the sub-tree under the new node. This means that whenever adding a new node at the leaf of the tree, this would insert rows for all of its parents. However, if you’re adding a new parent to a sub-tree, it will insert a row for all nodes in that sub-tree.

INSERT into closuretable(parent, child, depth)
SELECT p.parent, c.child, p.depth + c.depth+1
  FROM closuretable p, closuretable c
 WHERE p.child = PARENT_ITEM and c.parent = CHILD_ITEM

The cost of insertion is offset by what we gain when querying the tree.

Retrieve all children of a node

SELECT node.*
FROM node
  JOIN closuretable as t ON node.id = t.child
WHERE t.parent = NODE_ID;

Retrieve all parents of a node

SELECT node.*
FROM node
  JOIN closuretable as t ON node.id = t.parent
WHERE t.child = NODE_ID;

Delete a node

There are 2 cases when deleting a node:

Node is a leaf node

In this case, we can simply delete the node’s links to its parents.

DELETE FROM closuretable WHERE child = NODE_ID

Node is the parent of a sub-tree. The sub-tree has to be deleted as well.

DELETE FROM closuretable
WHERE child IN (SELECT child
                     FROM closuretable
                     WHERE parent = NODE_ID);

Move a node in the tree

This would involve changing the parent of a node. If the node has a sub-tree underneath it, then all of its children would need to be moved as well.

We first delete all the existing parent-child links in our closure table.

DELETE FROM closuretable
        WHERE child IN
            (
                SELECT child
                FROM closuretable
                WHERE parent = NODE_ID
            )
        AND parent IN
            (
                SELECT parent
                FROM closuretable
                WHERE child = NODE_ID
                    AND parent != child
            )

And then we insert new links for all the nodes in our sub-tree. This is very similar to when we were inserting a node in the tree.

INSERT INTO closuretable (parent, child, depth,)
        SELECT supertree.parent, subtree.child,
                supertree.depth + subtree.depth + 1
        FROM closuretable AS supertree
        CROSS JOIN closuretable AS subtree
        WHERE supertree.parent = PARENT_IDS
        AND subtree.child = CHILD_IDS

And that’s it, folks. The code examples in this post have been deliberately kept at SQL level for wide applicability. You can set up triggers for all of the table maintenance operations - insert, delete and call it a day.

If you want to implement closure tables in your favorite language/framework, please check out the following libraries which already do a great job of maintaining this for you.

Library	Language	Framework/ORM/DAO
ClosureTree	ruby	ruby on rails
ClosureTable	php	laravel
closure-table	php	laravel
closure_table	elixir	ecto
django-closuretree	python	django (does not look like its maintained)

References

SQL Antipatterns - Bill Karwin
Bill Karwins presentation on hierarchical data
Stack Overflow Thread
Bill Karwin’s Blog post on closure tables
dirtsimple.org
Joe Celko’s Trees and Hierarchies in SQL for Smarties

Adjacency Lists#

Closure Table#

References#

Related articles:

Adjacency Lists

Closure Table

References