The Programmer [In Training]: Concept 3: Binary Search Trees

Now I know this article was supposed to be here earlier, prior to rotations, but what can I say? I really wanted to write about rotations, and so I did. However, here's the promised article on Binary Search Trees.

Disclaimer: This is really long, so prepare yourself.

Let's start this off in as generic a manner as possible. What are trees? No, not those trees. I'm talking about the trees in computer science. Now trees are collections of data elements called nodes, where each node is connected to one or more nodes. When I cover graphs on this blog, you will see that trees are just a subset of the larger set of graphs. For now, however, know that the important distinguishing factor of a tree is that a tree does not have cycles; in other words, it is not possible to start from any node, follow its connections and end up back at the same node.

So what is a binary tree? Remember when I said that a tree consists of a collection of nodes that have one or more connections to other nodes? Well, a binary tree is a subset of that definition, in that the connections can be considered directed and also that every node has one incoming connection (except for a special node called the 'root' node) and has zero, one or two outgoing connections.

In pictures, a binary tree looks like this:

Notice how every node has one incoming connection? This property is true for all but the topmost (colored red) node. That node is called the 'root' as it has no 'parent'. Also, every node has zero, one, or two 'children', as shown by the number of outgoing connections. For ease of understanding, the number of outgoing connections for each node has been written in the node. Note that the bottom-most nodes have zero children each. Nodes with no children are termed leaf nodes, thus extending the tree analogy.

So what are Binary Search Trees?

Phew. Finally. Binary Search Trees are nothing but binary trees which follow ONE SIMPLE RULE:

For every node, the LEFT ~~CHILD~~ SUBTREE of the node ~~has a value~~ consists of nodes with values LESS than itself, and the RIGHT ~~CHILD~~ SUBTREE of the node ~~has a value~~ consists of nodes with values GREATER than itself.

(EDIT: An error pointed out by +Aaditya Lakshmanan has now been rectified. Many internets to you!)

The above rule can be specifically stated for just the node and its children, being that a node has a value greater than its left child and lesser than its right child. This specific rule will generalize itself to the above property if the steps for insertion and deletion are followed as below.

In other words, if you take the set of the left child, the node itself, and the right child as {L, N, R}, then L<N<R for all nodes N. Henceforth, I shall call the above rule as the Binary Search Tree Property.

Now in binary trees, we denote a subtree as a triangle with an alphabet, where the alphabet is representative of the root node of that subtree. In other words, the binary tree shown below...

... can be represented as...

... with respect to node A and its child subtrees.

From here on, we shall be using the subtree notation to simplify work and make the representation uncluttered.

Therefore, if the above diagram was a Binary Search Tree, then we would represent the rule as B<A<C. Note that this rule holds for all nodes in the tree which have one or more children.

So what's the point? Why go through the hassle of this rule to store data? Well, the answer is simple. Binary Search Trees have an inherent advantage in guess what? Yes, search. It's, like in the name.

How does this advantage come about? Well, as long as we know the value of what element we are searching for (which, of course, we do), we can quickly search for its existence and position by a series of comparisons. At each comparison, in a well-balanced binary search tree, we can skip past roughly half the remaining elements. How this comes about will be explained.

Insertion

Before we do anything else, we have to learn how to insert a value in a BST (Yes, that is what I shall be calling it from now on). Insertion of a value is not a difficult task, and involves a simple couple of steps:

Find the point to insert, by filtering down the tree using the BST Rule.
Insert the node as a new leaf.

Yeah, it is that simple. All you have to do is to find that point. Now in order to do that, recall that the BST Property applies for all nodes, without exception. In other words, you can treat any node as the root of its own subtree and the same property will apply for it and its children. In other words, recursion.

The image above shows the insertion of the element '6' in the BST given. Note how we search for the correct point to insert at by moving left or right at each step depending on whether the inserted element is lesser or greater than the current element.

So look at the pseudocode below, carefully, and try to understand how the insertion procedure works.

insert_BST (x, root) :
    # input: We are given the node x to be inserted.
    if (root == NULL) :
        root = x
        return
    
    if x.value > root.value:
        insert_BST(x, root.right)
    else
        insert_BST(x, root.left)

Clear enough? Good. We can now proceed to searching.

Searching

The more astute among you would have figured out that we have actually already covered searching. Confused? Recall point 1 of the insertion steps: we require to find the point of insertion. Yes, that is precisely how search in a BST works: we just move downwards from the root following the BST rule till we end up at either a NULL element (the requested node does not exist) or at the required element. Searching is done as follows:

search_BST (x, root) :
    # input: x is the value we are looking for
    if (root == NULL) :
        print "The node requested does not exist"
        return
    if x == root.value:
        print "Value found!"
        return
    if x > root.value:
            print "Go right"
            search_BST(x, root.right)
        else
            print "Go left"
            search_BST(x, root.left)

Note that this is basically a modification of a submodule referred by the insertion step, the difference being that we go all the way to the leaf nodes during insertion, while we may terminate the process when the node is found prematurely before reaching a leaf node.

Those of you who have learnt binary search in sorted arrays will be seeing the similarity in the procedure already. Instead of starting from the middle element, we start from the root, and then it's basically the same steps.

Deletion

Sometimes, it is necessary to delete an element in a BST. Deletion in BSTs is the most complicated subtopic to understand, so pay attention.

Deletion involves some simple overall steps:

Find the element to delete using a search.
Assuming the element exists, it can be in one of three states:

It may be a leaf node, i.e. have no children.
It may have one child, either a left or a right child.
It may have both a left and a right child.

Depending on which case of the above the node belongs to, its deletion will follow three different methods.

The three cases of deletion are given below:

Case 1: Leaf Node

Deletion of leaf nodes is quite direct. All you do is to remove the reference to the node and deallocate the space held by it. As deletion of a leaf node affects no other node (as it has no children), this case is straightforward.

Case 2: One child

Nodes with one child can be deleted by crossing over the node in the link from its parent node. If the preceding sentence does not make sense to you, have a graphic.

In both the left and the right case shown in the image, the node colored red is to be deleted, with the green subtree as its only child. Notice that we just link past the node to be deleted: its parent now links to its child.

Case 3: Two Children

Okay, so here is the true issue. After looking at the first two cases, I think you might be beginning to see why a node with two children cannot be simply deleted. Firstly, it is not possible to directly delete the node, as its children have to now be relocated. Also, it is not possible to link past the node, as it has two children, so which child can you link its parent to?

Thus, the strategy for deleting a node with two children is quite different. This involves knowledge of something called the inorder set of the BST.

So what's the inorder set? It is basically exactly as it sounds: it is the sorted set of elements in the tree. If you remember the BST Property, for every node, the left child is less than the node and the right child is greater than it. From this, we can extend it to saying that all the elements in the left subtree are less than the node while all the elements in the right subtree are greater than the node. Again, extending this for a particular subtree, we can see that the least valued element in a subtree is the leftmost element, while the greatest valued element is the rightmost element of that subtree.

Following the above discussion, we can see that the inorder set will contain elements in the order as they are visible from left to right.

If you follow the little dude's line of sight as he looks directly upwards, and then move the dude towards the right, you end up with the inorder set of the given BST. For the BST above it is D, B, E, A, F, C, G, with the elements on the left less than the elements on the right (Check it!).

Now assume all the elements were placed on a number line. In a number line, lesser elements appear on the left. Hence, the number line would contain the same order: D-B-E-A-F-C-G. Assume we had to delete an element from this number line. Now when we do that, the element has to be replaced by another element already present in the number line. This element has to be chosen such that the replacement does not violate the number line property, namely the elements from left to right must stay in order. Thus, we can choose either the element to the immediate left, or the element to the immediate right of the element to be deleted, and then replace the position.

Similarly, in a BST, to delete the element which has two children, we must replace it by another node that appears either RIGHT BEFORE or RIGHT AFTER it in the inorder set. In other words, this element is the greatest element before the node or the smallest element after the node.

Recall that all elements before (less than) the node lie in the left-child-subtree of the node, while all elements after (greater than) the node lie in the right-child-subtree of the node. Also recall that the smallest element of a subtree is the leftmost node of that subtree, while the largest is the rightmost.

Thus, we can see that the element with which we replace a node X is the rightmost node of X's left child, or the leftmost node of X's right child. The former is known as the inorder predecessor while the latter is called the inorder successor of the node X.

Thus, we can choose to replace the node by either its inorder predecessor or successor. Let us, for the purposes of this discussion, take the inorder predecessor.

Now, the inorder predecessor of a node is the rightmost node of its left subtree. Following that logic, the predecessor CANNOT have a right child (as then it will not be the rightmost node). Thus, the only cases are that the inorder predecessor can be a leaf node or a one-child node (with a left child). Therefore, deletion proceeds according to the following graphic:

In the above graphic, we are trying to delete the red node, which has two children, as shown. First we locate the inorder predecessor by finding the rightmost node of the left subtree. It is denoted by the blue dot in the corner of the left subtree. Then we copy its value to the node to be deleted (the 'Replace' step). Then we delete the original predecessor node (as there are two copies of it now, as shown), which will follow one of the first two cases, depending on whether it has no children or a left child.

The complete deletion algorithm is as given below.

del_BST (x, root) :
    # input: We assume that x is the node to be deleted, already found using search
    # Also, nodes have a parent attribute

    xpar := x.parent
        
    # Case 1:

    if (x.left is NULL and x.right is NULL) :
        if xpar is NULL :
            # We are trying to delete the tree root
            root = NULL
            del x
            return
        if x == xpar.left
            xpar.left = NULL
            del x
            return
        else
            xpar.right = NULL
            del x
            return

    # Case 3:

    else if (x.left is not NULL and x.right is not NULL) :
        
        inordPred := rightmost(x.left)
        x.value = inordPred.value
        
        del_BST (inordPred, root)
        return
        
    # Case 2:
        
        else:
            if xpar is NULL:
                # We are trying to delete the tree root
                if (x.left is not NULL):
                    root = x.left
                    root.parent = NULL
                    del x
                    return
                else :
                    root = x.right
                    root.parent = NULL
                    del x
                    return
            
            else:
                # x.parent is not NULL
                if (x.left is not NULL) :
                    if (x == xpar.left) :
                        xpar.left = x.left
                        x.left.parent = xpar
                        del x
                    else:
                        xpar.right = x.left
                        x.left.parent = xpar
                        del x
                else:
                    if (x == xpar.left) :
                        xpar.left = x.right
                        x.right.parent = xpar
                        del x
                    else:
                        xpar.right = x.right
                        x.right.parent = xpar
                        del x
            return

Take a moment to understand the pseudocode before trying to implement it on your own in a program. The three cases are delineated clearly. In the above algorithm, we assume each node also has a reference to its parent, for ease of manipulation. In practice, this is not necessary, and the node's parent can be found in the search step itself.

Notes

So as I had said earlier, the main strength of a Binary Search Tree appears when we have to search for an element, as it potentially reduces the amount of elements we have to go through by half. However, consider the following order of insertion:

1, 2, 4, 6, 7, 8

In the above order, there is something peculiar. Yes, all the elements are already in order. Hence when '1' is inserted, it becomes the root; when '2' is inserted, it goes to its right; when '4' is inserted, it goes to 2's right; and this process continues till we get this binary search tree:

Notice something weird? It's not much of a tree, is it? Notice that it looks exactly like a linked list. If we search for an element in the above tree, say '6', then we see that it does not eliminate any subtrees at any node. It requires us to check 4 nodes to reach node '6', and 6 nodes to reach nodes '8'. In other words, a poorly balanced tree like the one above, while satisfying the BST Property, still requires O(n) time to search for any element.

To combat this lack of balance, we use concepts like AVL Trees and Red-Black Trees. But more on those, another time.

The Programmer [In Training]

My Blog

Search This Blog

Tuesday, 2 December 2014

Concept 3: Binary Search Trees

So what are Binary Search Trees?

Insertion

Searching

Deletion

Notes

No comments:

Post a Comment