Showing posts with label bst. Show all posts

Tuesday, 9 December 2014

Concept 7: An Intro to AVL Trees

So I recall talking about BSTs once, and ending it on a note of concern. The problem with a BST was that, following its almighty rule, it was still possible to create something akin to a linked list. This completely destroyed the whole reason behind using a BST, which was the fact that a BST can potentially reduce our time for searching for a particular node by a lot, specifically by the factor O(lg n) (and yeah, the lg is not a mistake, it is nothing but the logarithm to the base 2).

So how do we ensure our dear friend the Binary Search Tree retains its O(lg n) time required for searching? Enter the concept of balancing and the world of balanced binary search trees. In this episode, we are gonna be taking a look at the basics of AVL Trees.

Prerequisites

Ah, the prerequisites. Well, for today's post, all you are gonna be needing is BSTs, which you can read about right here. Other than that, you may require to hold on to your wits, because this is not exactly a very simple topic.

The Beginning

In the beginning, there were Binary Search Trees. And they were good. But three dudes whose names started with the letters A, V and L (I could Google their names right now, but I'm actually writing this without an internet connection) realized that they were not good enough! So they set out to create a new data structure, inventively called AVL Trees (and confused some people: why would you do this? Most people (including myself) don't even know what AVL stands for. At least you could have named it after its purpose or something). And then began a revolution of more streamlined and effective algorithms, blah blah blah.

So anyway, cutting to the chase, what necessitated the AVL data structure was outlined, in like the first paragraph of this post. What is more important is HOW it works. The AVL data structure does what we want it to do: reduces searching to a worst-case scenario of O (lg n). What this means is that the BST data structure, that potentially allowed a worst-case O(n) scenario, has been drastically reduced in complexity (hehe, time-wise, m'boy! Programmers always have the worst of it when reducing so-called "complexity"). Therefore, basically we are gonna have faster searches, guaranteed.

What about insertion and deletion? The incredible part about AVL Trees is that even these take O(lg n) time. Sure, it may not sound as good as arrays or linked lists (those have constant-time insertion, but are meh on searching), but this complexity makes AVL Trees pretty frickin' awesome in terms of data structures.

How it works

Okay, so the whole point of this post is to get you acquainted with AVL Trees. I'm not going to cover the mechanics of insertion and deletion here, because those concepts are quite long and hard to wrap your brain around, and have the potential to make this post a gazillion miles long (notwithstanding my penchant for writing, like a lot). However, I will cover the basics of how an AVL Tree looks like and works.

Before we begin, I want to make one thing clear. I am using a Height Convention of 1. What this means is that a single leaf node possesses a height of 1 in my book. Now I know a lot of you have been taught otherwise, with heights of 0 and 2 and whatnot, but forget about all that and follow this convention when reading this post. It will make further discussion a lot more painless.

For those of you who don't know what the height of a BST is, first of all, shame on you. I know I did not cover that in my post on BSTs, but I never said you have to stop there. Google stuff and learn! It's good for you. Secondly, the definition is basically this:

Height (node x) = length of max path down from node x, including x

I am using the "including" convention here. A lot of people use the "excluding" convention (which I believe is the prevailing convention) but I don't like that. According to that, the height of a single leaf node is 0, which is plainly absurd. So no, I am not going down that path.

Anyway, AVL Trees are BSTs (which means the BST Rule applies), with one more rule:

A node in an AVL Tree has the height difference between its left and right subtrees as either 1, 0 or -1.

Confused? Here's what it means. Consider the node A and its subtrees B and C below.

In the above, remember that B and C are the topmost nodes of those subtrees (that's the subtree notation). Now in an AVL tree, height(B) - height(C) = 0 or 1 or -1, and this is true for all nodes A in the AVL tree.

In other words, no matter which node you take, it can have an equal depth on both its children, or it can have a single additional node in the height balance on one of its children. In other words, every node is almost completely balanced.

This rule above makes AVL Trees strictly height-balanced. For example, this is an AVL Tree:

While this is NOT an AVL Tree (spot why!):

This is also an AVL Tree:

While this isn't:

From the above examples, it should be plainly clear that an AVL Tree enforces a sort of "strict" balancing and does not really allow any linked-listey formations (look at the fourth image, immediately above. The thing looks like two linked lists joined at the top). This limits the height of the tree to reasonable proportions depending on the number of nodes. This in turn results in faster searching as we have to traverse fewer levels to reach (or not) the node we want.

Now, while working with AVL Trees, you may wonder how we enforce this height balance. The method is quite simple: we change the root of the node whenever the tree goes out of balance. For a simple example, consider the tree containing three elements: 5, 6 and 7. Now these can be put into a binary tree in three ways:

We see that all three ways are valid BSTs, with elements on the left less than elements on the right. However, out of the three, only the middle one is a valid AVL Tree, or is well-balanced. Therefore, as a rule of thumb, we try to make the horizontally middling value as the root of the tree whenever balancing is concerned.

How that is done is beyond the scope of this post. For now, however, know that this is what we must ensure when dealing with AVL Trees. Just like normal BSTs, an AVL Tree must satisfy the AVL Property at all stages (along with the BST Property itself... AVL Trees are subsets of BSTs).

Worst Case Scenarios

Even in an AVL Tree, there IS a worst case of O lg(n). We will see that this case appears quite definitely in one particular form of an AVL Tree: one where no node has balance factor 0. In other words, the tree is as staggered as it can get: every node has a balance factor of either 1 or -1. This sort of tree is called a Fibonacci Tree and looks sorta like this:

I did not make the above image. This is from Wikipedia. Either way, that shouldn't matter.

For those of you who are too dense to realize it, every node shows the height and not the balance factor of itself in the above image. The balance factor of each node above (except the leaf nodes) is 1.

These nodes have the worst case of O (lg n). Why that is so will be explained when I cover insertion on this blog. Until then, know that AVL Trees, in general, are best suited for purposes where you have a (mostly) static tree in which you do loads of searches. For trees where you modify stuff more often, you use the big brother of all binary trees, the Almighty Red-Black Tree, which I am almost scared to write about because there is so much to write there.

But until then, I have two more posts to go regarding AVL Trees themselves, so I'm gonna get right to that. Meanwhile you guys can go Google it or something. Stay alive, stay alert, keep learning.

PS: That was totally lame.

PPS: I know I have a lot more posts to catch up on (4 more in my reckoning), but I've been having internet troubles for a day or so, so I'm really sorry about that. Plus, my original resolution requires me to learn a new concept everyday, which I am doing. These concepts I am blogging currently are stuff I have learnt back during my semester. The newer concepts are going into the queue and will be here later. 'Til then, ciao!

Tuesday, 2 December 2014

Concept 3: Binary Search Trees

Now I know this article was supposed to be here earlier, prior to rotations, but what can I say? I really wanted to write about rotations, and so I did. However, here's the promised article on Binary Search Trees.

Disclaimer: This is really long, so prepare yourself.

Let's start this off in as generic a manner as possible. What are trees? No, not those trees. I'm talking about the trees in computer science. Now trees are collections of data elements called nodes, where each node is connected to one or more nodes. When I cover graphs on this blog, you will see that trees are just a subset of the larger set of graphs. For now, however, know that the important distinguishing factor of a tree is that a tree does not have cycles; in other words, it is not possible to start from any node, follow its connections and end up back at the same node.

So what is a binary tree? Remember when I said that a tree consists of a collection of nodes that have one or more connections to other nodes? Well, a binary tree is a subset of that definition, in that the connections can be considered directed and also that every node has one incoming connection (except for a special node called the 'root' node) and has zero, one or two outgoing connections.

In pictures, a binary tree looks like this:

Notice how every node has one incoming connection? This property is true for all but the topmost (colored red) node. That node is called the 'root' as it has no 'parent'. Also, every node has zero, one, or two 'children', as shown by the number of outgoing connections. For ease of understanding, the number of outgoing connections for each node has been written in the node. Note that the bottom-most nodes have zero children each. Nodes with no children are termed leaf nodes, thus extending the tree analogy.

So what are Binary Search Trees?

Phew. Finally. Binary Search Trees are nothing but binary trees which follow ONE SIMPLE RULE:

For every node, the LEFT ~~CHILD~~ SUBTREE of the node ~~has a value~~ consists of nodes with values LESS than itself, and the RIGHT ~~CHILD~~ SUBTREE of the node ~~has a value~~ consists of nodes with values GREATER than itself.

(EDIT: An error pointed out by +Aaditya Lakshmanan has now been rectified. Many internets to you!)

The above rule can be specifically stated for just the node and its children, being that a node has a value greater than its left child and lesser than its right child. This specific rule will generalize itself to the above property if the steps for insertion and deletion are followed as below.

In other words, if you take the set of the left child, the node itself, and the right child as {L, N, R}, then L<N<R for all nodes N. Henceforth, I shall call the above rule as the Binary Search Tree Property.

Now in binary trees, we denote a subtree as a triangle with an alphabet, where the alphabet is representative of the root node of that subtree. In other words, the binary tree shown below...

... can be represented as...

... with respect to node A and its child subtrees.

From here on, we shall be using the subtree notation to simplify work and make the representation uncluttered.

Therefore, if the above diagram was a Binary Search Tree, then we would represent the rule as B<A<C. Note that this rule holds for all nodes in the tree which have one or more children.

So what's the point? Why go through the hassle of this rule to store data? Well, the answer is simple. Binary Search Trees have an inherent advantage in guess what? Yes, search. It's, like in the name.

How does this advantage come about? Well, as long as we know the value of what element we are searching for (which, of course, we do), we can quickly search for its existence and position by a series of comparisons. At each comparison, in a well-balanced binary search tree, we can skip past roughly half the remaining elements. How this comes about will be explained.

Insertion

Before we do anything else, we have to learn how to insert a value in a BST (Yes, that is what I shall be calling it from now on). Insertion of a value is not a difficult task, and involves a simple couple of steps:

Find the point to insert, by filtering down the tree using the BST Rule.
Insert the node as a new leaf.

Yeah, it is that simple. All you have to do is to find that point. Now in order to do that, recall that the BST Property applies for all nodes, without exception. In other words, you can treat any node as the root of its own subtree and the same property will apply for it and its children. In other words, recursion.

The image above shows the insertion of the element '6' in the BST given. Note how we search for the correct point to insert at by moving left or right at each step depending on whether the inserted element is lesser or greater than the current element.

So look at the pseudocode below, carefully, and try to understand how the insertion procedure works.

insert_BST (x, root) :
    # input: We are given the node x to be inserted.
    if (root == NULL) :
        root = x
        return
    
    if x.value > root.value:
        insert_BST(x, root.right)
    else
        insert_BST(x, root.left)

Clear enough? Good. We can now proceed to searching.

Searching

The more astute among you would have figured out that we have actually already covered searching. Confused? Recall point 1 of the insertion steps: we require to find the point of insertion. Yes, that is precisely how search in a BST works: we just move downwards from the root following the BST rule till we end up at either a NULL element (the requested node does not exist) or at the required element. Searching is done as follows:

search_BST (x, root) :
    # input: x is the value we are looking for
    if (root == NULL) :
        print "The node requested does not exist"
        return
    if x == root.value:
        print "Value found!"
        return
    if x > root.value:
            print "Go right"
            search_BST(x, root.right)
        else
            print "Go left"
            search_BST(x, root.left)

Note that this is basically a modification of a submodule referred by the insertion step, the difference being that we go all the way to the leaf nodes during insertion, while we may terminate the process when the node is found prematurely before reaching a leaf node.

Those of you who have learnt binary search in sorted arrays will be seeing the similarity in the procedure already. Instead of starting from the middle element, we start from the root, and then it's basically the same steps.

Deletion

Sometimes, it is necessary to delete an element in a BST. Deletion in BSTs is the most complicated subtopic to understand, so pay attention.

Deletion involves some simple overall steps:

Find the element to delete using a search.
Assuming the element exists, it can be in one of three states:

It may be a leaf node, i.e. have no children.
It may have one child, either a left or a right child.
It may have both a left and a right child.

Depending on which case of the above the node belongs to, its deletion will follow three different methods.

The three cases of deletion are given below:

Case 1: Leaf Node

Deletion of leaf nodes is quite direct. All you do is to remove the reference to the node and deallocate the space held by it. As deletion of a leaf node affects no other node (as it has no children), this case is straightforward.

Case 2: One child

Nodes with one child can be deleted by crossing over the node in the link from its parent node. If the preceding sentence does not make sense to you, have a graphic.

In both the left and the right case shown in the image, the node colored red is to be deleted, with the green subtree as its only child. Notice that we just link past the node to be deleted: its parent now links to its child.

Case 3: Two Children

Okay, so here is the true issue. After looking at the first two cases, I think you might be beginning to see why a node with two children cannot be simply deleted. Firstly, it is not possible to directly delete the node, as its children have to now be relocated. Also, it is not possible to link past the node, as it has two children, so which child can you link its parent to?

Thus, the strategy for deleting a node with two children is quite different. This involves knowledge of something called the inorder set of the BST.

So what's the inorder set? It is basically exactly as it sounds: it is the sorted set of elements in the tree. If you remember the BST Property, for every node, the left child is less than the node and the right child is greater than it. From this, we can extend it to saying that all the elements in the left subtree are less than the node while all the elements in the right subtree are greater than the node. Again, extending this for a particular subtree, we can see that the least valued element in a subtree is the leftmost element, while the greatest valued element is the rightmost element of that subtree.

Following the above discussion, we can see that the inorder set will contain elements in the order as they are visible from left to right.

If you follow the little dude's line of sight as he looks directly upwards, and then move the dude towards the right, you end up with the inorder set of the given BST. For the BST above it is D, B, E, A, F, C, G, with the elements on the left less than the elements on the right (Check it!).

Now assume all the elements were placed on a number line. In a number line, lesser elements appear on the left. Hence, the number line would contain the same order: D-B-E-A-F-C-G. Assume we had to delete an element from this number line. Now when we do that, the element has to be replaced by another element already present in the number line. This element has to be chosen such that the replacement does not violate the number line property, namely the elements from left to right must stay in order. Thus, we can choose either the element to the immediate left, or the element to the immediate right of the element to be deleted, and then replace the position.

Similarly, in a BST, to delete the element which has two children, we must replace it by another node that appears either RIGHT BEFORE or RIGHT AFTER it in the inorder set. In other words, this element is the greatest element before the node or the smallest element after the node.

Recall that all elements before (less than) the node lie in the left-child-subtree of the node, while all elements after (greater than) the node lie in the right-child-subtree of the node. Also recall that the smallest element of a subtree is the leftmost node of that subtree, while the largest is the rightmost.

Thus, we can see that the element with which we replace a node X is the rightmost node of X's left child, or the leftmost node of X's right child. The former is known as the inorder predecessor while the latter is called the inorder successor of the node X.

Thus, we can choose to replace the node by either its inorder predecessor or successor. Let us, for the purposes of this discussion, take the inorder predecessor.

Now, the inorder predecessor of a node is the rightmost node of its left subtree. Following that logic, the predecessor CANNOT have a right child (as then it will not be the rightmost node). Thus, the only cases are that the inorder predecessor can be a leaf node or a one-child node (with a left child). Therefore, deletion proceeds according to the following graphic:

In the above graphic, we are trying to delete the red node, which has two children, as shown. First we locate the inorder predecessor by finding the rightmost node of the left subtree. It is denoted by the blue dot in the corner of the left subtree. Then we copy its value to the node to be deleted (the 'Replace' step). Then we delete the original predecessor node (as there are two copies of it now, as shown), which will follow one of the first two cases, depending on whether it has no children or a left child.

The complete deletion algorithm is as given below.

del_BST (x, root) :
    # input: We assume that x is the node to be deleted, already found using search
    # Also, nodes have a parent attribute

    xpar := x.parent
        
    # Case 1:

    if (x.left is NULL and x.right is NULL) :
        if xpar is NULL :
            # We are trying to delete the tree root
            root = NULL
            del x
            return
        if x == xpar.left
            xpar.left = NULL
            del x
            return
        else
            xpar.right = NULL
            del x
            return

    # Case 3:

    else if (x.left is not NULL and x.right is not NULL) :
        
        inordPred := rightmost(x.left)
        x.value = inordPred.value
        
        del_BST (inordPred, root)
        return
        
    # Case 2:
        
        else:
            if xpar is NULL:
                # We are trying to delete the tree root
                if (x.left is not NULL):
                    root = x.left
                    root.parent = NULL
                    del x
                    return
                else :
                    root = x.right
                    root.parent = NULL
                    del x
                    return
            
            else:
                # x.parent is not NULL
                if (x.left is not NULL) :
                    if (x == xpar.left) :
                        xpar.left = x.left
                        x.left.parent = xpar
                        del x
                    else:
                        xpar.right = x.left
                        x.left.parent = xpar
                        del x
                else:
                    if (x == xpar.left) :
                        xpar.left = x.right
                        x.right.parent = xpar
                        del x
                    else:
                        xpar.right = x.right
                        x.right.parent = xpar
                        del x
            return

Take a moment to understand the pseudocode before trying to implement it on your own in a program. The three cases are delineated clearly. In the above algorithm, we assume each node also has a reference to its parent, for ease of manipulation. In practice, this is not necessary, and the node's parent can be found in the search step itself.

Notes

So as I had said earlier, the main strength of a Binary Search Tree appears when we have to search for an element, as it potentially reduces the amount of elements we have to go through by half. However, consider the following order of insertion:

1, 2, 4, 6, 7, 8

In the above order, there is something peculiar. Yes, all the elements are already in order. Hence when '1' is inserted, it becomes the root; when '2' is inserted, it goes to its right; when '4' is inserted, it goes to 2's right; and this process continues till we get this binary search tree:

Notice something weird? It's not much of a tree, is it? Notice that it looks exactly like a linked list. If we search for an element in the above tree, say '6', then we see that it does not eliminate any subtrees at any node. It requires us to check 4 nodes to reach node '6', and 6 nodes to reach nodes '8'. In other words, a poorly balanced tree like the one above, while satisfying the BST Property, still requires O(n) time to search for any element.

To combat this lack of balance, we use concepts like AVL Trees and Red-Black Trees. But more on those, another time.

Friday, 28 November 2014

Concept 1: Rotations in Binary Search Trees

Now everybody knows good ole Binary Search Trees, right? No? Okay, for those of you who are new to that, I will probably write about it in another post soon (EDIT: That post is now UP!). In the meantime, you may get your arse out of my blog and go Google it or something. For the others who have already been introduced to that concept, I bring you another, simple, quite innocuous little topic: rotations.

Now rotations are used to shift two nodes in a BST relative to each other, while maintaining the Binary Search Tree Property: the left child of every node is smaller than it, while the right child is larger than it.

In other words, a node and its subtrees in a BST look something like this:

Here, 'L' is used to refer to the node at the root of the left subtree, and 'R' as the root of the right subtree. Note that L<P<R for all nodes P.

The whole point of rotations is to take two nodes (a parent and any one of its children) and change their relative vertical positions without changing their horizontal ones. In other words, you want to maintain the L<P<R property, but you want to change the parent from the current node to the other node (which is its immediate child: rotations only work on adjacent nodes).

In pictures, what you want to do is this:

Now look at the little dude at the bottom. If he looks upwards, what he sees is the nodes in the order c A d B e. This is basically the left-to-right order (the inorder traversal) of the subtree, with c<A<d<B<e (check it!). Notice how we managed to make B the parent and A the child without disturbing this left-to-right order.

Rotation works both ways, as shown in the above image. Converting to the left side is called the right rotate (as the child B is on the RIGHT) and converting to the right side is called left rotate.

Frankly, I think the word "rotation" is a misnomer. It looks nothing like a rotation! IMO, it looks more like the node B climbs vertically up while node A climbs down (look at the image: they are in the same horizontal position). But to each man his own.

Anyway, let's get down to the nitty-gritty. Here are the steps you have to follow to rotate:

Left Rotation

First, I'd like all of you to go through the below algorithm carefully and draw it, or if you can, try to visualize it. Drawing is better, less room for error.

Initially, A is the parent and B is its RIGHT CHILD.
Create a temporary node 'save' and give it A's value
Set save's left to parent.left
Set save's right to parent.right.left
Copy B's value to parent
Set parent.right to parent.right.right
Set parent.left to save
And you're done

Note that all we are doing is constructing the new parent's left child, which is the A-subtree as seen in the right side of the diagram. Then we modify the children of the parent to the new children relative to their original positions (for example, the new parent.right is now parent.right.right).

Right Rotation

Right rotation can be done with some simple steps:

Go to the left rotate section and copy the algorithm text
Paste it somewhere, notepad or something
Find all instances of 'left' and change them to 'right'
Find all original instances of 'right' and change them to 'left'
Read and understand

You didn't honestly expect me to do all the work for you, did you? You lazy ass! It's the same concept!

Usage

Hopefully, by now, you've got the essence of rotation. If not, go back up there and learn it.

Done? Good. Now rotation can be employed in several different uses. Most notable, however, is self-balancing binary search trees. With a simple rotation, you can shift the entire balance of the subtree from a left-bias to a right-bias, and vice versa. This technique will be employed in AVL Trees and Red-Black Trees, both of which I will be writing about ASAP.

Until then, keep learning

Ciao!

EDIT: Wikipedia has this brilliant image for rotation in BSTs.

By the way, for those who want to know, the first two images were created by me using this Chrome app called Gliffy. It's really, really handy. For all those using Chrome, I highly recommend it.

The Programmer [In Training]

My Blog

Search This Blog