The Programmer [In Training]: Concept 2: The Maximum Subarray Problem

Consider an array A, of numbers. What you want to do is to find two points (elements) in the array such that the sum of all the elements between those points (including the two endpoints) is the maximum possible sum of any such subarray. This is termed the maximum subarray problem.

In other words, what you want is to find the continuous section of the array with the maximum total sum. Of course, this means that the array contains both positive and negative elements (Otherwise, if the array contains only positive elements, the maximum subarray is nothing but the entire array: think about that for a second). So you want to minimize the 'losses', which are the negative elements, while maximizing the 'gains', the positive ones.

This algorithm follows the famous method of "divide and conquer". The overall method is exactly as it sounds: the divide-and-conquer method of algorithms works by dividing the main problem into subproblems of manageable size, which we can deal with directly. These subproblems are nothing but smaller instances of the main problem - with the special cases among them being handled separately. Naturally, this kind of a problem lends itself to a recursive solution.

In recursion, you must remember that each instance of the recursive call behaves in the exact same manner as that of its parent, and this holds for all cases except the edge (terminating) cases. For example, for the Fibonacci sum of a number:

F(n) = F(n-1) + F(n-2)

Now we see that the above condition holds for all values of n: all except n=1 and n=0, which are the edge cases. For all other cases, replacing 'n' with any value will result in a correct relation.

Now that we are done with that little diversion into the basics of recursion, we can proceed with the maximum subarray problem.

Algorithm Basics

The maximum subarray problem can be logically divided into three steps:

The maximum subarray of A (between A.begin and A.end) can either be:

The maximum subarray between A.begin and A.mid.
The maximum subarray between A.mid + 1 and A.end.
The maximum subarray crossing A.mid, with part of it in the first half and part of it in the second half.

The subproblems entirely in the left half (step 1.1) and entirely in the right half (step 1.2) are nothing but direct subproblems of the original problem.
The subproblem crossing A.mid (step 1.3) is a special case and needs further inspection.

In other words, we acknowledge that there can be only one maximum subarray (if there are two with the same sum, we take any one, depending on the algorithm). This one maximum subarray can lie in three positions: entirely in the first half, entirely in the second half, and partially in the first half and second half. We are also aware that the third case must cross the midpoint. In other words, as the subarray is continuous, the midpoint must be part of it in the third case.

Using this data, we can craft the entire algorithm.

Complete Algorithm

As we can see, the case for the subarray being entirely in the left or right half of the array warrants no inspection, as it is nothing but a copy of the original problem on size (A.size)/2. Thus, it will be handled by the recursive call.

What we need to worry about is the third case: the one where the maximum subarray may belong to the subset of all subarrays crossing the midpoint. In this case, we can handle it thus:

Now, we are aware that any subarray crossing the midpoint must include the midpoint itself. What we don't know is what the endpoints are.
As we need the maximum subarray, and using the fact that addition is commutative, we can actually start from the midpoint and add in either direction to obtain both endpoints.
So start from the midpoint, and add down to the first element (the 'low' index) and maintain a running tally of the obtained sum and the index of that endpoint of the maximum obtained sum.
Again, for the other end, start from the midpoint (actually midpoint + 1, as we have counted the midpoint already for the previous point), add up to the last element (the 'high' index of the array) and maintain a running tally of the sum and the index of the endpoint of the maximum obtained sum.

What we have now obtained are two values: the maximum sum in the region before the midpoint (and including it) and the maximum sum in the region after the midpoint (and including midpoint+1). The total sum crossing the midpoint is nothing but the sum of the above two values. This can be made clearer using the following pseudocode:

max_crossing_subarray (A, lo, mid, hi):
 
 tSum := 0
 lSum := -inf # inf = infinity
 loIndex := mid
 
 for i := mid downto lo :
  tSum += i
  if tSum > lSum:
   lSum = tSum
   loIndex = i
 
 tSum = 0
 rSum := -inf # inf = infinity
 hiIndex := mid+1
 
 for i:= mid+1 to hi:
  tSum += i
  if tSum > rSum:
   rSum = tSum
   hiIndex = i
 
 return (loIndex, hiIndex, lSum+rSum)

From the pseudocode, it must be clear how the procedure for finding the subarray that crosses the midpoint must proceed. From here, it is a simple matter of returning the maximum value among the left half, right half and middle section as the maximum possible subarray.

max_subarray (A, lo, hi) :
 
 if lo == hi : # only one element present
  return (lo, hi, A[lo])
 
 # else, more than one element
 
 mid = (lo+hi) / 2 # remember integer division floors the value
 
 # initialize all limits and maximum sums
 (lLo, lHi, lSum) := max_subarray (A, lo, mid)
 (rLo, rHi, rSum) := max_subarray (A, mid+1, hi)
 (mLo, mHi, mSum) := max_crossing_subarray (A, lo, mid, hi)
 
 elMax := max (lSum, rSum, mSum) # maximum function to get max value of 3 parameters
 
 # return matching tuple based on maximum region sum
 if elMax == lSum : return (lLo, lHi, lSum)
 else if elMax == rSum : return (rLo, rHi, rSum)
 else return  (mLo, mHi, mSum)

Time Complexity

The time complexity for the algorithm can be computed with the master theorem for recurrences. Now I know that I've not written anything about the master theorem yet, so for now go open up that Wikipedia link, or just Google it or something.

We use the master method as follows:

The main problem can be specified as some T(n), which is a function operating on 'n' inputs (for this problem, 'n' is the size of the array).
Now there are two recursive calls in each main function call: the function max_subarray() is called on the first half and second half of the array. Each of these calls can be said as T(n/2), ignoring the constant difference of 1 element when n is odd.
The main function also calls max_crossing_subarray() once. This call takes O(n) time as each element of the array is checked once.
Thus the total running time can be written as T(n) = 2T(n/2) + f(n), where function f(n) takes O(n) time.
By the master theorem, this takes the form of the second case (as given in that Wikipedia link, go look it up), and we end up with the running time of O(n*log(n)).

Usage

This algorithm is useful for a lot of tasks. Perhaps the most important is buying / selling stocks. You want to buy stock on the day it is cheapest, while selling it in such a manner that you obtain the maximum profit. If you have a mechanism in place that can predict stock price reliably, you can use this algorithm to find the days to buy and the days to sell. The inputs for the array in that case will be the difference in price from one day to the next. This difference is positive for a day X when the selling rate on that day is higher than the previous day, and vice-versa. Of course, chances are that if you can predict the stock prices to that extent, you'd be either rich, dead, or both.

EDIT: Added the Time Complexity section which I had quite stupidly forgotten. Thanks +Shreyas shankar srinivas! Also, there is a way to do this algorithm in linear (O(n)) time, but I haven't figured out how yet. When I do, I will put it up here.

The Programmer [In Training]

My Blog

Search This Blog

Sunday, 30 November 2014

Concept 2: The Maximum Subarray Problem

Algorithm Basics

Complete Algorithm

Time Complexity

Usage

No comments:

Post a Comment