MEDIUM
DATA STRUCTURES AND ALGORITHMS

How to solve the Longest Common Subsequence Problem

STRINGSDYNAMIC PROGRAMMING
Written By
tom-wagner.png
Tom Wagner
Christopher Ang

Introduction Longest Common Subsequence

The Longest Common Subsequence (LCS) problem involves finding the longest sequence of characters present in two strings. Variations of this problem are commonly found in real world applications such as bioinformatics, natural language processing, and text comparison. This problem can be solved using dynamic programming techniques, which involve breaking down the problem into smaller subproblems and solving them iteratively.

Problem Longest Common Subsequence

Given two strings, return the longest common subsequence between the two strings. A subsequence of a string is a string containing characters from the original string in the same order as the original, but may have characters deleted.

Example Inputs and Outputs

Example 1

Input: s1 = "abccba", s2 = "abddba" Output: "abba"

Example 2

Input: s1 = "zfadeg", s2 = "cdfsdg" Output: "fdg"

Example 3

Input: s1 = "abd", s2 = "badc" Output: "ad" (or "bd")

Constraints

  • 1 <= s1.length, s2.length <= 1000

  • There may be multiple valid answers, but they will all have the same length.

Solution Longest Common Subsequence

Watch Someone Solve Longest Common Subsequence
Google InterviewJavaScript Interview
Advance this person to the next round?
Thumbs up
Technical Skills:
4/4
Problem Solving Ability:
4/4
Communication Ability:
3/4
Show Transcript

Longest Common Subsequence Approaches

To solve the Longest Common Subsequence problem it is helpful to first consider a couple of important properties of the lcs function. Firstly, if two strings s1, s2 end in the same substring then their lcs is the lcs of the two strings without their common substring concatenated with said substring. For example, lcs("abccba", "abddba") = lcs("abcc", "abdd") + "ba"

Framefirstprop.png

Secondly, if two strings do not have a common ending substring, the lcs of the two strings will be the same as the lcs resulting from removing the ending of one of the strings. Put another way, lcs(s1, s2) is one of two recursive possibilities:

  1. lcs(s1[:-1], s2)
  2. lcs(s1, s2[:-1])

where s1[:-1] is a string with the last character removed. For example, suppose s1="abcd", s2="zdef", then lcs(s1, s2) = lcs("abcd", "zde") = lcs("abcd", "zd") = "d". But note in this example that lcs("abcd", "zde") is clearly not equal to lcs("abc", "zde").

Framesecondprop.png

Recursive Approach

Leveraging the above two properties we can define a recursive solution to solve this problem. Starting at the end of the two strings:

  1. If the characters at the end are the same, we can return lcs(s1[:-1], s2[:-1]) + s1[-1].
  2. If the characters are not the same, we must compute both lcs(s1[:-1], s2) and lcs(s1, s2[:-1]), and return the longer one.
def solution(s1, s2):
    if len(s1) is 0 or len(s2) is 0:
        return ''
    elif s1[-1] == s2[-1]:
        return solution(s1[:-1], s2[:-1]) + s1[-1]
    else:
        sub1 = solution(s1[:-1], s2)
        sub2 = solution(s1, s2[:-1])
        return sub1 if len(sub1) > len(sub2) else sub2

Time/Space Complexity

Let m and n be the length of the two strings.

  • Time Complexity: O(2^m * 2^n) in the worst case. This algorithm computes all possible subsequences for both strings, resulting in time complexity of 2^(len(s)) for a string, but it also computes all possible subsequences per subsequence of the other string, hence the product.
  • Space Complexity: O(max(m,n)). The space complexity is due to the height of the recursion call stack being the maximum length between the two strings.

Recursive Solution with Memoization

When implementing a recursive algorithm one thing to always look out for is repeated work. For example, if we employ the above algorithm on a string that includes the substring "abc" in three disparate locations we will effectively be calling lcs(abc) three times. How can we avoid this and improve the time complexity? By storing the lcs computations in a lookup table, otherwise known as memoization.

def solution(s1, s2):
    return solution_recur(s1, s2, {})
def solution_recur(s1, s2, solutions):
    # use a frozenset here because (1) frozenset is hashable, so can be
    # used for a key, and (2) order of the inputs does not matter for this function.
    inputs = frozenset([s1, s2])
    solved = solutions.get(inputs, None)
    if solved is not None:
        return solved
    if len(s1) == 0 or len(s2) == 0:
        solved = ''
    elif s1[-1] == s2[-1]:
        solved = solution_recur(s1[:-1], s2[:-1], solutions) + s1[-1]
    else:
        sub1 = solution_recur(s1[:-1], s2, solutions)
        sub2 = solution_recur(s1, s2[:-1], solutions)
        solved = sub1 if len(sub1) > len(sub2) else sub2
    solutions[inputs] = solved
    return solved

Time/Space Complexity

Due to how the recursive function is formulated, the lcs of (m + 1) * (n + 1) different input pairs will be computed. Since this solution caches those results, it has:

  • Time Complexity: O(m * n)
  • Space Complexity: O(m * n)

Dynamic Programming Approach

Notice that in the previous approaches, we relied on the fact that the problem being solved had very similar subproblems which were used to ultimately solve the original problem. Of course these types of situations can be tackled using memoization as seen above, however an even more powerful approach can be used to tackle overlapping subproblems: dynamic programming.

To solve this using a dynamic programming approach, this solution will construct a table of results, and then trace back through the table to construct the longest subsequence. Letting s1="abd" and s2="badc", the initial table would look like:

Frametable.png

Notice that the table has been augmented with a row and column for no characters in a string. The added row and column serve as a base case upon which to fill out the rest of the table.

The table can be constructed row by row (or column by column). At each cell the character of the row is compared with the character of the column, and based on the result the following items are stored:

  1. The length of the longest subsequence between the two substrings up to and including the characters at the current cell. The value will be one plus the "previous" cell's value if the characters match; otherwise it will be the "previous" cell's value.
  2. The direction of the cell "prior" to the current cell which has the largest value. If the characters match, then the "previous" cell is up one and to the left one. Otherwise it is whichever cell has the larger value between the cell to the left and the cell above.

Continuing with the example, to fill in the first row the following would occur:

  1. 'a', 'b': unequal; above and to the left are 0; so store 0 with direction of "both"
  2. 'a', 'a': equal; upper left is 0; so store 1 with direction of "up-left"
  3. 'a', 'd': unequal; above is 0, but left is 1, so store 1 with direction of "left"
  4. 'a', 'c': unequal; above is 0, but left is 1, so store 1 with direction of "left"

Framefirstrow.png

Next we fill in the second row via:

  1. 'b', 'b': equal; upper left is 0, so store 1 with direction of "up-left"
  2. 'b', 'a': unequal; above and to the left are both 1; so store 1 with direction of "both"
  3. 'b', 'd': unequal; above and to the left are both 1; so store 1 with direction of "both"
  4. 'b', 'c': unequal; above and to the left are both 1; so store 1 with direction of "both"

Framesecondrow.png

Finally, the last row:

  1. 'd', 'b': unequal; above and to the left are 0, so store 0 with direction of "both"
  2. 'd', 'a': unequal; above is 1 while to the left is 0; so store 1 with direction of "up"
  3. 'd', 'd': equal; upper left is 1; so store 2 with direction "up-left"
  4. 'd', 'c': unequal; above is 1, while to the left is 2; so store 2 with direction 'left'

Framethirdrow.png

Now that the table is completed we can trace back through the table to construct the lcs. To do so, we start in the bottom right corner of the table, which will have the largest number. At each cell, check the direction. If the direction is "up-left", then the character at the current cell is part of the lcs, so prepend it to the result (since the traceback goes through the table in reverse). Whether the character is part of the lcs or not, follow the direction encoded in the cell to navigate the table until the value of the cell is zero.

Using the example above, the traceback would proceed as follows:

  1. Start in the bottom right corner, where the value stored is 2. This indicates any lcs has a length of 2. As the direction is left, move to the cell (d,d)
  2. In the cell (d,d), the direction is up-left, so 'd' is the last character in an lcs. Prepend 'd' to the result, which is then just "d" so far, and move to the cell (b,a).
  3. In the cell (b,a), the direction is both, and the algorithm could choose to go either to the left or up (with different resulting strings). For this example, the canonical direction is up, so navigate to the cell (a,a).
  4. In cell (a,a), the direction is up-left, so 'a' is the penultimate character in an lcs. Prepend 'a' to the result to end up with a string of "ad" so far, and move to the cell ('',b).
  5. In the cell ('',b), the value stored is 0, so the algorithm stops, and the lcs is "ad".

Frametraceback.png

Note 1: In step 3, had the algorithm gone to the left instead of up, the lcs would have been "bd" instead of "ad".

Note 2: It is possible to store the actual subsequences inside each cell, but such a solution would need to include all subsequences that could be arrived at in each cell, which would add to the space complexity.

class SolutionNode:
    def __init__(self, direction="sink", value=0):
        # valid directions are: sink, up, left, up-left, or both
        self.direction = direction
        self.value = value
def solution(s1, s2):
    if len(s1) == 0 or len(s2) == 0:
        return ''
    # building the table with the extra row and column
    lcs = [[SolutionNode() for x in range(len(s2)+1)]
           for y in range(len(s1)+1)]
    # skip first row; it is supposed to be all zeros anyway
    for i, row in enumerate(lcs[1:], 1):
        # skip first row; it is supposed to be all zeros anyway
        for j, cell in enumerate(row[1:], 1):
            print(i)
            print(j)
            if s1[i-1] == s2[j-1]:
                cell.value = lcs[i-1][j-1].value + 1
                cell.direction = 'up-left'
            elif lcs[i][j-1].value == lcs[i-1][j].value:
                cell.direction = 'both'
                cell.value = lcs[i][j-1].value
            elif lcs[i][j-1].value > lcs[i-1][j].value:
                cell.direction = 'left'
                cell.value = lcs[i][j-1].value
            else:
                cell.direction = 'up'
                cell.value = lcs[i-1][j].value
    # The table is built; now to traceback
    i = len(s1)
    j = len(s2)
    node = lcs[i][j]
    val = node.value
    result = ''
    while val > 0:
        # Could instead go 'left' on 'both'
        if node.direction == 'up' or node.direction == 'both':
            i -= 1
        elif node.direction == 'left':
            j -= 1
        else:
            i -= 1
            j -= 1
            # need to prepend since this constructs the lcs in reverse
            result = s1[i] + result
        node = lcs[i][j]
        val = node.value
    return result

Time/Space Complexity

  • Time Complexity: O(m * n).
  • Space Complexity: O(m * n).

Additional Reading

The final solution in this write can be further improved. One such way is the Hunt-Szymanski Algorithm.

Practice Questions Similar to Longest Common Subsequence

MEDIUM
Data Structures and Algorithms
Build a Max Heap

Given an array of integers, transform the array in-place to a max heap.

Watch 1 interview
MEDIUM
Data Structures and Algorithms
Three Sum

Given an array of integers, return an array of triplets such that i != j != k and nums[i] + nums[j] + nums[k] = 0.

Watch 1 interview
EASY
Data Structures and Algorithms
Reverse Linked List

Given the head of a linked list, reverse the list and return the new head.

Watch 2 interviews

About interviewing.io

interviewing.io is a mock interview practice platform. We've hosted over 100K mock interviews, conducted by senior engineers from FAANG & other top companies. We've drawn on data from these interviews to bring you the best interview prep resource on the web.

Ready to practice with a mock interview?

Book mock interviews with engineers from Google, Facebook, Amazon, or other top companies. Get detailed feedback on exactly what you need to work on. Everything is fully anonymous.