Longest Common Subsequence (LCS) Algorithm¶
Algorithm Introduction¶
The Longest Common Subsequence (LCS) algorithm is a classic dynamic programming algorithm used to find the longest subsequence common to two sequences. Unlike the Myers diff algorithm which directly computes edit operations, LCS focuses on identifying shared portions between sequences. LCS has wide applications in bioinformatics, file comparison, version control systems, and natural language processing.
Algorithm Principles¶
LCS algorithm is based on these core concepts:
Subsequence and Common Subsequence¶
- Subsequence: A new sequence derived from another sequence by deleting some elements (possibly none). For example, "ACE" is a subsequence of "ABCDE".
- Common Subsequence: A sequence that is a subsequence of both input sequences. For example, "BC" is a common subsequence of "ABCD" and "XBCZ".
- Longest Common Subsequence: The longest of all common subsequences.
Dynamic Programming Approach¶
LCS uses dynamic programming by building a 2D table to store intermediate results:
- State definition:
dp[i][j]represents the length of LCS between first i elements of sequence A and first j elements of sequence B. - State transition:
- If A[i-1] = B[j-1], then
dp[i][j] = dp[i-1][j-1] + 1 - Otherwise,
dp[i][j] = max(dp[i-1][j], dp[i][j-1]) - Boundary conditions:
dp[0][j] = dp[i][0] = 0(LCS length is 0 for empty sequences)
Algorithm Steps¶
- Initialization:
- Create a (n+1) × (m+1) 2D array dp, where n and m are lengths of sequences A and B
-
Initialize first row and first column to 0
-
Fill dp table:
-
For each position (i,j), 1 ≤ i ≤ n, 1 ≤ j ≤ m:
- If A[i-1] = B[j-1], then dp[i][j] = dp[i-1][j-1] + 1
- Else, dp[i][j] = max(dp[i-1][j], dp[i][j-1])
-
Backtrack to construct LCS:
- Start from dp[n][m]
- If A[i-1] = B[j-1], add this character to LCS and move to dp[i-1][j-1]
- Else, move to dp[i-1][j] or dp[i][j-1] (whichever is larger)
Example Walkthrough¶
Let's understand LCS with an example. Suppose we have two strings: - A = "ABCBDAB" - B = "BDCABA"
Building dp Table¶
We build an 8×7 table:
| | B | D | C | A | B | A
----+---+---+---+---+---+---+---
| 0 | 0 | 0 | 0 | 0 | 0 | 0
----+---+---+---+---+---+---+---
A | 0 | 0 | 0 | 0 | 1 | 1 | 1
----+---+---+---+---+---+---+---
B | 0 | 1 | 1 | 1 | 1 | 2 | 2
----+---+---+---+---+---+---+---
C | 0 | 1 | 1 | 2 | 2 | 2 | 2
----+---+---+---+---+---+---+---
B | 0 | 1 | 1 | 2 | 2 | 3 | 3
----+---+---+---+---+---+---+---
D | 0 | 1 | 2 | 2 | 2 | 3 | 3
----+---+---+---+---+---+---+---
A | 0 | 1 | 2 | 2 | 3 | 3 | 4
----+---+---+---+---+---+---+---
B | 0 | 1 | 2 | 2 | 3 | 4 | 4
Backtracking to Build LCS¶
Start from dp[7][6] = 4:
1. A[6] = B[5] = 'A', add to LCS, move to dp[6][5]
2. A[5] = 'D' ≠ B[4] = 'A', dp[5][5] > dp[6][4], move to dp[5][5]
3. A[4] = 'B' = B[4] = 'B', add to LCS, move to dp[4][4]
4. A[3] = 'C' ≠ B[3] = 'A', dp[3][4] = dp[4][3], move to dp[3][4]
5. A[2] = 'C' ≠ B[2] = 'C', dp[2][4] > dp[3][3], move to dp[2][4]
6. A[1] = 'B' = B[0] = 'B', add to LCS, move to dp[1][0]
7. Reach boundary
Thus, LCS is "BBA" (reversed) or "ABB" (forward).
Animation Demo¶
Code Implementation¶
def longest_common_subsequence(a, b):
"""
Compute the longest common subsequence between two sequences
Args:
a: First sequence
b: Second sequence
Returns:
The longest common subsequence
"""
n = len(a)
m = len(b)
# Create dp table
dp = [[0] * (m + 1) for _ in range(n + 1)]
# Fill dp table
for i in range(1, n + 1):
for j in range(1, m + 1):
if a[i-1] == b[j-1]:
dp[i][j] = dp[i-1][j-1] + 1
else:
dp[i][j] = max(dp[i-1][j], dp[i][j-1])
# Backtrack to build LCS
lcs = []
i, j = n, m
while i > 0 and j > 0:
if a[i-1] == b[j-1]:
lcs.append(a[i-1])
i -= 1
j -= 1
elif dp[i-1][j] > dp[i][j-1]:
i -= 1
else:
j -= 1
# Reverse to get correct order
return ''.join(reversed(lcs))
# Test
a = "ABCBDAB"
b = "BDCABA"
lcs = longest_common_subsequence(a, b)
print(f"Sequence A: {a}")
print(f"Sequence B: {b}")
print(f"Longest Common Subsequence: {lcs}")
public class LongestCommonSubsequence {
public static void main(String[] args) {
String a = "ABCBDAB";
String b = "BDCABA";
String lcs = findLCS(a, b);
System.out.println("Sequence A: " + a);
System.out.println("Sequence B: " + b);
System.out.println("Longest Common Subsequence: " + lcs);
}
public static String findLCS(String a, String b) {
int n = a.length();
int m = b.length();
// Create dp table
int[][] dp = new int[n + 1][m + 1];
// Fill dp table
for (int i = 1; i <= n; i++) {
for (int j = 1; j <= m; j++) {
if (a.charAt(i - 1) == b.charAt(j - 1)) {
dp[i][j] = dp[i - 1][j - 1] + 1;
} else {
dp[i][j] = Math.max(dp[i - 1][j], dp[i][j - 1]);
}
}
}
// Backtrack to build LCS
StringBuilder lcs = new StringBuilder();
int i = n, j = m;
while (i > 0 && j > 0) {
if (a.charAt(i - 1) == b.charAt(j - 1)) {
lcs.append(a.charAt(i - 1));
i--;
j--;
} else if (dp[i - 1][j] > dp[i][j - 1]) {
i--;
} else {
j--;
}
}
// Reverse to get correct order
return lcs.reverse().toString();
}
}
Complexity Analysis¶
- Time complexity: O(n×m)
- Need to fill n×m dp table
-
Each cell computation is O(1) operation
-
Space complexity: O(n×m)
- Need to store n×m dp table
- Can be optimized to O(min(n,m)) by storing only two rows/columns
Relationship Between LCS and Diff Algorithms¶
LCS is closely related to diff algorithms like Myers diff:
- Diff representation: After finding LCS, elements not in LCS can be marked as insertions or deletions
- Elements in A not in LCS represent deletions
-
Elements in B not in LCS represent insertions
-
Edit distance: Edit distance = len(A) + len(B) - 2×len(LCS)
-
Because each element not in LCS requires one edit operation
-
Efficiency comparison:
- LCS: O(n×m) time, suitable for general cases
- Myers: O(n×d) time, more efficient when edit distance d is much smaller than sequence lengths
Algorithm Optimizations¶
Common LCS optimizations:
- Space optimization:
- Store only two rows/columns of dp values, reducing space to O(min(n,m))
-
Use rolling array technique
-
Pruning optimization:
- Precompute common prefixes/suffixes, apply LCS only to middle portions
-
Particularly effective for very long sequences with small differences
-
Divide and conquer:
- Break problem into smaller subproblems
- Can be combined with parallel computing for large sequence comparisons
Applications¶
LCS has wide applications:
- Bioinformatics:
- DNA and protein sequence alignment
-
Genome research and evolutionary analysis
-
Text comparison:
- File diff tools (like diff)
-
Plagiarism detection systems
-
Version control systems:
- Identifying changes between file versions
-
Merging different versions
-
Natural language processing:
- Text similarity computation
-
Automatic summarization
-
Data compression:
- Basis for some delta compression algorithms
Exercises¶
-
Basic implementation: Implement LCS to find longest common subsequence between two strings.
-
Space optimization: Modify LCS to use only O(min(n,m)) space.
-
Application: Implement a simple text diff tool using LCS to highlight insertions and deletions.
-
Extension: Implement Longest Common Substring algorithm (different from LCS as substrings must be contiguous).
-
Multi-sequence LCS: Extend LCS to handle three or more sequences.
-
LCS and edit distance: Prove and implement: edit distance = len(A) + len(B) - 2×len(LCS), explaining the formula.
-
Efficient implementation: Implement an efficient LCS for very long sequences (like genomic data) combining multiple optimizations.
References¶
-
Cormen, T. H., Leiserson, C. E., Rivest, R. L., & Stein, C. (2009). "Introduction to Algorithms" (3rd ed.). MIT Press. [Classic algorithms textbook with detailed LCS coverage]
-
Hirschberg, D. S. (1975). "A linear space algorithm for computing maximal common subsequences". Communications of the ACM, 18(6), 341-343. [Linear space LCS algorithm]
-
Hunt, J. W., & McIlroy, M. D. (1976). "An Algorithm for Differential File Comparison". Computing Science Technical Report, Bell Laboratories 41. [Early diff algorithm using LCS]
-
Myers, E. W. (1986). "An O(ND) Difference Algorithm and Its Variations". Algorithmica, 1(1), 251-266. [Related diff algorithm]
-
Bergroth, L., Hakonen, H., & Raita, T. (2000). "A survey of longest common subsequence algorithms". In Proceedings of the Seventh International Symposium on String Processing and Information Retrieval (SPIRE'00). [LCS survey paper]