Improved Myers Diff Algorithm¶
Algorithm Introduction¶
The Myers diff algorithm is an efficient sequence comparison algorithm proposed by Eugene W. Myers in 1986. It finds the minimal edit operations (insertions and deletions) required to transform one sequence into another. This algorithm is widely used in version control systems (like Git), text comparison tools, and bioinformatics.
Algorithm Principles¶
The Myers algorithm is based on the concept of an edit graph, which finds the shortest path from the top-left to bottom-right corner to determine the minimal edit operations. Let's explain its core concepts in detail:
Edit Graph¶
Given two sequences A[1...N] and B[1...M], the edit graph is an (N+1) × (M+1) grid:
- Each point (x,y) represents the state of comparing the first x characters of A with the first y characters of B
- Starting from (0,0) (empty sequence comparison)
- Goal is to reach (N,M) (complete sequence comparison)
In this graph: - Horizontal move (from (x,y) to (x+1,y)) represents deletion of A[x+1] - Vertical move (from (x,y) to (x,y+1)) represents insertion of B[y+1] - Diagonal move (from (x,y) to (x+1,y+1)) represents A[x+1] and B[y+1] matching (or substitution, though Myers algorithm only considers matches)
Edit Distance (d) and k-lines¶
The Myers algorithm introduces two key concepts:
- Edit distance d: The minimal number of edit operations (insertions and deletions) needed to reach a point from the start
- d = 0 means no edit operations (only diagonal moves)
- d = 1 means one edit operation (one insertion or one deletion)
-
And so on
-
k-lines: All points where x - y = k
- k = 0 represents the main diagonal (x = y)
- k > 0 represents diagonals above the main diagonal (x > y)
- k < 0 represents diagonals below the main diagonal (x < y)
Relationship Between k and d¶
There's an important mathematical relationship between k and d:
- k range: For a given edit distance d, possible k values range from -d ≤ k ≤ d
-
This means for edit distance d, we only need to consider 2d+1 k-lines
-
k parity: For a given d, k has the same parity as d
- If d is even, k must also be even
-
If d is odd, k must also be odd
-
Progressive relationship:
- When moving from edit distance d-1 to d, we only need to calculate new k values
- Each new k value is based on results from k-1 or k+1 at edit distance d-1
Core Algorithm Idea¶
The core idea of Myers algorithm is:
- For each edit distance d (starting from 0 and increasing)
- Calculate the furthest reachable point (maximum x value) on each k-line
- If we can reach the endpoint (N,M) at edit distance d, we've found the shortest edit path
This approach guarantees finding the path with minimal edit operations.
Detailed Algorithm Steps¶
- Initialization:
- Set starting point (0,0)
-
Initialize V array to store the furthest x coordinate on each k-line
-
Iterative search:
- For each edit distance d (from 0 upwards)
-
For each k (from -d to d, step 2)
- Determine whether to move down (vertical, insertion) or right (horizontal, deletion)
- Calculate the furthest reachable point (x,y) on the k-line
- Attempt to move diagonally (matches)
- Check if endpoint (N,M) is reached
-
Path backtracking:
- Backtrack from endpoint to start point
- Record each edit operation (insertion or deletion)
Illustrated Example¶
Let's understand Myers diff algorithm with a concrete example. Suppose we have two strings: - A = "ABCABBA" - B = "CBABAC"
Edit Graph Representation¶
We can represent these strings as an edit graph:
| | C | B | A | B | A | C
----+---+---+---+---+---+---+---
| 0 | 1 | 2 | 3 | 4 | 5 | 6
----+---+---+---+---+---+---+---
A | 1 | | | \ | | \ |
----+---+---+---+---+---+---+---
B | 2 | | \ | | \ | |
----+---+---+---+---+---+---+---
C | 3 | \ | | | | | \
----+---+---+---+---+---+---+---
A | 4 | | | \ | | \ |
----+---+---+---+---+---+---+---
B | 5 | | \ | | \ | |
----+---+---+---+---+---+---+---
B | 6 | | \ | | \ | |
----+---+---+---+---+---+---+---
A | 7 | | | \ | | \ |
Where "\" represents diagonal matches (A[i] = B[j]).
k-lines Illustration¶
In this edit graph, different k-lines are: - k = 0: Main diagonal, x = y - k = 1: x - y = 1 diagonal - k = -1: x - y = -1 diagonal - And so on
Edit Distance d Iteration¶
- d = 0: Only consider k = 0
- Start from (0,0), move along diagonal
-
Cannot reach endpoint
-
d = 1: Consider k = -1 and k = 1
- k = -1: Move down to (0,1), then attempt diagonal
- k = 1: Move right to (1,0), then attempt diagonal
-
Still cannot reach endpoint
-
d = 2, 3, ...:
- Continue iterations until finding path to endpoint
Final Path¶
Eventually, the algorithm finds a shortest path from (0,0) to (7,6), and the corresponding edit operation sequence is the minimal edit script.
Code Implementation¶
def myers_diff(a, b):
"""
Compute the difference between two sequences using Myers diff algorithm
Args:
a: First sequence
b: Second sequence
Returns:
List of edit operations where:
- ('+', i, x) means insert element x at position i
- ('-', i, x) means delete element x at position i
"""
# Create edit graph
n = len(a)
m = len(b)
max_edit = n + m
# Store furthest x coordinate on each k-line
v = {1: 0}
# Store path
trace = []
# Compute shortest edit path
for d in range(max_edit + 1):
# Save current d's path info
trace.append(v.copy())
for k in range(-d, d + 1, 2):
# Determine whether to move down or right
if k == -d or (k != d and v.get(k - 1, -1) < v.get(k + 1, -1)):
x = v.get(k + 1, -1) # Move down (insert)
else:
x = v.get(k - 1, -1) + 1 # Move right (delete)
# Calculate y coordinate
y = x - k
# Move diagonally (matches)
while x < n and y < m and a[x] == b[y]:
x += 1
y += 1
# Update v
v[k] = x
# If reached endpoint, backtrack and return edit script
if x >= n and y >= m:
return _backtrack(a, b, trace, n, m)
# If no path found (shouldn't happen)
return []
def _backtrack(a, b, trace, x, y):
"""Backtrack to construct edit script"""
result = []
d = len(trace) - 1
while d > 0:
v = trace[d]
k = x - y
# Determine previous k value
if k == -d or (k != d and v.get(k - 1, -1) < v.get(k + 1, -1)):
prev_k = k + 1 # Previously moved down (insert)
else:
prev_k = k - 1 # Previously moved right (delete)
# Calculate previous position
prev_x = v.get(prev_k, 0)
prev_y = prev_x - prev_k
# Handle diagonal moves (matches)
while x > prev_x and y > prev_y:
x -= 1
y -= 1
# Record edit operation
if x == prev_x:
# Insert operation
result.insert(0, ('+', x, b[prev_y]))
else:
# Delete operation
result.insert(0, ('-', prev_x, a[prev_x]))
# Update position
x = prev_x
y = prev_y
d -= 1
return result
# Test
a = "ABCABBA"
b = "CBABAC"
diff = myers_diff(a, b)
print("Diff result:", diff)
import java.util.*;
public class MyersDiff {
public static void main(String[] args) {
String a = "ABCABBA";
String b = "CBABAC";
List<EditOperation> diff = myersDiff(a, b);
System.out.println("Diff result:");
for (EditOperation op : diff) {
System.out.println(op);
}
}
// Edit operation class
static class EditOperation {
char type; // '+' for insert, '-' for delete
int position;
char element;
public EditOperation(char type, int position, char element) {
this.type = type;
this.position = position;
this.element = element;
}
@Override
public String toString() {
return "(" + type + ", " + position + ", " + element + ")";
}
}
public static List<EditOperation> myersDiff(String a, String b) {
char[] aChars = a.toCharArray();
char[] bChars = b.toCharArray();
int n = aChars.length;
int m = bChars.length;
int maxEdit = n + m;
// Store furthest x coordinate on each k-line
Map<Integer, Integer> v = new HashMap<>();
v.put(1, 0);
// Store path
List<Map<Integer, Integer>> trace = new ArrayList<>();
// Compute shortest edit path
for (int d = 0; d <= maxEdit; d++) {
// Save current d's path info
trace.add(new HashMap<>(v));
for (int k = -d; k <= d; k += 2) {
// Determine whether to move down or right
int x;
if (k == -d || (k != d && v.getOrDefault(k - 1, -1) < v.getOrDefault(k + 1, -1))) {
x = v.getOrDefault(k + 1, -1); // Move down (insert)
} else {
x = v.getOrDefault(k - 1, -1) + 1; // Move right (delete)
}
// Calculate y coordinate
int y = x - k;
// Move diagonally (matches)
while (x < n && y < m && aChars[x] == bChars[y]) {
x++;
y++;
}
// Update v
v.put(k, x);
// If reached endpoint, backtrack and return edit script
if (x >= n && y >= m) {
return backtrack(aChars, bChars, trace, n, m);
}
}
}
// If no path found (shouldn't happen)
return new ArrayList<>();
}
private static List<EditOperation> backtrack(char[] a, char[] b, List<Map<Integer, Integer>> trace, int x, int y) {
List<EditOperation> result = new ArrayList<>();
int d = trace.size() - 1;
while (d > 0) {
Map<Integer, Integer> v = trace.get(d);
int k = x - y;
// Determine previous k value
int prevK;
if (k == -d || (k != d && v.getOrDefault(k - 1, -1) < v.getOrDefault(k + 1, -1))) {
prevK = k + 1; // Previously moved down (insert)
} else {
prevK = k - 1; // Previously moved right (delete)
}
// Calculate previous position
int prevX = v.getOrDefault(prevK, 0);
int prevY = prevX - prevK;
// Handle diagonal moves (matches)
while (x > prevX && y > prevY) {
x--;
y--;
}
// Record edit operation
if (x == prevX) {
// Insert operation
result.add(0, new EditOperation('+', x, b[prevY]));
} else {
// Delete operation
result.add(0, new EditOperation('-', prevX, a[prevX]));
}
// Update position
x = prevX;
y = prevY;
d--;
}
return result;
}
}
Complexity Analysis¶
The complexity of Myers diff algorithm relates to edit distance and sequence length:
- Time complexity: O(ND)
- N is the total length of both sequences (N = len(A) + len(B))
- D is the edit distance (minimal edit operations)
- In worst case (completely different sequences), D may approach N, making time complexity near O(N²)
-
But in practice, usually D << N, making the algorithm very efficient
-
Space complexity: O(ND)
- Mainly for storing path information for each edit distance d
- Can be optimized by only storing necessary information
Detailed Algorithm Example¶
Let's understand Myers diff execution with a detailed example. Suppose we have two strings: - A = "ABCABBA" - B = "CBABAC"
Initial State¶
- Start point: (0,0)
- End point: (7,6)
- Initialize V array: V[1] = 0
Edit Distance d = 0¶
- Only consider k = 0 (main diagonal)
- Start from (0,0), cannot move diagonally (A[0] ≠ B[0])
- V[0] = 0
Edit Distance d = 1¶
- Consider k = -1 and k = 1
- k = -1:
- Move down to (0,1)
- Cannot move diagonally (A[0] ≠ B[1])
- V[-1] = 0
- k = 1:
- Move right to (1,0)
- Cannot move diagonally (A[1] ≠ B[0])
- V[1] = 1
Edit Distance d = 2¶
- Consider k = -2, k = 0, k = 2
- k = -2:
- Move down to (0,2)
- Cannot move diagonally
- V[-2] = 0
- k = 0:
- Choose to move right from k = -1 (because V[-1] = 0 < V[1] = 1)
- Reach (1,1)
- Cannot move diagonally (A[1] ≠ B[1])
- V[0] = 1
- k = 2:
- Move right to (2,0)
- Move diagonally to (3,1) (because A[2] = B[0] = 'C')
- V[2] = 3
Continue Iterating¶
- Continue calculating higher edit distances until finding path to endpoint
- Each iteration updates V array, recording furthest reachable point on each k-line
Final Path¶
Eventually, the algorithm finds a shortest path from (0,0) to (7,6). Through backtracking, we can construct the edit operation sequence:
- Delete A[0]: 'A'
- Match C
- Match B
- Match A
- Match B
- Match B
- Match A
- Insert C
This represents the minimal edit operations to transform "ABCABBA" to "CBABAC".
Animation Demo¶
Applications¶
- Version control systems: Git and other VCS use diff algorithms similar to Myers for comparing file versions
- Text comparison tools: Tools like diff, Beyond Compare
- Collaborative editing: Conflict resolution in real-time collaborative editors
- Bioinformatics: DNA sequence alignment
Exercises¶
-
Basic implementation: Implement Myers diff algorithm to compute edit operations between two strings.
-
Visualization: Create a simple visualization tool showing Myers algorithm execution and edit graph.
-
Optimization: Myers algorithm can consume significant memory for large sequences. Try implementing a space-optimized version storing only necessary information.
-
Application: Use Myers diff to implement a simple text comparison tool highlighting differences between two text files.
-
Extension: Modify Myers algorithm to handle replacement operations (changing one element to another) in addition to insertions and deletions.
-
Challenge: Implement a linear-space variant of Myers algorithm using O(N) space instead of O(ND), where N is total sequence length and D is edit distance.
References¶
- Myers, E. W. (1986). "An O(ND) Difference Algorithm and Its Variations". Algorithmica. 1 (1): 251–266.
- Miller, W.; Myers, E. W. (1985). "A File Comparison Program". Software: Practice and Experience. 15 (11): 1025–1040.
- Hunt, J. W.; McIlroy, M. D. (1976). "An Algorithm for Differential File Comparison". Computing Science Technical Report, Bell Laboratories 41.
- Bergroth, L., Hakonen, H., & Raita, T. (2000). "A survey of longest common subsequence algorithms". In Proceedings of the Seventh International Symposium on String Processing and Information Retrieval (SPIRE'00).