Skip to content

Improved Myers Diff Algorithm

Algorithm Introduction

The Myers diff algorithm is an efficient sequence comparison algorithm proposed by Eugene W. Myers in 1986. It finds the minimal edit operations (insertions and deletions) required to transform one sequence into another. This algorithm is widely used in version control systems (like Git), text comparison tools, and bioinformatics.

Algorithm Principles

The Myers algorithm is based on the concept of an edit graph, which finds the shortest path from the top-left to bottom-right corner to determine the minimal edit operations. Let's explain its core concepts in detail:

Edit Graph

Given two sequences A[1...N] and B[1...M], the edit graph is an (N+1) × (M+1) grid:

  • Each point (x,y) represents the state of comparing the first x characters of A with the first y characters of B
  • Starting from (0,0) (empty sequence comparison)
  • Goal is to reach (N,M) (complete sequence comparison)

In this graph: - Horizontal move (from (x,y) to (x+1,y)) represents deletion of A[x+1] - Vertical move (from (x,y) to (x,y+1)) represents insertion of B[y+1] - Diagonal move (from (x,y) to (x+1,y+1)) represents A[x+1] and B[y+1] matching (or substitution, though Myers algorithm only considers matches)

Edit Distance (d) and k-lines

The Myers algorithm introduces two key concepts:

  1. Edit distance d: The minimal number of edit operations (insertions and deletions) needed to reach a point from the start
  2. d = 0 means no edit operations (only diagonal moves)
  3. d = 1 means one edit operation (one insertion or one deletion)
  4. And so on

  5. k-lines: All points where x - y = k

  6. k = 0 represents the main diagonal (x = y)
  7. k > 0 represents diagonals above the main diagonal (x > y)
  8. k < 0 represents diagonals below the main diagonal (x < y)

Relationship Between k and d

There's an important mathematical relationship between k and d:

  1. k range: For a given edit distance d, possible k values range from -d ≤ k ≤ d
  2. This means for edit distance d, we only need to consider 2d+1 k-lines

  3. k parity: For a given d, k has the same parity as d

  4. If d is even, k must also be even
  5. If d is odd, k must also be odd

  6. Progressive relationship:

  7. When moving from edit distance d-1 to d, we only need to calculate new k values
  8. Each new k value is based on results from k-1 or k+1 at edit distance d-1

Core Algorithm Idea

The core idea of Myers algorithm is:

  1. For each edit distance d (starting from 0 and increasing)
  2. Calculate the furthest reachable point (maximum x value) on each k-line
  3. If we can reach the endpoint (N,M) at edit distance d, we've found the shortest edit path

This approach guarantees finding the path with minimal edit operations.

Detailed Algorithm Steps

  1. Initialization:
  2. Set starting point (0,0)
  3. Initialize V array to store the furthest x coordinate on each k-line

  4. Iterative search:

  5. For each edit distance d (from 0 upwards)
  6. For each k (from -d to d, step 2)

    • Determine whether to move down (vertical, insertion) or right (horizontal, deletion)
    • Calculate the furthest reachable point (x,y) on the k-line
    • Attempt to move diagonally (matches)
    • Check if endpoint (N,M) is reached
  7. Path backtracking:

  8. Backtrack from endpoint to start point
  9. Record each edit operation (insertion or deletion)

Illustrated Example

Let's understand Myers diff algorithm with a concrete example. Suppose we have two strings: - A = "ABCABBA" - B = "CBABAC"

Edit Graph Representation

We can represent these strings as an edit graph:

    |   | C | B | A | B | A | C
----+---+---+---+---+---+---+---
    | 0 | 1 | 2 | 3 | 4 | 5 | 6
----+---+---+---+---+---+---+---
  A | 1 |   |   | \ |   | \ |  
----+---+---+---+---+---+---+---
  B | 2 |   | \ |   | \ |   |  
----+---+---+---+---+---+---+---
  C | 3 | \ |   |   |   |   | \
----+---+---+---+---+---+---+---
  A | 4 |   |   | \ |   | \ |  
----+---+---+---+---+---+---+---
  B | 5 |   | \ |   | \ |   |  
----+---+---+---+---+---+---+---
  B | 6 |   | \ |   | \ |   |  
----+---+---+---+---+---+---+---
  A | 7 |   |   | \ |   | \ |  

Where "\" represents diagonal matches (A[i] = B[j]).

k-lines Illustration

In this edit graph, different k-lines are: - k = 0: Main diagonal, x = y - k = 1: x - y = 1 diagonal - k = -1: x - y = -1 diagonal - And so on

Edit Distance d Iteration

  1. d = 0: Only consider k = 0
  2. Start from (0,0), move along diagonal
  3. Cannot reach endpoint

  4. d = 1: Consider k = -1 and k = 1

  5. k = -1: Move down to (0,1), then attempt diagonal
  6. k = 1: Move right to (1,0), then attempt diagonal
  7. Still cannot reach endpoint

  8. d = 2, 3, ...:

  9. Continue iterations until finding path to endpoint

Final Path

Eventually, the algorithm finds a shortest path from (0,0) to (7,6), and the corresponding edit operation sequence is the minimal edit script.

Code Implementation

def myers_diff(a, b):
    """
    Compute the difference between two sequences using Myers diff algorithm

    Args:
        a: First sequence
        b: Second sequence

    Returns:
        List of edit operations where:
        - ('+', i, x) means insert element x at position i
        - ('-', i, x) means delete element x at position i
    """
    # Create edit graph
    n = len(a)
    m = len(b)
    max_edit = n + m

    # Store furthest x coordinate on each k-line
    v = {1: 0}
    # Store path
    trace = []

    # Compute shortest edit path
    for d in range(max_edit + 1):
        # Save current d's path info
        trace.append(v.copy())

        for k in range(-d, d + 1, 2):
            # Determine whether to move down or right
            if k == -d or (k != d and v.get(k - 1, -1) < v.get(k + 1, -1)):
                x = v.get(k + 1, -1)  # Move down (insert)
            else:
                x = v.get(k - 1, -1) + 1  # Move right (delete)

            # Calculate y coordinate
            y = x - k

            # Move diagonally (matches)
            while x < n and y < m and a[x] == b[y]:
                x += 1
                y += 1

            # Update v
            v[k] = x

            # If reached endpoint, backtrack and return edit script
            if x >= n and y >= m:
                return _backtrack(a, b, trace, n, m)

    # If no path found (shouldn't happen)
    return []

def _backtrack(a, b, trace, x, y):
    """Backtrack to construct edit script"""
    result = []
    d = len(trace) - 1

    while d > 0:
        v = trace[d]
        k = x - y

        # Determine previous k value
        if k == -d or (k != d and v.get(k - 1, -1) < v.get(k + 1, -1)):
            prev_k = k + 1  # Previously moved down (insert)
        else:
            prev_k = k - 1  # Previously moved right (delete)

        # Calculate previous position
        prev_x = v.get(prev_k, 0)
        prev_y = prev_x - prev_k

        # Handle diagonal moves (matches)
        while x > prev_x and y > prev_y:
            x -= 1
            y -= 1

        # Record edit operation
        if x == prev_x:
            # Insert operation
            result.insert(0, ('+', x, b[prev_y]))
        else:
            # Delete operation
            result.insert(0, ('-', prev_x, a[prev_x]))

        # Update position
        x = prev_x
        y = prev_y
        d -= 1

    return result

# Test
a = "ABCABBA"
b = "CBABAC"
diff = myers_diff(a, b)
print("Diff result:", diff)
import java.util.*;

public class MyersDiff {
    public static void main(String[] args) {
        String a = "ABCABBA";
        String b = "CBABAC";
        List<EditOperation> diff = myersDiff(a, b);
        System.out.println("Diff result:");
        for (EditOperation op : diff) {
            System.out.println(op);
        }
    }

    // Edit operation class
    static class EditOperation {
        char type; // '+' for insert, '-' for delete
        int position;
        char element;

        public EditOperation(char type, int position, char element) {
            this.type = type;
            this.position = position;
            this.element = element;
        }

        @Override
        public String toString() {
            return "(" + type + ", " + position + ", " + element + ")";
        }
    }

    public static List<EditOperation> myersDiff(String a, String b) {
        char[] aChars = a.toCharArray();
        char[] bChars = b.toCharArray();
        int n = aChars.length;
        int m = bChars.length;
        int maxEdit = n + m;

        // Store furthest x coordinate on each k-line
        Map<Integer, Integer> v = new HashMap<>();
        v.put(1, 0);

        // Store path
        List<Map<Integer, Integer>> trace = new ArrayList<>();

        // Compute shortest edit path
        for (int d = 0; d <= maxEdit; d++) {
            // Save current d's path info
            trace.add(new HashMap<>(v));

            for (int k = -d; k <= d; k += 2) {
                // Determine whether to move down or right
                int x;
                if (k == -d || (k != d && v.getOrDefault(k - 1, -1) < v.getOrDefault(k + 1, -1))) {
                    x = v.getOrDefault(k + 1, -1);  // Move down (insert)
                } else {
                    x = v.getOrDefault(k - 1, -1) + 1;  // Move right (delete)
                }

                // Calculate y coordinate
                int y = x - k;

                // Move diagonally (matches)
                while (x < n && y < m && aChars[x] == bChars[y]) {
                    x++;
                    y++;
                }

                // Update v
                v.put(k, x);

                // If reached endpoint, backtrack and return edit script
                if (x >= n && y >= m) {
                    return backtrack(aChars, bChars, trace, n, m);
                }
            }
        }

        // If no path found (shouldn't happen)
        return new ArrayList<>();
    }

    private static List<EditOperation> backtrack(char[] a, char[] b, List<Map<Integer, Integer>> trace, int x, int y) {
        List<EditOperation> result = new ArrayList<>();
        int d = trace.size() - 1;

        while (d > 0) {
            Map<Integer, Integer> v = trace.get(d);
            int k = x - y;

            // Determine previous k value
            int prevK;
            if (k == -d || (k != d && v.getOrDefault(k - 1, -1) < v.getOrDefault(k + 1, -1))) {
                prevK = k + 1;  // Previously moved down (insert)
            } else {
                prevK = k - 1;  // Previously moved right (delete)
            }

            // Calculate previous position
            int prevX = v.getOrDefault(prevK, 0);
            int prevY = prevX - prevK;

            // Handle diagonal moves (matches)
            while (x > prevX && y > prevY) {
                x--;
                y--;
            }

            // Record edit operation
            if (x == prevX) {
                // Insert operation
                result.add(0, new EditOperation('+', x, b[prevY]));
            } else {
                // Delete operation
                result.add(0, new EditOperation('-', prevX, a[prevX]));
            }

            // Update position
            x = prevX;
            y = prevY;
            d--;
        }

        return result;
    }
}

Complexity Analysis

The complexity of Myers diff algorithm relates to edit distance and sequence length:

  • Time complexity: O(ND)
  • N is the total length of both sequences (N = len(A) + len(B))
  • D is the edit distance (minimal edit operations)
  • In worst case (completely different sequences), D may approach N, making time complexity near O(N²)
  • But in practice, usually D << N, making the algorithm very efficient

  • Space complexity: O(ND)

  • Mainly for storing path information for each edit distance d
  • Can be optimized by only storing necessary information

Detailed Algorithm Example

Let's understand Myers diff execution with a detailed example. Suppose we have two strings: - A = "ABCABBA" - B = "CBABAC"

Initial State

  • Start point: (0,0)
  • End point: (7,6)
  • Initialize V array: V[1] = 0

Edit Distance d = 0

  • Only consider k = 0 (main diagonal)
  • Start from (0,0), cannot move diagonally (A[0] ≠ B[0])
  • V[0] = 0

Edit Distance d = 1

  • Consider k = -1 and k = 1
  • k = -1:
  • Move down to (0,1)
  • Cannot move diagonally (A[0] ≠ B[1])
  • V[-1] = 0
  • k = 1:
  • Move right to (1,0)
  • Cannot move diagonally (A[1] ≠ B[0])
  • V[1] = 1

Edit Distance d = 2

  • Consider k = -2, k = 0, k = 2
  • k = -2:
  • Move down to (0,2)
  • Cannot move diagonally
  • V[-2] = 0
  • k = 0:
  • Choose to move right from k = -1 (because V[-1] = 0 < V[1] = 1)
  • Reach (1,1)
  • Cannot move diagonally (A[1] ≠ B[1])
  • V[0] = 1
  • k = 2:
  • Move right to (2,0)
  • Move diagonally to (3,1) (because A[2] = B[0] = 'C')
  • V[2] = 3

Continue Iterating

  • Continue calculating higher edit distances until finding path to endpoint
  • Each iteration updates V array, recording furthest reachable point on each k-line

Final Path

Eventually, the algorithm finds a shortest path from (0,0) to (7,6). Through backtracking, we can construct the edit operation sequence:

  1. Delete A[0]: 'A'
  2. Match C
  3. Match B
  4. Match A
  5. Match B
  6. Match B
  7. Match A
  8. Insert C

This represents the minimal edit operations to transform "ABCABBA" to "CBABAC".

Animation Demo

Applications

  1. Version control systems: Git and other VCS use diff algorithms similar to Myers for comparing file versions
  2. Text comparison tools: Tools like diff, Beyond Compare
  3. Collaborative editing: Conflict resolution in real-time collaborative editors
  4. Bioinformatics: DNA sequence alignment

Exercises

  1. Basic implementation: Implement Myers diff algorithm to compute edit operations between two strings.

  2. Visualization: Create a simple visualization tool showing Myers algorithm execution and edit graph.

  3. Optimization: Myers algorithm can consume significant memory for large sequences. Try implementing a space-optimized version storing only necessary information.

  4. Application: Use Myers diff to implement a simple text comparison tool highlighting differences between two text files.

  5. Extension: Modify Myers algorithm to handle replacement operations (changing one element to another) in addition to insertions and deletions.

  6. Challenge: Implement a linear-space variant of Myers algorithm using O(N) space instead of O(ND), where N is total sequence length and D is edit distance.

References

  1. Myers, E. W. (1986). "An O(ND) Difference Algorithm and Its Variations". Algorithmica. 1 (1): 251–266.
  2. Miller, W.; Myers, E. W. (1985). "A File Comparison Program". Software: Practice and Experience. 15 (11): 1025–1040.
  3. Hunt, J. W.; McIlroy, M. D. (1976). "An Algorithm for Differential File Comparison". Computing Science Technical Report, Bell Laboratories 41.
  4. Bergroth, L., Hakonen, H., & Raita, T. (2000). "A survey of longest common subsequence algorithms". In Proceedings of the Seventh International Symposium on String Processing and Information Retrieval (SPIRE'00).


📅 Created 0 days ago ✏️ Updated 0 days ago