DL-Distance ≤1 Pattern Matching with Suffix Trees in Python

Introduction to DL-Distance ≤1 Multiple Pattern Matching

In bioinformatics, plagiarism detection, and even in AI-powered code assistants, finding approximate matches of patterns in large text collections is crucial. The Damerau-Levenshtein (DL) distance allows for insertions, deletions, substitutions, and transpositions of adjacent characters. When the threshold is ≤1, we can efficiently use a suffix tree to locate all occurrences of multiple patterns across multiple texts. This tutorial will guide you through building a suffix tree from scratch in Python and using it to perform DL-distance ≤1 searches, following the FIT3155 assignment 2 p0 specification.

By the end, you'll be able to handle any number of text files and pattern files, outputting results in the required format. We'll also discuss space-time efficiency and how to avoid common pitfalls.

Understanding the Problem

Given a collection of text files and multiple pattern files, we need to report all positions where each pattern matches a substring of any text with DL-distance ≤1. The suffix tree is the primary data structure. Input files contain 7-bit ASCII printable characters (codes 32–126) and whitespace (8–13, 32). The '$' symbol is reserved and does not appear in input.

The program accepts a run-configuration file that lists text filenames and pattern filenames. Output is written to output_a2.txt with lines formatted as:

<pattern_number> <text_number> <position>

Positions are 1-based. The order of lines does not need to be sorted.

Suffix Tree Construction

A suffix tree for a string S of length n is a compressed trie of all suffixes of S. It can be built in O(n) time using Ukkonen's algorithm, but for simplicity and educational purposes, we'll build a simpler but still efficient version using Python dictionaries. However, the assignment requires using only standard libraries; we can implement a suffix tree as a class with nodes.

Here's a basic node structure:

class SuffixTreeNode:
    def __init__(self):
        self.children = {}  # mapping from character to node
        self.start = -1
        self.end = -1
        self.suffix_link = None

We'll build the tree for each text separately, then search all patterns against each tree. To handle DL-distance ≤1, we'll use a recursive traversal that allows up to one edit operation.

DL-Distance ≤1 Search Algorithm

The search algorithm explores the suffix tree while tracking the current node, the position in the pattern, and the number of edits used (0 or 1). At each step, we consider:

Exact match: follow the edge corresponding to the next pattern character.
Insertion: skip one character in the pattern (increase pattern index, keep node).
Deletion: skip one character in the text (follow an edge, but do not consume pattern character).
Substitution: match a different character (consume pattern character, follow edge with different char).
Transposition: swap two adjacent pattern characters and match both.

We must ensure we do not exceed one edit. When a pattern is fully matched (all characters consumed), we report all leaf positions under the current node. To collect positions efficiently, each node can store a list of starting indices (suffix indices) or we can traverse the subtree.

Implementation Steps

1. Reading the Configuration File

The configuration file has two lines: the first lists text filenames, the second lists pattern filenames. Use sys.argv[1] to get the config file path. Read lines and split by whitespace.

import sys
with open(sys.argv[1], 'r') as f:
    lines = f.readlines()
text_files = lines[0].strip().split()
pattern_files = lines[1].strip().split()

2. Building Suffix Trees for Each Text

For each text file, read its content and append a unique terminator (e.g., '$') that doesn't appear in the text. Build the suffix tree. You can implement a simple O(n^2) builder for small inputs, but for efficiency, consider a linear-time algorithm. Since the assignment allows standard data structures, we can use a dictionary-based tree with explicit nodes.

3. Searching Patterns

For each pattern file, read the pattern string. For each text's suffix tree, perform the DL-distance ≤1 search. Collect all positions where the pattern matches. Write results to output_a2.txt.

Example Walkthrough

Suppose we have a text file text1.txt containing hello world and a pattern file pattern1.txt containing helo (DL-distance 1: deletion of 'l'). The suffix tree for hello world$ should find the pattern at position 1 (1-based). Let's trace: pattern h e l o. At root, follow 'h' -> 'e' -> 'l' (first l). But pattern has 'o' next; we have a deletion: skip one text character (the second 'l') and match 'o' with 'o'? Actually, we need to allow one edit. One possible path: match 'h','e','l' then delete one text character (skip the second 'l') and then match 'o' with 'o'? Wait, pattern has 'helo', text has 'hello'. After matching 'hel', we have pattern index 3 (next char 'o'), text has 'l' then 'o'. We can perform a substitution? No, we need deletion of text char: skip the text 'l' (not consume pattern char) and then match 'o' with 'o'. That uses one deletion. So position 1 is valid.

Optimization Tips

Use suffix links to speed up construction and search, but they are not strictly required for correctness.
Store suffix indices at nodes to quickly report positions without traversing the entire subtree.
For multiple patterns, you can precompute a trie of patterns and search simultaneously, but the assignment expects per-pattern search.
Memory usage: each node can be a dict; for large texts, consider using arrays for efficiency.

Handling Edge Cases

Patterns may be longer than the text; then no match. Also, patterns may contain characters not in the text; the search will fail early. Ensure that the terminator '$' is never part of the pattern (it's reserved). The output positions are 1-based, so adjust your internal 0-based indexing.

Testing Your Implementation

Create small test files: config.txt with text1.txt pattern1.txt. Write a text file with a known string and patterns that match with 0 or 1 edits. Verify output manually. Also test with multiple texts and patterns.

Connecting to Real-World Trends

This technique is used in spell-checkers like those in modern smartphones, plagiarism detection tools, and even in AI-driven code review systems such as GitHub Copilot that suggest fixes for typos. The DL-distance ≤1 constraint is common in autocorrect features where only one error is allowed. Understanding suffix trees gives you a powerful tool for building such applications efficiently.

Conclusion

In this tutorial, we covered the fundamentals of implementing a suffix tree for DL-distance ≤1 multiple pattern matching. You learned how to construct the tree, perform approximate searches, and output results. This knowledge is directly applicable to your FIT3155 assignment and to real-world problems in text processing. Remember to follow the submission guidelines: include your name and student ID, name the script a2.py, and submit a zip file with the required contents.

Happy coding!