String Matching: Beyond the Naive Approach

Finding a pattern $P$ of length $M$ within a text $T$ of length $N$ is a fundamental problem. While the naive approach takes $O(N \cdot M)$ , optimized algorithms achieve linear or sub-linear performance by exploiting the internal structure of the pattern.

1. Knuth-Morris-Pratt (KMP)

KMP avoids re-scanning characters by pre-processing the pattern to find the Longest Proper Prefix which is also a Suffix (LPS).

1.1 The LPS Array

For every position $i$ in $P$ , $LPS[i]$ stores the length of the longest proper prefix of $P[0 \dots i]$ that is also a suffix of $P[0 \dots i]$ .

Example: $P = \text{"ABABC"}$ * A: 0

AB: 0
ABA: 1 (A matches A)
ABAB: 2 (AB matches AB)
ABABC: 0

1.2 The Search Logic

When a mismatch occurs at $P[j]$ and $T[i]$ , we do not reset $i$ . Instead, we use the LPS array to "shift" the pattern: $j = LPS[j-1]$ . This ensures we never back-track the text pointer, guaranteeing $O(N+M)$ time complexity.

2. Boyer-Moore

Boyer-Moore is often faster than KMP in practice because it frequently skips large sections of the text. It scans the pattern from right to left.

2.1 The Bad Character Heuristic

When a mismatch occurs at $T[i] = c$ :

If $c$ is not in the pattern, we shift the pattern past $i$ .
If $c$ is in the pattern, we shift the pattern to align the rightmost occurrence of $c$ in $P$ with $T[i]$ .

2.2 The Good Suffix Heuristic

If a suffix of the pattern has already matched before the mismatch, we shift the pattern to align that matched suffix with its next occurrence in the pattern.

2.3 Performance

Average Case: $O(N/M)$ (Sub-linear).
Worst Case: $O(N \cdot M)$ (though $O(N+M)$ variants exist).

3. Rabin-Karp: Rolling Hashes

Rabin-Karp uses a Rolling Hash to find potential matches.

Compute hash $H_P$ of the pattern.
Compute hash $H_T$ of the current text window.
If $H_T = H_P$ , verify the match character-by-character.
Update $H_T$ in $O(1)$ time by subtracting the outgoing character and adding the incoming one.

Best For: Multiple pattern matching or searching in data streams.

Summary

Algorithm	Complexity (Avg)	Scan Direction	Key Mechanic
KMP	$O(N+M)$	Left-to-Right	LPS Array (Prefix/Suffix)
Boyer-Moore	$O(N/M)$	Right-to-Left	Bad Character / Good Suffix
Rabin-Karp	$O(N+M)$	Left-to-Right	Rolling Hash