String Matching: Beyond the Naive Approach
Finding a pattern $P$of length$M$within a text$T$of length$N$is a fundamental problem. While the naive approach takes$O(N \cdot M)$, optimized algorithms achieve linear or sub-linear performance by exploiting the internal structure of the pattern.
1. Knuth-Morris-Pratt (KMP)
KMP avoids re-scanning characters by pre-processing the pattern to find the **Longest Proper Prefix which is also a Suffix (LPS)**.
1.1 The LPS Array
For every position$i$in$P$,$LPS[i]$stores the length of the longest proper prefix of$P[0 \dots i]$that is also a suffix of$P[0 \dots i]$.
**Example:**$P = \text{"ABABC"}$* `A`: 0
* `AB`: 0
* `ABA`: 1 (`A` matches `A`)
* `ABAB`: 2 (`AB` matches `AB`)
* `ABABC`: 0
1.2 The Search Logic
When a mismatch occurs at$P[j]$and$T[i]$, we do not reset$i$. Instead, we use the LPS array to "shift" the pattern:$j = LPS[j-1]$. This ensures we never back-track the text pointer, guaranteeing **$O(N+M)$** time complexity.
2. Boyer-Moore
Boyer-Moore is often faster than KMP in practice because it frequently skips large sections of the text. It scans the pattern from **right to left**.
2.1 The Bad Character Heuristic
When a mismatch occurs at$T[i] = c$:
* If$c$is not in the pattern, we shift the pattern past$i$.
* If$c$is in the pattern, we shift the pattern to align the rightmost occurrence of$c$in$P$with$T[i]$.
2.2 The Good Suffix Heuristic
If a suffix of the pattern has already matched before the mismatch, we shift the pattern to align that matched suffix with its next occurrence in the pattern.
2.3 Performance
* **Average Case:**$O(N/M)$(Sub-linear).
* **Worst Case:**$O(N \cdot M)$(though$O(N+M)$variants exist).
3. Rabin-Karp: Rolling Hashes
Rabin-Karp uses a **Rolling Hash** to find potential matches.
1. Compute hash$H_P$of the pattern.
2. Compute hash$H_T$of the current text window.
3. If$H_T = H_P$, verify the match character-by-character.
4. Update$H_T$in$O(1)$time by subtracting the outgoing character and adding the incoming one.
**Best For:** Multiple pattern matching or searching in data streams.
Summary
| Algorithm | Complexity (Avg) | Scan Direction | Key Mechanic |
| :--- | :--- | :--- | :--- |
| **KMP** |$O(N+M)$| Left-to-Right | LPS Array (Prefix/Suffix) |
| **Boyer-Moore** |$O(N/M)$| Right-to-Left | Bad Character / Good Suffix |
| **Rabin-Karp** |$O(N+M)$ | Left-to-Right | Rolling Hash |