An analytical comparison of two string searching algorithms (Q800732)

A problem frequently encountered in text processing is to search for occurrences of a string PATTERN as a substring in a string TEXT. A straightforward solution for this problem aligns TEXT and PATTERN side by side and compares corresponding characters. As soon as a mismatch is detected, PATTERN is shifted one position towards the right end of TEXT, and the search is resumed at the first character of PATTERN. Obviously, this procedure may perform \(O(n\cdot m)\) steps, n and m denoting the respective lengths of TEXT and PATTERN. A clever algorithm for substring searching with an \(O(n+m)\) worst case complexity has been developed by \textit{V. R. Pratt, D. E. Knuth} and \textit{J. H. Morris jun.} [SIAM J. Comput. 6, 323-350 (1977; Zbl 0372.68005)] (hereafter referred to as KMP). The basic idea of this method is never to retract within TEXT to the left. Instead, after a mismatch between TEXT[tp] and PATTERN[pp] has been detected, the search is resumed by comparing TEXT[tp] and PATTERN[next(pp)]. Next is a table of length m, whose entries may be computed by O(m) steps. Therefore, the \(O(n+m)\) worst-case time complexity of KMP compares favorably with the \(O(n\cdot m)\) performance of the naive method. In this paper, the average case time complexities of the two strategies are investigated. The analyses center around Markov chain techniques. It is demonstrated that both methods can be conveniently modeled by absorbing Markov chains. Exploiting some fundamental results from Markov chain theory, it is shown that for an alphabet of size c, the ratio of the average case performances of KMP and the naive algorithm is very close to \(1-(1/c)+(1/c^ 2).\) This result justifies the assumption that only a marginal improvement may be expected when using KMP instead of the straightforward method. The technique employed in the paper is interesting in its own right, since it is applicable to a large class of combinatorial problems.

0 references

zbMATH Keywords

analysis of combinatorial algorithms

0 references

average case time complexity

0 references

pattern matching

0 references

text processing

0 references

substring searching

0 references