Compressed pattern matching

In computer science, compressed pattern matching (abbreviated as CPM) is the process of searching for patterns in compressed data with little or no decompression. Searching in a compressed string is faster than searching an uncompressed string and requires less space.

Compressed matching problem[]

If the compressed file uses a variable width encoding it could be present a problem: for example, let “100” be the codeword for a and let “110100” be the codeword for b. If we are looking for an occurrence of a in the text we could obtain as result also an occurrence that is within the codeword of b: we call this event false match. So we have to verify if the occurrence detected is effectively aligned on a codeword boundary. However we could always decode the entire text and then apply a classic string matching algorithm, but this usually requires more space and time and often is not possible, for example if the compressed file is hosted online. This problem of verifying the match returned by the compressed pattern matching algorithm is a true or a false match together with the impossibility of decoding an entire text is called the compressed matching problem.^[1]

Strategies[]

Many strategies exist for finding the boundaries of codewords and avoiding full decompression of the text, for example:

List of the indices of first bit of each codeword, where we can apply a binary search;
List of the indices of first bit of each codeword with differential coding, so we can take less space within the file;
Mask of bit, where bit 1 marks the starting bit of each codeword;
Subdivision in blocks, for a partial and aimed decompression.

There were introduced algorithms that provide running time that grows logarithmically with the increase of string and pattern length.^[2]

References[]

^ Joel Grus (2019). Data Science from Scratch. First Principles with Python. ISBN 9781491901427.
^ Artur Jeż (2013-06-25). "Faster fully compressed pattern matching by recompression". arxiv.org.

Shmuel T. Klein and Dana Shapira PATTERN MATCHING IN HUFFMAN ENCODED TEXTS (2003)
Marek Karpinski, Wojciech Rytter and Ayumi Shinohara. AN EFFICIENT PATTERN-MATCHING ALGORITHM FOR STRINGS WITH SHORT DESCRIPTIONS. Nordic Journal of Computing 4(2): pp.172-168 (1997).

External links[]

"Almost optimal fully LZW-compressed pattern matching". CiteSeerX 10.1.1.44.5521. Cite journal requires |journal= (help)
A Dictionary-based Compressed Pattern Matching Algorithm (PDF), archived from the original (PDF) on March 13, 2003
"A unifying framework for compressed pattern matching". CiteSeerX 10.1.1.50.1745. Cite journal requires |journal= (help)
"Speeding Up String Pattern Matching by Text Compression: The Dawn of a New Era" (PDF). Archived from the original (PDF) on 2007-08-08. Retrieved 2009-03-22. Cite journal requires |journal= (help)
"Shift-and approach to pattern matching in LZW compressed text". CiteSeerX 10.1.1.15.4609. Cite journal requires |journal= (help)
"LZW Algorithm" (PDF). Cite journal requires |journal= (help)

[1] Joel Grus (2019). Data Science from Scratch. First Principles with Python. ISBN 9781491901427.

[2] Artur Jeż (2013-06-25). "Faster fully compressed pattern matching by recompression". arxiv.org.

[1]

[2]

v t Strings
String metric	Approximate string matching Bitap algorithm Damerau–Levenshtein distance Edit distance Gestalt Pattern Matching Hamming distance Jaro–Winkler distance Lee distance Levenshtein automaton Levenshtein distance Wagner–Fischer algorithm
String-searching algorithm	Apostolico–Giancarlo algorithm Boyer–Moore string-search algorithm Boyer–Moore–Horspool algorithm Knuth–Morris–Pratt algorithm Rabin–Karp algorithm
Multiple string searching	Aho–Corasick Commentz-Walter algorithm
Regular expression	Comparison of regular-expression engines Regular grammar Thompson's construction Nondeterministic finite automaton
Sequence alignment	Hirschberg's algorithm Needleman–Wunsch algorithm Smith–Waterman algorithm
Data structure	DAFSA Suffix array Suffix automaton Suffix tree Generalized suffix tree Rope Ternary search tree Trie
Other	Parsing Pattern matching Compressed pattern matching Longest common subsequence Longest common substring Sequential pattern mining Sorting