Edleno Silva de Moura
Depto. de Ciência da Computação,
Univ. Federal de Minas Gerais, Brazil.
Gonzalo Navarro
Depto. de Ciencias de la Computación,
Univ. de Chile, Chile.
Nivio Ziviani
Depto. de Ciência da Computação,
Univ. Federal de Minas Gerais, Brazil.
Ricardo Baeza-Yates
Depto. de Ciencias de la Computación,
Univ. de Chile, Chile.
The searching algorithm allows a large number of variations of the exact and approximate compressed string matching problem, such as phrases, ranges, complements, wild cards and arbitrary regular expressions. Separators and stopwords can be discarded at search time without significantly increasing the cost. The algorithm is based on a word-oriented shift-or algorithm and a fast Boyer-Moore-type filter. It concomitantly uses the vocabulary of the text available as part of the Huffman coding data. When searching for simple patterns, our experiments show that running our algorithm on a compressed text is twice as fast as running agrep on the uncompressed version of the same text. When searching complex or approximate patterns, our algorithm is up to 8 times faster than agrep. We also mention the impact of our technique in inverted files pointing to documents or logical blocks as Glimpse.
SIGIR'98