Statistical Phrases for Vector-Space Information Retrieval

Andrew Turpin
Department of Computer Science and Software Engineering, The University of Melbourne, Parkville 3052, Australia.

Alistair Moffat
Department of Computer Science and Software Engineering, The University of Melbourne, Parkville 3052, Australia.

Status

Proc. 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, San Francisco, August 1999, 309-310.

Abstract

When employing a vector-space model to evaluate a query against a document collection several choices must be made. A fundamental design decision is the definition of the terms which form the dimensions of the space. Should the terms be single words, pairs of words, linguistic phrases, entire sentences, or some other combination of textual units? It seems intuitive that when calculating a measure of similarity between a natural language query text and natural language documents, some respect should be paid to word ordering. Complex terms such as phrases should, therefore, increase the precision of retrieval results. Recent work has, however, shown that this is not the case. In this abstract we describe experiments that further confirm that observation. Note that we are solely concerned with statistical phrases; that is, phrases derived using techniques other than NLP.