Improving Two-Stage Ad-Hoc Retrieval for Short Queries


K. L. Kwok
Computer Science Department, Queens College, CUNY, Flushing, NY 11367, USA.

M. Chan
Computer Science Department, Queens College, CUNY, Flushing, NY 11367, USA.


Abstract

Short queries in an ad-hoc retrieval environment are difficult but unavoidable. We present several methods to try to improve our current strategy of 2-stage pseudo-relevance feedback retrieval in such a situation. They are: 1) avtf query term weighting, 2) variable high frequency Zipfian threshold, 3) collection enrichment, 4) enhancing term variety in raw queries, and 5) using retrieved document local term statistics. Avtf employs collection statistics to weight terms in short queries. Variable high frequency threshold defines and ignores statistical stopwords based on query length. Collection enrichment adds other collections to the one under investigation so as to improve the chance of ranking more relevant documents in the top n for the pseudo-feedback process. Enhancing term variety to raw queries tries to find highly associated terms in a set of documents that is domain-related to the query. Making the query longer may improve 1st stage retrieval. And retrieved document local statistics re-weight terms in the 2nd stage using the set of domain-related documents rather than the whole collection as used during the initial stage. Experiments were performed using the TREC 5 and 6 environment. It is found that together these methods perform well for the difficult TREC-5 topics, and also works for the TREC-6 very short topics.


SIGIR'98
24-28 August 1998
Melbourne, Australia.
sigir98@cs.mu.oz.au.