Oren Zamir
Department of Computer Science and Engineering,
University of Washington, Box 352350
Seattle, WA, 98195-2350
USA.
Oren Etzioni
Department of Computer Science and Engineering,
University of Washington, Box 352350
Seattle, WA, 98195-2350
USA.
The paper articulates the unique requirements of Web document clustering and reports on the first evaluation of clustering methods in this domain. A key requirement is that the methods create their clusters based on the short snippets returned by Web search engines. Surprisingly, we find that clusters based on snippets are almost as good as clusters created using the full text of Web documents.
To satisfy the stringent requirements of the Web domain, we introduce an incremental, linear time (in the document collection size) algorithm called Suffix Tree Clustering (STC), which creates clusters based on phrases shared between documents. We show that STC is faster than standard clustering methods in this domain, and argue that Web document clustering via STC is both feasible and potentially beneficial.?
SIGIR'98