SIGIR'98 papers: Distributional Clustering of Words for Text Classification

Distributional Clustering of Words for Text Classification

L. Douglas Baker
Department of Computer Science, Carnegie Mellon University, 5000 Forbes Ave., Pittsburgh, PA 15213, USA.
and
Justsystem Pittsburgh Research Center, 4616 Henry St., Pittsburgh, PA 15213, USA.

Andrew Kachites McCallum
Justsystem Pittsburgh Research Center, 4616 Henry St., Pittsburgh, PA 15213, USA.
and
Department of Computer Science, Carnegie Mellon University, 5000 Forbes Ave., Pittsburgh, PA 15213, USA.

Abstract

This paper describes the application of Distributional Clustering to document classification. This approach clusters words into groups based on the distribution of class labels associated with each word. Thus, unlike some other unsupervised dimensionality-reduction techniques, such as Latent Semantic Indexing, we are able to compress the feature space much more aggressively, while still maintaining high document classification accuracy.

Experimental results obtained on three real-world data sets show that we can reduce the feature dimensionality by three orders of magnitude and lose only 2% accuracy--significantly better than Latent Semantic Indexing, class-based clustering, feature selection by mutual information, or Markov-blanket-based feature selection. We also show that less aggressive clustering sometimes results in improved classification accuracy over classification without clustering.