L. Douglas Baker
Department of Computer Science,
Carnegie Mellon University,
5000 Forbes Ave.,
Pittsburgh, PA 15213, USA.
and
Justsystem Pittsburgh Research Center,
4616 Henry St.,
Pittsburgh, PA 15213, USA.
Andrew Kachites McCallum
Justsystem Pittsburgh Research Center,
4616 Henry St.,
Pittsburgh, PA 15213, USA.
and
Department of Computer Science,
Carnegie Mellon University,
5000 Forbes Ave.,
Pittsburgh, PA 15213, USA.
Experimental results obtained on three real-world data sets show that we can reduce the feature dimensionality by three orders of magnitude and lose only 2% accuracy--significantly better than Latent Semantic Indexing, class-based clustering, feature selection by mutual information, or Markov-blanket-based feature selection. We also show that less aggressive clustering sometimes results in improved classification accuracy over classification without clustering.