SIGIR'98 papers: Using A Generalized Instance Set for Automatic Text Categorization

Using A Generalized Instance Set for Automatic Text Categorization

Wai Lam,
Department of Systems Engineering and Engineering Management,
The Chinese University of Hong Kong,
Shatin,
Hong Kong,
wlam@se.cuhk.edu.hk

Chao Yang Ho,
Department of Systems Engineering and Engineering Management,
The Chinese University of Hong Kong,
Shatin,
Hong Kong,
cyho@se.cuhk.edu.hk

Abstract

We investigate several recent approaches for text categorization under the framework of similarity-based learning. They include two families of text categorization techniques, namely the k-nearest neighbor (k-NN) algorithm and linear classifiers. After identifying the weakness and strength of each technique, we propose a new technique known as the generalized instance set (GIS) algorithm by unifying the strengths of k-NN and linear classifiers and adapting to characteristics of text categorization problems. We also explore some variants of our GIS approach. We have implemented our GIS algorithm, the ExpNet algorithm, and some linear classifiers. Extensive experiments have been conducted on two common document corpora, namely the OHSUMED collection and the Reuters-21578 collection. The results show that our new approach outperforms the latest k-NN approach and linear classifiers in all experiments.

SIGIR'98
24-28 August 1998
Melbourne, Australia.
sigir98@cs.mu.oz.au.