SIGIR'98 papers: Using A Generalized Instance Set for Automatic Text
Categorization
Using A Generalized Instance Set for Automatic Text
Categorization
Wai Lam,
Department of Systems Engineering and Engineering Management,
The Chinese University of Hong Kong,
Shatin,
Hong Kong,
wlam@se.cuhk.edu.hk
Chao Yang Ho,
Department of Systems Engineering and Engineering Management,
The Chinese University of Hong Kong,
Shatin,
Hong Kong,
cyho@se.cuhk.edu.hk
Abstract
We investigate several recent approaches for text categorization
under the framework of similarity-based learning.
They include two families of text categorization
techniques, namely the k-nearest neighbor (k-NN) algorithm and linear
classifiers. After identifying the weakness and strength of each
technique, we propose a new technique known as the generalized instance
set
(GIS) algorithm by unifying the strengths of k-NN and linear classifiers
and adapting to characteristics of text categorization problems.
We also explore some variants of our GIS approach.
We have implemented our GIS algorithm, the ExpNet algorithm, and
some linear classifiers.
Extensive experiments have been conducted on two common document
corpora,
namely the OHSUMED collection and the Reuters-21578 collection.
The results show that our new approach outperforms the
latest k-NN approach and linear classifiers in all
experiments.
SIGIR'98
24-28 August 1998
Melbourne, Australia.
sigir98@cs.mu.oz.au.