SIGIR'98 papers: Four text classification algorithms compared on a Dutch
corpus
Four text classification algorithms compared on a Dutch corpus
Hein Ragas
Cap Gemini Netherlands,
P.O. Box 2575,
3500 GN Utrecht, The Netherlands.
Cornelis H.A. Koster
Department of Computer Science,
University of Nijmegen,
6525 ED Nijmegen, The Netherlands.
Abstract
In the context of the document routing project
DORO
we performed an experiment in applying text classification
algorithms to Dutch texts. Four well-known learning algorithms,
Rocchio's algorithm, the Simple Bayesian Classifier (SBC), the
Sleeping Experts (SE) and Winnow were implemented. They were tested on
a corpus of articles from the Dutch newspaper NRC, pre-classified into
four categories. Using keywords as terms,
the algorithms were compared on learning speed and
error rate. We also investigated the effect of discarding terms, using
either a dynamic stoplist or the Winnow heuristic.
The Winnow heuristic with 2 steps greatly improved the performance of
BC. Under the circumstances of our experiment, Rocchio and SBC
performed better than Winnow, and a lot better than the Sleeping
Experts. Rocchio learns faster than SBC, but it rapidly reaches a
performance plateau.
A combination of Roccio and SBC should give the best overall performance.
SIGIR'98
24-28 August 1998
Melbourne, Australia.
sigir98@cs.mu.oz.au.