SIGIR'98 papers: Four text classification algorithms compared on a Dutch corpus

Four text classification algorithms compared on a Dutch corpus


Hein Ragas
Cap Gemini Netherlands, P.O. Box 2575, 3500 GN Utrecht, The Netherlands.

Cornelis H.A. Koster
Department of Computer Science, University of Nijmegen, 6525 ED Nijmegen, The Netherlands.


Abstract

In the context of the document routing project DORO we performed an experiment in applying text classification algorithms to Dutch texts. Four well-known learning algorithms, Rocchio's algorithm, the Simple Bayesian Classifier (SBC), the Sleeping Experts (SE) and Winnow were implemented. They were tested on a corpus of articles from the Dutch newspaper NRC, pre-classified into four categories. Using keywords as terms, the algorithms were compared on learning speed and error rate. We also investigated the effect of discarding terms, using either a dynamic stoplist or the Winnow heuristic.

The Winnow heuristic with 2 steps greatly improved the performance of BC. Under the circumstances of our experiment, Rocchio and SBC performed better than Winnow, and a lot better than the Sleeping Experts. Rocchio learns faster than SBC, but it rapidly reaches a performance plateau.

A combination of Roccio and SBC should give the best overall performance.


SIGIR'98
24-28 August 1998
Melbourne, Australia.
sigir98@cs.mu.oz.au.