A new Effective Approach for Categorizing Web Documents
- A new Effective Approach for Categorizing Web Documents
- Claus-Peter Klas
- Norbert Fuhr
- Proceedings of the 22th BCS-IRSG Colloquium on IR Research
Categorization of Web documents poses a new challenge for automatic classification methods. In this paper, we present the megadocument approach for categorization. For each category, all corresponding document texts from the training sample are concatenated to a megadocument, which is indexed using standard methods. In order to classify a new document, the most similar megadocument determines the category to be assigned. Our evaluations show that for Web collections, the megadocument method clearly outperformes other classification methods. In contrast, for the Reuters collection, we only achieve mediocre results. Thus, our method seems to be well suited for heterogeneous document collections.
Fulltext as PS