Citation-Key:
Goevert/etal:99
Title:
A probabilistic description-oriented approach for categorising Web documents
Author(s):
Norbert Gövert
Mounia Lalmas
Norbert Fuhr
In:
Citation-Key:
CIKM:99
Title:
Proceedings of the Eighth International Conference on Information and Knowledge Management
Editor(s):
Susan Gauch
Il-Yeol Soong
Publisher:
ACM
In:
Proceedings of the Eighth International Conference on Information and Knowledge Management
Year:
1999

BibTeX entry

Page(s):
475--482
Year:
1999

Abstract:
The automatic categorisation of web documents is becoming crucial for organising the huge amount of information available in the Internet. We are facing a new challenge due to the fact that web documents have a rich structure and are highly heterogeneous. Two ways to respond to this challenge are (1) to use a representation of the content of web documents that captures these two characteristics and (2) to use more effective classifiers. Our categorisation approach is based on a probabilistic description-oriented representation of web documents, and a probabilistic interpretation of the k-nearest neighbour classifier. With the former, we provide an enhanced document representation that incorporates the structural and heterogeneous nature of web documents. With the latter, we provide a theoretical sound justification for the various parameters of k-nearest neighbour classifier. Experimental results show that (1) using an enhanced representation of web documents is crucial for an effective categorisation of web documents, and (2) a theoretical interpretation of the k-nearest neighbour classifier gives us improvement over the standard k-nearest neighbour classifier.

BibTeX entry

Fulltext as PDF