Zitationsschlüssel:
Goevert/etal:99
Titel:
A probabilistic description-oriented approach for categorising Web documents
Autor(en):
Norbert Gövert
Mounia Lalmas
Norbert Fuhr
In:
Zitationsschlüssel:
CIKM:99
Titel:
Proceedings of the Eighth International Conference on Information and Knowledge Management
Herausgeber:
Susan Gauch
Il-Yeol Soong
Verlag:
ACM
In:
Proceedings of the Eighth International Conference on Information and Knowledge Management
Jahr:
1999

BibTeX-Eintrag

Seite(n):
475--482
Jahr:
1999

Zusammenfassung:
The automatic categorisation of web documents is becoming crucial for organising the huge amount of information available in the Internet. We are facing a new challenge due to the fact that web documents have a rich structure and are highly heterogeneous. Two ways to respond to this challenge are (1) to use a representation of the content of web documents that captures these two characteristics and (2) to use more effective classifiers. Our categorisation approach is based on a probabilistic description-oriented representation of web documents, and a probabilistic interpretation of the k-nearest neighbour classifier. With the former, we provide an enhanced document representation that incorporates the structural and heterogeneous nature of web documents. With the latter, we provide a theoretical sound justification for the various parameters of k-nearest neighbour classifier. Experimental results show that (1) using an enhanced representation of web documents is crucial for an effective categorisation of web documents, and (2) a theoretical interpretation of the k-nearest neighbour classifier gives us improvement over the standard k-nearest neighbour classifier.

BibTeX-Eintrag

Volltext als PDF