- Citation-Key:
- Goevert/etal:99
- Title:
- A probabilistic description-oriented approach for categorising Web documents
- Author(s):
- Norbert Gövert
- Mounia Lalmas
- Norbert Fuhr
- In:
-
- Citation-Key:
- CIKM:99
- Title:
- Proceedings of the Eighth International Conference on Information and Knowledge Management
- Editor(s):
- Susan Gauch
- Il-Yeol Soong
- Publisher:
- ACM
- In:
- Proceedings of the Eighth International Conference on Information and Knowledge Management
- Year:
- 1999
- Page(s):
- 475--482
- Year:
- 1999
- Abstract:
- The automatic categorisation of web documents is becoming crucial for organising the huge amount of information available in the Internet. We are facing a new challenge due to the fact that web documents have a rich structure and are highly heterogeneous. Two ways to respond to this challenge are (1) to use a representation of the content of web documents that captures these two characteristics and (2) to use more effective classifiers. Our categorisation approach is based on a probabilistic description-oriented representation of web documents, and a probabilistic interpretation of the k-nearest neighbour classifier. With the former, we provide an enhanced document representation that incorporates the structural and heterogeneous nature of web documents. With the latter, we provide a theoretical sound justification for the various parameters of k-nearest neighbour classifier. Experimental results show that (1) using an enhanced representation of web documents is crucial for an effective categorisation of web documents, and (2) a theoretical interpretation of the k-nearest neighbour classifier gives us improvement over the standard k-nearest neighbour classifier.
Fulltext as PDF
