A probabilistic description-oriented approach for categorising Web documents

  • Citation-Key:
    Goevert/etal:99
  • Title:
    A probabilistic description-oriented approach for categorising Web documents
  • Author(s):
    Norbert Gövert
    Mounia Lalmas
    Norbert Fuhr
  • In:
    • Citation-Key:
      CIKM:99
    • Title:
      Proceedings of the Eighth International Conference on Information and Knowledge Management
    • Editor(s):
      Susan Gauch
      Il-Yeol Soong
    • Publisher:
      ACM
    • In:
      Proceedings of the Eighth International Conference on Information and Knowledge Management
    • Year:
      1999
  • Page(s):
    475--482
  • Year:
    1999

Abstract:


The automatic categorisation of web documents is becoming crucial for organising the huge amount of information available in the Internet. We are facing a new challenge due to the fact that web documents have a rich structure and are highly heterogeneous. Two ways to respond to this challenge are (1) to use a representation of the content of web documents that captures these two characteristics and (2) to use more effective classifiers. Our categorisation approach is based on a probabilistic description-oriented representation of web documents, and a probabilistic interpretation of the k-nearest neighbour classifier. With the former, we provide an enhanced document representation that incorporates the structural and heterogeneous nature of web documents. With the latter, we provide a theoretical sound justification for the various parameters of k-nearest neighbour classifier. Experimental results show that (1) using an enhanced representation of web documents is crucial for an effective categorisation of web documents, and (2) a theoretical interpretation of the k-nearest neighbour classifier gives us improvement over the standard k-nearest neighbour classifier.

Fulltext as PDF