A probabilistic description-oriented approach for categorising Web documents

  • Zitationsschlüssel:
    Goevert/etal:99
  • Titel:
    A probabilistic description-oriented approach for categorising Web documents
  • Autor(en):
    Norbert Gövert
    Mounia Lalmas
    Norbert Fuhr
  • In:
    • Zitationsschlüssel:
      CIKM:99
    • Titel:
      Proceedings of the Eighth International Conference on Information and Knowledge Management
    • Herausgeber:
      Susan Gauch
      Il-Yeol Soong
    • Verlag:
      ACM
    • In:
      Proceedings of the Eighth International Conference on Information and Knowledge Management
    • Jahr:
      1999
  • Seite(n):
    475--482
  • Jahr:
    1999

Zusammenfassung:


The automatic categorisation of web documents is becoming crucial for organising the huge amount of information available in the Internet. We are facing a new challenge due to the fact that web documents have a rich structure and are highly heterogeneous. Two ways to respond to this challenge are (1) to use a representation of the content of web documents that captures these two characteristics and (2) to use more effective classifiers. Our categorisation approach is based on a probabilistic description-oriented representation of web documents, and a probabilistic interpretation of the k-nearest neighbour classifier. With the former, we provide an enhanced document representation that incorporates the structural and heterogeneous nature of web documents. With the latter, we provide a theoretical sound justification for the various parameters of k-nearest neighbour classifier. Experimental results show that (1) using an enhanced representation of web documents is crucial for an effective categorisation of web documents, and (2) a theoretical interpretation of the k-nearest neighbour classifier gives us improvement over the standard k-nearest neighbour classifier.

Volltext als PDF