Argument clustering

Status


Finished master thesis

Student


  • Uneeb Khalid

Formalia


Targeted audience
  • AI Master
Preconditions
  • Ability to read and understand papers written in English.
  • Ability to perform academic writing.
  • Strong programming skills (e.g.Java, essential)
  • Lectures Information Retrieval oder Information Mining and the use of tools such as RapidMiner (essential)

Task description


Social media platforms allow millions of internet users to easily create and share multi-media content. This generates a continuously increasing volume of big data that harbours precious knowledge of the crowds. Much of crowd wisdom is bundled up in arguments, i.e. claims that are supported or refuted by evidence. This evidential data could be used to answer questions, understand complex phenomena or evaluate services and products - if it was easily accessible. However, currently, analytic tools can only tell what users report in big data, not why.

Given the volume of data (big data) the arguments will necessarily recur and be also disconnected from each other. In Twitter, for instance, the communication between the users happens asynchronously. Therefore, the arguments will easily be repeated by several contributors. Thus, it is important to determine similar arguments and group them under a representative argument.

This master project will investigate argument clustering using semantic similarity, textual entailment and supervised machine learning approaches. Each of these should be evaluated against the gold standard data (DART data reported by Bosc et al. 2016).

Bosc, Tom and Cabrio, Elena and Villata, Serena. DART: a Dataset of Arguments and their Relations on Twitter. Proceedings of the 10th edition of the Language Resources and Evaluation Conference, 2016.

Tasks:

  • Literature scan. This should be done before the actual project starts. Here the student will be given some initial papers. Based on these papers the student should collect more papers, perform a review of all the papers and prepare an oral presentation of 30 mins. providing an intro to the field. This should take 2-3 weeks. Actual work:
  • Acquisition of data. See DART data.
  • Preprocessing of data. This can be done automatically using Natural Language Processing techniques.
  • Manual annotation of the data. This means that the data collected above needs to be annotated for determining the argument clusters.
  • Performing the argument clustering. The student should perform automatic feature extraction and apply machine learning to perform the argument clustering automatically. Results of the automatic system should be evaluated using precision and recall.