Towards multi-lingual argument mining: Creating multi-lingual comparable corpora.
- Firas Sabbah
- AI Master
- Ability to read and understand papers written in English.
- Ability to perform academic writing.
- Strong programming skills (e.g.Java, essential)
- Lectures Information Retrieval oder Information Mining (essential)
Social media platforms allow millions of internet users to easily create and share multi-media content. This generates a continuously increasing volume of big data that harbours precious knowledge of the crowds. Much of crowd wisdom is bundled up in arguments, i.e. claims that are supported or refuted by evidence. This evidential data could be used to answer questions, understand complex phenomena or evaluate services and products - if it was easily accessible. However, currently, analytic tools can only tell what users report in big data, not why. Furthermore, currently most argument mining studies work with English or resource rich European languages. A desired situation would be if such systems exist also for under-resourced languages such as Urdu, Arabic, Persian and Turkish.
This master project will contribute towards developing tools for automatic extraction of relevant and reliable arguments from multi-lingual big data. In particular the tools will be applied to news articles written in English, Arabic, Persian, Urdu and Turkish.
The student will be provided with news articles in the respective languages -- the data is close to 1TB. Before argument mining a desired situation is when the documents are paired to comparable corpora. Two documents written in two different languages are comparable if they talk about the same topic or event. The aim of this project is to develop a tool for creating comparable corpora -- i.e. pairing e.g. English documents with Arabic ones. The project should investigate features to determine the similarity between documents. Also new development in semantic similarity such as word embeddings should be considered in determining the similarity between documents. Given gold standard data those features should be used to train a classifier (multi-label). To assess the quality of the classifier it should be also evaluated against the gold standard data.
- literature scan. This should be done before the actual project starts. Here the student will be given some initial papers. Based on these papers the student should collect more papers, perform a review of all the papers and prepare an oral presentation of 30 mins. providing an intro to the field. This should take 2-3 weeks. Actual work:
- Preprocessing of data. This can be done automatically using Natural Language Processing techniques.
- Performing data indexing e.g. using Lucene.
- Feature extraction and supervised learning. The student should perform automatic feature extraction and apply machine learning to generate comparable corpora. Results of the automatic system should be evaluated using precision and recall.