Towards multi-lingual argument mining: Comparable passage extraction from comparable corpora.
- Huangpan Zang
- AI Master
- Ability to read and understand papers written in English.
- Ability to perform academic writing.
- Strong programming skills (e.g.Java, essential)
- Lectures Information Retrieval oder Information Mining (essential)
Social media platforms allow millions of internet users to easily create and share multi-media content. This generates a continuously increasing volume of big data that harbours precious knowledge of the crowds. Much of crowd wisdom is bundled up in arguments, i.e. claims that are supported or refuted by evidence. This evidential data could be used to answer questions, understand complex phenomena or evaluate services and products - if it was easily accessible. However, currently, analytic tools can only tell what users report in big data, not why. Furthermore, currently most argument mining studies work with English or resource rich European languages. A desired situation would be if such systems exist also for under-resourced languages such as Urdu, Arabic, Persian and Turkish.
This master project will contribute towards developing tools for automatic extraction of relevant and reliable arguments from multi-lingual big data. In particular the tools will be applied to news articles written in English, Arabic, Persian, Urdu and Turkish.
The student will be provided with comparable news articles in the respective languages. Two documents written in two different languages are comparable if they talk about the same topic or event. The aim of this project is to develop a tool for creating comparable text passages -- i.e. matching text passages from an English file with text passages in the other language (e.g. Urdu) . The project should investigate features to determine the similarity between text passages. Also new development in semantic similarity such as word embeddings, neural networks should be considered in determining the similarity between the passages. Given gold standard data those features should be used to train a classifier (multi-label). To assess the quality of the classifier it should be also evaluated against the gold standard data.
- literature scan. This should be done before the actual project starts. Here the student will be given some initial papers. Based on these papers the student should collect more papers, perform a review of all the papers and prepare an oral presentation of 30 mins. providing an intro to the field. This should take 2-3 weeks. Actual work:
- Preprocessing of data. This can be done automatically using Natural Language Processing techniques.
- Performing data indexing e.g. using Lucene.
- Feature extraction and supervised learning. The student should perform automatic feature extraction and apply machine learning to generate comparable passages. Results of the automatic system should be evaluated using precision and recall.