Clustering online reviews


Abgeschlossene Masterarbeit


  • Muhammad Hasan Shahid


  • AI Master
  • Ability to read and understand papers written in English.
  • Ability to perform academic writing.
  • Strong programming skills (e.g.Java, essential)
  • Lectures Information Mining (essential)


In shopping websites such as amazon a product can have several reviews from customers who have already consumed that product. As reviews represent the customer experience with the product it may describe one or more aspects of the product. Considering a digital camera, for instance, a review may entail information about the battery life of the camera or talk about the display quality. When dealing with reviews it is thus important to understand what aspects the reviews talk about. So the main problem here is to identify the different aspects that are described in the reviews of a product. Once this is determined reviews talking about the same aspect can be clustered together, i.e. all reviews about the same aspect can be grouped together within the same bin. Note there might be reviews concerning with more than one aspect. In this case the review have to be clustered into the respective aspect clusters -- same copy of the review is assigned to several aspect clusters.


  • One main solution to the above problem is to develop a clustering approach of reviews. Clustering can be hard where a review belongs to one cluster only, or soft (Fuzzy ) where each review can belong to one or more clusters. As discussed above, what we are aiming at here is soft clustering as a review can be talking about one or more aspects. Different approaches can be used for this task like Latent Dirichlet Allocation (LDA), graph based clustering, etc. Here the student should investigate existing approaches to clustering and adopt the best performing one for clustering reviews.
  • The developed approach will need to be evaluated using B Cubed, a precision-recall oriented evaluation method. B Cubed requires gold standard clusters. Thus the project should also manually obtain such gold standard data.