MSc. Artificial Intelligence and Machine Learning

Access Academic Paper

Dissertation Topic

Analysing news articles about Russia’s war on Ukraine using Latent Dirichlet Allocation based topic modelling

Description

Topic Modelling is one of the most effective techniques used in the field of natural language processing for the discovery of latent information within a given corpus. The main focus of this work is aimed at using Latent Dirichlet Allocation, which is one of the most popular topic modelling methods.

The purpose of this work is to use an LDA-based topic modelling approach to capture the main themes or topics that can be found in news articles from The Guardian digital newspaper that are exclusively related to the ongoing Russian invasion of Ukraine. Furthermore, an interpretation of the topics obtained and its latent significance for the current state of the situation and its future is desired. We would be using Gensim, which is a well-known Python library that provides a fast, efficient, and scalable implementation of the LDA algorithm. In order to train our model first we are going to extract our data and manipulate it in a preprocessing step needed before feeding the data to our model.

Then we are going to train multiple models with different values of numbers of topics and further parameter tuning. Subsequently, we will evaluate and decide what model is best for our purpose and we will interpret the results.

This work shows that: preprocessing is a fundamental step when implementing LDA-based topic modelling to obtain good results; considerably large dataset also leads to good results; the process of tuning the parameters of our model could get computationally demanding if the number of topics chosen is too large; and lastly, LDA is a great tool to analyse large datasets and find the main topics contained within them.

Furthermore, it is possible to give an interpretation of the obtained topics and their related words with the highest probabilities.