A clustering-based topic model using word networks and word embeddings

Mu, Wenchuan; Lim, Kwan Hui; Liu, Junhua; Karunasekera, Shanika; Falzon, Lucia; Harwood, Aaron

doi:10.1186/s40537-022-00585-4

Research
Open access
Published: 11 April 2022

A clustering-based topic model using word networks and word embeddings

Wenchuan Mu¹,
Kwan Hui Lim ORCID: orcid.org/0000-0002-4569-0901²,
Junhua Liu^2,3,
Shanika Karunasekera⁴,
Lucia Falzon⁴ &
…
Aaron Harwood⁴

Journal of Big Data volume 9, Article number: 38 (2022) Cite this article

5199 Accesses
6 Citations
8 Altmetric
Metrics details

Abstract

Online social networking services like Twitter are frequently used for discussions on numerous topics of interest, which range from mainstream and popular topics (e.g., music and movies) to niche and specialized topics (e.g., politics). Due to the popularity of such services, it is a challenging task to automatically model and determine the numerous discussion topics given the large amount of tweets. Adding on this complexity is the need to identify these topics with the absence of prior knowledge about both the types and number of topics, while having the requirement of the relevant technical expertise to tune the numerous parameters for the various models. To address this challenge, we develop the Clustering-based Topic Modelling (ClusTop) algorithm that first constructs different types of word networks based on different types of n-grams co-occurrence and word embedding distances. Using these word networks, ClusTop is then able to automatically determine the discussion topics using community detection approaches. In contrast to traditional topic models, ClusTop does not require the tuning or setting of numerous parameters and instead uses community detection approaches to automatically determine the appropriate number of topics. The ClusTop algorithm is also able to capture the syntactic meaning in tweets via the use of bigrams, trigrams, other word combinations and word embedding techniques in constructing the word network graph, and utilizes edge weights based on word embedding. Using three Twitter datasets with labelled crises and events as topics, we show that ClusTop outperforms various traditional baselines in terms of topic coherence, pointwise mutual information, precision, recall and F-score.

Introduction

Twitter is a popular microblogging service that is prevalent and widely used in everyday life, with a high volume of 500 million tweets posted on a daily basis [1]. On microblogging services, such as Twitter, users frequently perform discussions and debates on topics of interest, ranging from mainstream and popular topics (e.g., movies, TV, music, entertainment) to niche and specialized topics (e.g., politics, religion, current affairs). The capability to detect and understand the discussions about these topics are useful for numerous purposes, such as understanding the general sentiments and trends of these topics, and recommending accurate and relevant content. However, the large volume and high posting frequency of tweets makes it a significant challenge for users to effectively understand the discussion topics in these tweets [2, 3].

A popular approach is to utilise topic modelling algorithms to automatically detect the topics discussed in a set of traditional text-based documents, such as news articles, academic papers, etc. In such algorithms, the output is a set of keywords denoting the topics that are relevant to each document. Examples of topic modelling algorithms are the original Latent Semantic Analysis [4], Probabilistic Latent Semantic Analysis [5] and latent Dirichlet allocation [6]. These algorithms were developed mainly for topic modelling on traditional and large documents such as news articles or papers [7, 8]. The advent of microblogging services has led to the widespread use of short documents (i.e., tweets) in social media, which traditional topic modelling algorithms do not work well on. In response, various researchers have proposed variants of these traditional topic models, based on various types of aggregation schemes to combine a set of tweets as larger documents [9, 10]. While latent Dirichlet allocation and its variant have been shown to model topics well for traditional documents, the number of topics needs to be defined in advance and they do not account for the syntactic structure of sentences.

In this work, we aim to overcome these limitations by introducing a topic modelling algorithm that is able to automatically determine the appropriate number of topics. Our proposed algorithm is based on the adaptation of community detection algorithms on a network graph where vertices are words and edges are relations between words. Our algorithm is also able to capture the syntactic nature of language via the use of bigrams, trigrams, other word combinations and different word embeddings in constructing our word network graph. In addition, we perform an empirical study to better examine the different types of network graphs based on the types of nodes, edges and embedding techniques, and its effect in terms of the accuracy and quality of topics detected.

Main contributions

In this paper, our main contributions are as follows:^{Footnote 1}

1.
We propose the Clustering-based Topic Modelling (ClusTop) algorithm that makes use of community detection approaches for modelling topics on Twitter using a word network graph. In this word network graph, nodes represent different definitions of words and phrases and edges represent either word/phrase co-occurrences or the similarity distances between words based on embeddings. Unlike more traditional topic models, ClusTop automatically determines the number of topics by maximizing a modularity score among words in the network.
2.
In addition to using a traditional co-word usage network, we experiment with different variants of our ClusTop algorithm based on numerous definitions of words (unigrams, bigrams, trigrams, hashtags, nouns from part-of-speech tagging), types of relations (word co-occurrence frequency and word embedding similarity distance) and different aggregation schemes (individual tweets, hashtags and mentions). In addition, we also propose variants based on different word embeddings techniques where edges are weighted based on the similarity distances between different words.
3.
Using three Twitter datasets with labelled topics, we evaluate ClusTop and its variants against various LDA baselines based on measures of topic coherence, pointwise mutual information, precision, recall and F-score. Experimental results show that ClusTop offers superior performance based on these evaluation metrics, compared to the various baselines.

Structure and organization

The rest of this paper is structured as follows. “Related work” section discusses key literature on studying topics on microblogs and topic modelling algorithms. “Proposed algorithm” section describes our ClusTop algorithm. “Dataset and evaluation methodology” section outlines our experimental methodology in terms of the dataset used, baseline algorithms and evaluation metrics. “Experiments” section highlights the results from our evaluation and discusses our main findings. “Conclusion” section concludes this paper and highlights possible future directions for this work.

Related work

In this section, we discuss two main areas of research related to our work, namely the study of topics on microblogs and general topic modelling algorithms.

Studying topics using communities

The most closely related works to our proposed approach are those that make use of community detection techniques for understanding and studying topics on microblogs. As such, we discuss main works that utilise such techniques. Towards the effort to better understand research themes in the Human Computer Interaction domain, Liu et al. [12] used hierarchical clustering on co-keywords usage in academic papers to identify the main research clusters across two different time periods. Researchers have also proposed approaches for identifying communities that frequently interact about common interest topics using various types of community detection algorithms. These approaches are based on topological links such as friendship networks among users and celebrities [13] and interaction links in the form of explicit mentions of other users [14]. Researchers like [15] and [16] have also used community detection algorithms on word networks to identify topics with a focus on network analysis and visualization, and detection of spammer topics, respectively. Fried et al. [17] used topic modelling on a series of food-related tweets to understand health information such as overweight rate and diabetes rate. Others like Surian et al. [18] and [19] combined the use of topic modelling algorithms with community detection algorithms to characterize discussions relating to vaccines on Twitter, study discussion topics of Italian users, respectively.

General topic modelling algorithms

Also relevant to our work are those that proposed various types of topic modelling algorithms, of which latent Dirichlet allocation (LDA) is a particularly popular one with many variants being proposed. As such, we next discuss a series of works that utilizes the popular LDA for proposing new variants that are targetted for use on microblogs and other forms of short text, as well as the application of LDA on various types of social media. LDA [6] is a popular topic model that is used to determine the set of latent topics associated with a set of documents. Each document is usually represented as a bag-of-words in LDA, with each topic modelled by a distribution of words, and each document is assigned a distribution of topics via a generative process. Variants of bag-of-words, such as keeping nouns only or removing stop words, improve topics’ semantic coherence [20, 21]. LDA is sometimes accompanied by other representation structures. Structural relationships among social texts in a discussion tree have been added to LDA as context information to alleviate data sparsity and noise [22]. When word co-occurrences are lacking, distributional word embedding captures semantic and syntactic correlations among words [23]. It helps discover interpretable topics even with large vocabularies that include rare words and stop words [24, 25]. Domain-specific semantic relationships of words are useful in areas such as clinical predictive modelling [26] and restricting keywords to specific predefined topics better stabilizes topic assignment [27]. LDA can also be built on a conditional random field, two-layer bidirectional long short-term memory, or other neural network representations [23, 28, 29]. Although LDA is traditionally used for longer documents such as news articles and academic papers, LDA has also been applied to Twitter where each tweet is considered a document. To address the limitations caused by short texts such as tweets, researchers have used aggregation schemes where tweets by the same author or with the same terms, hashtags, posted time are combined as one document [9, 10, 23, 30]. Zhao et al. have also used LDA to study the differences between Twitter and New York Times in terms of the discussed topics and content [31], while Aiello et al. [32] applied LDA for the purpose of trending topics detection in sports and politics, using different textual pre-processing steps. Similarly, researchers have modified LDA to capture the temporal nature of documents, such as the Topic over Time (TOT) algorithm [33] for detecting topical trends over continuous time, and Temporal-LDA [34] for modelling topics and their transitions in streaming documents. LDA has also been applied in various domains, such as urban analytics [35, 36], advertising/marketting [37, 38], diseases/medical [39, 40], climate sentiment measurement [41], communication research [42], and aspect-based product review [43].

Social media analytics

Apart from studying topics on microblogs, topic models have also been used to enhance other tasks such as distinguishing between personal and corporate accounts [44] and identifying fake follower accounts [45]. Contrasting opinion topic models find opinions from multiple perspectives in news media [46]. Stances on different opinions can further be used to detect disinformation [47, 48], analyze sentiment to improve the stock prediction [49], or study correlation between topics and their prevalence [50,51,52] as a social science task. Topic models are helpful in recommendation systems. Topic models for software similarity [53, 54] help in recommending suitable open-source software repositories for developers [55]. Modelling travellers’ preference e.g., cultural, city, or landmark, from the textual description of photos [56, 57] can help travelling recommendation. Similar travellers could be identified according to similar topic preferences. Moreover, modelling textual descriptions of photos could in turn help recognize images on social media [58] as well. In addition to direct application, topic models can assist other algorithms to solve more tasks, such as news or legal document summarization [59, 60].

Discussion

These earlier related works conducted various studies and provided interesting insights into their applications and main findings of topic models on microblogs such as Twitter. In addition, they have also proposed various novel topic modelling algorithms that have shown good performance on different types of datasets, particularly short texts such as microblogs. Building upon these works, our research and proposed method differ from these earlier works in the following ways:

1.
In contrast to previous works that study discussion topics on microblogs, these works approach this problem by applying topic modelling algorithms on microblogs with the aim of understanding topical trends in the microblogging community from a social perspective. These earlier works focus less on classifying individual tweets into specific topics and as a result, they do not emphasise on the performance evaluation on these algorithms.
2.
While there are researchers that employ community detection algorithms for understanding discussion topics, we perform an empirical study based on an extensive range of network types (with multiple definitions of vertices and edges), instead of using only word co-occurrence. In addition to a standard word co-occurrence network, we also experiment and evaluate a variant of ClusTop that utilizes word embeddings and the distance similarity between the works for community detection and deriving the discussion topics. In terms of experimental evaluation, we also focus on validating the performance of our proposed algorithm on a set of labelled tweets, instead of only understanding the broad topical trends.
3.
Although existing topic models have been adapted to microblogs and short texts with relatively good performance, these algorithms typically require the tuning and setting of appropriate values for various algorithmic parameters, such as the number of topics to model and the Dirichlet prior for both document-topic distributions and topic-word distributions, which are modelled by the k, alpha and beta parameters, respectively. In contrast, our ClusTop algorithm automatically determines the number of topics and does not require any parameter to be set, due to its local maximization of modularity.

Proposed algorithm

We now describe our proposed algorithm by first defining the basic notations and preliminaries used in this algorithm. Using standard network theory notations, we denote V and E to represent the set of vertices and edges, respectively. Following this, an undirected graph $G = (V, E)$ is represented as a collection of vertices V that are connected by a set of edges E. In turn, each edge $e \in E$ is denoted by $e = (\{v_i, v_j\}, w)$, where w represents the weight of the link between vertices $v_i$ and $v_j$. In our application of community detection algorithms to topic modelling, we first explore the use of an undirected graph as $G = (U, R)$, where U is the set of unigrams (vertices) and R is the set of relations (edges) between the unigrams. In later sections of this paper, we further examine the effects of different definitions of vertices, such as bi-grams, tri-grams, hashtags, etc, as well as different types of edge weights, such as frequency counts and similarity distances based on word embedding.

In this work, we propose the Clustering-based Topic Modelling (ClusTop) algorithm that uses community detection approaches to topic modelling, based on the undirected graph $G = (U, R)$ and different definitions of unigrams and relations. We first provide an overview of the basic ClusTop algorithm, which consists of the following steps:

1.
Network Construction. The first step of this algorithm involves constructing a unigram network, i.e., an undirected graph $G = (U, R)$, based on a particular definition of vertices (unigrams) and edges (relations). This step will be elaborated further in “Network construction” section, where we will describe the various types of vertices (unigrams) and edges (relations) modelled in this work.
2.
Community Detection. Using the network graph obtained from Step 1, we will next apply community detection approaches to identify the main communities (topics), where sets of vertices (unigrams) will be grouped into different communities that represent different topics. This step will be further described in “Community detection” section.
3.
Topic Assignment. Based on the detected community from Step 2 that corresponds to a specific topic, this step examines individual tweets and aims to assign this tweet to a specific community. In short, this step aims to label each tweet with an appropriate topic. More details about this step are provided later in “Topic assignment” section.

Figure 1 provides an overview of our ClusTop algorithm, with the three main steps of network construction, community (topic) detection and topic assignment. In our subsequent experiments, the steps of network construction and community detection are performed only on the training set, while topic assignment is performed and evaluated on the testing set.

Network construction

In this section, we will describe the first step of network graph construction. There are different ways of constructing this network graph, which depends on: (i) the type of network based on different definitions of unigrams (vertices) and their relations (edges); and (ii) the type of document aggregation, i.e., individual tweets, aggregated by hashtag or mentions.

Types of network

The first stage of our algorithm involves constructing a network graph of word usage, as shown in Algorithm 1. This algorithm involves the following: (i) examining all tweets and tokenizing all words in each tweet based on whitespaces; (ii) for each word-pair in each tweet, build a weighted edge e linking the two words; and (iii) repeating Steps 1 and 2 for all tweets, until we obtain a network graph, where the vertices represent uni-grams and edges represent a relation between two unigrams. The choice of vertices and edges will lead to a different type of network graph being constructed. To better examine the effect of these vertices and edges on the graph type, we experiment with a variety of relations types between different types of uni-gram, including the following:

Co-word Usage (Word). A relationship where two words (uni-grams) are used in the same tweet. That is, co-word usage models all pair-wise word co-occurrence in a tweet, regardless of where the word appeared.
Co-hashtag Usage (Hash). A relationship where two hashtags are used in the same tweet. Twitter users typically use hashtags to categorize their tweets into themes and topics [61, 62], and thus serve as a suitable form of unigram relation.
Co-noun Usage (Noun). A relationship where two nouns are used in the same tweet. For determining the noun in a tweet, we utilize the part-of-speech tagging component from Apache OpenNLP library [63], which has been used by many researchers for similar natural language processing [64,65,66].
Bigram occurrence (BiG). A relationship for two words of each bigram in the tweet. Unlike the co-word usage, this bigram occurrence only considers a relation/edge between two words if they are used one after another in sequence.
Trigram occurrence (TriG). Similar to the earlier bigram occurrence, except that we model a relationship between three words in a trigram, i.e., there is an additional edge between the first and third word.
Bigram + Hashtag (Biha). A combination of bigram occurrence and co-hashtag usage, we consider each bigram occurrence and add a relation/edge between each word of the bigram and all hashtags in the tweet.

In the above examples, we determine edge weights based on the co-occurrence frequency of terms observed in a set of tweets, i.e., our training set. We also make use of word embedding to model edge weights as the cosine similarity between a pair of words, i.e., more similar words will be linked with a higher edge weight. For this purpose, we first use the GloVe algorithm [67] for generating the word vector based on hashtags used, then construct a network with vertices based on hashtags and edge weights based on the cosine similarity scores between hashtags. In addition to GloVe, we also generalize this variant using other popular word embedding algorithms such as Word2Vec [68] and FastText [69] to better examine the effects of different word embedding techniques on our approach.

We denote the three variants of these word embedding based networks with its similarity based edge weights as:

Hash2Vec-Glove (h2vg). A network based on co-usage of hashtags in the same tweet, where the edge weights are based on cosine similarity scores of a word vector trained using GloVe [67].
Hash2Vec-Word2Vec (h2vw). A network based on co-usage of hashtags in the same tweet, where the edge weights are based on cosine similarity scores of a word vector trained using Word2Vec [68].
Hash2Vec-FastText (h2vf). A network based on co-usage of hashtags in the same tweet, where the edge weights are based on cosine similarity scores of a word vector trained using FastText [69].

Types of document aggregation

In the above examples, we are modelling each tweet as a single document for topic modelling purposes. In more traditional topic modelling, each document typically corresponds to a lengthy piece of text (such as a news article, website or abstract) and traditional topic modelling algorithms work better for these types of lengthy documents. Comparatively on Twitter, each document typically corresponds to a much shorter document in the form of a tweet with up to 280 characters. Researchers have found that aggregating multiple tweets into a single document improves the performance of LDA on Twitter [9, 10]. Building upon these findings, we also experiment with different forms of document aggregation scheme for our ClusTop algorithm, including:

No Aggregation, i.e. individual tweets (na). The basic representation where each tweet is considered a single document, i.e., no aggregation as per traditional topic modelling.
Aggregate by Hashtags (ah). Each document comprises a set of tweets that are aggregated based on common hashtags used.
Aggregate by Mentions (am). Each document comprises a set of tweets that are aggregated based on common mentions of Twitter users.

Community detection

After constructing the network graph in the previous section, we now describe our approach to modelling the topics in this graph using community detection approaches. Our main example in this paper is on the adaptation of the Louvain algorithm [70] for this purpose, as the Louvain algorithm has been shown to be one of the best performing algorithm in a comprehensive survey of community detection algorithms [71].^{Footnote 2}

Our adaptation of the Louvain algorithm [70] for the purpose of topic modelling is described by the pseudo-code in Algorithm 2, which comprises the following steps:

1.
Initially, each unigram is placed in its own community/cluster (Line 2).
2.
Following which, for each unigram, we examine each neighbour of this unigram and combine two unigrams into the same community/cluster if their modularity gain is the greatest among all of the neighbours (Lines 4 to 16).
3.
Next, we build a new network graph where unigrams in the same community/cluster are combined as a single vertex (unigram), and Step 2 is repeated until the modularity score is maximized (Lines 17 to 20).

One of the reasons for the Louvain algorithm’s good performance is due to its local adjustment of unigrams (vertices) into communities/clusters, by maximizing the gain in the following modularity function [70]:

$$\begin{aligned} \begin{aligned} Q&= \Bigg [ \frac{\sum _{in} + k_{i,in}}{2m} - \bigg ( \frac{\sum _{tot} + k_{i}}{2m} \bigg )^2 \Bigg ] \\&\quad -\Bigg [ \frac{\sum _{in}}{2m} - \bigg ( \frac{\sum _{tot}}{2m} \bigg )^2 - \bigg ( \frac{k_{i}}{2m} \bigg )^2 \Bigg ] \end{aligned} \end{aligned}$$

(1)

where $\sum _{in}$ and $\sum _{tot}$ represents the total weight of all links inside a community/cluster and total weight of all links to a community/cluster, respectively. Similarly, the terms $k_{i}$ and $k_{i,in}$ denote the total weight of all links to i and total weight of links to i within the community/cluster. Lastly, m denotes the total weight of all links in the network graph.

At the end of this step, we will obtain a set of communities/clusters based on the provided network graph. Each community/cluster will represent a particular topic, where the members of each community/cluster serve as the representative words of each topic. For each topic, we also rank the keywords (i.e., members of each community) based on the total weight of all links to a unigram/vertex.

Topic assignment

Given the detected communities/topics C from “Community detection” section and a tweet $t = \{u_1,\ldots, u_n\}$, we define the most likely topic for this tweet as:

$$\begin{aligned} \underset{c \in C}{\mathrm {arg\,max}} \sum _{u \in c} k_u \delta (u=u_t), \quad \forall \, u_t \in t \end{aligned}$$

(2)

where $\delta (u=u_t) = 1$ if a unigram u of a community/topic $c \in C$ is the same as a unigram $u_t$ of a tweet t and $\delta (p=c) = 0$ otherwise, and $k_u$ denotes the total weight of links to unigram u (as previously described in “Community detection” section).

In short, we assign a tweet t to a community/topic c that has the highest co-occurrence of unigrams in both the tweet and community/topic, where the unigram in the community/topic is weighted based on its co-occurrence to other unigrams.

Dataset and evaluation methodology

In this section, we give an overview of our experimental dataset and describe our evaluation methodology in terms of the ClusTop algorithm variants, baseline algorithms and evaluation metrics.

Dataset

For our experimental evaluation, we utilize three Twitter datasets with labelled topics [74,75,76], which enables us to better evaluate our algorithm and baselines against the ground truth topics compared to an unlabelled dataset. In total, these datasets comprise close to 8 million tweets, from which we focus on the subset of tweets with annotated and verified topics. These topics are in the form of 60k labelled tweets about 6 crises [74], 27.9k labelled tweets about 26 crises [75], and 3.6k labelled tweets about 8 events [76]. Refer to Table 1 for more details. The annotation of these tweets into the respective topics (crises and events) were performed via the CrowdFlower crowdsourcing platform, and more details can be found in the respective papers.

Table 1 Description of dataset

Full size table

We split each dataset into four partitions and perform a fourfold cross validation [77]. At each evaluation iteration, we use three partitions as our training set and the last partition as our testing set. After completing all evaluations, we compute and report the mean results for each algorithm based on the metrics of topic coherence, pointwise mutual information, precision, recall and f-score, which we elaborate further in the rest of the paper.

Topic quality metrics

For determining the quality of the detected topics, we measure the topic quality based on the topic coherence and pointwise mutual information metrics. These two metrics have also been widely used by many topic modelling researchers [10, 78, 79]. For both evaluation metrics, we denote a detected topic t that comprises a set of n representative unigrams/keywords $U^{(t)} = (u^{(t)}_1,\ldots, u^{(t)}_n)$ for each topic.

1.
Topic Coherence (TC) Given that $D(u_i, u_j)$ denotes the number of times both unigrams $u_i$ and $u_j$ appeared in the same document/tweet, and similarly, $D(u_i)$ for a single unigram $u_i$, topic coherence is defined as:
$$\begin{aligned} TC(t, U^{(t)}) = \sum _{u_i \in U^{(t)}}\sum _{u_j \in U^{(t)}, u_i \ne u_j} log~\frac{D(u_i, u_j)}{D(u_j)} \end{aligned}$$
(3)
2.
Pointwise Mutual Information (PMI) Given that $P(u_i, u_j)$ denotes the probability of a unigram pair $u_i$ and $u_j$ appearing in the same document/tweet, and $P(u_i)$ for the probability of a single unigram $u_i$, pointwise mutual information is defined as:
$$\begin{aligned} PMI(t, U^{(t)}) = \sum _{u_i \in U^{(t)}}\sum _{u_j \in U^{(t)}, u_i \ne u_j} log~\frac{P(u_i, u_j)}{P(u_i) P(u_j)} \end{aligned}$$
(4)

In both the TC and PMI metrics, it is possible for a division by 0 or taking the log of 0 when the appropriate numerator or denominator is 0, i.e., when a particular word or word pair has not been previously observed. As such, we adopt a similar strategy as [10, 78] by adding a small value $\epsilon =1$ to both components to avoid the situation of a division by 0 or log of 0.

Topic relevance metrics

Precision, recall and f-score are popular metrics used in Information Retrieval and other related fields, such as in topic modelling [32, 80], tour recommendation [81,82,83], location prediction and tagging [84,85,86], event detection [87, 88], among others. In contrast to the previous topic quality metrics (TC and PMI), these metrics allow us to evaluate how relevant and accurate the detected topics are, compared to the ground truth topics. In topic modelling, researchers typically manually curate a set of ground truth keywords to describe a specific topic, then evaluate how well the detected keywords from their topic models match these ground truth keywords [32]. For our evaluation, we adopt a similar methodology except that we automatically determine the ground truth keywords from the respective Wikipedia article for each topic.

Given that $U^{D} = (u^{D}_1,\ldots,u^{D}_n)$ and $U^{G} = (u^{G}_1,\ldots,u^{G}_n)$ denotes the set of detected unigrams and ground truth unigrams for a specific topic, the metrics we use are as follows:

Precision The proportion of unigrams for the detected topic $U^{D}$ that also appears in the ground truth unigrams $U^{G}$. For a topic t, precision is defined as:
$$\begin{aligned} P(t) = \frac{|U^{D} \cap U^{G}|}{|U^{D}|} \end{aligned}$$
(5)
Recall The proportion of ground truth unigrams $U^{G}$ that also appears in the unigrams for the detected topic $U^{D}$. For a topic t, recall is defined as:
$$\begin{aligned} R(t) = \frac{|U^{D} \cap U^{G}|}{|U^{G}|} \end{aligned}$$
(6)
F-score The harmonic mean of precision P(t) and recall R(t), which was introduced in Equations 5 and 6, respectively. For a topic t, F-score is defined as:
$$\begin{aligned} F(t) = \frac{2 \times P(t) \times R(t)}{P(t) + R(t)} \end{aligned}$$
(7)

In our experiments, we compute the precision, recall and F-score derived from the testing set, in terms of the top 5 and 10 keywords of each topic modelled.

Summary rank metrics

As our experiments involve five evaluation metrics, three datasets and 18 algorithms, we develop an intuitive approach to represent the performance of each algorithm. This approach first ranks an algorithm’s performance from 1 to 18 for each evaluation metric and dataset, with the lowest rank being the best performing one. For each combination of topic quality metrics (topic coherence and pointwise mutual information) and topic relevance metrics (precision, recall and f-score), we take the average of each metrics group across all three datasets for an average rank. For example, if an algorithm ranked 1st, 1st and 2nd in terms of topic coherence and 2nd, 1st, 2nd in terms of pointwise mutual information for datasets A, B, C, respectively, this algorithm will be assigned an overall rank of 1.5 for the topic quality metric.

Variants of ClusTop algorithm

Based on the six types of unigram network and three types of document aggregation (introduced in “Network construction” section), there can be mutliple variants of our ClusTop algorithm. For our evaluation, we experiment with the following 21 variants of our ClusTop algorithm, namely:

ClusTop-Word-NA ClusTop based on a co-word usage network, with no tweet aggregation.
ClusTop-BiG-NA ClusTop based on a bigram occurrence network, with no tweet aggregation.
ClusTop-TriG-NA ClusTop based on a trigram occurrence network, with no tweet aggregation.
ClusTop-BiHa-NA ClusTop based on a bigram occurrence + co-hashtag usage network, with no tweet aggregation.
ClusTop-Hash-NA ClusTop based on a co-hashtag usage network, with no tweet aggregation.
ClusTop-H2VG-NA ClusTop based on a co-hashtag usage network where edge weights are based on hash2vec-GloVe scores between hashtags, with no tweet aggregation.
ClusTop-H2VW-NA ClusTop based on a co-hashtag usage network where edge weights are based on hash2vec-Word2Vec scores between hashtags, with no tweet aggregation.
ClusTop-H2VF-NA ClusTop based on a co-hashtag usage network where edge weights are based on hash2vec-FastText scores between hashtags, with no tweet aggregation.
ClusTop-Noun-NA ClusTop based on a co-noun usage network, with no tweet aggregation.
ClusTop-Word-AH ClusTop based on a co-word usage network, with tweets aggregated based on common hashtags.
ClusTop-Hash-AH ClusTop based on a co-hashtag usage network, with tweets aggregated based on common hashtags.
ClusTop-H2VG-AH ClusTop based on a co-hashtag usage network where edge weights are based on hash2vec-GloVe scores between hashtags, with tweets aggregated based on common hashtags.
ClusTop-H2VW-AH ClusTop based on a co-hashtag usage network where edge weights are based on hash2vec-Word2Vec scores between hashtags, with tweets aggregated based on common hashtags.
ClusTop-H2VF-AH ClusTop based on a co-hashtag usage network where edge weights are based on hash2vec-FastText scores between hashtags, with tweets aggregated based on common hashtags.
ClusTop-Noun-AH ClusTop based on a co-noun usage network, with tweets aggregated based on common hashtags.
ClusTop-Word-AM ClusTop based on a co-word usage network, with tweets aggregated based on common mentions.
ClusTop-Hash-AM ClusTop based on a co-hashtag usage network, with tweets aggregated based on common mentions.
ClusTop-H2VG-AM ClusTop based on a co-hashtag usage network where edge weights are based on hash2vec-GloVe scores between hashtags, with tweets aggregated based on common mentions.
ClusTop-H2VW-AM ClusTop based on a co-hashtag usage network where edge weights are based on hash2vec-Word2Vec scores between hashtags, with tweets aggregated based on common mentions.
ClusTop-H2VF-AM ClusTop based on a co-hashtag usage network where edge weights are based on hash2vec-FastText scores between hashtags, with tweets aggregated based on common mentions.
ClusTop-Noun-AM ClusTop based on a co-noun usage network, with tweets aggregated based on common mentions.

Note that we did not use the ClusTop variants based on bigrams and trigrams combined with the hashtag and mention aggregation schemes, as these variants provide minimal improvements compared to their original non-aggregated variants. Consider a simple example of three tweets with a common hashtag, the hashtag aggregation scheme with bigrams will only produce an additional two bigrams resulting from the first and second tweet as well as the second and third tweet. Moreover, these two additional bigrams will be generated from the last word of the first tweet and the first word of the second tweet, which will not be syntactically meaningful in most cases.

Baseline algorithms

LDA is a popular topic modelling algorithm that was used for traditional documents (such as news articles), and more recently for social media (such as tweets on Twitter). Given the popularity of LDA for topic modelling, we compare our ClusTop algorithm and its variants against the following LDA-based algorithms, namely:

1.
LDA-Orig The original version of LDA introduced by [6], where each document corresponds to a single tweet.
2.
LDA-Hash A variant of LDA applied on Twitter, where each document is aggregated from multiple tweets with the same hashtag [10].
3.
LDA-Ment An adaptation of the Twitter-based LDA variant proposed by [89], where we aggregate tweets with the same mention into a single document.

Experimental results and discussion

In this section, we report on the results of our experiments and discuss some implications of these findings.

Table 2 Comparison of ClusTop algorithm against various baselines, in terms of Topic Coherence (TC) and Pointwise Mutual Information (PMI) for the top 5, 10, 15 and 20 keywords

Full size table

Topic coherence and pointwise mutual information

Table 2 shows a summary of the performance of our ClusTop algorithm and its variants against the various LDA baselines, in terms of average rank based on Topic Coherence and Pointwise Mutual Information scores on the top 5, 10, 15 and 20 keywords in the detected topics. For a more detailed breakdown, Tables 4 and 5 show the performance of our ClusTop algorithm and its variants against the various LDA baselines, in terms of Topic Coherence and Pointwise Mutual Information, based on the top 5, 10, 15 and 20 keywords in the detected topics.

The results generally show that all variants of our ClusTop algorithm outperform the various LDA baselines, in terms of the average rank metrics. All ClusTop variants also out-perform the LDA baselines in terms of the individual evaluation metrics of Topic Coherence and Pointwise Mutual Information across all datasets. In particular, we note the following:

The performance of ClusTop could be largely attributed to its usage of the various types of word network graphs, which retain the syntactic meaning and association between words in a tweet via the use of vertices in the form of unigrams, bigrams, trigrams and its variants, and edges in the form of word co-occurrence usage and various types of word embedding similarity distances.
All ClusTop variants that utilize hashtags (ClusTop-Hash-NA, ClusTop-Hash-AH and ClusTop-Hash-AM) offer better overall performance compared to its counterparts that utilizes other forms of unigram and relation, i.e., words, bigrams, trigrams, nouns.
The aggregation schemes employed by LDA (LDA-Hash and LDA-Ment) generally outperform their original counterpart (LDA-Orig), thus showing that LDA works better on larger documents.
In addition to all ClusTop variants outperforming the LDA baselines, the aggregation schemes employed by ClusTop showed better performance compared to their non-aggregated counterparts.

Table 3 Comparison of ClusTop algorithm against various baselines, in terms of Precision (Pre), Recall (Rec) and F-score (FS) for the top 5 keywords/unigrams of each topic

Full size table

Precision, recall and F-score

Table 3 shows the average ranks based on the Precision, Recall and F-score scores of our ClusTop algorithm and variants, and the various LDA baselines based on the top 5, 10, 15 and 20 keywords of detected topics. For a more detailed breakdown of the results, Table 6 shows the Precision, Recall and F-score of our ClusTop algorithm and variants, and the various LDA baselines based on the top 5 keywords of detected topics, while Tables 7, 8, and 9 show the same results based on top 10, 15 and 20 keywords of detected topics, respectively.

There are specific variants of ClusTop that outperform the LDA baselines in terms of Precision, Recall and F-score. Our main observations are as follows:

In terms of overall rank (average of Precision, Recall and F-score), ClusTop-BiG-NA, ClusTop-TriG-NA, ClusTop-H2VG-AM offers the best overall performance.
In terms of precision, the Hash2Vec variants consistently fosters the best performers (except one case in Table 6 where LDA-Orig is ranked 2nd, beating all Hash2vec variants except ClusTop-H2VW-AH). The next two best performers are ClusTop-Hash-NA and ClusTop-Noun-AM, except for the aforementioned case where they are beaten by LDA-Orig Table 6.
ClusTop-H2VW-* and ClusTop-H2VF-* have slightly higher precision than ClusTop-H2VG-*, observed from that ClusTop-H2VG-* never scores top precision in Table 6. The existence of this slight difference is due to each word embedding algorithm being trained using its own vocabulary set. While GloVe (twitter.27B) provides embeddings for some hashtags that are not standard English, the other two algorithms do not. Missing these hashtags possibly increases precision of ClusTop-H2VW-* and ClusTop-H2VF-* in Table 6, as vast majority of ground truth words are in standard English vocabulary.

Conclusion and future work

In this paper, we proposed the ClusTop algorithm for topic modelling on Twitter, using community detection approaches on a network graph with multiple definitions of vertices and edges. While traditional topic modelling algorithms require the tuning and setting of numerous parameters, ClusTop does not require this parameter tuning and is able to automatically determine the appropriate number of topics using a local maximization of modularity among the word network graph. We also performed an empirical study on the effects of using different types of vertices (unigrams, bigrams, trigrams, hashtags, nouns from part-of-speech tagging) types of edges (word co-occurrence frequency and word embedding similarity distance), and different aggregation schemes (individual tweets, hashtags and mentions). The different possible combinations of vertices, edges and aggregation schemes results in multiple variants of our ClusTop algorithm, which we use to compare among the variants as well as against various LDA baselines. Our experimental evaluation on the ClusTop variants and baselines are based on the evaluation metrics of topic coherence, pointwise mutual information, precision, recall and F-score. The experimental results based on three Twitter datasets with labeled topics (crises and events) show that our ClusTop algorithm out-performs the various LDA baselines in terms of these evaluation metrics.

This work explored how community detection approaches alongside different types of word network graphs can be used for automated topic modelling on Twitter. We performed an empirical study to examine the effects of different types of network graphs based on different definitions of vertices, edges and aggregation schemes on a variety of performance metrics. There still remain various directions for future research, which include:

A major challenge in evaluating topic models and text classification models is the requirement of a dataset with annotated labels of the ground truth topics. A possible future direction is to automate the labelling of this ground truth topic by using the semantic similarity between tweets or other short texts and Wikipedia or news articles to assign the appropriate topic labels based on the categorisation for the latter.
Our work is primarily focused on using community detection approaches for topic modelling purposes and does not incorporate other aspects of a social network, such as friendship links. Future work can utilize a joint modelling of social relations between users and the various types of word network graph to detect topic-coherence communities, i.e., communities of users based on topical interests.
Another future direction is to extend our ClusTop algorithm to incorporate temporal and spatial attributes associated with geo-tagged tweets. With the increased use of smart devices and geo-tagged social media, this consideration of temporal and spatial attributes will enable researchers to better model topics that are associated with specific time periods or physical locations.

Availability of data and materials

Not applicable.

Notes

This paper is an extended version of [11], with the addition of more than 40% new materials. These additional materials include: (i) an updated literature review to include more recent works; (ii) a more detailed description of our proposed approach; (iii) a new algorithm that utilizes various word embeddings and distance measures between words; (iv) additional experiments and evaluations; (v) a more in-depth discussion of the results and our main findings.
Using the earlier generated network graph, our approach can also be easily generalized and used as an input to other community detection algorithms. We have also experimented with other popular community detection algorithms such as the Infomap [72] and Label Propagation [73] algorithms. However, the results show that these algorithms have a tendency to generate a large number (hundreds to thousands) of small communities, thus making it unfeasible for our topic modelling purpose. As such, we utilize the Louvain algorithm due to its good performance, compared to these other community detection algorithms.

References

Statistics IL. Twitter Usage Statistics. 2016. http://www.internetlivestats.com/twitter-statistics/.
Kumar S, Morstatter F, Liu H. Twitter Data Analytics. New York: Springer; 2013.
Google Scholar
Liao Y, Moshtaghi M, Han B, Karunasekera S, Kotagiri R, Baldwin T, Harwood A, Pattison P. Mining Micro-Blogs: Opportunities and Challenges. Social Networks: Computational Aspects and Mining. In: London in the Computer Communications and Networks series. Springer: New York; 2011.
Deerwester S, Dumais ST, Furnas GW, Landauer TK, Harshman R. Analysing how people orient to and spread rumours in social media by looking at conversational threads. J Am Soc Inf Sci. 1990;41(6):391.
Article Google Scholar
Hofmann T. Probabilistic latent semantic analysis. In: Proceedings of the Fifteenth Conference on Uncertainty in Artificial Intelligence (UAI’99). 2012. p. 289–296.
Blei DM, Ng AY, Jordan MI. Latent dirichlet allocation. J Mach Learn Res. 2003;3:993–1022.
MATH Google Scholar
De Smet W, Moens M-F. Cross-language linking of news stories on the web using interlingual topic modelling. In: Proceedings of the 2nd ACM Workshop on Social Web Search and Mining. 2009; p. 57–64.
Jacobi C, Van Atteveldt W, Welbers K. Quantitative analysis of large amounts of journalistic texts using topic modelling. Digital J. 2016;4(1):89–106.
Google Scholar
Hong L, Davison BD. Empirical study of topic modeling in twitter. In: Proceedings of the First Workshop on Social Media Analytics (SMA’10), 2010. p. 80–8.
Mehrotra R, Sanner S, Buntine W, Xie L. Improving LDA topic models for microblogs via tweet pooling and automatic labeling. In: Proceedings of the 36th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR’13), 2017. p. 889–92.
Lim KH, Karunasekera S, Harwood A. Clustop: A clustering-based topic modelling algorithm for twitter using word networks. In: Proceedings of the 2017 IEEE International Conference on Big Data (BigData’17), 2017. p. 2009–18.
Liu Y, Goncalves J, Ferreira D, Xiao B, Hosio S, Kostakos V. CHI 1994− 2013: mapping two decades of intellectual progress through co-word analysis. In: Proceedings of the 32nd Annual ACM Conference on Human Factors in Computing Systems (CHI’14), 2014. p. 3553–62.
Lim KH, Datta A. A topological approach for detecting twitter communities with common interests. In: Ubiquitous Social Media Analysis. New York: Springer; 2013. p. 23–43.
Lim KH, Datta A. An interaction-based approach to detecting highly interactive twitter communities using tweeting links. Web Intelligence. 2016;14(1):1–15.
Article Google Scholar
Paranyushkin D. Identifying the pathways for meaning circulation using text network analysis. In: Nodus Labs; 2011.
Jr SB, Kido GS, Tavares GM. Artificial and natural topic detection in online social networks. iSys. Revista Brasileira de Sistemas de Informacao 2017;10(1): 80–98.
Fried D, Surdeanu M, Kobourov S, Hingle M, Bell D. Analyzing the language of food on social media. In: Proceedings of the 2014 IEEE International Conference on Big Data (BigData’14), 2014; p. 778–83.
Surian D, Nguyen DQ, Kennedy G, Johnson M, Coiera E, Dunn AG. Characterizing twitter discussions about hpv vaccines using topic modeling and community detection. J Med Internet Res. 2016;18:8.
Article Google Scholar
Amati G, Angelini S, Cruciani A, Fusco G, Gaudino G, Pasquini D, Vocca P. Topic modeling by community detection algorithms. In: Proceedings of the 2021 Workshop on Open Challenges in Online Social Networks, 2021. p. 15–20.
Martin F, Johnson M. More efficient topic modelling through a noun only approach. In: Proceedings of the Australasian Language Technology Association Workshop 2015, Parramatta, Australia, pp. 111–115 2015. https://aclanthology.org/U15− 1013.
Yang S, Zhang H. Text mining of twitter data using a latent dirichlet allocation topic model and sentiment analysis. Int J Comput Inf Eng. 2018;12(7):525–9.
Google Scholar
Sun Y, Loparo K, Kolacinski R. Conversational structure aware and context sensitive topic model for online discussions. In: 2020 IEEE 14th International Conference on Semantic Computing (ICSC), p. 85–92. 2020.
Gao W, Peng M, Wang H, Zhang Y, Xie Q, Tian G. Incorporating word embeddings into topic modeling of short text. Knowl Inf Syst. 2019;61(2):1123–45.
Article Google Scholar
Dieng AB, Ruiz FJ, Blei DM. Topic modeling in embedding spaces. Trans Assoc Comput Linguistics. 2020;8:439–53.
Article Google Scholar
Dai X, Bikdash M, Meyer B. From social media to public health surveillance: Word embedding based clustering method for twitter classification. In: SoutheastCon 2017, pp. 1–7.
Bagheri A, Sammani A, van der Heijden PG, Asselbergs FW, Oberski DL. Etm: Enrichment by topic modeling for automated clinical sentence classification to detect patients’ disease history. J Intell Inf Syst. 2020;55(2):329–49.
Article Google Scholar
Nikolenko SI, Koltcov S, Koltsova O. Topic modelling for qualitative studies. J Inf Sci. 2017;43(1):88–102.
Article Google Scholar
Jansson P, Liu S. Distributed representation, LDA topic modelling and deep learning for emerging named entity recognition from social media. In: Proceedings of the 3rd Workshop on Noisy User-generated Text, pp. 154–159. Association for Computational Linguistics, Copenhagen, Denmark. 2017. https://doi.org/10.18653/v1/W17-4420.https://aclanthology.org/W17-4420.
Bhat MR, Kundroo MA, Tarray TA, Agarwal B. Deep lda: A new way to topic model. J Inf Optimiz Sci. 2020;41(3):823–34.
Google Scholar
Steinskog A, Therkelsen J, Gambäck B. Twitter topic modeling by tweet aggregation. In: Proceedings of the 21st Nordic Conference on Computational Linguistics, pp. 77–86. Association for Computational Linguistics, Gothenburg, Sweden. 2017. https://aclanthology.org/W17-0210.
Zhao WX, Jiang J, Weng J, He J, Lim E-P, Yan H, Li X. Comparing twitter and traditional media using topic models. In: Proceedings of the 33rd European Conference on Information Retrieval (ECIR’11). 2011. p. 338–49.
Aiello LM, Petkos G, Martin C, Corney D, Papadopoulos S, Skraba R, Göker A, Kompatsiaris I, Jaimes A. Sensing trending topics in twitter. IEEE Trans Multimedia. 2013;15(6):1268–82.
Article Google Scholar
Wang X, McCallum A. Topics over time: A non-markov continuous-time model of topical trends. In: Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD’06). 2006. p. 424–33.
Wang Y, Agichtein E, Benzi M. Tm-lda: Efficient online modeling of latent topic transitions in social media. In: Proceedings of the 18th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD’12). 2012. p. 123–31.
Lansley G, Longley PA. The geography of twitter topics in london. Comput Environ Urban Syst. 2016;58:85–96.
Article Google Scholar
Wang J, Feng Y, Naghizade E, Rashidi L, Lim KH, Lee KE. Happiness is a choice: Sentiment and activity-aware location recommendation. In: Proceedings of the 2018 Web Conference Companion (WWW’18). 2018. p. 1401–5.
Chen Y, Amiri H, Li Z, Chua T-S. Emerging topic detection for organizations from microblogs. In: Proceedings of the 36th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR’13), 2013. p. 43–52.
Barry AE, Valdez D, Padon AA, Russell AM. Alcohol advertising on twitter-a topic model. Am J Health Educ. 2018;49(4):256–63.
Article Google Scholar
Missier P, Romanovsky A, Miu T, Pal A, Daniilakis M, Garcia A, Cedrim D, da Silva Sousa L. Tracking dengue epidemics using twitter content classification and topic modelling. In: Proceedings of the 2016 International Conference on Web Engineering (ICWE’16). 2016 p. 80–92.
Kwan JS-L, Lim KH. Understanding public sentiments, opinions and topics about covid− 19 using twitter. In: Proceedings of the 2020 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining (ASONAM’20). 2020. p. 623–6.
Dahal B, Kumar SA, Li Z. Topic modeling and sentiment analysis of global climate change tweets. Soc Netw Anal Mining. 2019;9(1):1–20.
Article Google Scholar
Maier D, Waldherr A, Miltner P, Wiedemann G, Niekler A, Keinert A, Pfetsch B, Heyer G, Reber U, Häussler T, et al. Applying lda topic modeling in communication research: Toward a valid and reliable methodology. Commun Methods Meas. 2018;12(2–3):93–118.
Article Google Scholar
Jeong B, Yoon J, Lee J-M. Social media mining for product planning: A product opportunity mining approach based on topic modeling and sentiment analysis. Int J Inf Manag. 2019;48:280–90.
Article Google Scholar
Yin P, Ram N, Lee W-C, Tucker C, Khandelwal S, Salathe M. Two sides of a coin: Separating personal communication and public dissemination accounts in twitter. In: Proceedings of the 18th Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD’14). 2014. p. 163–75.
Shen Y, Yu J, Dong K, Nan K. Automatic fake followers detection in chinese micro-blogging system. In: Proceedings of the 18th Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD’14). 2014. p. 596–607.
Fang Y, Si L, Somasundaram N, Yu Z. Mining contrastive opinions on political texts using cross-perspective topic model. In: Proceedings of the Fifth ACM International Conference on Web Search and Data Mining. 2012. p. 63–72.
Shu K, Sliva A, Wang S, Tang J, Liu H. Fake news detection on social media: A data mining perspective. ACM SIGKDD Explorat Newslett. 2017;19(1):22–36.
Article Google Scholar
Song X, Petrak J, Jiang Y, Singh I, Maynard D, Bontcheva K. Classification aware neural topic model for covid-19 disinformation categorisation. PloS one. 2021;16(2):0247086.
Google Scholar
Nguyen TH, Shirai K. Topic modeling based sentiment analysis on social media for stock market prediction. In: Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), 2015. p. 1354–64.
Roberts ME, Stewart BM, Tingley D, Lucas C, Leder-Luis J, Gadarian SK, Albertson B, Rand DG. Structural topic models for open-ended survey responses. Am J Polit Sci. 2014;58(4):1064–82.
Article Google Scholar
Roberts ME, Stewart BM, Airoldi EM. A model of text for experimentation in the social sciences. J Am Stat Assoc. 2016;111(515):988–1003.
Article MathSciNet Google Scholar
Grimmer J. A bayesian hierarchical topic model for political texts: Measuring expressed agendas in senate press releases. Polit Anal. 2010;18(1):1–35.
Article Google Scholar
Tian K, Revelle M, Poshyvanyk D. Using latent dirichlet allocation for automatic categorization of software. In: 2009 6th IEEE International Working Conference on Mining Software Repositories. IEEE. 2009. p. 163–6.
Linstead E, Rigor P, Bajracharya S, Lopes C, Baldi P. Mining concepts from code with probabilistic topic models. In: Proceedings of the Twenty-second IEEE/ACM International Conference on Automated Software Engineering. 2007. p. 461–4.
Di Rocco J, Di Ruscio D, Di Sipio C, Nguyen P, Rubei R. Topfilter: an approach to recommend relevant github topics. In: Proceedings of the 14th ACM/IEEE International Symposium on Empirical Software Engineering and Measurement (ESEM). 2020. p. 1–11.
Jiang, S., Qian, X., Shen, J., Mei, T.: Travel recommendation via author topic model based collaborative filtering. In: International Conference on Multimedia Modeling, pp. 392–402 (2015). Springer
Hu B, Ester M. Spatial topic modeling in online social media for location recommendation. In: Proceedings of the 7th ACM Conference on Recommender Systems. 2013. p. 25–32.
Niu Z, Hua G, Gao X, Tian Q. Semi-supervised relational topic model for weakly annotated image recognition in social media. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2014. p. 4233–40.
Alguliyev RM, Aliguliyev RM, Isazade NR, Abdi A, Idris N. Cosum: Text summarization based on clustering and optimization. Expert Syst. 2019;36(1):12340.
Article Google Scholar
Nagwani NK. Summarizing large text collection using topic modeling and clustering based on mapreduce framework. J Big Data. 2015;2(1):1–18.
Article Google Scholar
Ma Z, Sun A, Cong G. Will this #hashtag be popular tomorrow? In: Proceedings of the 35th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR’12). 2012. p. 1173–4.
Lehmann J, Goncalves B, Ramasco JJ, Cattuto C. Dynamical classes of collective attention in twitter. In: Proceedings of the 21st International Conference on World Wide Web (WWW’12). 2012. p. 251–60.
Foundation TAS. The Apache OpenNLP library. http://opennlp.apache.org. 2017.
Mattmann CA, Sharan M. An automatic approach for discovering and geocoding locations in domain-specific web data. In: Proceedings of the 2016 IEEE 17th International Conference on Information Reuse and Integration (IRI’16). 2016. p. 87–93.
Vicente IS, Saralegi X, Agerri R. Elixa: A modular and flexible absa platform. In: Proceedings of the 9th International Workshop on Semantic Evaluation (SemEval’15). 2015. p. 748–52.
Agerri R, Rigau G. Robust multilingual named entity recognition with shallow semi-supervised features. Artif Intell. 2016;238:63–82.
Article MathSciNet Google Scholar
Pennington J, Socher R, Manning C. Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP’14). 2014. p. 1532–43.
Mikolov T, Chen K, Corrado G, Dean J. Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781. 2013.
Joulin A, Grave E, Bojanowski P, Mikolov T. Bag of tricks for efficient text classification. arXiv preprint arXiv:1607.01759. 2016.
Blondel VD, Guillaume J-L, Lambiotte R, Lefebvre E. Fast unfolding of communities in large networks. J Stat Mech. 2008;2008(10):10008.
Article Google Scholar
Fortunato S. Community detection in graphs. Phys Rep. 2010;486(3):75–174.
Article MathSciNet Google Scholar
Rosvall M, Bergstrom CT. Maps of random walks on complex networks reveal community structure. Proc Natl Acad Sci. 2008;105(4):1118–23.
Article Google Scholar
Raghavan UN, Albert R, Kumara S. Near linear time algorithm to detect community structures in large-scale networks. Phys Rev E. 2007;76(3):036106.
Article Google Scholar
Olteanu A, Castillo C, Diaz F, Vieweg S. Crisislex: A lexicon for collecting and filtering microblogged communications in crises. In: Proceedings of the Eighth International AAAI Conference on Weblogs and Social Media (ICWSM’14). 2014. p. 376–85.
Olteanu A, Vieweg S, Castillo C. What to expect when the unexpected happens: Social media communications across crises. In: Proceedings of the 18th ACM Conference on Computer Supported Cooperative Work and Social Computing (CSCW’15). 2015. p. 994–1009.
Zubiaga A, Liakata M, Procter R, Hoi GWS, Tolmie P. Analysing how people orient to and spread rumours in social media by looking at conversational threads. PloS one. 2016;11(3):0150989.
Article Google Scholar
Kohavi R. A study of cross-validation and bootstrap for accuracy estimation and model selection. In: Proceedings of the 14th International Joint Conference on Artificial Intelligence (IJCAI’95). 1995. p. 1137–45.
Mimno D, Wallach HM, Talley E, Leenders M, McCallum A. Optimizing semantic coherence in topic models. In: Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP’11). 2011. p. 262–72.
Yao L, Zhang Y, Wei B, Qian H, Wang Y. Incorporating probabilistic knowledge into topic models. In: Proceedings of the 19th Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD’15). 2015. p. 586–97.
Ritter A, Etzioni O, Clark S. Open domain event extraction from twitter. In: Proceedings of the 18th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD’12). 2012. p. 1104–12.
Halder S, Lim KH, Chan J, Zhang X. Transformer-based multi-task learning for queuing time aware next poi recommendation. In: Proceedings of the 25th Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD’21). 2011. p. 510–23.
Brilhante IR, Macedo JA, Nardini FM, Perego R, Renso C. On planning sightseeing tours with tripbuilder. Inform Process Manag. 2015;51(2):1–15.
Article Google Scholar
Zhou F, Wu H, Trajcevski G, Khokhar A, Zhang K. Semi-supervised trajectory understanding with poi attention for end-to-end trip recommendation. ACM Trans Spatial Algorith Syst (TSAS). 2020;6(2):1–25.
Article Google Scholar
Zheng D, Hu T, You Q, Kautz HA, Luo J. Towards lifestyle understanding: Predicting home and vacation locations from user’s online photo collections. In: Proceedings of the Ninth International AAAI Conference on Web and Social Media (KDD’15). 2015. p. 553–61.
Cao B, Chen F, Joshi D, Philip SY. Inferring crowd-sourced venues for tweets. In: Proceedings of the 2015 IEEE International Conference on Big Data (BigData’15). 2015. p. 639–48.
Zheng X, Han J, Sun A. A survey of location prediction on twitter. IEEE Trans Knowl Data Eng. 2018;30(9):1652–71.
Article Google Scholar
Dhiman A, Toshniwal D. An approximate model for event detection from twitter data. IEEE Access. 2020;8:122168–84.
Article Google Scholar
George Y, Karunasekera S, Harwood A, Lim KH. Real-time spatio-temporal event detection on geotagged social media. J Big Data. 2021;8(91):1–28.
Google Scholar
Weng J, Lim E-P, Jiang J, He Q. Twitterrank: Finding topic-sensitive influential twitterers. In: Proceedings of the Third ACM International Conference on Web Search and Data Mining (WSDM’10). 2010. p. 261–70.

Download references

Acknowledgements

This research is funded in part by the Defence Science and Technology Group, Edinburgh, South Australia, under contract MyIP:7293, and the Singapore University of Technology and Design under grant SRG-ISTD-2018-140.

Funding

This research is funded by MyIP:7293 and SRG-ISTD-2018-140.

Author information

Authors and Affiliations

Engineering Product Development Pillar, Singapore University of Technology and Design, Singapore, Singapore
Wenchuan Mu
Information Systems Technology and Design Pillar, Singapore University of Technology and Design, Singapore, Singapore
Kwan Hui Lim & Junhua Liu
Forth AI, Singapore, Singapore
Junhua Liu
School of Computing and Information Systems, The University of Melbourne, Melbourne, Australia
Shanika Karunasekera, Lucia Falzon & Aaron Harwood

Authors

Wenchuan Mu
View author publications
You can also search for this author in PubMed Google Scholar
Kwan Hui Lim
View author publications
You can also search for this author in PubMed Google Scholar
Junhua Liu
View author publications
You can also search for this author in PubMed Google Scholar
Shanika Karunasekera
View author publications
You can also search for this author in PubMed Google Scholar
Lucia Falzon
View author publications
You can also search for this author in PubMed Google Scholar
Aaron Harwood
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

WM and KHL designed the research, ran experiments, analyzed results, and wrote the manuscript. JL, SK, LF and AH designed the research, analyzed results and contributed to manuscript preparation. All authors read and approved the final manuscript.

Corresponding author

Correspondence to Kwan Hui Lim.

Ethics declarations

Ethics approval and consent to participate

Not applicable.

Consent for publication

The authors consent to the publication of this manuscript.

Competing interests

The authors declare that they have no competing interests.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendices

Appendix A: Detailed results on topic coherence and pointwise mutual information

See Tables 4, and 5.

Table 4 Comparison of ClusTop algorithm against various baselines, in terms of Topic Coherence (TC) and Pointwise Mutual Information (PMI) for the top 5 and 10 keywords

Full size table

Table 5 Comparison of ClusTop algorithm against various baselines, in terms of Topic Coherence (TC) and Pointwise Mutual Information (PMI) for the top 15 and 20 keywords

Full size table

Appendix B: Detailed results on Precision, Recall and F-score

See Tables 6, 7, 8, 9.

Table 6 Comparison of ClusTop algorithm against various baselines, in terms of Precision (Pre), Recall (Rec) and F-score (FS) for the top 5 keywords/unigrams of each topic

Full size table

Table 7 Comparison of ClusTop algorithm against various baselines, in terms of Precision (Pre), Recall (Rec) and F-score (FS) for the top 10 keywords/unigrams of each topic

Full size table

Table 8 Comparison of ClusTop algorithm against various baselines, in terms of Precision (Pre), Recall (Rec) and F-score (FS) for the top 15 keywords/unigrams of each topic

Full size table

Table 9 Comparison of ClusTop algorithm against various baselines, in terms of Precision (Pre), Recall (Rec) and F-score (FS) for the top 20 keywords/unigrams of each topic

Full size table

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Mu, W., Lim, K.H., Liu, J. et al. A clustering-based topic model using word networks and word embeddings. J Big Data 9, 38 (2022). https://doi.org/10.1186/s40537-022-00585-4

Download citation

Received: 13 October 2021
Accepted: 14 March 2022
Published: 11 April 2022
DOI: https://doi.org/10.1186/s40537-022-00585-4

A clustering-based topic model using word networks and word embeddings

Abstract

Introduction

Main contributions

Structure and organization

Related work

Studying topics using communities

General topic modelling algorithms

Social media analytics

Discussion

Proposed algorithm

Network construction

Types of network

Types of document aggregation

Community detection

Topic assignment

Dataset and evaluation methodology

Dataset

Topic quality metrics

Topic relevance metrics

Summary rank metrics

Variants of ClusTop algorithm

Baseline algorithms

Experimental results and discussion

Topic coherence and pointwise mutual information

Precision, recall and F-score

Conclusion and future work

Availability of data and materials

Notes

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Ethics approval and consent to participate

Consent for publication

Competing interests

Additional information

Publisher's Note

Appendices

Appendix A: Detailed results on topic coherence and pointwise mutual information

Appendix B: Detailed results on Precision, Recall and F-score

Rights and permissions

About this article

Cite this article

Share this article

Keywords