Real-time event detection in social media streams through semantic analysis of noisy terms

Kolajo, Taiwo; Daramola, Olawande; Adebiyi, Ayodele A.

doi:10.1186/s40537-022-00642-y

Research
Open access
Published: 12 July 2022

Real-time event detection in social media streams through semantic analysis of noisy terms

Taiwo Kolajo¹^nAff2,
Olawande Daramola¹ &
Ayodele A. Adebiyi³

Journal of Big Data volume 9, Article number: 90 (2022) Cite this article

5541 Accesses
13 Citations
2 Altmetric
Metrics details

Abstract

Interactions via social media platforms have made it possible for anyone, irrespective of physical location, to gain access to quick information on events taking place all over the globe. However, the semantic processing of social media data is complicated due to challenges such as language complexity, unstructured data, and ambiguity. In this paper, we proposed the Social Media Analysis Framework for Event Detection (SMAFED). SMAFED aims to facilitate improved semantic analysis of noisy terms in social media streams, improved representation/embedding of social media stream content, and improved summarization of event clusters in social media streams. For this, we employed key concepts such as integrated knowledge base, resolving ambiguity, semantic representation of social media streams, and Semantic Histogram-based Incremental Clustering based on semantic relatedness. Two evaluation experiments were conducted to validate the approach. First, we evaluated the impact of the data enrichment layer of SMAFED. We found that SMAFED outperformed other pre-processing frameworks with a lower loss function of 0.15 on the first dataset and 0.05 on the second dataset. Second, we determined the accuracy of SMAFED at detecting events from social media streams. The result of this second experiment showed that SMAFED outperformed existing event detection approaches with better Precision (0.922), Recall (0.793), and F-Measure (0.853) metric scores. The findings of the study present SMAFED as a more efficient approach to event detection in social media.

Introduction

Event detection is a computational operation that enables the automatic identification of significant incidents by analysing social media data. An event refers to a significant incident happening in a place and time [1].

Event detection in social media streams is a worthwhile research area because it provides the opportunity to know about current happenings and people’s views at the click of a button. Gone are the days when persons can ‘kill’ the news that they do not want other persons to know about by suppressing it through the news agency or organization. This is no longer possible with the existence of various social media platforms. Once a piece of interesting news gets into social media, it travels faster. It reaches a wider audience who would continue to spread the news until a desirable action is taken. Thus, researching this area is important to objectively reveal current events and happenings, such as breaking news, instant outbreaks, infectious disease, and terror attacks [2].

We are in the era of social media, where there is abundant data and opportunities to be exploited. Social media enable us to take advantage of the social nature of human association, making it possible for individuals to express their feelings, become part of a virtual network and collaborate remotely [3].

A social media stream is made up of user-generated content. As such, ambiguity, which is part of the core problems of natural language text, also exists in the content of social media streams. Ambiguity occurs when a word can be expressed in at least two ways or senses in a determined context [4, 5]. Ambiguity may not present any difficulty to a human being (i.e., native speakers) because ambiguity can be resolved using the knowledge of the context and common sense. But disambiguating text efficiently with a computer application is still a problem [6].

Social media streams are characterized by short messages; utilization of progressively advancing sporadic, casual, abridged words, syntactic and spelling blunders; blended dialects; vagueness; and inappropriate sentence structure [7]. These characteristics make it difficult for methods that rely on them for computational purposes to perform adequately and effectively [7,8,9,10,11,12]. Likewise, many of the current methodologies for event detection have mostly paid attention to the use of trivial keywords or themes retrieval. Still, they have not fully considered the valuable semantics embedded in social media streams, which hinders the accuracy of event detection [13, 14]. Particularly of interest to this paper is the problem of ambiguity that stems from the use of slangs, abbreviations, and acronyms (SAB) in social media streams which makes the interpretation of such terms complicated during the event detection process. Most of the existing event detection methods have not considered the semantic analysis of noisy terms in the form of SAB and associated ambiguities in the design of their solutions.

Social media is a suitable medium for reporting serious events and emergencies. However, it presents many challenging issues that make it difficult to effectively uncover interesting and useful messages. Event summarization in the context of this paper can be referred to as finding a tweet representative that can suitably represent an event cluster. The noisy characteristics of social media content require new semantic innovations that will facilitate accurate analysis of social media streams. Thus, improving the precision of event detection techniques by resolving the noisy attributes of social media content is necessary.

This paper proposes a Social Media Analysis Framework for Event Detection (SMAFED) that resolves the noisy and ambiguous terms in social media streams to improve event detection accuracy. SMAFED was realized by integrating a local vocabulary consisting of slangs, acronyms, abbreviations; and incremental semantic clustering. This is to facilitate the understanding of implicit semantics embedded in social media streams to improve event detection. An evaluation was done by benchmarking SMAFED with existing approaches, including locality sensitive hashing [15], cluster summarisation [16], entity-based approach [17], and Repp framework [18]. The Precision, Recall, and F-measure metrics were used to assess the accuracy of event detection. To further demonstrate the plausibility of SMAFED in other research domains, the pre-processing and enrichment components of SMAFED were benchmarked with other pre-processing frameworks to extract sentiments from tweets using a generalized dataset, Twitter sentiment analysis training corpus, and a dataset of Nigerian origin called Naija-tweets.

The contributions of this paper are as follows:

1.
Existing event detection methods have focused mostly on filtering out slangs, abbreviations, and acronyms (SAB); removing noisy terms including SAB, or ignoring them entirely during the pre-processing stage of social media streams. They did not perform semantic analysis of noisy terms like SAB to determine their contextual meanings and their impact on the accuracy of results. Semantic analysis of SAB terms was done for the first time in this study which yielded improved results.
2.
In contrast to existing approaches, the proposed framework (SMAFED) introduces the data enrichment layer that enables the semantic analysis of SAB and ambiguity issues associated with their usage.
3.
A dedicated algorithm for the disambiguation of slangs, acronyms, and abbreviations (SABDA) was proposed and used to disambiguate ambiguous SAB terms to better understand and interpret noisy terms in social media streams.
4.
An integrated knowledge base (IKB) representing a local vocabulary of SAB terms was created to facilitate semantic analysis of noisy terms in social media streams. The IKB is a valuable and reusable resource that can support other computational operations on SAB.

The remaining part of this paper is organized as follows. Related work presents the related work, where an overview of the previous approaches that are relevant to event detection was presented. Methodology discussed the proposed Social Media Analysis Framework for Event Detection (SMAFED) for improved event detection in social media streams. Evaluation experiment presents the report of the evaluation of the SMAFED. In doing this, the impact of using the data enrichment layer to aid the results of SMAFED, and the performance of SMAFED when used for event detection from social media streams were discussed. The paper is concluded in Conclusion and further work with a summary and an overview of future work.

Related work

This section presents an overview of relevant previous research efforts on event detection from social media streams. The aspects covered include unsupervised learning, semi-supervised learning, supervised learning, and semantic-based approaches for event detection in social media streams.

Unsupervised learning for event detection in social media stream

Unsupervised learning is a type of learning that draws inductions from an unlabeled dataset [19]. Due to several iterations required to compute similarity or dissimilarity in the observed dataset, all of the datasets ought to be accessible in memory before running the algorithm in most cases. However, with data stream clustering, the challenge is searching for a new structure in the data as it evolves, characterizing the streaming data in clusters to leverage them to report events in the data stream. The clusters are then ordered based on the scoring function [1]. Some studies on event detection based on unsupervised learning are presented next.

Authors in [15] worked on streaming first story detection with an application to Twitter. The authors used a hash function called Locality Sensitive Hashing (LSH) to place similar documents in the same bucket. Shannon entropy was used to measure the information contained in the cluster. Clusters were ranked based on the value of the entropy. Event detection in Twitter was carried out by [20]. The paper focused on detecting real-life events from tweets using Event Detection with Clustering of Wavelet-based Signals (EDCoW). Authors in [21] employed Term Frequency (TF) and Kullback–Leibler divergence (KLD) to propose real-time summarization of scheduled events from Twitter streams. The work addressed summarization of tweet content to provide the user with summed upstream describing the key sub-events by employing a two-step process: sub-event detection using an outlier-based sub-event detection technique and selection of tweets related to the sub-event detected to provide a summary. For the summary, TF and KLD techniques were compared and found out that KLD performed better. Authors in [16] proposed a framework to detect events in social streams using similarity score and cluster summarisation techniques. Authors used content- and network-stream-based clustering for event detection. Mining Spatio-temporal information on microblogging streams using a density-based online clustering method was proposed by [22]. The paper investigated the extraction of spatio-temporal features of social media streams by employing an Incremental Density-based Spatial Clustering Application with Noise (DBSCAN) algorithm to enhance event awareness. A weighting factor called BursT, a sliding window technique to address concept drift, was employed. However, none of these outlined approaches focused on handling or analysing SAB terms prevalent in social media streams.

Balanced Iterative Reducing and Clustering using Hierarchies (BIRCH) clustering algorithm was proposed by [23]. The research work was based on detecting localized events and tracking the evolution of such events. Spatio-temporal characteristics of keywords were continuously extracted using the entropy of the spatial signature. A single-pass clustering algorithm, Birch, was used to group event keywords based on the cosine similarity of their spatial signatures. Top-k scoring clusters were considered possible event clusters. In the pre-processing stage, stop-words were removed, stemming was applied, and WordNet (a lexical dictionary for English words) dictionary lookups were performed. Authors in [24] employed multiple social media feeds features such as titles, description, location-textual, location proximity, and date along with Term Frequency—Inverse Document Frequency (TF-IDF) and Normalized Mutual Information Frequency to detect events. In the same vein, [25] traced the German centennial flood in the stream of tweets. The authors applied density-based clustering called Ordering Points to Identify the Clustering Structure (OPTICS) to group a set of flood-related tweets with respect to time and location. The result was validated with subsidiary data sources. Fuzzy hierarchical agglomerative clustering was used to propose a TweetMogaz framework for identifying new stories in social media [26]. The framework used an adaptive method to track relevant tweets. Fuzzy Hierarchical Agglomerative clustering with term co-occurrence probability as a distance measure was used to identify hot stories tweets with enough content. To overcome the problem of duplicate story detection, cosine similarity was used to compute vectors of the two clusters.

Also, real-time entity-based event detection for Twitter was proposed by [17]. The proposed approach identified bursty named entities and then clustered tweets based on the occurrence of the named entities using a cosine distance similarity score. Multiscale event detection in social media was presented by [27]. The authors explored the properties of the wavelet transform. They proposed a novel algorithm to compute a data similarity graph at appropriate scales and simultaneously detect events of different scales by a single graph-based clustering process. The clustering process is based on comparing common terms between pairs of tweets. Authors in [28] proposed a bursty event detection from a microblog framework using a distributed and incremental approach. The paper focuses on detecting events from Weibo (microblog) on Spark engine framework (taking into consideration of topic drift) by employing distributed and incremental temporal topic model, Bursty Event dEtection (BEE +). Online indexing and clustering of social media data for emergency management was presented by [29]. The authors implemented online indexing techniques: incremental TF-IDF; Skewness; and Learn & Forget Model. Clustering was evaluated using Silhouette and Davies-Bouldin metrics. Authors in [13] presented three different approaches to merging information from two different social media sources using time-evolving graphs. It was demonstrated that using information from multiple data streams increases the quality and quantity of detected events. An event detection system that uses inverted indices and incremental clustering algorithms was proposed by [30]. Burst detection based on the volume of tweets without considering the tweets’ context may be misleading because co-occurrence terms in tweets may not be synonymous when the context in which they are used is taken into consideration. Real-time event detection on social data streams was conducted by [31]. Events were modelled as a list of clusters of trending entities over time using entity co-occurrence, Louvain clustering, and aggregate ranking. The approach only considered the length of words contained in a tweet but did not look at such representative’s local and global importance. In addition, SAB terms were not handled. Authors in [32] proposed a multimedia big data system that used both incremental clustering event detection approach enriched with the analysis of multimedia content and a bio-inspired influence analysis technique to support alert spread and situation awareness over the network. Target-aware holistic influence maximization in spatial social networks was carried out by [33]. The authors came up with a diffusion model which takes care of both physical and cyber user interactions. They also proposed a spatial, social index based on an R-tree algorithm that computes users’ interest similarity concerning online keyword queries. Both synthetic and three datasets were used to validate the effectiveness of the proposed model. Authors in [35] improved on the drawbacks of conventional methods to detect sub-events from social media by proposing a hashtag-based sub-event detection framework for social media. In the same vein, authors in [36] proposed a spatiotemporal clustering-based method to detect traffic events using geosocial media data. However, these approaches did not handle the semantic analysis of SAB in social media content. The summary of unsupervised learning approaches to event detection is shown in Table 1.

Table 1 Unsupervised learning approaches used for event detection

Real-time event detection in social media streams through semantic analysis of noisy terms

Abstract

Introduction

Related work

Unsupervised learning for event detection in social media stream

Semi-supervised learning for event detection in social media stream

Supervised learning for event detection in social media stream

Semantic-based approaches for event detection in social media stream

Methodology

Description of main tasks during event detection using SMAFED

Formal definition of main tasks in SMAFED

Definition T1 (Data Streams Collection)

Definition T2 (Tokenize Tweets)

Definition T3 (Lemmatize Tweets)

Definition T4 (Filtering SAB)

Definition T5 (Disambiguating SAB)

Definition T6 (Semantic Tweets Representation)

Definition T7 (Semantic Similarity among tweets)

Definition T8 (Grouping of semantically similar tweets into clusters)

Definition T9 (Event Cluster Ranking)

Definition T10 (Representative Event Selection)

High-level overview of SMAFED

The conceptual architecture of SMAFED

Data input layer

Data pre-processing layer

Data enrichment layer

The formal definition of the SABDA model

The algorithm for disambiguation of SAB (SABDA)

Illustration of SABDA pseudocode

Event detection layer

The embedder

The Event clusterer

The Event clusterer algorithm

The Event ranker

The Event ranker algorithm

The Event summarizer

The Event summarizer algorithm

Evaluation experiment

Experiment I: impact of the data enrichment layer of SMAFED

Dataset description

Feature extraction and representation

Classifiers

Experiment I: result and discussion

SMAFED efficiency

Experiment II: accuracy of SMAFED

Dataset description

Experiment II: result and discussion

Conclusion and further work

Availability of data and materials

Abbreviations

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Ethics approval and consent to participate

Consent for publication

Competing interests

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords