Real-time event detection in social media streams through semantic analysis of noisy terms

Interactions via social media platforms have made it possible for anyone, irrespective of physical location, to gain access to quick information on events taking place all over the globe. However, the semantic processing of social media data is complicated due to challenges such as language complexity, unstructured data, and ambiguity. In this paper, we proposed the Social Media Analysis Framework for Event Detection (SMAFED). SMAFED aims to facilitate improved semantic analysis of noisy terms in social media streams, improved representation/embedding of social media stream content, and improved summarization of event clusters in social media streams. For this, we employed key concepts such as integrated knowledge base, resolving ambiguity, semantic representation of social media streams, and Semantic Histogram-based Incremental Clustering based on semantic relatedness. Two evaluation experiments were conducted to validate the approach. First, we evaluated the impact of the data enrichment layer of SMAFED. We found that SMAFED outperformed other pre-processing frameworks with a lower loss function of 0.15 on the first dataset and 0.05 on the second dataset. Second, we determined the accuracy of SMAFED at detecting events from social media streams. The result of this second experiment showed that SMAFED outperformed existing event detection approaches with better Precision (0.922), Recall (0.793), and F-Measure (0.853) metric scores. The findings of the study present SMAFED as a more efficient approach to event detection in social media.

piece of interesting news gets into social media, it travels faster. It reaches a wider audience who would continue to spread the news until a desirable action is taken. Thus, researching this area is important to objectively reveal current events and happenings, such as breaking news, instant outbreaks, infectious disease, and terror attacks [2].
We are in the era of social media, where there is abundant data and opportunities to be exploited. Social media enable us to take advantage of the social nature of human association, making it possible for individuals to express their feelings, become part of a virtual network and collaborate remotely [3].
A social media stream is made up of user-generated content. As such, ambiguity, which is part of the core problems of natural language text, also exists in the content of social media streams. Ambiguity occurs when a word can be expressed in at least two ways or senses in a determined context [4,5]. Ambiguity may not present any difficulty to a human being (i.e., native speakers) because ambiguity can be resolved using the knowledge of the context and common sense. But disambiguating text efficiently with a computer application is still a problem [6].
Social media streams are characterized by short messages; utilization of progressively advancing sporadic, casual, abridged words, syntactic and spelling blunders; blended dialects; vagueness; and inappropriate sentence structure [7]. These characteristics make it difficult for methods that rely on them for computational purposes to perform adequately and effectively [7][8][9][10][11][12]. Likewise, many of the current methodologies for event detection have mostly paid attention to the use of trivial keywords or themes retrieval. Still, they have not fully considered the valuable semantics embedded in social media streams, which hinders the accuracy of event detection [13,14]. Particularly of interest to this paper is the problem of ambiguity that stems from the use of slangs, abbreviations, and acronyms (SAB) in social media streams which makes the interpretation of such terms complicated during the event detection process. Most of the existing event detection methods have not considered the semantic analysis of noisy terms in the form of SAB and associated ambiguities in the design of their solutions.
Social media is a suitable medium for reporting serious events and emergencies. However, it presents many challenging issues that make it difficult to effectively uncover interesting and useful messages. Event summarization in the context of this paper can be referred to as finding a tweet representative that can suitably represent an event cluster. The noisy characteristics of social media content require new semantic innovations that will facilitate accurate analysis of social media streams. Thus, improving the precision of event detection techniques by resolving the noisy attributes of social media content is necessary.
This paper proposes a Social Media Analysis Framework for Event Detection (SMAFED) that resolves the noisy and ambiguous terms in social media streams to improve event detection accuracy. SMAFED was realized by integrating a local vocabulary consisting of slangs, acronyms, abbreviations; and incremental semantic clustering. This is to facilitate the understanding of implicit semantics embedded in social media streams to improve event detection. An evaluation was done by benchmarking SMAFED with existing approaches, including locality sensitive hashing [15], cluster summarisation [16], entity-based approach [17], and Repp framework [18]. The Precision, Recall, and F-measure metrics were used to assess the accuracy of event detection. To further demonstrate the plausibility of SMAFED in other research domains, the pre-processing and enrichment components of SMAFED were benchmarked with other pre-processing frameworks to extract sentiments from tweets using a generalized dataset, Twitter sentiment analysis training corpus, and a dataset of Nigerian origin called Naija-tweets.
The contributions of this paper are as follows: 1. Existing event detection methods have focused mostly on filtering out slangs, abbreviations, and acronyms (SAB); removing noisy terms including SAB, or ignoring them entirely during the pre-processing stage of social media streams. They did not perform semantic analysis of noisy terms like SAB to determine their contextual meanings and their impact on the accuracy of results. Semantic analysis of SAB terms was done for the first time in this study which yielded improved results. 2. In contrast to existing approaches, the proposed framework (SMAFED) introduces the data enrichment layer that enables the semantic analysis of SAB and ambiguity issues associated with their usage. 3. A dedicated algorithm for the disambiguation of slangs, acronyms, and abbreviations (SABDA) was proposed and used to disambiguate ambiguous SAB terms to better understand and interpret noisy terms in social media streams. 4. An integrated knowledge base (IKB) representing a local vocabulary of SAB terms was created to facilitate semantic analysis of noisy terms in social media streams. The IKB is a valuable and reusable resource that can support other computational operations on SAB.
The remaining part of this paper is organized as follows. Related work presents the related work, where an overview of the previous approaches that are relevant to event detection was presented. Methodology discussed the proposed Social Media Analysis Framework for Event Detection (SMAFED) for improved event detection in social media streams. Evaluation experiment presents the report of the evaluation of the SMAFED. In doing this, the impact of using the data enrichment layer to aid the results of SMAFED, and the performance of SMAFED when used for event detection from social media streams were discussed. The paper is concluded in Conclusion and further work with a summary and an overview of future work.

Related work
This section presents an overview of relevant previous research efforts on event detection from social media streams. The aspects covered include unsupervised learning, semi-supervised learning, supervised learning, and semantic-based approaches for event detection in social media streams.

Unsupervised learning for event detection in social media stream
Unsupervised learning is a type of learning that draws inductions from an unlabeled dataset [19]. Due to several iterations required to compute similarity or dissimilarity in the observed dataset, all of the datasets ought to be accessible in memory before running the algorithm in most cases. However, with data stream clustering, the challenge is searching for a new structure in the data as it evolves, characterizing the streaming data in clusters to leverage them to report events in the data stream. The clusters are then ordered based on the scoring function [1]. Some studies on event detection based on unsupervised learning are presented next.
Authors in [15] worked on streaming first story detection with an application to Twitter. The authors used a hash function called Locality Sensitive Hashing (LSH) to place similar documents in the same bucket. Shannon entropy was used to measure the information contained in the cluster. Clusters were ranked based on the value of the entropy. Event detection in Twitter was carried out by [20]. The paper focused on detecting reallife events from tweets using Event Detection with Clustering of Wavelet-based Signals (EDCoW). Authors in [21] employed Term Frequency (TF) and Kullback-Leibler divergence (KLD) to propose real-time summarization of scheduled events from Twitter streams. The work addressed summarization of tweet content to provide the user with summed upstream describing the key sub-events by employing a two-step process: subevent detection using an outlier-based sub-event detection technique and selection of tweets related to the sub-event detected to provide a summary. For the summary, TF and KLD techniques were compared and found out that KLD performed better. Authors in [16] proposed a framework to detect events in social streams using similarity score and cluster summarisation techniques. Authors used content-and network-streambased clustering for event detection. Mining Spatio-temporal information on microblogging streams using a density-based online clustering method was proposed by [22]. The paper investigated the extraction of spatio-temporal features of social media streams by employing an Incremental Density-based Spatial Clustering Application with Noise (DBSCAN) algorithm to enhance event awareness. A weighting factor called BursT, a sliding window technique to address concept drift, was employed. However, none of these outlined approaches focused on handling or analysing SAB terms prevalent in social media streams.
Balanced Iterative Reducing and Clustering using Hierarchies (BIRCH) clustering algorithm was proposed by [23]. The research work was based on detecting localized events and tracking the evolution of such events. Spatio-temporal characteristics of keywords were continuously extracted using the entropy of the spatial signature. A single-pass clustering algorithm, Birch, was used to group event keywords based on the cosine similarity of their spatial signatures. Top-k scoring clusters were considered possible event clusters. In the pre-processing stage, stop-words were removed, stemming was applied, and WordNet (a lexical dictionary for English words) dictionary lookups were performed. Authors in [24] employed multiple social media feeds features such as titles, description, location-textual, location proximity, and date along with Term Frequency-Inverse Document Frequency (TF-IDF) and Normalized Mutual Information Frequency to detect events. In the same vein, [25] traced the German centennial flood in the stream of tweets. The authors applied density-based clustering called Ordering Points to Identify the Clustering Structure (OPTICS) to group a set of flood-related tweets with respect to time and location. The result was validated with subsidiary data sources. Fuzzy hierarchical agglomerative clustering was used to propose a TweetMogaz framework for identifying new stories in social media [26]. The framework used an adaptive method to track relevant tweets. Fuzzy Hierarchical Agglomerative clustering with term co-occurrence probability as a distance measure was used to identify hot stories tweets with enough content. To overcome the problem of duplicate story detection, cosine similarity was used to compute vectors of the two clusters.
Also, real-time entity-based event detection for Twitter was proposed by [17]. The proposed approach identified bursty named entities and then clustered tweets based on the occurrence of the named entities using a cosine distance similarity score. Multiscale event detection in social media was presented by [27]. The authors explored the properties of the wavelet transform. They proposed a novel algorithm to compute a data similarity graph at appropriate scales and simultaneously detect events of different scales by a single graph-based clustering process. The clustering process is based on comparing common terms between pairs of tweets. Authors in [28] proposed a bursty event detection from a microblog framework using a distributed and incremental approach. The paper focuses on detecting events from Weibo (microblog) on Spark engine framework (taking into consideration of topic drift) by employing distributed and incremental temporal topic model, Bursty Event dEtection (BEE +). Online indexing and clustering of social media data for emergency management was presented by [29]. The authors implemented online indexing techniques: incremental TF-IDF; Skewness; and Learn & Forget Model. Clustering was evaluated using Silhouette and Davies-Bouldin metrics. Authors in [13] presented three different approaches to merging information from two different social media sources using time-evolving graphs. It was demonstrated that using information from multiple data streams increases the quality and quantity of detected events. An event detection system that uses inverted indices and incremental clustering algorithms was proposed by [30]. Burst detection based on the volume of tweets without considering the tweets' context may be misleading because co-occurrence terms in tweets may not be synonymous when the context in which they are used is taken into consideration. Real-time event detection on social data streams was conducted by [31]. Events were modelled as a list of clusters of trending entities over time using entity cooccurrence, Louvain clustering, and aggregate ranking. The approach only considered the length of words contained in a tweet but did not look at such representative's local and global importance. In addition, SAB terms were not handled. Authors in [32] proposed a multimedia big data system that used both incremental clustering event detection approach enriched with the analysis of multimedia content and a bio-inspired influence analysis technique to support alert spread and situation awareness over the network. Target-aware holistic influence maximization in spatial social networks was carried out by [33]. The authors came up with a diffusion model which takes care of both physical and cyber user interactions. They also proposed a spatial, social index based on an R-tree algorithm that computes users' interest similarity concerning online keyword queries. Both synthetic and three datasets were used to validate the effectiveness of the proposed model. Authors in [35] improved on the drawbacks of conventional methods to detect sub-events from social media by proposing a hashtag-based sub-event detection framework for social media. In the same vein, authors in [36] proposed a spatiotemporal clustering-based method to detect traffic events using geosocial media data. However, these approaches did not handle the semantic analysis of SAB in social media content. The summary of unsupervised learning approaches to event detection is shown in Table 1.

Semi-supervised learning for event detection in social media stream
Semi-supervised learning models are trained by combining both the unlabeled and labelled data. More specifically, a little proportion of labelled data with a great deal of unlabeled data. Some of the event detection efforts based on semi-supervised learning are now presented. Authors in [37] identify and characterize social media events by using generic event detection and topic-specific event detection with TF-IDF and Naïve Bayes. Civil unrest prediction with a Tumblr-based exploration was reported by [38]. The authors focused on detecting civil unrest by continuously applying text-based filters (keyword, location and future date filters) to the Tumblr data stream. A semi-supervised method for Automatic Targeted-domain Spatio-temporal Event Detection (ATSED) in Twitter using historical and real-life Twitter streams was proposed by [39]. The proposed method was suitable for event detection from historical data but not for real-time event detection. SPOTHOT: Scalable detection of geo-spatial events in large textual streams was proposed by [40]. The authors proposed a SigniTrend event detection system capable of tracking unusual occurrences of arbitrary words at arbitrary locations in real-time without specifying the terms of interest in advance. None of these outlined methods handled the noisy characteristics inherent in social media data.
Also, [41] implemented various algorithms like k-means, Hierarchical agglomerative, and Latent Dirichlet Allocation (LDA) topic modelling on Twitter stream to analyze real-time Twitter data to empower citizens by keeping them updated about what is happening around the city. In the pre-processing stage, removal of hashtag, stop-word, URL and special characters and stemming were done, but there was no treatment of SAB terms. Authors in [42] proposed a model for detecting and tracking breaking news from Twitter in real-time by employing Multinomial Naïve Bayes Classifier and DBSCAN algorithms. The proposed model could not dynamically learn from the available new sources. The pre-processing stage removed tags, mentions, URLs, and non-ASCII characters but did not address SAB terms prevalent in social media posts. Authors in [18] proposed a framework for detecting news events from the Twitter stream in real-time. The approach used ANN to classify news relevant tweets from the stream based on AvgW2V and Mini-batch cluster to group detected tweets into events. Authors in [43] worked on sub-story detection in Twitter with a hierarchical Dirichlet process. The paper proposed a Hierarchical Dirichlet Process (HDP) to address the problem of automatic substory detection associated with the main story. Like others, none of these research efforts considered resolving the ambiguity of SAB terms during the analysis of social media content. The summary of semisupervised approaches to event detection is presented in Table 2.

Supervised learning for event detection in social media stream
The supervised learning models are the class of machine learning algorithms that can extrapolate a prediction or classification function after being trained on labelled sample data.
The training examples contain a couple of input (vector) and output (supervisory signal). Instances are of the format (x,y), where x is a vector and y is referred to as the class or target attribute (or scalar). Supervised learning approaches typically build a model that maps x to y by finding a mapping m(.) such that m(x) = y. Given an unlabeled instance, m(x) and m(.) learned from training data, the outcome of an unlabeled instance can be computed. Subsequently, some instances of supervised learning applied to event detection are presented next. Authors in [44] proposed Geo-spatial event detection in a Twitter stream. Machine learning algorithms (Naïve Bayes, Multilayer perceptron, and Prune C4.5) were used to analyse whether the geo-spatial clusters contain real-life events. The detected events (candidate clusters) were displayed according to the individual tweet ranking score in descending order on a map with their locations in real-time. Authors in [45] proposed a graphical-based model, location-time-constraint topic, and LTT (an improvement over LDA) to capture social media time, content, and location data. Kullback Leibler, KL-divergence was used to measure the similarity of uncertain media content. Social events were detected using a hash-based indexed scheme, Variable Dimensional Extendible Hash (VDEH). The LTT model was refreshed after every block of tweets in an incoming time slot to accommodate topic drift. Transaction-based Rule Change Mining (TRCM) framework that applied Association Rule Mining to extract association rules from the tweet's hashtag was proposed by [46]. Unexpected changes in the consequent and conditional rules in each time slot were ranked. Hashtags detected were then compared with the key terms in the ground truth from BBC Sport commentary within the same time frame. Authors in [47] studied real-time top-R topic detection on Twitter with topic hijack filtering. The extraction of meaningful topics and noisy filtering messages over the Twitter stream were integrated using Streaming Non-negative Matrix (NMF). There were false detections (false negatives) of hijacked topics due to the model misspecification. Twitter Life Detection Framework was presented by [48] using TF-IDF for similarity score and SFPM for classification.TF-IDF directly computes document similarity on the word-count space, which is usually slow for large vocabularies. None of these reported approaches focused on treating SAB terms in tweets.
A multimodal classification of events in social media using TF-IDF and SVM was presented by [49]. The pre-processing stage removed stop-words, special characters, numbers, emoticons, HTML tags, and words with less than four characters. Authors in [50] proposed an audio-based multimedia event detection using recurrent neural networks. The authors introduced longer-range temporal information with a recurrent neural network for feature representation and classification to determine whether a given event can be traced to a video. In the same vein, multimedia event detection was presented by [51]. The author proposed algorithms for detecting complex event detection from web videos by engaging a two-stage convolutional neural network. Our focus in this paper is to use social media contents which are very noisy due to user-generated content (social media content). These reported efforts on multimedia event detection did not specifically address SAB terms and grammatical errors that may be contained in video data.
Also, [52] proposed an approach to detect Foodborne disease from Weibo data using TextRank and SVM. The SVM was used to filter unwanted tweets. However, the proposed framework was found to perform poorly in the face of sparsity and concept drift. A deep learning approach for traffic accident detection from social media was developed by [53]. Tokens and paired tokens were extracted from over 3 million tweets. Deep Belief Network and Long Short-Term Memory deep learning models were implemented on the extracted tokens and paired tokens to detect traffic accident information. Authors in [54] proposed a hate speech detection model to identify hatred against vulnerable minority groups using Amharic text data on Facebook. Apache Spark distributed platform was used for data pre-processing and feature extraction. Feature extraction was done using Word2Vec as an embedding model. Gated Recurrent Unit (GRU) was used for the classification stage. Table 3 summarizes supervised learning approaches that were applied to event detection.

Semantic-based approaches for event detection in social media stream
Authors in [55] worked on scalable distributed event detection using Twitter streams. The paper proposed scalable automatic distributed real-time event detection by incorporating a lexical key partitioning strategy (hash key grouping borrowed from LSH) to spread the detection process across multiple machines while avoiding partitioning as a series of subsets. The proposed framework was implemented on the Storm topology. It was identified that no pre-processing was done, even though Twitter streams are noisy, temporal, and full of slang. Authors in [56] proposed Locality Sensitive Hashing (LSH) to detect events from Twitter and Facebook. LSH was used twice in the event detection process; it was used to obtain events from Twitter and Facebook independently. It was later applied to detect cross-over events in the two social media streams. LITMUS, a system that used keywords to extract social media data related to "landslide", was proposed by [57]. The system then employed an augmented Explicit Semantic Analysis (ESA) algorithm using a semantic interpreter by extracting a subset of Wikipedia as classification features to classify data into relevant and irrelevant. Semantic clustering based on semantic distance was used for location estimation. Only geo-tagged data were considered and not the entire dataset. These semantic-based approaches did not consider the treatment of SAB terms in their analysis. Authors [58] presented a system, ArmaTweet, which used Natural Language processing techniques to extract structured information from tweets and then integrated the structured information with RDF from DBpedia and WordNet. The system used semantic queries to identify tweets matching the user interest and passed them to the anomaly detection algorithm to determine their correspondence to actual events. This improves the keyword search and is suitable for topic-specific event detection. However, the precision of the pre-processing component was not investigated in the face of acronyms, slangs, abbreviations, and passive words prevalent in social media data. A framework for event classification in tweets based on hybrid semantic enrichment using TF-IDF, Named Entity Recognition, Page Rank, CfsSubsetEval was proposed by [59]. Semantic enrichment was combined with external document enrichment and named entity extraction to classify tweets. Authors in [60] proposed an event detection model based on scoring and word embedding to discover key events from a high volume of data streams. In the pre-processing stage, stop words, modal auxiliary verbs, URLs, and emoticons were removed. Word2vec was used for embedding, and improved Expected maximization was used for the event detection stage. Word2Vec is limited to calculating word similarities. However, none of these approaches considered resolving the ambiguity associated with SAB terms in social media. A summary of the instances of applying semantic-based approaches for event detection methods is presented in Table 4.
Our review of the literature revealed that existing event detection methods had focused mostly on filtering out SAB, removing noisy terms including SAB, or ignoring them entirely during the pre-processing stage of social streams. They did not perform semantic analysis of noisy terms like SAB to determine their contextual meanings and their impact on the accuracy of results. These noisy terms include short messages, slangs, acronyms, mixed languages, grammatical and spelling errors, dynamically evolving, irregular, informal, abbreviated words, and improper sentence structure, which make it challenging for the efficient performance of the learning algorithms [7,61]. This gap that was not addressed by previous research efforts necessitated our study.
According to [30], the representation of social media stream must be in a way that the semantics of social media content is preserved. Hence, using the contextual clues surrounding a social media stream is critical for useful and accurate results. Thus, there is a need to develop an event detection framework that will focus on semantic analysis of slangs, acronyms, and abbreviations (SAB) terms in social media streams; and the ambiguity associated with their usage to improve the accuracy of event detection in social media streams. Most of the previous research efforts have not addressed this problem, which is where SMAFED seeks to make a difference. The summary of the strength and weaknesses of existing event detection techniques and their attributes is presented in Table 5.

Methodology
This paper proposes the Social Media Analysis Framework for Event Detection (SMAFED) as an efficient and integrated social media stream analysis approach incorporating social media stream pre-processing and enrichment to improve event detection results.

Description of main tasks during event detection using SMAFED
Our proposed approach to event detection in social media streams consists of ten main tasks (1-10) that are sequentially structured and can be abstracted-by the Input-Process-Output model as shown in Fig. 1. The input phase consists of task 1, tasks 2-7 constitute the process phase, while tasks 8-10 constitute the output phase. The tasks are outlined as follows: 1. Collecting Twitter data stream; 2. Tokenize tweets; 3. Lemmatize tweets; 4. Filtering slangs, abbreviations, and acronyms (SAB) from tweets; 5. Resolving ambiguity issues in the usage of slangs, abbreviations, and acronyms through disambiguation; 6. Representing enriched tweets in a way that will be suitable for clustering; 7. Performing semantic similarity among tweets 8. Grouping of semantically similar tweets into clusters 9. Ranking of event clusters 10. Selecting event representative.

Formal definition of main tasks in SMAFED
We now present the formal definitions of the main tasks of our approach to event detection as follows: Definition T1 (Data Streams Collection) A stream S = e 1 , e 2 , …, e n is an ordered sequence of objects or points where e i indicates the ith object or point observed by the algorithm. For t > 0, let S(t) symbolizes the first t entries of the stream: e i , e i+1 , …, e t . For 0 < i ≤ j, let S(i,j) designate the substream e i , e i+1 , …, e j . Define S = S(1,n) be the whole stream observed until e n , where n is, as before, the total number of objects or points observed so far. Each SAB term W i has one or more possible senses. Let the number of senses of the SAB term be represented as |W i |. Each possible combination of senses for SAB in the context window will be evaluated. There are N i=1 |W i | such combinations, each of which is referred to as a candidate combination. A combination score is computed for each candidate combination. The target SAB term is assigned the sense of the candidate combination that attains the maximum score.

Definition T6 (Semantic Tweets Representation)
Given context or source embedding v w and target embedding u w for each word w in the vocabulary with embedding dimension h and k =|v|. The tweet embedding is the average context word embeddings of constituent words augmented by learning n-grams. The tweet embedding v s for current tweet S is modelled as: where R(S) designates the list of n-grams, including unigrams present in sentence S.
Definition T7 (Semantic Similarity among tweets) Assume two tweets x has m words x 1 , x 2 , …, x m and y has n words y 1 , y 2 , …, y n . The semantic similarity matrix (SSM) for two tweets x and y is given as: The semantic similarity between word x s and tweet b is given as follows: The semantic similarity between tweets x and y is calculated as: The semantic relatedness of x i and y i is calculated by comparing glosses of synsets related to x i and y i through explicit relationships of IKB.

Definition T8 (Grouping of semantically similar tweets into clusters)
Given the assumption of the distribution of data stream uploads belonging to an event, an incremental clustering algorithm can be defined as follows: First, the small size of the window C 1 of the data stream of size, N is clustered such that C 1 ≤ N . As a new data stream arrives on window C 2 Clustering is performed again with |C 2 | = 2 * |C 1 |.
If certain clusters detected in the window C 1 are re-detected in C 2 , then those clusters that are "stable" remain stable, and their items are removed from further clustering. For subsequent data stream clustering on window C 3 , there is likely to be where C i is the set of stable clusters.

Definition T9 (Event Cluster
Ranking) The importance/information richness of a cluster is based on the number of important words it contains, the Weight of a cluster C, W (C) is computed as follows: sim(x 2 , y 1 ) . . .
where count(w) is the count of the word w in the input collection and the count(w) is greater than a given threshold.

Definition T10 (Representative Event Selection)
A candidate for representation is selected based on the importance of its constituent words. Let the local importance of word w be given as log(1 + CFT ) , where CFT is the cluster term frequency. Let the global importance be log(1 + CF ) , where CF is the cluster frequency. The importance of word w is the average of the local and global importance given as Weight(w) = α 1 log(1 + CTF ) + α 2 log(1 + CF ) , where α 1 = α 2 = 0.5 (constant). Tweet with max(Score(D i )) is selected as an event candidate for each of the clusters.
Summarily, the formal definition of the research problem is specific to social media streams and can be defined as follows: Given a set of Twitter streams, T is a set of tweets T = {t 1 , t 2 , . . . , t n } andT ∈ R nXt . The k-NN of data point t ∈ R t in T can be denoted as N k (t), we need to extract a set of {t, N k (t)} where 0 < = k < = n based on semantic distance diff (t, N k (t)) between the data point t and N k (t) : d(t, N k (t)) = diff |t, N k (t)| and then classify {t, N k (t)} as eventE = e 1 , e 2 , . . . , e m , where {t, N k (t)} ≥ thresholdǫ.

High-level overview of SMAFED
We now present the high-level process view of SMAFED. The process workflow of SMAFED (see Fig. 2) is divided into four main steps described below.
Step 1: A user interface is built around the underlying API provided by Twitter using Python Programming Language to collect tweets in English or Pidgin English from Nigeria origin. Python is chosen due to its efficiency and suitability for building high traffic and data-heavy workflows. Collected tweets within each window period are stored in a queue. The collected tweets are passed to the pre-processing stage.
Step 2: From the data stream collected, URLs, Tags, mentions and non-ASCII characters were automatically removed through the use of a regular expression. Then, the Fig. 2 A View of SMAFED Process Workflow next data preparation stage was to perform tokenisation and normalisation. This basic pre-processing reduces the number of features and addresses the problem of overfitting (Romero & Becker, 2019). After that, slang, acronyms, and abbreviations (SAB) are filtered from the tweets using corpora of English words in the natural language toolkit (NLTK). The filtered SAB terms are then passed to the local vocabulary (IKB) for further processing.
Step 3: Meanings of SAB are extracted from IKB. Due to several meanings attached to each SAB, there is a need to disambiguate the ambiguous terms and select the best sense from the several meanings provided. This is done by leveraging the Slang, Acronym, and Abbreviation, Disambiguation Algorithm (SABDA) based on the ambiguous SAB's context in the tweet. The peculiarity informed the choice of SABDA of the IKB to provide a rich source of information and improve overall disambiguation accuracy. The data enrichment stage is then concluded with spelling correction (using Python's automatic spell-checker library) and emoticon replacement.
Step 4: The enriched tweets produced from the previous stage must be transformed to enable the clustering algorithm to build on them. The enriched tweets are transformed into a vectorial form using the sent2vec model. Sent2Vec provides a significant improvement over state-of-the-art supervised and unsupervised methods for sentence or paragraph embedding, as revealed in the literature [62]. The embedded tweets are clustered as they arrive using semantic histogram-based incremental clustering (SHC) [63]. The idea is to have clusters representing the same event as much as possible. SHC maintains high cohesiveness within clusters, implying a high distribution of similarities. This necessitates the choice of SHC. The tweets in each cluster are ranked based on the information richness of their constituent words. Lastly, the representative tweet that best describes each of the top n candidate event clusters is selected.

The conceptual architecture of SMAFED
The conceptual architecture defines the structure of components of the SMAFED. It consists of four modules: data collection, data pre-processing, data enrichment, and event detection, as presented in Fig. 3.

Data input layer
The data layer serves as the input layer of SMAFED. It is responsible for streaming tweets from Twitter to SMAFED for processing. A user interface enabled by a Twitter API is used for tweet streaming in JSON formats. The input to SMAFED is data at rest stored either in the Comma Separated Value (CSV) or JavaScript Object Notation (JSON) format.

Data pre-processing layer
The data pre-processing layer comprises three sub-layers: data cleaner, data transformer, and data filter. The data cleaner handles data cleaning of responses fetched through the Twitter API: punctuations, repeated characters elimination, and substitution. The data transformer and the data filter, both in the pre-processing layer of SMAFED, perform the feature extraction. The data transformer performs tokenisation and normalisation using the Python NLTK library. After that, the data filter extracts the SAB from tweets collected using corpora of English words in the Python NLTK library. In other words, any normalized token that is not found in the corpora of English words is taken as slang or abbreviations or acronyms or emoticons. The sifted SAB is transferred to the data enrichment layer for additional processing. The tweet being analysed, and the SAB serve as input to the enrichment layer.

Data enrichment layer
This layer has three sub-components: IKB API, Disambiguator, and Spelling Checker. To better represent tweets, there is a need to provide meaning for slang, acronyms, and abbreviations (SAB) found in tweets because these noisy contents in tweets have hidden meanings that can form part of the rich context of tweets well defined. The IKB component of SMAFED is a lexicon of SAB that stores all the contents of the three knowledge sources: Naijalingo, Urban dictionary, and Internet slang in MongoDB. Naijalingo is an online Nigerian Pidgin English and slang words reference that gives definitions to Nigerian words and expressions. Urban dictionary is a publicly supported online word reference for English slang words and expressions. Internet slang is a word reference containing a pool of slang terms, acronyms, and abbreviations on online blogs, Twitter, chat rooms, SMS, and internet forums. The IKB includes about 2 million defined SAB terms and their usage examples and related terms.
The Disambiguator, which is a sub-component of the IKB API, is responsible for the disambiguation of ambiguous SAB. The Slang, Acronym, and Abbreviation Disambiguation Algorithm (SABDA) determines the semantic sense associated with specific terms in the IKB. SABDA adapts the original Lesk algorithm [64] to disambiguate slang, acronyms, and abbreviations in social media content. Instead of looking at the glosses of definition for a term in the WordNet as the Lesk algorithm does it, SABDA looks at the usage examples for SAB terms in the IKB. SABDA derives the proper sense in which noisy terms which are not available in the WordNet (WordNet is a database of regular English lexicon) are used. Since the current context (i.e. the tweet being analysed) is similar to how the SAB term is used, measuring the overlap between the usage examples and the current context would produce a better result. SABDA measures the overlap between senses of usage examples of SAB terms as defined in the IKB and the usage context of the SAB term in a tweet that is being analysed. The usage example with the highest overlap is then mapped with the respective definition, which replaces the SAB term in the target tweet.
The last stage of the enrichment layer is to perform a spelling check on the tweet content. JamSpell version 1.0.0 from the python library is a spell-checking tool that is efficient and effective. It considers words context for better correction (accuracy), can correct up to 5000 words/sec (speed) and is available for many languages (multi-language). The choice of Jam-Spell was informed by its better performance when compared to other spell-check libraries such as Norvig, Hunspell, and Dummy in terms of speed and accuracy.

The formal definition of the SABDA model The formal definition of the SABDA model is as follows:
If u = u 1 , u 2 , . . . u n and c = c 1 , c 2 , . . . c n are the usage gloss and the context, respectively, we build their semantic representation u and c in the semantic space through the addition of word vectors belonging to them: The measure of relatedness between u and c is a measure of the similarities between u and c given as: where N is the number of pair relations in RELPAIRS which is defined in a reflexive relation given as: where RELS is a set of relations. To choose the best def i , Map the max(Relatedness(u, c)) with the corresponding def i ∈ definition.
The algorithm for disambiguation of SAB (SABDA) The pseudocode of the Slang, Acronym, and Abbreviation Disambiguation algorithm (SABDA) is presented in Algorithm 1.
Illustration of SABDA pseudocode For clarity, the disambiguation and interpretation of the noisy terms process are illustrated in Example 1.
Example 1: Consider the case of disambiguating the term "baddo" in the tweet: "I am a baddo when it comes to this profession. " Given the following senses from the ikb as shown in Table 6, pick the sense with the most word overlap between the context (tweet in question, sjk) and the usage examples (usage_senses). The overlap between the context and the usage example is shown in Table 7.
The tweet that is being considered, sjk, "I am a baddo when it comes to this profession", is contrasted with all the usage examples (usage_senses) in the ikb identifying with the word "baddo". The score for every comparison (relatedness(sti,sjk)) is stored as an array (score). The comparison with the most noteworthy (highest) score is taken as the best score. I am a [baddo] when it comes to this profession. The usage example 2 with 6 overlaps is picked as the most proper sti. The best usage example is mapped to its corresponding definition. The usage example 2 is mapped with the meaning of baddo 2 . Consequently, the best sense for "baddo" in this context is "someone who is highly respected or seen as very good at what he/she does".

Event detection layer
The event detection layer has four components: Embedder, Event Clusterer, Event Ranker, and Event Summarizer.

The embedder
The embedder converts the enriched tweets into a vector form. This is done using a language model called sent2vec, developed by [63]. The model uses an unsupervised objective to train distributed representation of phrases/sentences. Words that are not found in the dictionary of the model are represented as zero vector, which implies that such words have no contribution to the mean vector.

The Event clusterer
The Event Clusterer of the Event Detection Layer in SMAFED performs the incremental grouping of the embedded tweets into event bins using semantic histogrambased incremental clustering [64]. Semantic histogram-based incremental clustering is a dynamic incremental method of building clusters that makes use of the semantic histogram concept to maintain a high degree of cluster coherency. New tweets are compared with each event cluster histogram to maintain the incremental creation of coherent clusters. If the addition of a new tweet will largely degrade the distribution, such a tweet is not added; otherwise, it is added. The quality of event cluster cohesiveness (semantic histogram) is measured by the ratio of similarity count above a certain similarity threshold to the total similarity count. The higher the semantic histogram ratio, the more the cluster cohesiveness.

The Event clusterer algorithm
The event clusterer algorithm (Semantic Histogram-based Incremental clustering) is presented in Algorithm 2. The computation of semantic similarities between tweets is based on how they are semantically represented. The Sent2vec model is used to obtain the semantic representation of tweets. When a new semantic similarity value between two tweets is determined, it augments the semantic comparability check (count) inside the bin (cluster) where such similarity is found. To add another tweet, the new tweet is compared against each semantic histogram cluster. On the off chance that the distribution is degraded, it is not added; else, it is added. At this stage, the issue of concept drift is implicitly taken care of because when there is an arriving tweet that does not fit into the existing clusters, a new cluster is created for it. Kolajo et al. Journal of Big Data (2022) 9:90 The Event ranker The Event Ranking component of SMAFED orders the contents in each cluster based on the importance of the constituent words. Since the event clustering is unsupervised and the number of clusters is not known in advance, it is necessary to determine the clusters that would contribute to the representative summary. In other words, the importance of the information richness of a cluster is based on the number of important words it contains.

The Event ranker algorithm
The event ranker algorithm is implemented (as in definition T9) using Algorithm 3. The focus of the event ranker algorithm is to determine which of the detected event clusters are actual events. For each of the clusters, the weight is computed by summing up the weights of all the important words in each event cluster. The event clusters are then sorted in descending other based on their weights. Any event cluster whose weight is greater than a given threshold is considered an event.

The Event summarizer
The Event Summarizer component of SMAFED finds a suitable representative summary of the candidate event cluster with a coherent and fluent summary using the extractive summary approach. Ideally, candidate event clusters should have tweets that belong to the same event, but there is a need to find a representative that can represent individual clusters. The tweet with the highest score based on its local and global importance is selected. The local significance of a word found in each tweet shows how much contribution the word makes to the central tweet concept. The global importance corresponds to the word's contribution in the subtopics formation spread over the cluster of tweets. The Event summarizer algorithm Algorithm 4 presents the event summary based on definition T10. The event summarizer algorithm looks at each tweet found in each event cluster to find which of the tweets in each cluster can serve as representative. In other words, which of the tweets in each event cluster can we pick and use as a summary of all the tweets in an event cluster? The algorithm answers this question by counting the frequency of each important word in a tweet and in the cluster in which the tweet appears. After that, the average local and global importance is computed. This computation is done for each important word in the tweet, and the sum is taken as the weight of the tweet in the event cluster. This is how the weight of all tweets in the event cluster is computed. The tweet with the highest score is taken as the summary of the event cluster.

Evaluation experiment
In this section, we report the evaluation of the SMAFED framework. The evaluation was divided into two parts. The first part gives a detailed summary of the impact of the data enrichment layer of SMAFED, which focused on semantic analysis of SAB compared to when there is no treatment of SAB. The second part focuses on the performance of SMAFED when used for event detection from social media streams.

Experiment I: impact of the data enrichment layer of SMAFED
SMAFED was evaluated by benchmarking it with the General Social Media Feed Preprocessing Method (GSMFPM) to determine the impact of the enrichment layer of SMAFED. The difference between GSMFPM and SMAFED is highlighted in Table 8.

Dataset description
Two datasets for the first experiment were Twitter sentiment analysis training corpus and Naija-tweets. A summary of the two datasets is presented in Table 9.

Feature extraction and representation
We extracted two types of features, namely, unigram and bigram, from the datasets. The summary of the feature extraction and representation is shown in Table 10. Global Vector for Word Representation (GloVe) was used for the feature extraction. GloVe is an unsupervised learning algorithm for obtaining word-word co-occurrence statistics from a corpus. This results in representations that showcase interesting linear structures of the word vector space. GloVe is a log-bilinear model with weighted leastsquares, which combines the features of the local context window and global matrix factorization methods. The underlying intuition of the model is that the ratios of wordword co-occurrence have some form of meaning encoding potential.

Classifiers
We used supervised learning techniques for text classification, namely, multilayer perceptron (MLP) and convolutional neural networks (CNN). MLP trains on input-output pairs and models the dependencies between the inputs and outputs. CNN is a deep   learning architecture model that aims to learn higher-order features present in data through convolutions.

Experiment I: result and discussion
The proposed SMAFED was benchmarked with the General Social Media Feed Preprocessing method (GSMFPM) by testing their impact on two classifiers. The essence of assessing the impact of the general pre-processing method -GSMFPM and the proposed SMAFED on the classifiers was to determine whether analysing SAB terms and resolving ambiguity in SAB in social media streams can affect event detection results. To do this, we compared the cross-entropy loss function of the classifiers (MLP and CNN) when GSMFPM and SMAFED were used. The comparison based on the loss function indicates how good a classifier accurately predicts the expected outcome. The cross-entropy result for sentiment classification of Twitter sentiment analysis training corpus for Multilayer Perceptron and Convolutional Neural Network with five epochs and eight epochs, respectively, are shown in Tables 11 and 12. At the same time, the Naija-tweets dataset is presented in Tables 13 and 14.  Table 11 presents the multilayer perceptron cross-entropy loss function for GSMFPM and SMAFED on Twitter Sentiment Analysis Training Corpus based on Unigram, Bigram, and Unigram + Bigram. The table shows that SMAFED  outperformed GSMFPM with respect to matching between the predicted and the actual sentiment by returning lower loss function values. It should also be noted that the lowest loss function in each of the unigram, bigram, and unigram + bigram of both approaches was obtained at epoch_5, meaning that the more the number of epochs, the better the performance of the classifier. The Cross-Entropy Loss Function of CNN with kernel size = 3 and one-four convolution layers using eight epochs for SMAFED compared with GSMFPM on Twitter Sentiment Analysis Training Corpus are presented in Table 12. From the table, it can be deduced that the pre-processing coupled with the data enrichment components of SMAFED outperformed GSMFPM in matching the predicted and the actual sentiment. It should also be noted that the loss function of the first layer of CNN crossentropy for both approaches is lower than that of other layers.
The Cross-Entropy Loss Function of Multilayer Perceptron with Unigram, Bigram, and Unigram + Bigram features using five epochs for SMAFED compared with GSMFPM on Naija-Tweets dataset are presented in Table 13. The table shows that SMAFED outperformed GSMFPM with respect to matching between the predicted and the actual sentiment. As noted with Twitter Sentiment Analysis Training Corpus and Naija-Tweets, the lowest loss function in both approaches' unigram, bigram, and unigram + bigram was obtained at epoch_5. The outcome of the experiment revealed  that the proposed SMAFED performed better than the general pre-preparing technique. It also shows an improvement in terms of accuracy (on the obtained results). This underscores the significance of using a local vocabulary in pre-processing social media feeds to disambiguate the noisy terms that are contained in the social media feeds from a specific origin. Table 14 presents the Cross-Entropy Loss Function of CNN with kernel size = 3 and one-four convolution layers using eight epochs for SMAFED compared with GSMFPM on the Naija-Tweets dataset. From the table, the performance of pre-processing coupled with data enrichment components of SMAFED outperformed GSMFPM with respect to matching the predicted and the actual sentiment. It should also be noted that the loss function of the first layer of CNN cross-entropy for both approaches is lower than that of other layers.

SMAFED efficiency
The performance of the SMAFED framework was assessed using run-time performance metrics [65] to measure the efficiency and practicability of the framework. We implemented the proposed event detection method using Python (v 3.7). We used Intel(R) Core(TM) i5-6200U CPU @ 2.30 GHz processor for testing, with 12 GB RAM and a 64-bit Windows 10 operating system. The framework was deployed on the cloud using a 4 GB Docker Droplet hosted on DigitalOcean services.
The tweets used for the prototype implementation were sourced from tit took 5 secondshe Nigeria location. It was found out that the average number of tweets from Nigeria per minute is 45 without the application of a filter. This shows that tweeting in Nigeria is very small compared to an overall average of 350,000 tweets per minute. The part of the processing that took the longest time to process was spell checking. Before the clustering stage, it took 5 s to pre-process and enrich 40 tweets. The average processing time for each tweet is about 0.125 s. This is well within the limits needed to manage the estimated average of 1 tweet from Nigeria. In an improbable circumstance, where all tweets are recognized as events, the framework will have the option to handle eight times the normal volume of the tweets from Nigeria origin. With SMAFED, the life span of an event cluster is 4 days, after which it is deleted. This assumption was due to the fact that the potential value of data lies in its freshness. Choosing a lifespan of 4 days for an event cluster was done so as to: 1) avoid the clogging of the memory, 2) maintain a limited number of clusters in memory, and 3) limit the number of comparisons to be made. SMAFED efficiency in terms of tweet streaming from Nigeria origin and pre-processing is depicted in Fig. 4.

Experiment II: accuracy of SMAFED
This experiment aims to determine how well the SMAFED can detect events in social media streams. Three metrics, including Precision, Recall and F-measure, were used to benchmark SMAFED with other existing frameworks. These metrics are regular assessment measurements for event detection techniques [1]. Precision alludes to the number of actual events detected. Recall gives the actual similar event level that the framework can identify, and F-measure speaks to the harmonic mean of Precision and Recall. The formulae for the metrics used to ascertain event detection accuracy are given as follows: Kolajo et al

Dataset description
To evaluate SMAFED, tweet IDs and the event relevance judgment (which is made available based on Twitter's terms of service) provided by [66] were used to obtain the tweet dataset. This was done using Tweepy with Twitter REST API. All 152 900 relevant tweets could not be extracted because some user accounts had been deleted and were no longer accessible. A total of 82,887 labelled tweets and additional 142,652 irrelevant tweets were collected. The distribution of the final dataset used by SMAFED for evaluation is presented in Table 15.

Experiment II: result and discussion
The section presents the results of the enrichment layer of SMAFED and the evaluation of SMAFED as benchmarked with existing frameworks in terms of accuracy. The final result of the enrichment stage is shown in Fig. 5.
After the data enrichment stage, with a typical result presented in Fig. 5, there is event clustering, ranking and summarisation. The enriched tweets from the enrichment layer are used as input to the event detection module to detect events from tweets. This module has four sub-modules: embedding, clustering, ranking, and summarisation. Sent2vec model vectorizes cleaned and enriched tweets. The Sent2vec model is wrapped in class Sent2VecWrapper and has a method "vectorize_sentences, " which returns an array of vectorized sentences. The model is downloaded from the cloud service, DigitalOcean. Vectorized tweets are clustered with a semantic histogram clustering algorithm. The event ranking phase follows this. This task is implemented in the "Ranker" class. The last stage of the event detection stage is an event summary involving cluster summary computation. The resulting sample of the event detection stage, which includes event clustering, ranking and summarisation, is presented in Table 16.
The section also presents a report on the accuracy of the event detection by SMAFED as compared with Locality Sensitive Hashing (LSH), Cluster Summary (CS), Entitybased approach and the Repp framework. The difference between SMAFED and four event detection approaches is presented in Table 17.
The results include the Precision, Recall, and F-measure for each approach. The result is presented in Table 18 and depicted by the line graph in Fig. 6. Table 18 shows the results of four different approaches compared with SMAFED. A cluster is considered a candidate event for the two baselines (LSH and CS) if it contains more than 30 tweets. The entropy-based method used 75 and above tweets with the best run, [17] used 10 + tweets with a mean over 20 runs, and SMAFED considered clusters weight > 100 threshold. Out of 120 million tweets available in the Event2012, [17] discovered 152,900 tweets that can be considered event tweets. Instead of using 120 million unlabeled tweets, SMAFED focused on the relevant tweets (150,000 +) used as ground truth for benchmarking purposes. However, since tweets may be deleted or users can delete their own Twitter account, making them unavailable, the total 152,900 relevant   Table 17 Comparison of smafed against other Frameworks in terms of scope, coverage and approach 142,652 irrelevant tweets were collected from the pool of irrelevant tweets in the 120 million tweets. All the four approaches compared with SMAFED, except for the framework proposed by [15] (LSH), used the number of tweets in a cluster as criteria for it to be considered as a relevant event cluster. This method does not perform well as clusters with more tweets may be less informative [18,67]. The ranking algorithm used in SMAFED focused on the important constituents of each cluster. For a cluster to be considered as an event, it must have a weight > 100. This weight was chosen after testing for different weight size (50,100,150) and having in mind the number of event clusters in the ground truth dataset. None of the existing approaches compared with SMAFED in this study used a tweet representative to summarize clusters. The baseline approaches reported in [66], along with [17], used crowdsourcing for categorization. SMAFED used a weighting approach for each tweet in a cluster to determine a representative tweet. The tweet with the highest weight in each event cluster is selected as a tweet representative. Choosing a tweet representative in each event cluster gives a quick summary which can be used for reporting what is happening.
SMAFED can adapt to changes in the social media streams as it does not use any restriction in streaming tweets except eliminating retweets, tweets with more than three hashtags and or more than two URLs. This was done to eliminate spam tweets which may adversely affect the event detection process. Unlike using an already built classifier to filter tweets of interest from social media streams during data collection, an approach used by [18], such approach would not be able to adapt to changes in the social media streams as it requires a continuous update of the classifier to incorporate changes in the social media stream. Even though Twitter has been recognized as an important resource for event detection, the scarcity of publicly available dataset coupled with different parameter settings makes it difficult to judge against existing approaches [68]. Since Event2012 Twitter dataset is the only publicly available event detection dataset with annotation, it was used as the ground-truth dataset for evaluating SMAFED. The result of the evaluation was compared with existing works that made use of the same dataset to evaluate event detection. The four existing works that were compared with SMAFED include two baselines used to generate the collection of Event2012 Twitter dataset (Locality Sensitive Hashing (LSH) and Cluster Summary (CS) proposed by [15] and [16], respectively), the entitybased approach proposed by [17], and a framework for event detection proposed by [18].
From Fig. 6, it can be deduced that SMAFED performed better than the existing event detection approaches. SMAFED also has the highest value for F-measure compared to existing methods for event detection. This indicates that both precision and Recall are reasonably high and that the SMAFED has an excellent ability to detect events in social media streams. SMAFED was measured closely against the best event detection framework (amongst the four event detection approaches compared with SMAFED) proposed by [18]. Comparing event detection approaches may seem difficult in the sense that there are slight differences in the parameters or criteria which were earlier pointed out. SMAFED was measured closely against the best event detection framework (amongst the four event detection approaches SMAFED was compared with) proposed by Repp (2016). Even though the number of tweets (categorized as irrelevant) in Events2012 Twitter dataset added to the available 82,887 tweets (relevant) was more than that of Repp's framework, SMAFED still performed better.

Conclusion and further work
In this paper, a Social Media Analysis Framework for Event Detection (SMAFED) that can analyse the rich but hidden knowledge in social media streams to improve the accuracy of event detection was presented. SMAFED, as proposed in this paper, serves as an improvement on the existing event detection approaches with better metric scores in terms of Precision (0.922), Recall (0.793) and F-Measure (0.853). In addition, an evaluation experiment was carried out to determine the impact of the data enrichment layer by benchmarking SMAFED with GSMFPM. The cross-entropy result for sentiment classification of Twitter sentiment analysis training corpus and Naija-Tweets dataset for Multilayer Perceptron and Convolutional Neural Network with five epochs and eight epochs, respectively, showed that SMAFED outperformed GSMFPPM. This paper contributes to big data analytics research, particularly event detection in social media streams. More precisely, it caters for the observed limitations of existing event detection approaches by (1) performing semantic analysis of SAB terms along with ambiguity in their usage. This leads to better comprehension and interpretation of social media streams noisy terms; (2) evolving SABDA to disambiguate ambiguous SAB terms; (3) creating an integrated knowledge base to facilitate semantic analysis of noisy terms in social media streams.
In this paper, SMAFED used only social media stream texts for event detection in social media streams. The integration of images and correlated text from social media streams will further strengthen the event detection result. While Twitter is a well-known research data source, exploring and or combining it with other social media sources will lead to more events being detected and harmonisation of event detection results. This is still open to further research as few approaches have exploited this medium.

ATSED
Automatic Targeted