Emotion AWARE: an artificial intelligence framework for adaptable, robust, explainable, and multi-granular emotion analysis

Gamage, Gihan; De Silva, Daswin; Mills, Nishan; Alahakoon, Damminda; Manic, Milos

doi:10.1186/s40537-024-00953-2

Methodology
Open access
Published: 10 July 2024

Emotion AWARE: an artificial intelligence framework for adaptable, robust, explainable, and multi-granular emotion analysis

Gihan Gamage¹,
Daswin De Silva¹,
Nishan Mills¹,
Damminda Alahakoon¹ &
…
Milos Manic²

Journal of Big Data volume 11, Article number: 93 (2024) Cite this article

65 Accesses
1 Altmetric
Metrics details

Abstract

Emotions are fundamental to human behaviour. How we feel, individually and collectively, determines how humanity evolves and advances into our shared future. The rapid digitalisation of our personal, social and professional lives means we are frequently using digital media to express, understand and respond to emotions. Although recent developments in Artificial Intelligence (AI) are able to analyse sentiment and detect emotions, they are not effective at comprehending the complexity and ambiguity of digital emotion expressions in knowledge-focused activities of customers, people, and organizations. In this paper, we address this challenge by proposing a novel AI framework for the adaptable, robust, and explainable detection of multi-granular assembles of emotions. This framework consolidates lexicon generation and finetuned Large Language Model (LLM) approaches to formulate multi-granular assembles of two, eight and fourteen emotions. The framework is robust to ambiguous emotion expressions that are implied in conversation, adaptable to domain-specific emotion semantics, and the assembles are explainable using constituent terms and intensity. We conducted nine empirical studies using datasets representing diverse human emotion behaviours. The results of these studies comprehensively demonstrate and evaluate the core capabilities of the framework, and consistently outperforms state-of-the-art approaches in adaptable, robust, and explainable multi-granular emotion detection.

Introduction

The rapid digitalisation of society has empowered knowledge-focussed human activities and communication to transpire on hyper-connected, digital platforms. This spectrum of intrapersonal, interpersonal, and group activities have led to the generation and management of high volumes of big social data that represents patterns of behaviour of individuals and organizations, and how they leverage insights drawn from that information for further engagement and collaborative activities [1]. Expressions of emotion are encapsulated in these digital platforms which is highly useful towards accurately modelling human behaviour [2]. The persistence of this textual digital record enables the use of computational approaches to process, analyse and synthesise emotion expressions. Computational approaches for emotion detection have been classified using several schemes in existing literature. Acheampong et al. [3]. proposed three categories, rule-based, machine learning and hybrid methods. Alswaidan et al. [4] proposed a scheme of five categories, keyword-based, rule-based, classical learning, deep learning and hybrid. In reviewing these schemes, we have summarised into three technical categories, (1) heuristics (which includes keywords, rule-based, probabilistic and statistical), (2) Artificial Intelligence (AI) (consisting of classical learning, machine reasoning and deep learning) and (3) hybrids of the two. Despite the maturity of this topic in terms of classification schemes and the prevalence of many approaches across these three classes, the complexity and ambiguity of emotion expressions on digital platforms have not been fully addressed. We substantiated this challenge of complexity and ambiguity in terms of four capabilities, (1) output (granularity of emotion detection output), (2) domain specificity, (3) adaptability, and (4) explainability.

We conducted a systematic literature review of the state-of-the-art of recent emotion analysis and detection research published in the last five years, from 2018 to 2022. The Preferred Reporting Items for Systematic Reviews and Meta-Analysis (PRISMA) flow diagram for this review in reported in Supplementary Fig. 1 (Filename: emotionaware supp Fig. 1.docx). The review produced 83 articles that aligned with the selection criteria, which then we evaluated in terms of the four capabilities noted above. Supplementary Table 1 (Filename: EmotionAwareSuppTable1.xlsx) presents the results of this evaluation.

Based on the findings of the literature review and the subsequent evaluation against capabilities, we propose a novel framework for Emotion Assembles With Adaptability Robustness and Explainability (AWARE). This Emotion AWARE framework intervolves heuristics and AI techniques with lexicon generation and finetuned Large Language Models (LLM) into a hetero-hierarchical structure that receives text containing emotion expressions as input and produces as output an assemble of emotions with corresponding intensity values. Emotion assembles can be created at three levels of granularity, two, eight and fourteen. The framework is adaptable as the hetero-hierarchical structure can be revised and reintroduced to reflect a domain or topic of interest. The framework is robust in its ability to detect implied emotion expressions through the context of surrounding terms as well as scale the intensity values based on negations, intensifiers, and inhibitors. The framework is explainable in its identification of terms and phrases for each emotion expression, leading up to a collection of terms that can be used to profile and compare multiple assembles.

In comparison to related work on emotion detection, the Emotion AWARE framework is novel in its construction of emotion assembles with intensity values, and the explainability, adaptability and robustness of these emotion assembles. On approach, AWARE leverages prior knowledge of lexicons and learned knowledge of the finetuned language models, in contrast to the singular approaches adopted in related work, and it is the only approach evaluated on eight datasets (across studies). In terms of output, it produces multi-granular emotion assembles of 2,8, and 14 emotions with intensity scores, in contrast to the class-based output produced by other methods. In terms of valence and arousal, the proposed framework detects valence across a broad spectrum of 14 emotion categories, and each category is assigned a score from 0 to 1. This scoring reflects arousal levels and is determined while taking modifiers and negations into consideration. All related methods in recent literature are limited to a specific domain or general application, whereas AWARE is intrinsically generic but can be adapted to a domain of interest. This feature is aptly demonstrated in the experimental results (study 5 and (6). Explainability, adaptability and modifier resolution are similarly more advanced than those reported in existing literature, mainly due to the effectiveness of the hybrid approach of prior knowledge from lexicons and learned knowledge from finetuned language models.

Literature review

As noted above, we conducted a systematic literature review of the state-of-the-art research on emotion analysis published in the last 5 years, from 2018 to 2022. The PRISMA flow diagram and the evaluation of the selected work against the four capabilities are reported in Supplementary Fig. 1 (Filename: EmotionAwareSuppFig1.docx) and Supplementary Table 1 (filename: EmotionAwareSuppTable1.xlsx), respectively. Here, we delineate key findings in terms of the three categories, heuristics, AI and hybrids.

Heuristic approaches include keyword recognition, rule-based logical/grammatical affinities, statistical and probabilistic methods. These methods are grounded in emotional lexicons, corpora and dictionaries that represent prior knowledge of how emotion is expressed in that domain or discipline. The emotion lexicon is typically a list of synonyms and related words used for each emotion category, where each word may also be assigned a fixed intensity value. Besides a list, the lexicon can also be organised hierarchically in a tree structure or interlinked as a graph or map structure. Several emotion lexicons reported in the literature are, Plutchik’s emotional terms [5], theWordNet-Affect [6], EmoSenticNet [7], DepecheMood [8], SentiWordNet dictionaries [9]. Keyword recognition methods [10] rely on locating keywords representing emotions in a given text and assigning an emotion label based on these keyword counts and other statistics. These methods can be used for explicit emotion detection. For example, “their arrival made me happy” explicitly expresses the emotion happiness/joy with the keyword “happy”. But often emotions are not explicitly mentioned and can be negated or modified to give different or opposing interpretations than a keyword search method would suggest. In such cases more advance heuristics are required. Rule-based approaches incorporate text processing methods such as tokenization, part-of-speech tagging, and dependency parsing along with corpora and lexicons to find the most effective rules sets for emotion detection [11, 12]. Several other approaches use lexical affinity with the support of lexicons to capture contextual and semantic relatedness to generate probabilistic values for each emotion category [13]. Furthermore, some approaches utilize dimensionality reduction and categorical feature extraction methods such as Latent semantic analysis (LSA) [14], Probabilistic LSA [15] for improved emotion detection [16]. The use of lexicons enables domain adaptation in emotion detection as lexicons can be easily extended or altered to suit the target domain. Furthermore, these methods can be extended for emotion intensity calculation, negation and modifier detection as they can locate the keywords and evaluate the corresponding neighbourhood. However, a major drawback of all heuristic methods is that emotion expressions that are not specified in the lexicon and those that are implied or ambiguous are not detected. Due to these reasons, methods that are purely based on lexicons are not comparable to benchmark performance of AI based methods [3].

AI-based methods can be subdivided into two, conventional supervised learning methods situated in annotated datasets and the contemporary transfer learning methods that leverage pre-training contextual language models. The conventional methods require large, labelled datasets where each sentence, paragraph or segment in the corpus is pe-assigned an emotion category (or label), typically by a human expert. This annotated dataset is used to train a multiclass classification model using supervised learning algorithms. Emotion classification and intensity calculation using XGBoost [17, 18], Support Vector Machines (SVM) [19, 20],Naïve Bayes (NB) [21, 22], k-Nearest Neighbor (kNN) [22] and Decision Trees [23, 24] are some prominent techniques reported in related literature. More recently, deep learning algorithms such as Long Short Term Memory (LSTM) networks [25, 26], Gated Recurrent Units (GRU) [27, 28] and Deep Neural Networks (DNN) [29, 30] have also been used in the same supervised learning context but with increased performance. Collectively, all supervised learning methods have reported accuracies in the range of 65–80% on benchmark datasets [3]. However, supervised learning methods are impeded by two major limitations, the scarcity of large, domain independent labelled datasets and the challenge of ambiguous and implicit emotion expressions. More recent AI methods address these limitations by leveraging the semantic context of emotion expressions embedded in pre-trained language models. Unlike supervised methods, these methods can be fine-tuned with smaller labelled datasets using transfer learning. Emotion extraction using variations of BERT [31,32,33], GPT [34, 35], XLNet [36, 37] are such methods that leverage the contextual knowledge embedded in language models. These approaches report state-of-the-art accuracies for emotion detection from benchmark datasets in the range of 75–99% [38]. However, this strength is also a weakness due to the limited generalisability across new, unforeseen emotion expressions, as well as intensifiers, inhibitors, and negations of emotion expressions, lack of explainability and constrained domain adaptation. Collectively, these limitations question the practical value of the high accuracies reported in empirical evaluation [39].

Several hybrid methods also have been proposed in the recent literature combining heuristics with AI methods to improve accuracy and refine the emotion categories. Tzacheva et al. (2019) [20] proposed lexicon-based emotion annotation to train SVM classifiers, for emotion extraction in tweets. Wu and Chuang [40] utilized a rule-based approach to extract semantics related to emotions and combined it with lexicon ontology to extract emotions. In Salim et al. [41], authors presented self-supervised hybrid methodology for sentiment classification from unlabelled data that combines a machine learning classifier with a lexicon-based strategy. Li et al. [42] proposed a hybrid emotion detection system combining hand crafted rules and lexicon with machine learning based classifier to extract emotional levels in online blogs.

Collectively across all three categories, the practical value of these methods in the management of information and extraction of patterns of behaviour of individuals and organizations is vast. Large scale analyses of social media during elections [43, 44], patient-centred care for chronic illnesses such as Alzheimer’s disease, cancer, and diabetes [45,46,47], real-time depression detection on social networks [48, 49], expressions of emotion and sentiment during the COVID-19 global pandemic [50,51,52], highlight the practical value in social and individual settings. In organisational settings, financial sentiment analysis [53], understanding consumer satisfactions [54], the role of social media in stock price moments [55], and the influence of review credibility and review usefulness [56] are pivotal studies that signify the continuing and incremental value of emotion analysis in digitalised content for all stakeholders.

In concluding the literature review, we elaborate on the four capabilities and their potency in addressing the challenges of the complexity and ambiguity of digital emotion expressions in knowledge-focused activities. The first capability is the output of the emotion detection approach. In most cases, this is limited to an emotion label without an intensity score for that emotion. This emotion label is also limited to a single granularity which cannot be further analysed in terms of its constituents. Most approaches assign a single emotion per atomic unit of text (sentence, paragraph or document), and overlook the presence of multiple emotions. The second capability, domain specificity relates to the generalisability of the approach across diverse domains. Most approaches are highly specific to the syntax or semantics of a given domain, such as emotions in short text like tweets [57, 58], emotions in poetry [59, 60], emotions in code switched text [61, 62], and consumer reviews [63, 64]. These are developed using supervised learning and then evaluated using labelled custom datasets which further limit generalisability and its application in diverse domains. Despite the custom datasets, some methods can be adapted (or retrained) for a new application, which is the third capability of adaptability. In recent work that is based on language models and annotated datasets, this capability is limited due to the large number of parameters and the opacity of transformer-based learning. They cannot be adapted without a significant volume of work on configuration and finetuning which is equivalent to developing an entirely new approach. The fourth capability is explainability of the detected emotion which is becoming more important given our increasing dependence on AI and automation. Explainability has been overlooked in most approaches, mainly due to design limitations that have focused on producing emotion labels of singular granularity. We do not consider accuracy as a core capability as it can be configured (or tweaked) in the design phase as an offset between the availability of annotated datasets for supervised learning and the need for generalisability across multiple domains. A high-quality human-annotated dataset can be leveraged by a supervised learning approach to produce highly accurate emotion classifications. In summary, the granularity of emotion detection output, domain specificity, adaptability and explainability are the formative capabilities of the proposed method for addressing the complexity and ambiguity of emotion expressions.

Methods

As illustrated in Fig. 1, the emotion AWARE framework consists of three modules, Module 1—Emotion Language Model Finetuning, Module 2—Emotion Lexicon Generation and Module 3—AWARE Core. The components depicted in grey are external sources feeding into the Emotion AWARE framework, where the general instances we have used in this study can be replaced with specialised instances depending on the domain of application (this is demonstrated in Study 5 and 6 for the financial and technology sector).

Module 1 begins with a state-of-the-art language model, such as BERT [65] which has been effectively applied on diverse NLP tasks such as Reading Comprehension [66, 67] and Natural Language Inference [68, 69]. State-of-the-art language models are pre-trained on large volumes of unlabelled data to generate deep contextualised word representations by considering syntaxes and semantics [70]. In application, these pre-trained models are finetuned using labelled datasets through transfer learning techniques. For this framework, we selected the DistilBERT [71] base-case model with Huggingface [72] PyTorch implementation for the finetuning. As the finetuning dataset we selected Emotion dataset [73] due to its substantial size, granularity of emotions, and widespread acceptance in the research community. It contained 20,000 tweets based on six emotions joy (33.5%), sadness (29.2%), anger (13.5%), fear (12.1%), love (8.2%), and surprise (3.6%). For the finetuning, we combined train and validation sets, randomized and selected a subset of 5653 points where 1000 samples per each emotion except surprise which was 653 points. Finetuning settings were, a default token length of 128 enabled by both padding and truncation, and batch size of 64 with 8 epochs. At a learning rate of 0.00002 and weight decay of 0.01, the finetuning completed with an F1 score of 0.9394 for the test segment of the dataset. The finetuned language model is utilised by Module 2 for the expansion of a curated list of emotion seed words and in Module 3 for emotion embedding space generation. As noted earlier, DistilBERT can be replaced with any other language model that is closely aligned with the domain of interest.

Module 2 initiates with an emotion seed word list constructed and curated using a combination of automated and manual methods. In developing our emotion lexicon, we referenced Plutchik’s model [74] which identifies eight primary emotion classes, each further divided into three subcategories, resulting in a comprehensive 24-class system. Initially, seed keywords for each of these classes were manually curated from an online thesaurus [75]. However, we encountered a scarcity of unique terms for certain emotions, which necessitated the merging of closely related categories joy and ecstasy, amazement and surprise, disgust and loathing, interest and vigilance, anger (rage, anger, annoyance), fear (terror, fear, apprehension). As a result, we consolidated the model into 14 broader emotion classes, each supported by 15–20 thesaurus-derived terms.

While manually curating seed terms yielded high-quality initial seeds, the number of words was insufficient for comprehensive lexicon construction. Therefore, we utilized the vocabulary of the finetuned DistilBERT model itself and extracted embeddings for each of our seed words and compared them with the raw embeddings from the model’s vocabulary terms to find contextually and emotionally similar words. However, due to the ambiguity of individual term embeddings, the relevance of these expanded terms was not highly consistent. To address this, we first clustered seed words into 4[(with k = 4 set via the Elbow method [76])] subgroups using the constrained k-means algorithm [77] and then used the average embedding of each subgroup for the expansion. This process extended each subgroup to include highest similar 25 terms from the model’s vocabulary, aiming for a total of 100 terms per each of the 14 emotion classes. Subsequent refinement involved removing duplicates and terms conflicting with Plutchik’s polar opposites to improve the lexicon coherence.

The resulting vocabulary size for each emotion class contained between 80 to 100 terms. To standardize the lexicon, we pruned it by considering the centrality of term embeddings where we compared each term’s embedding to the average category embedding and retained the 80 most pertinent terms per class. The final emotion lexicon comprised 1120 terms across the 14 classes. Table 1 depicts the alignment of the 2, 8 and 14 emotion classification schemes. The 8 classes of emotion contained 80 words per class with total of 640 terms. The version with two classes contained 480 words per category with total of 960 terms. Module 2 also contains externally sourced lexicons for modifiers (inhibitors and intensifiers) and negations, which was based on the valence detection work described in VADER [78]. VADER employs an advanced process that integrates human annotations, heuristic rules, and statistical modelling to determine the valence and polarity of the modifiers. Module 2 provides these two lexicons and the expanded emotion terms as output into Module 3.

Module 3 received the expanded emotion word terms and their corresponding embeddings to generate an emotion embedding space. In case lexicon is constructed from the scratch this step will be skipped as words are already tagged with embeddings during the expansion. For external lexicons each word will go through embedding extractor and tagged with the corresponding embedding. The high dimensional vectors of this emotion embedding space can be visualised using the t-SNE algorithm on a 2-D grid as shown in Fig. 2. Each point on this Fig. 2 corresponds to an emotion term, with clear separation between green and red, where green is for positive emotions, and red for negative, in 14 emotion categorization.

Table 1 Alignment of the 2, 8 and 14 emotion classification schemes

Emotion AWARE: an artificial intelligence framework for adaptable, robust, explainable, and multi-granular emotion analysis

Abstract

Introduction

Literature review

Methods

Results

Study 1: Elicitation of two-emotion assembles (positive and negative) using ISEAR and twitter sentiment datasets

Study 2: Elicitation of four emotion assembles (anger, fear, sadness, joy) using SemEval 2007 (“Affective Text”), ISEAR and fairy tales

Study 3: Elicitation of four emotion assembles (disgust, surprise, trust, anticipation) using GoEmotions and SemEval-2018

Study 4: Elicitation of 2, 8 and 14 emotion assembles in increasing granularity

Study 5: Emotion AWARE adapted for the finance sector using the PhraseBank dataset

Study 6: Emotion AWARE adapted for the technology sector using Senti4SD8 dataset

Study 7: Robustness of Emotion AWARE across intensifiers and inhibitors

Study 8: Robustness of emotion AWARE in negation detection

Study 9: Explainability of emotion assembles using constituent intensity scores and terms of emotional significance

Discussion

Conclusion

Availability of data and materials

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Ethics approval and consent to participate

Consent for publication

Competing interests

Additional information

Publisher's Note

Supplementary Information

Supplementary File 1.

Rights and permissions

About this article

Cite this article

Share this article