Readers’ affect: predicting and understanding readers’ emotions with deep learning

Emotions are highly useful to model human behavior being at the core of what makes us human. Today, people abundantly express and share emotions through social media. Technological advancements in such platforms enable sharing opinions or expressing any specific emotions towards what others have shared, mainly in the form of textual data. This entails an interesting arena for analysis; as to whether there is a disconnect between the writer’s intended emotion and the reader’s perception of textual content. In this paper, we present experiments for Readers’ Emotion Detection through multi-target regression settings by exploring a Bi-LSTM-based Attention model, where our major intention is to analyze the interpretability and effectiveness of the deep learning model for the task. To conduct experiments, we procure two extensive datasets REN-10k and RENh-4k, apart from using a popular benchmark dataset from SemEval-2007. We perform a two-phase experimental evaluation, first being various coarse-grained and fine-grained evaluations of our model performance in comparison with several baselines belonging to different categories of emotion detection, viz., deep learning, lexicon based, and classical machine learning. Secondly, we evaluate model behavior towards readers’ emotion detection assessing attention maps generated by the model through devising a novel set of qualitative and quantitative metrics. The first phase of experiments shows that our Bi-LSTM + Attention model significantly outperforms all baselines. The second analysis reveals that emotions may be correlated to specific words as well as named entities.

expression of emotions on social media has been modulated by new affordances from social media platforms such as when Facebook in 2016 introduced five main emotion reactions to deepen embedding of emotions in responses to social media posts 1 . The presence and usage of such affordances provide a wealth of data to analyze and offers space for research into textual data through different perspectives, such as the emotion expressed by the writer (Writer Emotion), the emotion elicited from the readers' (Readers' Emotion), and the dichotomy between expressed and perceived emotions in textual emotion detection. This is because in most cases readers' emotions triggered by the document do not always agree with the writer emotions. Leveraging readers' emotions has numerous potential applications that have attracted attention from the Natural Language Processing (NLP) and machine learning research sub-communities through a variety of tasks, viz., emotion aware search engines/recommendation systems, emotion enriched article generation, automated article editing to filter out or diminish the emotionally sensitive contents, forecasting readers' emotions on any creative article so that the writer can realize emotions that influence the readers' in advance, etc., [1,2].
The computational task of detecting readers' emotions is generally formulated as a single/multi-class or multi-label classification task [3][4][5][6][7]. A minority of approaches model the task as a multi-target method [8][9][10], that usually follows the traditional NLP regression settings and helps to gain information on the intensity of corresponding emotions, apart from detecting emotion classes. Research into readers' emotion detection take advantage of methods such as lexicon based [11], rule-based decision making [3], classical machine learning [12,13], deep learning [10,14], and hybrid approaches [15]. Of these, deep learning based approaches are usually observed to outperform other approaches, as generally in the case of many other areas of NLP including text classification, machine translation, sentiment analysis, etc., [16][17][18], with the advent of various architectures such as Convolutional Neural Networks (CNN) and Recurrent Neural Networks (RNN) like Long Short-Term Memory (LSTM) and Bidirectional Long Short-Term Memory (Bi-LSTM), that encompass multiple levels of non-linear operations to accommodate automated feature representation of input data with different hierarchies of abstraction. Among the deep learning based studies in textual emotion detection, there has been some recent interest in utilizing attention mechanisms to improve model performance [19] or to observe the words responsible for decision making [20]. But, to our best knowledge, there has been no prior work analyzing and quantifying the role of emotion words or named entities for the task of readers' emotion detection. In this work, we utilize a Bi-LSTM + Attention model with an intention to analyze the interpretable nature and behavior of the model for readers' emotion detection through multitarget regression settings over short-text news documents, where we perform detailed qualitative and quantitative analysis to understand the underlying model behavior and to quantify the role of emotion words and named entities in decision making. The major benefits of our study include a readers' emotion detection model that performs better than the baselines, systematic investigation of the model's decision making (model behavior) and specifically studying the role of emotion words and named entities for the task. In this study, to represent readers' emotions we utilize the discrete basic emotions defined by Paul Ekman [21] (happiness, sadness, anger, fear, disgust, and surprise), since they are the most frequently discussed basic emotions by the theorists in discrete emotion models, and also, most of the social media platforms allow their users to react to news or posts with discrete emotion representations.

Motivation and contributions
Inspired by recent works in the related area of sentiment analysis proposed by Kardakis et al. [19] to investigate the performance improvement of the attention based deep neural networks over non-attention based models, and the work by Sen et al. [22] to explore the interpretability of attention based deep neural networks, our objective in this work specific to the task of readers' emotion detection is to evaluate the attention enabled deep neural architecture and to illustrate that attention models have the potential to enrich the model prediction while enhancing the understanding of the process of decision making. Hence, in this work we limit the investigations towards readers' emotion detection using attention enabled Bi-LSTM. Other state-of-the-art technologies such as transformer-based language models are outside the scope of our present study.
To our best knowledge, there are only a few datasets that provide emotion intensities for regression based studies [23,24]. However, these datasets are not suitable for multitarget regression settings specific to readers' emotion detection as they map documents to only a single emotion with corresponding intensity. An available benchmark dataset that suits multi-target regression based readers' emotion detection is the SemEval-2007 [25], but being annotated by only six readers, this dataset doesn't meet the real-world scenario of a document being read and annotated by many readers. Also, even though there are few readers' emotion detection models that have been benchmarked over specific languages (e.g., [13,26] that utilize Chinese corpora), there exists a need for readers' emotion detection dataset in English to learn the linguistic and affective characteristics within English text. This inadequacy, as also mentioned in [12,27,28], motivates us to procure extensive datasets that particularly suit the deep learning based multi-target regression settings to predict readers' emotion intensities rather than emotion class mapping.
The major contributions of this work are: • We explore a Bi-LSTM + Attention model for the task of readers' emotion detection through multi-target regression settings over short-text news documents and compare the model performance against a set of baselines belonging to various families of textual emotion detection techniques including lexicon based, machine learning, and deep learning, using an extensive set of coarse-grained and fine-grained evaluation measures • We investigate interpretability of the attention mechanism to understand the underlying behavior of Bi-LSTM + Attention model for the task of readers' emotion detection by conducting qualitative and quantitative analysis to quantify the role of emotion words and named entities in the model's decision making. • We procure two new readers' emotion news datasets, REN-10k and RENh-4k where the news articles are associated with corresponding readers' emotions. We also assign the associated genre information to the articles. As a result, apart from readers' emotion detection, these datasets can be used for multiple tasks including, document summarization and genre classification, in various scales (shorttext and long-text), making them heterogeneous task datasets. We shall contribute REN-10k at https:// dcs. uoc. ac. in/ cida/ resou rces/ ren-10k. html and RENh-4k at https:// dcs. uoc. ac. in/ cida/ resou rces/ renh-4k. html publicly, along with the publication to aid future research.
The rest of the paper is organized as, the review of literature presented in "Related work" section, followed by methodology in "Multi-target readers' emotion detection" section, dataset description, experimental setup, model performance evaluation, and model behavior analysis in "Empirical study" section, and finally, the concluding remarks with scope for future research in "Conclusion" section.

Related work
Among the large volume of studies present in literature for textual emotion detection, including the writer/document perspective and readers' perspective, only a few focus on readers' perspective of textual emotion detection. In this section, we review prominent works in the writer and readers' perspective of textual emotion detection across three categories, viz., lexicon based, classical machine learning, and deep learning approaches. The abundance of work using deep learning prompts us to consider it as a separate category despite it falling within the broader machine learning umbrella.

Lexicon based approaches
Studies in this context leverage emotion lexicons, including general-purpose [29][30][31] and domain-specific emotion lexicons [32], which consist of lexical word units and their intensity associations to the emotion classes, and its utility to build numerous emotion detection systems by exploiting word level matches. There has been limited exploration in the lexicon based approach of textual emotion detection, very specific to readers' emotions. Such readers' emotion detection works began with the popular shared task, SemEval-2007 Task 14 [25], to predict the intensity of different emotion classes for a reader annotated dataset, where SWAT [11] is one of the popular among the top three systems of this task. This was followed by other works like the Emotion-Term model built over Naïve Bayes and its extension, the Emotion-Topic model that uses topic models [12]. Even though lexicon based approaches are beneficial enough due to their simplicity and ease of spotting keywords from the relevant vocabulary, they are limited in their ability towards handling negations, multiple word senses etc. In this context, Krcadinac et al. [33] illustrates the possibility of a hybrid lexicon based system, Synesketch, with several heuristic rule sets along with emotion lexicons for textual emotion detection, even though not specifically for readers' emotion. We make use of Synesketch [33], and two other promising lexicon based approaches specific to readers' emotion detection, i.e., SWAT [11] and Emotion-Term Model [12], as baselines for model performance comparison.

Machine learning based approaches
Classical machine learning opens up the way to learn hidden patterns in data through several mathematical models and overcome the drawbacks of lexicon based approaches in handling words with implicit emotion expressions. Most studies in this approach of textual emotion detection are designed as supervised multi-class tasks and some as multi-label/target tasks [7], with learning models like Support Vector Machine (SVM) [34], Naïve Bayes [35], multi-layer perceptron [36], logistic regression [37,38] etc. Features used across such approaches can be broadly categorized as Linguistic features [34,39], Symbol level features [32], and Affective features [32,40]. Apart from widely explored linguistic features like TF-IDF, N-grams, BOW, etc., Ren et al. [39] utilizes pretrained word embeddings for computing Word Mover's Distance (WMD), a distance based feature to address textual emotion detection. Readers' perspective of textual emotion detection also rely on almost the same set of features and learning prototypes for multi-class [1,2] and multi-label/target [4,5] settings. Apart from the supervised studies, there also exists unsupervised ways of readers' emotion detection built with the help of topic level parameters [12,13,27]. But Dong et al., points out that such topiclevel works are more suitable to predict writer emotion rather than readers' emotions [14]. Considering these, we choose baseline models that follow multi-target regression based settings since those are likely more suitable to predict readers' emotion intensities, rather than simply mapping to the emotion classes as done in multi-class/label classification settings. Multi-target problems can be addressed in many ways like problem transformation, algorithm adaptation, and ensemble approaches [41]; we use baselines that leverage both problem transformation and algorithm adaptation with a few prominent linguistic and affective features.

Deep learning based approaches
Deep learning architectures significantly outperform classical machine learning methods in most NLP tasks off late. Deep learning based works in textual emotion detection includes CNNs [42], combination of CNN with various RNN models [15,43], stacked RNNs [44], attention-based architectures [45], Gated Recurrent Unit (GRU) [46], LSTM [47], etc. Apart from these studies, Kratzwald et al. [48] and Chatterjee et al. [44] consider the possibilities of sentiment aided transfer learning (sent2affect) and sentimentspecific word embedding (SS-BED), respectively, for textual emotion detection. Research in textual emotion detection specific to readers' emotions also explore similar learning architectures [9,14,49]. Slightly different lines of inquiry to predict readers' emotions are presented in recent works, viz., [50] that utilize an ontology driven knowledge base with deep learning classifier and [51] that combines comments along with articles as input to their deep learning model. In reference to such recent advances, we draw upon the notable studies sent2affect [48] and SS-BED [44], and the RNN architectures, GRU [46], LSTM and Bi-LSTM [15,44,48], as baselines in our empirical evaluation.

The question of interpretability
Deep learning based approaches for textual emotion detection are found to generally outperform other approaches but, their decisions are not easily explainable as their core learnings are embedded deep within several weight parameters. Nonetheless, there has been much interest in using attention networks in order to throw light into the workings of deep learning models. Using attention, neural architectures can automatically differentiate slices of input data in form of weights, and such learnt attention can also aid the overall learning. This helps to boost overall model performance and enhance interpretability. While there has been research in textual emotion detection that incorporate attention mechanisms to improve model performance [43,52] or to observe salient words responsible for decision making in typical architectures [45,53,54], there has been virtually no exploration tuned specifically to readers' emotion detection; however, models for related tasks may be considered for the task. The sentiment analysis based work by Sen et al. [22] demonstrating and quantifying the resemblance of machine attention maps with hand-labeled human attention maps is a notable work in this regard. Others include research on text classification by Lertvittayakumjorn et al. [55] that performs human grounded explanation evaluations to analyze model behavior, model predictions, and uncertain predictions, and the research by Wiegreffe et al. [56] proposing various tests to determine the usefulness of attention to obtain explanations. Insights from these works along with some of the attention based works in NLP (e.g., [19,57]) show that attention does encode several linguistic notions and hence one can utilize attention as a prominent way of interpretability to open the neural black box. In this context, our study adopts an attention mechanism for readers' emotion detection to interpret emotion associated linguistic notions and their importance in predictions.

Multi-target readers' emotion detection
We now outline our task more formally. We formulate the task of detecting readers' emotions of a textual document as a multi-target regression problem, where the statistical model applied on each input document is expected to produce intensity values for various emotion classes namely, anger, fear, joy, sadness, and surprise. Each textual document d consists of a sequence of words [w 1 , w 2 , w 3 , . . .] , each word drawn from the dictionary of words compiled from across the document corpus. For each d, the corresponding readers' emotion profile from labelled data is modelled as a normalized distribution of votes cast by multiple readers' for E distinct emotions represented as, Thus, a document that has gathered equal votes for a set of five emotions would yield ep r (d) = [0.2, 0.2, 0.2, 0.2, 0.2] . The sum-to-one normalization enables placing documents of different popularity (i.e., vote abundance) on the same footing. Thus, the labelled corpus D with M documents can be represented as, D = (d 1 , ep r (d 1 )), (d 2 , ep r (d 2 )), . . . , (d M , ep r (d M )) , where, ep r (d i ) indicates the readers' emotion profile of document d i .
The supervised task of reader-emotion detection is then to find the best fit mapping function f : document → R E , such that each document d is mapped as close as possible to the readers' emotion profile from the labelled data, i.e., ep r (d).

Methodology
To build the readers' emotion detection model, we use one of the prominent RNN based architecture, Bi-LSTM [58], combined with Attention [16]. Our choice of deep learning architecture is oriented towards ensuring model performance as well as ability to investigate model behavior (i.e., interpretability of model). The Bi-LSTM network is capable of learning long-term dependencies without maintaining duplicate context representations [59], and works by performing sequential modeling in both (left to right, and right to left) directions by incorporating past and future context information effectively [17]. The Attention modelling on top of the Bi-LSTM network provides weightage to relevant words in input sequence that highly correlate to our task of prediction. Apart from enhancing overall model performance [19], the use of Attention helps to analyze interpretability of our model towards readers' emotion detection. In particular, it aids our intent towards analyzing how the presence of emotion words and named entities relate to the workings of reader emotion identification. This interpretability analysis objective is attained using explanations precipitated as Attention Maps from the attention layer. A detailed sketch of our technique is illustrated in a self-explanatory manner within Fig. 1.
The Bi-LSTM network is capable of processing sequential inputs from left to right (forward) and from right to left (backward) at the same time to produce contextual information as the output vectors. Let − → h l be the forward processing hidden layer and ← − h l be the backward processing hidden layer, concatenated to form a single layer h defined by The Bi-LSTM network can be defined as, where, f and b represent parameters of forward and backward LSTM units, w i serves as the representation of each word. To learn representations that assign more weightage to those words that contribute significantly to the model's decision making, we exploit an attention mechanism on top of Bi-LSTM by adopting the popular Attention mechanism proposed by Bahdanau et al. [16]. To implement Attention, initially, we take the last hidden state h n as a document summary vector Z and process it through an alignment model, which is a feedforward network trained along with the entire model, to produce a scalar value u i , and later use softmax to obtain weights α i that represents importance of each hidden state h i .
where, W h , W Z ∈ R a×b and v ∈ R a are the learnable weight parameters. The final document representation H just before prediction layer is then computed as a weighted sum over h i and their corresponding weights α i , denoted as, This helps to execute the attention mechanism by determining for which words in the source document attention or weightage has to be paid. H is then fed to the output layer, which consists of a single fully connected Multi-Layer Perceptron (MLP) (i.e., dense layer) network capable of producing a normalized distribution of readers' emotions using a softmax, The loss between ep r (d) and labelled vector ep r (d) is propagated back to complete the learning process. Once the model is trained, we empirically evaluate the model on two fronts. First, the accuracy of emotion prediction is evaluated based on how well the predicted emotion distribution reflects the distribution derived from the labels. Second, the attention outputs from documents from a fully-trained network, as indicated, will be qualitatively and quantitatively evaluated to assess model behavior, as outlined in the following section.

Empirical study
We conduct experiments to analyze the performance of Bi-LSTM + Attention model and compare against a number of baselines to illustrate that Bi-LSTM + Attention shows significant performance improvement in detecting reader's emotions. We then consider evaluating model behavior with respect to understanding its workings, particularly with a focus on understanding the role of emotion words and named entity mentions. We first describe datasets used in this study, followed by experimental setup and evaluations of model performance and model behavior, with corresponding results and discussion.

Dataset
In our experiments, we utilize three datasets, two Readers' Emotion News Datasets (RENh-4k and REN-10k) that we have newly curated, and the SemEval-2007 [25] benchmark dataset.

Readers' emotion news datasets
To procure our two Readers' Emotion News datasets, we use the social news network, Rappler [60] and its award-winning Mood Meter 2 widget. Mood Meter enables readers to cast their emotion votes towards several categories of emotions (Afraid, Amused, Angry, Annoyed, Don't care, Happy, Inspired, and Sad) and records the total percentage of votes obtained for each emotion. Unlike other sources, we choose Rappler due to its simplicity, popularity, and ease of organizing several news articles under multiple genres and associated emotion profiles. We manually collect only the popular news articles by checking for high emotion votings represented in the Rappler Mood Meter, to ensure that the selected news articles have a high social reach. The detailed information of our two datasets is given below.
RENh-4k: This is a short-text dataset with 4000 news documents and associated readers' emotion profiles. News headlines and associated abstract/snippet are combined to form the documents, and corresponding readers' emotion profiles are obtained from readers' votings on Mood Meter for emotion classes: Afraid, Angry, Happy, Inspired, and Sad. We also assign documents into either of the categories, Health & well-being, Social issues or Others, after manually verifying news genres.
REN-10k: This is an advanced version of RENh-4k, in terms of the number of documents, length of documents, and much diverse set of emotion classes and document genres. This dataset contains 10,272 news documents with corresponding readers' emotion profiles. Here, documents comprise news headlines, abstracts, and news content or fulllength news stories without non-textual content like images and videos. Unlike RENh-4k, readers' emotion profiles are collected for a wider set of emotion classes: Afraid, Amused, Angry, Annoyed, Don't care, Happy, Inspired, and Sad. We also assign documents to the categories Business, Entertainment, Lifestyle, Sports, Technology, and Others, by manually verifying genre information available in Rappler. REN-10k documents consist of the whole textual content associated with a particular news article, the average words per document is 533.613, i.e., long-text in nature. Since our study is over short-text documents, we utilize only the news headlines and associated abstracts of REN-10k to form the documents without the associated news content or full-length news stories.

SemEval-2007
SemEval-2007 is a short-text dataset consisting of 1250 documents comprising of news headlines and corresponding emotion scores for the emotion classes Anger, Disgust, Fear, Joy, Sadness, and Surprise, annotated by six readers [25].

Dataset pre-processing
Given our intent of predicting basic emotions elicited from readers, the first set of pre-processing we perform on datasets is an emotion label mapping from Rappler Mood Meter emotion classes to Paul Ekman's basic emotions [21]. We map Angry→ Anger, Sad→Sadness, Afraid→Fear, Happy→Joy and Inspired→Surprise and discard other Mood Meter emotion classes such as Don't care, Inspired, Amused, and Annoyed by following the methodology proposed by Badaro et al. [30] and Staiano et al. [61]. Since Disgust in Ekman's basic emotions do not match with any of the Mood Meter emotion classes, we discard it in our study and maintain rest five basic emotions to preserve common set of labels for all the datasets, as done in [30,61]. To represent output labels in a better way, as a distribution of five emotions (anger, sadness, fear, joy, and surprise) we follow a normalization procedure similar to that of Lei et al. [27]. We then perform data cleaning in our datasets by removing noisy or metadata keywords like report, new-review, survey, (UPDATED), Midday-wRa, etc., that appear several times in the articles. To improve quality of text representation, we also apply generic set of pre-processing techniques including removal of unknown symbols and  [62] to derive the statistics. Figure 2 depicts distribution of emotions in each of the datasets.

Experimental setup and evaluations
We conduct two sets of experiments to evaluate our Bi-LSTM + Attention model for detecting readers' emotions from short-text documents. The first set of experiments focuses on model performance evaluation where we compare the performance of our model with several baselines using various coarse-grained and fine-grained evaluation measures. The second set of evaluations focuses on model behavior analysis (i.e., interpretability of the model) using the attention maps generated during the predictions. In model behavior analysis, we initially perform an ablation study to identify the impact of attention in predicting readers' emotion profiles, followed by a novel set of qualitative and quantitative evaluation techniques over the attention maps to extensively scrutinize the model's decision making, specifically to realize the role of emotion words and named entities in readers' emotion detection.

Model performance evaluation
To conduct our empirical evaluation, each of the datasets are split into train, validation, and test sets in the ratio 60:20:20 of total dataset volume. To build our Bi-LSTM + Attention model, we embed input documents using different pre-trained word embeddings, Google Word2Vec-300d 4 , Wikipedia2Vec-100d and 200d 5 , and Glove-100d 6 . The dropout set to 0.5, Mean Squared Error (MSE) as loss function, Adam optimizer with learning rate 0.0005, batch size 128, l2(0.001) regularizer, and 100 epochs, are hyperparameters that can aid reproducibility of our work.
To compare the performance of our Bi-LSTM + Attention model, we implement a set of baselines belonging to the categories deep learning, lexicon based, and classical machine learning (as outlined while discussing related work). Deep learning baselines include the recent state-of-the-art textual emotion detection works and other popular architectures. The lexicon and classical machine learning baselines also include the popular and top-performing state-of-the-art methods. We outline the details of baselines below:

Deep learning baselines
• sent2affect [48]: This is a textual emotion detection method that utilizes transfer learning from an RNN model initially trained for the task of sentiment analysis. Towards reproducing their work faithfully, we use sentiment140 7 dataset to build the model; the Twitter Sentiment dataset used in their paper was not found in the relevant link provided 8 . We believe sentiment140 is appropriate for usage primarily due to its large size, comprising as much as 1.6 million data objects. • SS-BED [44]: This is a semantic and sentiment oriented textual emotion detection system, where the same text is subject to two different representations, the semantic representation using word embedding, and the sentiment representation using sentiment specific word embedding proposed in [63]. • Kim's CNN [64]: This work is a popular CNN architecture for text classification. The hyper-parameters used to build this model are given in Appendix. • Naïve Deep Learning Baselines: Includes the general RNN architectures like GRU [46] and, LSTM and Bi-LSTM used as baselines in certain textual emotion detection works [15,44,48]. The hyper-parameters used are given in Appendix.

Lexicon based baselines
• SWAT [11]: SWAT is one of the top ranked systems developed on the shared task, SemEval-2007 Task 14: Affective Text [25]. This supervised system uses predefined sets of emotion words, developed using a unigram model to build emotion annotation of news headlines. • Emotion Term Model [12]: This is an improved version of the classical Naïve Bayes that incorporates information of emotion rating along with the term independence assumption. • Synesketch [33]: This is a textual emotion detection system that makes use of a wordlevel lexicon and an emoticon lexicon, along with a set of heuristic rules. 7 https:// www. kaggle. com/ kazan ova/ senti ment1 40.

Classical machine learning baselines
• WMD [39]: WMD comprises a textual emotion detection method using Word Mover's Distance feature along with SVM classifier. To reproduce this work faithfully, we use 60% of our corpus for training, 20% for testing, and rest 20% for seed corpus, for the five emotion classes. We use Support Vector Regression (SVR) with multi-output regressor for our multi-target regression problem instead of their SVM classifier. • Multi-target regression with handcrafted features: We use multiple methods for multi-target regression, with a rich set of features. We describe the features and the models below: -TF-IDF Feature [39,48]: This is a popular and commonly used feature vector indicating Term Frequency (TF) and Inverse Document Frequency (IDF). -N-Grams Feature [32,44]: Towards using the N-Grams feature, we choose N from {1, 2, 3, 4}. For improved efficiency, we utilize Parts-of-Speech tagging to identify and retain only the noun, verb, adverb, and adjectives as they are a prominent source of subjective content [65]. We make use of VADER [66], to compute the sentiment features. -Embedding Features [44,63]: Two different types of embeddings, the semantic embeddings which include Word2Vec, GloVe and FastText, and the Sentiment Specific Word Embedding, SSWE u proposed in [63]. The individual word vectors are averaged to form document vectors for both the embeddings. -Multi-target Regression Models: We now describe the multi-target regression models across various families of methods. Based on the problem transformation approach, we implement Multi-output Regressor using Ridge 9 , SVR 10 , and GradientBoostingRegressor 11 . Within the algorithm adaptation approach, we implement a Multi-Layer Perceptron with a single hidden layer of 128 neurons, ReLU activation and l2(0.001) regularizer, and final output layer with softmax activation. Other hyperparameters are MSE loss function, Adam optimizer with a learning rate 0.0005, batch size set to 64, and 100 epochs.

Performance evaluation measures
To measure the effectiveness of readers' emotion detection, we make use of different coarse-grained and fine-grained evaluation metrics [67]. Coarse-grained measures are useful to understand the correctness of prediction at a binary level, whereas fine-grained measures indicate the nearness of prediction to ground truth. In coarse-grained evaluation, we map regression predictions to a 0/1 classification problem and use Acc@1 (accuracy of top first prediction), a measure that effectively maps to the micro-averaged F1 measure [68]. Acc@1 is popularly used in several textual emotion detection works [12,13,27,69] to measure the performance of a corpus with imbalanced distribution of data. In fine-grained evaluation, we use a set of measures such as AP document , AP emotion , Root Mean Square Error and Wasserstein Distance, which we will describe shortly. AP document and AP emotion are quite popular in textual emotion detection [11,13,70] and takes into consideration the correlation between predicted emotion probabilities and ground truth readers' emotion reactions over the emotions and documents respectively. Our task being formulated as a regression problem uses Root Mean Square Error and Wasserstein Distance that gives a sense of how close (or distant) the predicted emotion probabilities are from the ground truth.    (11)   points over best results among problem transformation baselines and 3.00, 11.06, 5.13, 3.05, and 2.07 percentage points over best results among algorithm adaptation baselines, respectively. The overall trends across the results are consistent and suggest that our Bi-LSTM + Attention model performs well on prediction of both highest (Acc@1) and overall (AP document and AP emotion ) readers' emotion profiles along with lower values for error and distance metrics over three different datasets. This indicates that the Bi-LSTM + Attention model is able to leverage two-way learning and attention effectively towards identifying readers' emotions. Among several deep learning baselines, SS-BED performs better because we believe it encodes both sentiment and semantic information to enhance the traditional way of embedding. Transfer learning, in general, gives good results, but on contrary, sent2affect shows low results for sentiment to emotion transfer learning, in our experiments. Our informed guess is that this might be because the source model was built over Twitter data meant specifically for the coarse-grained sentiment classification task, but the target model is meant for an entirely different finegrained emotion regression task. Whereas in the original implementation of sent2affect, they build both source and target models with similar kinds of Twitter data, both meant for the classification task, which leads to better alignment. In the case of lexicon based baselines, we can observe that SWAT performs well even being an old baseline. We believe that both SWAT and Emotion Term Model could effectively utilize word features available within corpora which makes them top performing baselines. On the other hand, Synesketch uses a very generic and non-filtered general-purpose emotion lexicon as the major component (except rule sets), which may be the cause for low results. In machine learning baselines, among various features, affective features, more specifically TEI, outperform traditional linguistic features like TF-IDF and N-Grams in many cases, where TF-IDF, TEC, and MEI are others producing the best results. We also analyze affective features GEC and GEI with three different thresholds of δ , 0.25, 0.5, and 0.75, where we observe degradation in performance with an increase of δ from 0.25 to 0.75, which we believe is due to decreased coverage of emotion words by lexicon, as mentioned in [32]. Performance evaluation across multiple datasets illustrates that SemEval-2007 dataset shows slightly better results than RENh-4k, even with less amount of data. We suppose that SemEval-2007 with labels sourced from up to six annotators is a less complex and better curated dataset. But in the context of our datasets, the minimum number of annotators involved is 242,680 for RENh-4k, and 528,327 for REN-10k, which makes it a complex real-world dataset with several contradictory readers' votings in ground-truth emotion profiles. To understand the effect of dataset complexity with respect to the number of readers' annotating a document, we find the degree of correlation between emotions, using Pearson's correlation coefficient [10], shown in Fig. 3 (dark colors indicate high correlation and light colors indicate low correlation). We can observe several natural correlations in SemEval-2007 such as, anger highly correlated to fear and sadness, but in REN-10k and RENh-4k, a low correlation exists between them. Also, when we observe the correlation between joy and fear in SemEval-2007, there exists a very low correlation between them, whereas, for REN-10k and RENh-4k, they have comparatively slightly higher correlations. We assume these kinds of irregular and complex patterns Fig. 3 Emotion profile correlations in the datasets potentially due to noise across a large number of annotators reduce the performance gain in RENh-4k, which is overcome with huge amounts of data in REN-10k producing remarkable gains by allowing to learn the complex patterns.

Results and discussion
In addition to the substantial gain observed over various evaluation measures, we statistically evaluate the difference between models by conducting statistical significance tests on paired models in terms of the ideal measures, Acc@1 and RMSE, which are highly capable of representing coarse-grained (i.e., classification) and fine-grained (i.e., regression) characteristics of our task, respectively. We perform McNemar's test over Acc@1 and Kolmogorov-Smirnov test over RMSE to compute the significance between our Bi-LSTM + Attention model and the best baseline using the conventional significance level, i.e., a p-value of 0.05. We obtain statistically significant results corresponding to p-values of 4.79E-11, 3.46E-3, and 8.87E-3 for Acc@1 and 2.96E-19, 1.45E-3, and 3.89E-10 for RMSE, for the three datasets REN-10k, RENh-4k, and SemEval-2007, respectively, which indicates that the results of our Bi-LSTM + Attention model are statistically significant over the best baselines.

Model behavior analysis
Readers' emotions elicited from textual documents may be intuitively expected to be highly oriented towards emotion words and named entities present in the documents. However, such assumptions need to be verified empirically, so they may inform further research into reader emotion detection. In this context, we set our evaluation hypothesis that key terms that could have helped prediction of readers' emotion profiles in our Bi-LSTM + Attention model are emotion words and named entities present in the documents. Every prediction of our Bi-LSTM + Attention model produces readers' emotion profiles along with an attention map that highlights key terms (terms which are given weightage by the Attention). Our Bi-LSTM + Attention model enables to analyze the attention maps and hence model behavior (i.e., model's decision making) in the context of readers' emotion detection. Based on our hypothesis, we expect that the attention map of predictions must highlight emotion words and named entities present in the textual document as key terms. Hence, in this section, we devise novel evaluation strategies to computationally represent and validate the hypothesis by initially verifying the necessity of attention mechanism for the task followed by qualitatively and quantitatively analyzing the behavior of our Bi-LSTM + Attention model.

Ablation study over attention layer-uniform attention as the adversary
There is no point in analyzing model behavior to study the impact of emotion words and named entities if attention does not have a reasonable influence on prediction [56]. Hence, to establish the necessity of attention mechanism in our readers' emotion detection task experimented with three different datasets, we adopt a technique similar to ablations studies in machine learning. We study the importance of attention by using uniform attention as the fall back model based on observations in [56] that analysis or interpretability of attention stays valid only if it performs as a necessary component in the entire prediction model. For this, we rebuild our model by altering the attention mechanism on top of hidden states, with uniform weights instead of varying weight distributions (a uniform-attention model). This can, in a sense, nullify the effect of attention layer so that we can analyze the model without the influence of attention, and compare it against the model with an attention mechanism, making the study an ablation analysis. Results obtained for uniform attention as the adversary experiments for the three datasets are given in Table 5 and is compared against our Bi-LSTM + Attention model, taken as a baseline. The results indicate that our model has noteworthy gains over all the datasets for all evaluation measures. McNemar's test over Acc@1 and Kolmogorov-Smirnov test over RMSE are also computed to analyze the statistical significance between attention enabled (i.e., our Bi-LSTM + Attention model) and uniform-attention (uniform attention as the adversary) models. Results illustrate that gains obtained for our Bi-LSTM + Attention model over uniform-attention model are statistically significant with p-values of 34.37E-4, 6.26E-3, and 1.59E-03 for Acc@1 and 1.52E-5, 2.41E-3, and 3.68E-04 for RMSE, for the three datasets REN-10k, RENh-4k, and SemEval-2007, respectively. Thus, the ablation study shows that the attention mechanism in our model significantly influences readers' emotion detection for all three datasets. This provides us confidence that the attention map could contain important information to verify our hypothesis with respect to emotion words and named entities.

Qualitative evaluation
Qualitative evaluation is conducted by manually investigating the presence of key terms in attention maps based on the hypothesis that what the model specifically looks for giving a weightage in the task of readers' emotion detection, are emotion words and named entities. Table 6 shows two sets of attention maps generated through our model with their associated ground truth ( ep r ) and predicted ( ep r ) emotion profiles. Color intensities over the words in attention maps indicate weightage associated with the words, i.e., dark red indicates high weightage for the words, whereas light red indicates less weightage. In the first set of attention maps, we include samples whose predicted emotion profiles are very near to ground truth, hence we categorize them as correct predictions. The first attention map among the correct predictions set shows that a high-intensity weightage is given to the word 'attack' and then to the words 'hiding' and 'threats' with a slight weightage decay, which explains the nearness of predicted emotion profiles to ground truth. That is, higher values are seen to peak around the emotions, fear, sadness and anger, for both predicted and ground truth emotion profiles, which undoubtedly showcases the intimate relationship between attention recognized words and emotions. Similarly, many other attention maps in the correct predictions set show a substantial weightage for emotion words; for example, the words 'pain' , 'suffer' , 'poisoning' in the fourth attention map and the association of predicted emotion profiles with emotion sadness. Also, the fifth attention map highlights words such as 'shining' , 'better' , 'care' , 'empowering' , which may be the reason to predict high intensities for emotions surprise and joy. Next, we observe weightage associated with named entities in the attention maps. In the correct predictions set we identify that many named entities like 'Korean' , 'Pakistan' , 'Lillard' , 'Ines Fernandez' , etc., are highlighted with varying weightages. For example, in the sixth attention map, we believe that the word 'Lillard' (name of an American basketball player) may also have influenced to produce high-intensity for emotion joy in some readers and anger in others, besides other words with an attention weightage. From the perspective of such qualitative analyses, we infer that attention gives high weightage to emotion words and nearly so to the named entities for the task of readers' emotion detection. In contrast to the first set of correct predictions, we include a few random samples from incorrect predictions and their attention maps also, in Table 6 as the second set. By incorrect predictions, we mean to refer to predictions that are far away from the patterns Table 6 Sample attention maps ( ep r : ground truth, ep r : predicted) of ground truth emotion profiles. Here too, we can observe that attention maps highlight a few emotion terms and named entities such as 'danger' , 'killed' , ' Antonio' , etc., but it has missed most of the relevant ones. For example, in the first attention map among incorrect predictions, attention gives zero weightage to the word 'attackers' , which we believe has enough power to predict high intensities for the emotions anger, fear, and sadness, similar to ground truth emotion profile. Apart from these kinds of exclusion of key terms (i.e., emotion words and named entities), we also identify that most of the incorrect predictions assign high weightage to many words like 'says' , 'year' , 'almost' , 'since' , etc. Hence, we believe that a major reason for the increase in gap between predicted and ground truth emotion profiles is due to the exclusion of emotion words and named entities, and instead assigning high weightage to many less significant words in the document. This correlation between focus on emotion words and named entities, with measures of performance further reasserts the value of emotion words and named entities in the readers' emotion detection task.

Quantitative evaluation
From above mentioned qualitative evaluations, we observe that attention maps give weightage mostly to emotion words and named entities. Hence in this section, we bring forth a novel set of evaluation measures to quantify the presence of emotion words and named entities in predictions. Therefore, apart from machine attention maps generated internally by our model, we devise external attention maps that can highlight emotion words and named entities by leveraging external information (e.g., lexicons). To generate external lexicon-based attention maps, we initially identify three popular emotion lexicons, NRC-Affect Intensity Lexicon [29], EmoWordNet [30] and DepecheMood++ [31], and compute lexicon coverage for unique words in the datasets used in our study, results are shown in table 7. We can observe that both DepecheMood++ and EmoWordNet gives better coverage, hence we choose these two lexicons for our quantitative studies the details of which will follow soon. Further, to identify named entities, we use an external tool, specifically the Named Entity Recognizer (NER) from spaCy 13 . The construction of the extrinsic attention maps will be evident through their definitions that follow.

Definition 1 (DAM) This is the internal Document Attention
Map produced by the model for each input document (from the attention layer), represented as a vector with intensity values or weightage associated with each word, which indicates the attention received by that word during prediction. If the weightage of words in the attention map is continuous then it is called a continuous attention map; DAM is generally a continuous representation. But if the weightage is either 0 or 1, indicating the presence or absence of attention for a certain word, then it may be called a binary attention map.

Definition 2 (EmoNE-EAM) This External Attention
Map is independent of the DAM (and thus, the BiLSTM − Attention method) and is generated with the help of an emotion lexicon (we use [30,31]) and Named Entity Recognizer (we use NER from spaCy).
To create EmoNE-EAM, we read each word in the document sequentially and set attention weightage of the words to a boolean value 1 if it is an element of emotion lexicon or NER, else set to 0. This map will be a binary representation that indicates only the presence of emotion words and named entities in the document. The above attention maps provide us a convenient platform to measure the impact of emotion words and named entities in the prediction. For computational convenience, we accomplish this by contrasting the extent of deviation between the EAM (external attention map) and the HAM (hybrid attention map). We quantitatively measure the impact of emotion words and named entities in prediction by finding the overlap between the HAM and EAM, using three measures, namely, behavioral similarity, word similarity, and word probability. where, |D ′ | indicates the number of documents that don't have any emotion words or named entities. This measure computes the similarity between emotion words and named entities identified by our attention mechanism on one side, and total emotion words and named entities present in the document on the other. ⋄ Word Probability: A measure that uses boolean intersection between binary EmoNE-HAM and EmoNE-EAM to quantify how much emotion words and named entities are identified by the attention mechanism during prediction, among the total number of emotion words and named entities present in the document. Unlike the previous similarity scores, this measure is represented in probabilities. We compute word probability WordProb D for corpus D by averaging word probabilities of all the documents.
where, = 1 only if EmoNE-EAM = 0, and = 0 if EmoNE-EAM = 0. Experimental results of quantitative evaluation of model behavior for the three datasets are illustrated in Table 8. In the case of behavioral similarity, the highest score of 0.8829 is observed for the model trained on REN-10k dataset, and the lowest score of 0.6988 is observed for the model trained on RENh-4k (this is still greater than 0.5), indicating a good amount of similarity between the model generated and external attention maps. Word similarity scores also show a good amount of similarity between these attention (14)  maps, where the model trained on REN-10k obtains the highest score of 0.8296 and the model trained on RENh-4k obtains the lowest score of 0.6606. For word probability, the highest score of 0.9043 is observed for the model trained on REN-10k and the lowest score of 0.7205 is observed for the model trained on RENh-4k, which indicates that attention captures a significant amount of emotion words and named entities to make the predictions. Promising and consistent results observed for the datasets over all the evaluation measures for both lexicons indicate that our attention mechanism highly relies on emotion words and named entities for predicting readers' emotion profiles. The general trend of scores decaying from REN-10k to RENh-4k reflects the prediction performances of the model trained on these datasets as shown in Tables 2, 3, 4, i.e., REN-10k gives best prediction results whereas, RENh-4k gives comparatively low prediction results in model performance analysis, and hence their quantitative behavior evaluation scores.

Conclusion
In this paper, we explored a Bi-LSTM + Attention model to predict the emotion profiles of readers' towards short-text documents. The simple design of our method ensures generalizable operation and allows a detailed evaluation of model behavior to draw reusable insights, especially that oriented towards assessing the interpretable nature of attention mechanism for the task of readers' emotion detection. To perform the experiments we procured two new readers' emotion news datasets, REN-10k and RENh-4k that can aid extensive studies in the future. Apart from our datasets, we also utilize the benchmark SemEval-2007 dataset. Our first phase of experiments for model performance evaluations using various coarse-grained and fine-grained measures shows that Bi-LSTM + Attention outperforms the baselines belonging to different categories of emotion detection including deep learning, lexicon based, and classical machine learning, with remarkable gains. We also performed model behavior evaluations using a novel set of qualitative and quantitative methods to interpret the workings of the attention mechanism; these studies firmly establish that emotion words and named entities significantly influence readers' emotion detection.

Future directions
Given that our study establishes emotion words significantly influence readers' emotion detection, we are considering to explore the scope of emotion-specific embedding with the combinations of Bi-LSTM + Attention and transformer based language models. Further, we are considering to develop an improved version of our dataset (REN-20k) to handle dataset complexities due to contradictory emotions provided for the documents depending on readers'/annotators votings. There is also a large scope for further evaluation with a completely human-generated attention map (as in [22]), apart from the model generated and external attention maps, to build better computational models.