Comparative analysis of deep learning based Afaan Oromo hate speech detection

Social media platforms like Facebook, YouTube, and Twitter are banking on developing machine learning models to help stop the spread of hateful speech on their platforms. The idea is that machine learning models that utilize natural language processing will detect hate speech faster and better than people can. Despite numerous progress has been made for resource reach language, only a few attempts have been made for Ethiopian Languages such as Afaan Oromo. This paper examines the viability of deep learning models for Afaan Oromo hate speech recognition. Toward this, the biggest dataset of hate speech was collected and annotated by the language experts. Variations of profound deep learning models such as CNN, LSTMs, BiLSTMs, LSTM, GRU, and CNN-LSTM are examined to evaluate their viability in identifying Afaan Oromo Hate speeches. The result uncovers that the model dependent on CNN and Bi-LSTM outperforms all the other investigated models with an average F1-score of 87%.

identifying the connections between online racist speech and physical world consequences and thus the research in this area remains limited.
Social media giant like Facebook has been criticized for allowing their platform to be used as a vehicle for hate speech, with minimum moderation given to non-English pages, which has caused recurrent problems in sharply polarized countries like Ethiopia.
In recent years, the increasing propagation of hate speech on social media and the urgent need for effective countermeasures have drawn significant investment from governments, companies, and researchers. Social media platforms rely on a combination of artificial intelligence, user reporting, and staff known as content moderators to enforce their rules regarding appropriate content publication. However, the problem arises when the platform's artificial intelligence is poorly adapted to the under-resourced language and companies have invested little in staff fluent in them.
Aiming to classify textual content into non-hate or hate speech, a large number of methods have been developed for resource reach languages such as English, Chinese, French, and others [2][3][4][5][6][7][8] using statistical and machine learning approaches. Some research conducted on those languages [9] depicts that the detection model built using a deep learning model achieves a state-of-the-art accuracy. However, only a few attempts have been made for Ethiopian languages [10][11][12][13].
This paper introduces an automatic classification of social media post that contains Afaan Oromo text into hate speech or not hate by leveraging variants of deep learning approaches. First, a corpus of comments and posts retrieved from Facebook and Twitter are built, and then features are extracted using word n-grams and word embedding techniques such as Word2Vec [14]. Then, variants of deep learning models such as convolutional neural networks (CNN), long short-term memory networks (LSTM [15]), bidirectional long short-term memory networks (Bi-LSTM [16]), GRU, and CNN-LSTMs are used for hate speech detection. The experiment result reveals that the model built using CNN and BiLSTM achieves the highest weighted average F1-score of 87%. Likewise, this study also assesses the impact of incorporating augmented data into the training dataset. The result shows that the textual data augmentation enhances the model F1-score up to 3%. data. In the light of the gap in this research area, our contributions described in this paper are the following: • Develop the largest labeled Afaan Oromo hate speech classification dataset of his kind. • This work investigates the accuracy of five state-of-the-art deep learning models at detecting hate speech for resource-scarce languages, i.e., Afaan Oromo. The output of the experiment provides insight into their detection accuracy, and capability in using pre-trained models, and text data augmentation, and offers important guidelines for their deployment in real-world applications. • Assess the impact of adding augmented textual data on the Afaan Oromo Hate Speech classification performance • Assess the impact of using pre-trained Word2Vec model with the one directly trained with the hate speech classification model • Build a pre-trained word embedding model, which is useful for other works in this area.
The rest of the paper is organized as follows. First it explains related works in "Background and related work" section. "Afaan Oromo Corpus creation and annotation" section presents the details of the data set. "Experiment and discussion" section reports on approaches utilized in this study to classify the hate speech content. Section 5 presents the results accompanied by detailed performance analysis. "Conclusion" section, concludes the paper.

Word embedding
In Natural Language Processing (NLP) research, the aim is to make a model that is capable to recognize human languages. However, there is a challenge that leaps out: we, people, communicate with words and sentences; meanwhile, deep learning models expect the number as the input. To make the textual data ready for processing, it must be transformed into a sequence of vectors. One of the techniques commonly employed in NLP to transform textual data into a real-valued vector is called Word embedding. Word embedding techniques learn a real-valued vector representation for a predefined fixedsized vocabulary from a corpus of text [17]. The learning process of vector representation is done separately or joined with the neural network model on some tasks using document statistics. Pre-trained word embeddings are publicly available for researchers. For instance, Facebook provided the fastText model, Google provided several BERT models for different languages. Word embedding models have been widely used in NLP research such as hate speech detection, sentiment analysis, machine translations, storage optimizations, and security [2,[18][19][20]. In literature, numerous word embeddings techniques are introduced, for instance, word2vec [17], Glove [21], and FastText [22]. The main idea behind these models is that the word co-occurrence probabilities have the potential for encoding some type of semantic importance between the words (refer to [23] for a detailed description of all those techniques).

Related work
The introduction of social media such as Facebook and other community forums has simplified the manner people communicate and express their thoughts, however, it also brought the issue of hate speeches. Online platforms provide a space for discourses that are harmful to certain groups of people. Recently, hate speech can be considered a serious problem by different country authorities. To minimize the impact of hate speech, various studies have been conducted at detecting hate speech. The majority of the research works have been done in resource reached languages like English [2,3], Spanish [24], and Chinese [7,8]. The most recent studies on hate speech detection have proposed the use of deep learning techniques for classification [4][5][6]. Aimed at distinguishing the instance of hate speech, Davidson et al. [7] utilized bigram, unigram, and trigram features with TF-IDF with a part-of-speech tagger as a feature for the machine learning model. The Twitter dataset that constitutes 33,458 English tweets was used for investigation. Their experiment using Logistic regression with a Linear Support Vector Machine (SVM) yielded an overall precision of 0.91, recall of 0.90, and F1 score of 0.90. Considering the simplicity of their model their finding is interesting. Authors in [25] also investigate the variety of hate categories to distinguish the kind of hate in Italian text. By leveraging morpho-syntactical features, sentiment polarity, and word embedding lexicons, they design and implement a classification model using SVM and Long Short Term Memory (LSTM). To train the LSTM architecture, they represent each word with a 262-dimensional vector. Their finding reveals that the detection model built using an LSTM has an F1-score of 72%.
A study introduced in [26] developed an Apache Spark-based model to classify Afaan Oromo language Facebook posts and comments into hate and not hate. They employed Random forest and Naïve Bayes as learning algorithms and Word2Vec and TF-IDF for feature extraction using 6120 (4882 to train the model and 1238 for testing). In their experiment, Naïve Bayes and Random Forest outperform with an accuracy of 79.83% and 65.34% with the word2vec feature vector modeling approach respectively. However, they recommended expanding the classification category with different aspects of hate and increasing the corpus size including other sources.
A study in [27] designed a system for hate speech text classification on Twitter using, the CNN model with 6,655 total datasets. The classifier predicts each Tweet to one of four predefined categories as racism, sexism, both (racism and sexism), and neither. They have created two CNN models based on different input vector sets. using, word2vec, and randomly generated vector baseline. The system based on word2vec word vectors performed best overall, with an F1-score of 78.3%. However, their result is not sufficient and the system wrongly identified some non-hate speech tweets as hate speech. In particular, the system was not able to identify properly the category of both racism and sexism.
Likewise, research introduced in [28] used a Convolutional Neural Network classifier with word embedding as a feature using the Hate Speech Identification dataset distributed via Crowd Flower. They used 24,783 English tweets that have been classified into three classes hate, offensive language, and neither. And a Publicly available Word2Vec word embedding with 300 dimensions pre-trained on the 3-billion-word from Google News with a skip-gram model. The final model resulted in an F-measure of 90%. However, the model incorrectly identified some non-hate speech as hate speech. Also, the majority of the hated class is misclassified, while the majority of the offensive class is correctly identified.
A study in [29] utilized the Bidirectional Long Short Term Memory method and the word2vec feature extraction method with Continuous bag-of-word (CBOW) architecture to detect hate speech for Indonesian tweets. After applying the word2vec model, and by setting the epoch to 10, the learning rate to 0.001, and the number of hidden neurons to 200 their model achieves an F1-score of 96.29%. They also found that the addition of more LSTM layers can increase the accuracy by 2.27%. Gated Recurrent Unit (GRU) was implemented in [30] to classify Bengali comments on Facebook as Hate Speech, Communal attacks, Inciteful, Religious Hatred, Political Comments, and Religious Comments. They introduce an annotated Bengali corpus of 5,126 Bengali comments belonging to six classes. Their experiment shows that a classification model built using GRU can achieve an accuracy of 70.10%.

Afaan Oromo Corpus creation and annotation
The Afaan Oromo Text dataset for Afaan Oromo hate speech detection (AHSD), which is the main focus of analysis in this paper, is retrieved from comments and posts published on Facebook and Twitter from January 2019 to June 2019 by the authors of this paper.

Corpus collection
This work targets Facebook pages and Twitter accounts that are open to suspected hate speech rather than focusing on websites or blogs that already have specific agenda. In Ethiopia, it's common for social network communities are commonly posting on political and religious issues. The summary of Facebook pages and Twitter pages that was utilized to build the corpus is provided in Table 1. Those pages listed in Table 1 typically post discussions on political, social, economic, religious, and environmental issues that took place in Ethiopia.
In total, 35,200 posts and comments were collected. In order to remove the noise from the data set, rigorous preprocessing was carried out, which resulted in the removal of HTML, URLs, tags, emoticons, and other language scripts. By applying the data augmentation approach outlined in [31], the total dataset size was increased to 42,100. Detailed statistics of the balanced Afaan Oromo dataset categorized into 4 classes (i.e., NEUTRAL, HATE, OFFENSIVE, and BOTH) are provided in Table 2.

Annotation guideline
Annotation is an integral part of the development of Text classification. Annotated data provides useful quantitative information about the occurrence of certain contextual features. For the Afaan Oromo, there is no standardized and labeled corpus for hate speech detection. In hate speech detection, dataset annotation can be performed either manually  or crowd-sourcing. A research work in [32] shows annotations generated by expert annotators outperform crowd-sourcing strategy. The reason is that cultural norms play an important role in how hate is expressed and whether or not people perceive something as hate speech. Thus, authors also believe that testing for inter-annotator reliability may help alleviate these biases in some cases but will not necessarily neutralize the impact of views on what is hateful or not, which may be shared by the majority of annotators. Toward this, Afaan Oromo language speakers from different corners of Ethiopia were engaged in the annotation process. Each annotator labeled a text into one of the four classes as shown in Table 2. To ensure consistency among annotators they are provided with the following guidelines (or rules) for each annotator: A post has been marked as hate: • If a post/comment uses references to the alleged inferiority or superiority of some target groups. • If a post/comment affects different characteristics of the person and motivates audiences to take action or make violation. • If a post/comment contains stereotype which means over-generalized belief about a given target. • If a post/comment Accusing or Condemning people based on their target groups.
A post has been marked as Offensive: • If a post or comment contains violent or insulting words but not possible to explicitly identify a target group in the post/comment. • If a post or comment contains defamation, which is a false accusation a person or attack on a person's character. • If a post or comment contains insulting, dirty, disgusting, or upsetting words but does not motivate the people to take action.
A post has been marked as Both (i.e, Hate and Offensive): • If there is a combination of hateful expression, and use an insult, threats, or derogatory terms toward a target's groups.
The structure of the Afaan Oromo Hate Speech Detection system is depicted in Fig. 1. As shown in Fig. 1, the system has three main components, i.e., Data preprocessor, Feature extractor, and deep learning models. First, the system accepts a sequence of the token as the input, and then it goes through a data preprocessor module that discards URLs, tags, and other unnecessary inputs. Using a pre-trained Word2Vec or a Word-2Vec trained with the model, the feature engineering component will extract the relevant feature vectors that are used for classification. Finally, the deep learning model will take the vector that corresponds to the input to output the corresponding class label after repetitive training.

Data preprocessor
As shown in Fig. 1, the data preprocessor module is responsible to remove Twitter profile tags, hashtags, URLs, emojis, and stop words (words that do not add any meaning or value to the sentence) in addition to the tokenization task.

Feature engineering
Once the raw data undergoes preprocessing, the Feature engineering part will transform each token into its corresponding embedding vector (a vector that summarizes a given token into a new representation that captures the contextual similarity). Deep neural networks can capture features automatically without any human intervention but they are designed to learn from numerical data or word vectors. Among the available word embedding techniques, a research in [33] reveals that a word2vec [14] model is found to be the effective distribution for hate speech detection research. So, in this study, a type of word embedding technique called Word2Vec is used, which is an algorithm that uses a neural network to learn word embedding. Its goal is to estimate each token's position in a multi-dimension vector space based on the similarity of different tokens. There are numerous variants of word2vec, among those the Continuous Bag-of-words (CBOW) model was adpoted for this study. In the CBOW model, the model learns the distributed representation by training a feed-forward neural network using word cooccurrence with language modeling to predict the word in the given context. The aim of training the CBOW model is to maximize the log-likelihood probability which is calculated as follows: where w in Eq. 1 is the target word, and w c represents the sequence of words in context. Word2Vec model can be implemented in two ways: (1) pre-training and using it as an input layer at the beginning of the model architecture, or (2) training it with the model itself. The impact of both approaches in detecting Afaan Oromo hate speech is assessed using Python together with Keras and Tensorflow. Overall, the utilization of embedded representation has two implications for the proposed hate detection models (1) it reduces the dimension of the input into a k-size vector, and (2) it is a more expressive representation as it captures contextual resemblance and semantic sequence of data.

Feature learning and classification using deep leaning models
Once the feature engineering module produces the embedding vectors, the next step is feature learning and classification. In this study, five deep learning models are selected for comparison.
• CNN which is a class of Deep Learning model that use convolutional layers and maximum pooling or max-overtime pooling layers to extract higher-level features. • LSTM is a powerful kind of RNN used for processing sequential data such as sound, time series (sensor) data or written natural language. • BiLSTM is is a hybrid bidirectional LSTM and CNN architecture. • GRU is similar to long short-term memory (LSTM) with a forget gate, but has fewer parameters than LSTM, as it lacks an output gate. • CONV-LSTM is a type of recurrent neural network for spatio-temporal prediction that has convolutional structures in both the input-to-state and state-to-state transitions.
After grid search, the optimal hyperparameters selected for those models are summarized in Table 3.

Evaluation setup
Five-fold cross-validation is used to train and test the model. That is out of 42,100 total sentences, 33,680 of them are used for training and the remaining one is used for testing at a time. In doing partition, due consideration was given to avoid overlap and to preserve the class distribution of the training set as closely as possible. In each case, the remaining training data (8420) was used as the testing set. Then, for every deep learning model investigated in this study training was made independently using a set for parameter optimization, and a development set for validation purposes. The performance report in this study indicates the decision for each method by averaging the result obtained after each fold. To provide the evaluation results, three evaluation metrics are adopted in this work. These are "Precision", "Recall", and "F1-score". The equation of these three performance metrics is shown in Table 4. In Table 4, "TP", "FP", "TN", and "FN" represents True Positive, False Positive, True Negative and False Negative, respectively. In this work, fivefold cross-validation is used, where 4 of the fold are used for training the model, and onefold will be used for testing the model (i.e., at a time). Finally, the average test result after fivefold is reported in Figs. 4, 5.

Results and discussion
This section presents the results of the experiments investigated using different deep learning models. As our primary goal is to compare the performance of models in different circumstances, the results for each model are presented in separate Tables. Three series of experiments are conducted using five different deep learning models. The first one involves the case where the word embedding is pre-trained (see Figure xxx), whereas the second one is the case where word embedding is trained together with the model itself. The third and fourth experiments are conducted to assess the impact of data augmentation on classification performance. Figure 2 shows the weighted average results of the experiments carried out with five deep learning methods when the pre-trained Word embedding is used for feature extraction. As can be seen in Fig. 2, the BiLSTM and CNN accomplished the best performance (with a weighted average F1-score of 87%). The average F1-score of CNN-LSTM, GRU, and LSTM is 85%, 86%, and 82%, respectively.  Figure 2 shows the result using a word embedding structure that was trained with the model as well. The results of these experiments are listed in Fig. 3. As it can be seen in Fig. 3, training the model with embedded representation slightly increases the classification performance of the investigated model by 1.5% on average. Authors  believe that training the model with embedding representation will allow the model to optimize the trainable parameters, and hence improve the classification performance.
Overall, three main observations can be made by comparing the experimental results of the neural network. First, a model trained with embedding representation is able to capture syntactic and semantic relations of Afaan Oromo words. Secondly, the data augmentation mechanism improves the performance of the hate detection models. Finally, Bidirectional Long Short-Term Memory (BILSTM) achieved the highest F-score of all classifiers used in our experiments.

Conclusion
This paper presents an empirical evaluation of five deep learning models (i.e., CNN, LSTM, GRU, BiLSTM, and CNN-LSTM) for detecting Afaan Oromo hate speech. First, the largest Afaan Oromo Corpus for Hate Speech Detection was prepared. The data used to train and test the model is collected from Twitter and Facebook pages. The finding shows that the best performance was showcased by the BiLSTM with a weighted classification F1 score of 91%. Moreover, the research also compared the effect of using a pre-trained embedded representation with the one training with the model. From the experiment, the authors conclude that training an embedded representation with a model and incorporating augmented samples will enhance the classification performance of all the investigated models. Considering the dataset size investigated in this paper the result performance of the deep learning model at detecting Afaan Oromo hate speech is promising. In future work, we would like to investigate the performance of classifier ensembles and meta-learning for this task. Future research needs to consider a mechanism to incorporate divergent opinions. Also, the performance is not perfect, which means that users will face up with misclassified content. Comparative analysis with models built for other languages would be interesting to know which is also one of the future directions.