Argument annotation and analysis using deep learning with attention mechanism in Bahasa Indonesia

Argumentation mining is a research field which focuses on sentences in type of argumentation. Argumentative sentences are often used in daily communication and have important role in each decision or conclusion making process. The research objective is to do observation in deep learning utilization combined with attention mechanism for argument annotation and analysis. Argument annotation is argument component classification from certain discourse to several classes. Classes include major claim, claim, premise and non-argumentative. Argument analysis points to argumentation characteristics and validity which are arranged into one topic. One of the analysis is about how to assess whether an established argument is categorized as sufficient or not. Dataset used for argument annotation and analysis is 402 persuasive essays. This data is translated into Bahasa Indonesia (mother tongue of Indonesia) to give overview about how it works with specific language other than English. Several deep learning models such as CNN (Convolutional Neural Network), LSTM (Long Short-Term Memory), and GRU (Gated Recurrent Unit) are utilized for argument annotation and analysis while HAN (Hierarchical Attention Network) is utilized only for argument analysis. Attention mechanism is combined with the model as weighted access setter for a better performance. From the whole experiments, combination of deep learning and attention mechanism for argument annotation and analysis arrives in a better result compared with previous research.


Introduction
contains opinion which is completed by supporting statements, it can be categorized as an argument.
An argument consists of several components and they show a structure which is based on argumentative relation between components [1]. Formulation of some argumentation scheme in presumptive reasoning was initiated as one of research pioneers in this field [2]. The scheme was utilized by several research in argumentation mining, one of which was essay scoring [3]. Variant of predefined argument schemes drives to further needs with respect to defining features for automatic classification. Certain researchers defined 5 group of features as the characteristics of an argument component [4]. It achieved 77.3% accuracy by using support vector machine (SVM) as the classifier. Figure 1 describes the argument scheme that is used by the research.
The data came from persuasive essays. Argument components consist of 4 type of statements: major claim, claim, premise and non-argumentative. As the continuation of this research, many additional features were defined. The features were grouped into 8 group of features [6]. Structural and contextual features were indicated as the most significant features among others to characterize an argument.
Researchers have observed argumentation mining from various different perspectives. Thus, research in this field reveals in many areas. For example, argument component detection which was well-utilized in legal documents [7]. On the other hands, other researchers used it for public policy formulation [8]. In addition to feature extraction and machine learning, rule-based approach which is commonly used for NLP research, was also utilized as an indicator to classify argument components. Rule-based approach was combined with probabilistic sequence model to automatically detect high-level organizational elements in argumentative discourse [9]. A slightly different approach was done by using ontology-based in detecting argument component. The result could be used in automatic essay scoring [10]. In a more comprehensive level in argumentation, research is not only required to see the argument components, but to see techniques which is capable to measure validity of an argument. Sufficiency measurement of an argument has been done by using support vector machine (SVM) and convolutional neural network (CNN) [11]. In line with that research, estimation of persuasiveness level from an argument in online forum was conducted [12]. Furthermore, other research worked on prediction about convincingness level of an argument [13]. Due to validity or quality that was being assessed, it required more than only 1 statement. Several statements from one certain argumentative discourse were observed to quantify argument validity.
Machine learning evolves from statistical approach to a more semantically aware system; called deep learning. Many researchers implemented deep learning in conventional task which initially used traditional feature engineering with expectation that they can Fig. 1 Relationship between argument components [5] eliminate tiresome process [14]. They believed by the existence of thousands of notlinear tensor computation, deep learning is able to automatically extract the features. Deep learning itself successfully won a lot of contests in the area of pattern recognition and machine learning. Deep learning can outperform other machine learning algorithms [15]. A lot of research result shows superiority of deep learning compared with regular machine learning. Convolutional neural network was better than machine learning techniques especially for NLP tasks [16]. However, we believed that deep learning is able to achieve better performance in argumentation mining as well as aforementioned NLP tasks.

Argumentation mining
Argumentation mining is a field of study that focuses on argument extraction and analysis from a natural language text [17]. Argumentation mining has 2 phases: argument annotation and argument analysis [18].

A. Argument annotation
Fundamental task in managing arguments is to understand how we can find the location of an argument in documents. For that matter, many supervised machine learning methods are used. The approach is to classify the arguments into argument component or non-argument component.
Data that comes from several sources such as magazine, advertisement, parliamentary notes, judicial summary, etc. were collected to be stored in a database [19]. As a continuation of which, a software named Araucaria was built [20]. This software was used to analyze argumentation and provided a relation among arguments in form of diagram. Initial analysis was conducted from existed corpus [19] and continued by exploration in 2 areas: argumentation surface feature and utilized argumentation scheme [21].
There was different investigation of argument coming from perspective to legal documents based on their rhetoric and visualization [7]. This research was conducted based on feature extraction in which 11 features were utilized. There were 286 words involved as one of the features sets.
Different approach for detecting argument components was done by utilizing combination of rule-based and probabilistic sequence model [9]. High-level organizational element from such argumentative discourse were attempted to identified. Organizational element was also known as shell language. Rule-based was defined by using 25 patterns of handwritten regular expression. Manual annotation without standard guideline was done to 170 essays. The annotation was executed by experts that has been familiar to essay writings. Sequence model was made in accordance to Conditional Random Fields (CRF) by using a number of general features based on lexical frequency. After conducting evaluation, hybrid sequence model was assumed to have best performance in the task.
Argument extraction was applied to support public policy formulation [8]. Result from this research was used to assist policy maker in observing how was the reaction from society in respect to the policy. Tense and mood were the main features as argument indicator.
By using ontology approach, 8 rules were defined to identify arguments from such statements [10]. Rules were defined by research intuition and informal examination to 9 essays. In other research, argumentation scheme was used for essay scoring [3]. It was based on Walton theory [2] involving some adjustments within. This research focused on how annotation protocols intended for argumentative essays were made. Annotation protocol was made for 3 argumentative schemes; they are policy argument scheme, causal argument scheme and argument from a sample scheme.
From other perspective of data, researcher attempted to see argument aspect from social media [22]. It was started by separating statement from dataset into 2 classes: statements which contains argument and does not contain. It was continued by computation involving Conditional Random Fields (CRF).
Argument extraction from Greek news was experimented [23]. Technique that was used in this research was word embeddings extracted from huge size of not-annotated corpus. From the result, one of interesting conclusions was that word embeddings could positively contribute in extracting argumentative sentence.
Unstructured and various data can be found in a web site. Argument extraction to websites were attempted as well [24]. In their research, a gold standard corpus from user-generated web discourse were built along with direct testing by using several machine learning algorithms.
As the continuation from research that did binary classification, which were argument components classification into 2 classes: argument or not, researchers made a try to formulate specific categories from argumentative statements. Generally, 2 classes were defined: claim and premise. Aside from those classes, there were still other various naming or definitions.
Corpus with claim and evidence as labels was built by extracting argumentative statements from Wikipedia articles [25]. It has been utilized by public to be tested by many approaches. There was an opinion saying that all leaves of tree were arguments [26]. They were premises and conclusions, which were placed together one to another.
A new corpus from persuasive essays was made [5]. It contained argumentative statements. This corpus consisted of 90 essays which was labelled by 3 annotators. This corpus covered 3 components of argumentation: major claim, claim, and premise. Other than that, statements that were not categorized as arguments were classified as nonargumentative. It was the 4th class. In order to see how argument components were related one to another, 2 classes to describe their relationship were defined. They were support class and attack class.
From aforementioned corpus, features formulation was also made such that annotated argumentative components could be recognized automatically [4]. All proposed features were categorized to 5 group of sub-features: structural, lexical, syntactic, indicator and contextual. It achieved an accuracy of 77.3%. Specifically, other researchers took a closer look to discourse marker role which was one feature from argumentative corpus in German language [27]. From several conducted experiments, discourse markers were said to be quite indicative in differentiating claim to premise. One research tried to combine all features that has been proposed before [28]. The results were better yet there was no significant improvement.
Caused by phenomenon that big and sparse feature space can result on difficulty of feature selection, a more compact feature was proposed [29]. By utilizing corpus of persuasive essays, n-gram and syntactic rules could be replaced by feature and constraint through extracted argument and domain word. Escalation of argument mining performance can be significantly achieved. After argument components were identified, post processing was conducted by using topic modelling: latent dirichlet allocation (LDA) to extract argument word and domain word.
Analyzing argumentation category was also enriched by contribution in certain fields such as debate technology and assessment of argumentation quality. Given a context, automatic claim detection in one discourse was possible [30]. This technique was then developed further by considering negation detection to each detected claim [31]. Following this current research, evidence detection in unstructured text was also conducted [32]. Specified context of data was used for experiments. After claim and evidence were successfully detected, several approaches to get stance from context-dependent claim was observed [33].
Claim and evidence cannot be separated in forming arguments. If claim does not have evidence, then it will not have meanings. For example, political debates contain many claims followed by evidences as the data to support claims. Given a condition of argumentation summarizer needs, an automatic summarizer for argumentation specifically for political debates was built by some researchers [34]. Not only for political debates, automatic summarizer for online debate forum was also conducted as well [35].
In addition, research on argument mining was also conducted in persuasive online discussion. A computational model that handled micro and macro level of argumentation was proposed [36]. Even further, generating argument using a novel framework named CANDELA was conducted. The argument generation was done with retrieval, planning, and realization [37]. Table 1 summarized all current works in argument annotation which are done so far. For further analysis in completing state-of-the-art of argument annotation research, we concentrate to utilize deep learning methods to handle this argument annotation tasks.

Argument analysis
To assess quality of arguments, not only extrinsic aspects need to be observed, but also intrinsic aspects as well. However, it is different to categorization whose assessment can be done directly by observing the texts (extrinsic aspects). Discourse marker as the main component to differ such argumentative statements is no longer valid to use in scoring quality of arguments. In this case, keywords as discourse marker are not representative as the evaluator. A good argument is the one that can convince the reader that it is a valid and strong argument. To handle this issue, some researchers started to propose some approaches in measuring argument validity. Persuasiveness level of an argument can be estimated by feature extraction to discussion in the online forum [12]. Posting time and writer reputation were said to be useful to utilize as metadata information. Textual features had worse result compared to argumentation-based features. If the data is an essay, argument quality can be assessed through the essay score. In addition to prompt adherence, coherence and technical quality aspect, argument strength can be involved as well to give grade to essays [38].
Huge number of online communities impacts to the appearance of debates in several issues in blogs or forums. Combination of textual entailment and argumentation theory were attempted to extract argumentation from debates, as well as their acceptability [39].
In other research, convincingness appeared as new terminology in assessing quality of argumentation [13]. Relation between arguments in one whole sequence of statements was assessed. Based on that relation, classification was applied. The output was to find out which argument was more convincing and create a list of arguments sorted by their convincingness level. Furthermore, there was another similar task in assessing argument quality. It was done by observing either the relation was sufficient or not [11]. Long Short Term Memory (LSTM) as one of promising deep learning method for text was modified involving Siamese network to recognize argumentation relation in persuasive essay [40]. Furthermore, Hierarchical Attention Network (HAN) with XGBoost was utilized to similar task and indicated to be a promising method for hierarchical data [41]. Table 2 summarized all current works in argument analysis which are done so far. Slightly different with current works, we concentrate to utilize deep learning methods to handle argument analysis tasks.

Proposed methods
Argumentative statements are the main object for this research. It was initiated by classifying statements into several type of argument components (argument annotation). More than that, categorizing arguments relation into sufficient or not was conducted (argument analysis). Those tasks are described in Fig. 2. Deep learning is used as main methods as well as attention mechanism for a better performance. Keras [42] was utilized as the main library in all stages from preprocessing (such as tokenizer, vocabulary processor, and indexing) to modeling. Experiments are conducted with a single NVIDIA TITAN X Pascal GPU. Experiment was conducted by involving 402 persuasive essays [6] as dataset which was translated manually into Bahasa Indonesia.
Argument annotation and analysis are included as classification task. Classes that are defined for the classification are:

Argument annotation
This task classifies statements based on their argument type. Statements are classified into 4 classes: Major Claim (MC), Claim (C), Premise (P), and Non-Argumentative (N).

Argument analysis
This task takes a look into relationship between arguments. Relationships are classified into 2 classes: Sufficient (S) and Insufficient (I). All experiments used dataset (402 persuasive essays) that has been translated to Bahasa Indonesia. FastText was used as word vector representation. Aside from it, we did not use word vector yet utilizing embedding layer (build vector from scratch, without using pre-trained word vector) to compare the performance. Previously, similar works using English dataset was conducted [43] and Glove as word vector representation was used. This research continues to investigate the result from specific language, which is in Bahasa Indonesia. Figure 3 describes all process from input to output. Each word was saved into dictionary and got its index. Therefore, each statement became sequence of id from all words. Indexing was done to escalate performance or reduce complexity. All words represented by ids were converted to vector representation. To compare the result to similar task [4,6], we did same setting for using cross validation to previous task. For classifying argument component (argument annotation), tenfold cross validation was used while classification layer was using fully connected.
Similar workflow happens for argument analysis as described in Fig. 3. The fundamental difference is in Hierarchical Attention Network (HAN) architecture as hierarchy form of attention mechanism. Attention mechanism process is visualized in Fig. 4. For argument analysis, 20 times fivefold cross validation is chosen as the evaluation scenario.
In identifying sufficiency from an argument, theoretical framework was used [42]. This theory has been used in another research as well [6]. Argument quality measurement happened in various way, such as sufficiency level of categorization [11], persuasiveness  [12], convincingness [13], and acceptability [39]. In this research, argument analysis focused on sufficiency criteria. This criterion separated which argument was supported sufficiently from others which was not supported sufficiently. The measurement was conducted from contribution given from premise to claim in the argument.
Taking role as main focus to measure impact of attention mechanism to deep learning, layer of attention mechanism was put after deep learning finished in processing the data. Figure 4 explained in detail what happened in "Deep Learning" box in Fig. 3. Output from CNN/RNN was in form of vector that further processed as input for attention layer. 'C' contains information from context of statement for attention layer. Vector of y 1 , y 2 till y n were the output from deep learning model. Tanh was chosen as activation function. All value of m 1 , m 2 , till m n were the output after going through activation function which afterwards went into softmax and resulted on vector of s 1 , s 2 till s n . All vectors were combined using vector addition. Final result was 'Z' vector which was vector representation from input statement after going through deep learning model and attention mechanism.

Combining deep learning model with attention mechanism for argument annotation and analysis
Several deep learning models were involved in the experiment, such as Convolutional Neural Network (CNN), Long Short-Term Memory (LSTM) and Gated Recurrent Unit (GRU). We utilized combination of deep learning with attention mechanism such that the result can justify the impact of attention mechanism in argument annotation and analysis.
Models of deep learning are briefly justified as follows: 1. Convolutional Neural Network (CNN) CNN is chosen due to its excellent performance in many different classification tasks such as sentiment classification or question classification [16]. Unlike Recurrent Neural Network architecture, CNN does not rely on the sequential nature of the data per se. Looking into how CNNs process words, it implies that there is a syntactical benefit similar to N-gram windows. Different window size may result into different behavior which may lead into a fairly robust. Through several experiments, a sin- Fig. 4 Attention mechanism attached to deep learning gle convolutional layer with a window size of 3 and 250 feature maps performs best together with 0.5 dropout rate. Attention mechanism was also added to the architecture in the experiments.

Long Short-Term Memory (LSTM)
LSTM has distinguished characteristics in its effectivity to handle data with sequential nature. LSTM was said to be the best Recurrent Neural Network (RNN) architecture empirically. This happens not only for one directional LSTM, but also bidirectional as well. Based on that background, both LSTMs for one and bidirectional were used for persuasive essays. By observing their parameter through several amount of experiment, 128-unit LSTM, 0.5 for dropout and recurrent dropout rate were used for the experiment. Furthermore, attention mechanism was attached to the architecture.

Gated Recurrent Unit (GRU)
GRU is used due to its performance which is more likely with LSTM and also it has beneficial from the aspect of computation efficiency. Differentiation between LSTM and GRU is the amount of gate in the model [44]. GRU has 2 gates: reset and update while LSTM has 3 gates: input, forget and output. Using the same scenario with LSTM, result comparison was done to GRU and bidirectional GRU. Best parameter for GRU and bidirectional GRU was 128-unit GRU and 0.5 dropout and recurrent dropout rate. Finally, attention mechanism was attached to the architecture. 4. Hierarchical Attention Network (HAN) Figure 5 showed HAN architecture using GRU [45]. This architecture worked with 2 level of attention mechanism. Document was considered as 4-dimensional data consisting of batch size, number of statements, number of words in statement, and vector representation. In the deepest part of the architecture, word-level attention was used by utilizing one bidirectional GRU. This word-level attention was seen as the most influential word representation in one statement.
On the outside of the architecture, other attention was added: sentence-level attention. Similar to word-level attention, this attention mechanism played a role as statement representation which was the most informative one from one document.
At the outermost part of the architecture, softmax layer [46] and negative log likelihood were used. Best setting for HAN was 1-layer bidirectional GRU for word and sentence encoder, along with utilizing 32 unit of GRU. Dropout and recurrent dropout rate were 0.5. Nadam [42] was used as the optimizer, 0.002 learning rate and 32 batch size.

Results and discussion
Corpus was initially created in English [24]. Excellent experts were selected to annotate arguments independently. For this research needs, the dataset was translated into Bahasa Indonesia involving some linguistic experts.

A. Argument annotation
By using translated dataset in form of 402 persuasive essays, result of utilizing several deep learning models was presented in Tables 3 and 4  non-argumentative. Result of using word embedding from scratch was presented in Table 3, while Table 4 presented result of using FastText [47] as the word embedding. Generally, result presented in Tables 3 and 4 showed that F1 score did not have significant performance indicating the success of argument annotation. However, this experiment arrived in some conclusions.
Learning mechanism which used word embedding from scratch gave relatively better result compared to FastText as the word embedding. This was caused by a condition where words combination in FastText was a result of crowdsourcing. It did not involve any language experts. Therefore, it was indicated some misuse of words because no quality assurance was dedicated to validating the data.
Other than that, formed word combination using FastText tend to be descriptive rather than argumentative. In the learning process of forming word vector, context of statements was observed such that the way the words be arranged one to another was realized. By the utilization of Wikipedia in Bahasa Indonesia as the ingredients in learning process, word combination that frequently appeared was the descriptive one. Nature of descriptive statements was quite different to argumentative. For example, utilization of word "because" was very rarely used in descriptive statements so given weight to the word "because" would be much different to argumentative statements. In argumentative statements, "because" are very often to be used.
Based on that condition, learning mechanism from scratch is indicated as a better option rather than FastText.
Attention mechanism can refine the performance of almost all deep learning model, such as LSTM (from scratch), BiLSTM (FastText), and BiGRU (from scratch dan Fast-Text). All of them are variants from RNN. This is related with the fact that RNN was claimed as the most suitable deep learning model for text. While for other models, the results were worse compared to deep learning model without attention mechanism. One of them was Convolutional Neural Network (CNN). CNN needs additional spatial information rather than seeing to the context of statements. We arrived in a conclusion that attention mechanism did not play significant role for all deep learning models experimented in this research. This happened because the number of class which was 4 while the total data was only 402 essays. In such case, deep learning did not have enough data to be trained. The best model for argument annotation using Bahasa Indonesia is LSTM with attention mechanism.

B. Argument analysis
Using smaller amount of class, which was 2, argument analysis is categorized as binary classification. ROC (Receiver Operating Characteristics)-AUC (Area Under the Curve) was used as one of evaluation methods. Same dataset was used for argument analysis, yet labelling was only categorized into 2 classes: sufficient and insufficient. Table 5 presented the result using word embedding from scratch while Table 6 contains result using FastText. Batch size was 128. Different attention mechanism architecture namely Hierarchical Attention Network (HAN) was used. Tables 7 and 8 presented result of HAN. Tables 5 and 6 described that attention mechanism significantly improved performance of RNN models. ROC-AUC for all RNN models went up after attention mechanism was attached. It clarified discussion from the result of argument annotation  clearly. Smaller amount of class assisted to better result utilizing 402 persuasive essays. If dataset is enlarged, we hypothesize that argument annotation task will have comparable result with argument analysis. CNN performed consistently to experiment in argument annotation. It had worse result when attention mechanism was added. Utilization of max pooling layers in CNN for image recognition enables the information to be denser. This information is very useful for recognition task because high level feature extraction will have a denser representation. However, problem in using this layer is loss of spatial information. After condensation has finished, location of certain word is no longer identified whereas location is very important in statements. When the attention mechanism is not used, the fully connected layer that acts as a classifier is assisted in seeing more dense representation patterns. However, changing attention no longer has effect because spatial information from the data has been lost.
Based on all experiments in argument analysis, word embedding from scratch has better performance than FastText. This is relevant with previous discussion in argument annotation.
Best model in argument analysis is HAN with word embedding from scratch with 64 as batch size. This result is in line with experiment using English dataset [43]. HAN has a good performance in dataset with hierarchical characteristics.
Some points that need to be highlighted from this research are as follows:

Word vectors utilization
Based on the experiments conducted, performance of FastText is worse than word embeddings from scratch. It is in line with previous research using English dataset   [43]. We arrived in a conclusion that pre-trained word vector is not suitable to work on argumentative statements.

Number of classes
More classes will drive to smaller amount of data in each class. The more the number of classes, the more difficult to learn the pattern. Argument analysis results on better performance than argument annotation.

Role of attention mechanism
Most experiments using deep learning with attention mechanism have better results, such as LSTM, GRU, BiLSTM, and BiGRU. Commonly, new features are added to improve performance, yet attention mechanism has its role to strengthen current features involvement. It works by identifying which part of whole sequences contributes in learning process such that the model can perform well. Attention mechanism improves result from bidirectional RNN. This is caused of RNN's behavior which involve future context in the process. Hierarchical Attention Network (HAN) performs well in argument analysis, due to HAN's characteristics in form of hierarchy. Attention layer in HAN is divided into 2 layers: word-level and sentence-level. HAN will perform in its best if the data is in form of hierarchy, for example paragraph statement word.

Form of language
Comparing our result with previous similar research utilizing English dataset [43], there is no extreme differences. F1 and ROC-AUC score are relatively close. Fundamental difference is on utilized word embedding. In English, word vector representation such as Glove or Word2vec can be used because they are trained with huge size of data. They can be used as universal feature extractor for several tasks related with text. Research in different language results on many variants of word vector representation, such as FastText for Bahasa Indonesia [47]. FastText is utilized in our research and it has no better result compared with word embeddings from scratch. Therefore, utilization of other language except English still need to consider how big is the data. We can have better and more representative word embeddings for the features.

Conclusion
Some conclusions related to all experiments conducted in this research are: 1. Pre-trained word vector has no high significance in improving performance argument annotation and analysis 2. Combining attention mechanism with deep learning model results on better performance, especially for Recurrent Neural Network (RNN) 3. Hierarchical Attention Network (HAN) as one variant of attention mechanism works well in hierarchical data, for example: one paragraph contains several statements, and one statement contains several words. 4. Word embedding will play an important role as feature only if it is trained by huge amount of data, otherwise it won't.