Arabic text summarization using deep learning approach

Natural language processing has witnessed remarkable progress with the advent of deep learning techniques. Text summarization, along other tasks like text translation and sentiment analysis, used deep neural network models to enhance results. The new methods of text summarization are subject to a sequence-to-sequence framework of encoder–decoder model, which is composed of neural networks trained jointly on both input and output. Deep neural networks take advantage of big datasets to improve their results. These networks are supported by the attention mechanism, which can deal with long texts more efficiently by identifying focus points in the text. They are also supported by the copy mechanism that allows the model to copy words from the source to the summary directly. In this research, we are re-implementing the basic summarization model that applies the sequence-to-sequence framework on the Arabic language, which has not witnessed the employment of this model in the text summarization before. Initially, we build an Arabic data set of summarized article headlines. This data set consists of approximately 300 thousand entries, each consisting of an article introduction and the headline corresponding to this introduction. We then apply baseline summarization models to the previous data set and compare the results using the ROUGE scale.

expressing the basic meaning of the text in a new linguistic style and in different words; it involves more sophisticated processes, such as paraphrasing, generalization, and reordering [4]. Previous studies have begun to generate abstract summaries either using linguistically inspired constraints [5,6] or with syntactic transformation of the input text [7,8].
In this work, we use a data-driven model to generate headlines for Arabic articles in a manner similar to the successful approach achieved by machine translation based on neural networks, which was also adopted by the new studies in generating headlines for English articles [9]. Recently, deep learning methods have made clear progress in the area of English text summarization, if not ideal, based on neural network models using a sequence-to-sequence framework. These models consist of two complementary units which are trained jointly through a gradient descent or reinforcement learning. The first unit is an encoder that generates a hidden representation of the original text, while the second unit is a decoder that generates the summarized text. Summarized text words are generated word-by-word until a special stopping character is generated that ends the summary. In addition, in the late days of its conclusion, abstractive text summarization models, based on neural networks, worked on merging the abstractive and extractive approaches using the pointer-generator approach. That approach added the capability of copying words from the source file directly to the summary [10,11]. The word 'pointer' from the approach name indicates the extraction methodology. Whereas, the word 'generator' indicates the possibility of generating a new word in the summary, according to the abstraction methodology. Our study generates summarizations using the abstractive neural model as a baseline model. Our baseline model depends on the attention mechanism presented by Bahdanau et al. [12], in order to determine the parts of the original text that must be focused on while generating each new word from the summary. The decoder uses the beam search algorithm to truncate the size of probabilities when generating summarization tokens from the abstractive section. We improved the baseline model by adding the copy mechanism presented by See et al. [11], to end up with a pointer-generator model.
We explain the steps for processing in more detail in "Proposed methodology" section. This approach of summarization, known as Attention Based Summarization (ABS), incorporates less linguistic structure than comparable abstractive summarization approaches, but can expand easily to train on a huge amount of data; it is capable of training on any article-headline pairs. Based on this availability, we have trained our system to generate headlines for articles after building a new data set in Arabic consisting of about 300 thousand pairs. The original text in each pair is the introduction to the article while the corresponding summary is the headline of the article. The method for building the data set is explained in more details in "Data set" section under "Experiments". An example of generating summary is presented in Table 1, and we mentioned "Training details" under "Experiments" section. To examine the efficiency of this approach in the Arabic language, we calculated the ROUGE scale shown in Table 3. The results of this study and the set of data relied upon it, are the first of its kind in the Arabic language in the field of summarizing articles' headlines based on deep learning techniques. The contributions in our research specifically (1) the newly created Arabic dataset with 300 thousand pairs of (article, headline) and (2) the comparisons between Arabic and English results, may contribute to the emergence of future comparisons as in the Document Understanding Conferences (DUC) for the English language.

Related work
A vast majority of past work in summarization has been extractive, which consists of identifying key sentences or passages in the source document and reproducing them as summary [13][14][15][16][17]. Humans on the other hand, tend to paraphrase the original story in their own words. As such, human summaries are abstractive in nature and seldom consist of reproduction of original sentences from the document. Some of the abstractive summarization researches used machine learning methodologies which Sarker et al. [18] made a brief summarization over them. One of the important contributions in Arabic text summarization based on machine learning is by Sobh et al. [19]. On the  Fig. 1 Sequence-to-sequence model other hand, with the emergence of deep learning as a viabe alternative for many NLP tasks, researchers have started considering this framework as an attractive, fully datadriven alternative to abstractive summarization. Consequently a new approach emerged which is the neural abstractive summarization with sequence-to-sequence models [12,20]. This approach has been applied to tasks such as headline generation [9] and article summarization [21]. Chopra et al. [22] show that attention approaches that are more specific to summarization can further improve the performance of models. Gu et al. [10] were the first to show that a copy mechanism, introduced by Vinyals et al. [23], can combine the advantages of both extractive and abstractive summarization by copying words from the source. See et al. [11] refine this pointer-generator approach and use an additional coverage mechanism [24] that makes a model aware of its attention history to prevent repeated attention. While contributions to the Arabic language using sequence-to-sequence deep learning framework are still limited, Elmadani et al. [25] showcased how the fine-tuned pretrained BERT model [26] can be applied to the Arabic language to both construct the first documented model for abstractive Arabic text summarization, and showed its performance in Arabic extractive summarization. In similar research domains, Helmy et al. [27] proposed a deep learning based approach for Arabic keyphrase extraction. It achieves better performance compared to the related competitive approaches. It also introduce the community with an annotated large-scale dataset of about 6000 scientific abstracts which can be used for training, validating and evaluating deep learning approaches for Arabic keyphrase extraction. Our work used the same framework as See et al. [11], where we use pointer-generator approach, but we go beyond the standard architecture and use coverage and length penalties too (Fig. 1). We also propose a novel dataset for Arabic headline summarization on which we establish benchmark numbers too.

Background
We describe the standard approach for supervised abstractive summarization learning based on the attentive sequence-to-sequence framework, and the challenges it faces in text representation and generation. The goal of a model under this framework is to maximize the probability of generating correct target sequences.

Sequence-to-sequence framework
The sequence-to-sequence framework consists of two parts: a neural network for the encoder and another network for the decoder. The source text, reference summary data is tokenized and fed to the encoder and decoder networks respectively during training. The encoder network reads the source text and transforms it into a potentially useful vector representation, which then passes to the decoder network to help in the prediction of the summary sequence on a token per token basis. Figure 2 illustrates how the encoder and decoder networks work together.

Encoder mechanism
The encoder mechanism uses a deep neural network to convert a sequence of source words into a sequence of vectors representing its contextual meaning. This encoding is done using recurrent, convolutional or transformer neural networks.

Decoder mechanism
The decoder network uses the vector representation coming out of the encoder network and its own internal state information to represent the state of the sequence generated so far. Essentially, the decoder mechanism combines specific vectorial knowledge about the relevant context with general knowledge about language generation in order to produce the output sequence.

Attention mechanism
A mapping of the decoder state at each time step with all the encoder states into an attention vector, helps produce a context vector which is a weighted sum of the encoder states. Incorporating this context vector at each decoding time step helps improve text generation [12].

Necessity for attention
From a cognitive science perspective, attention, defined as the ability to focus on one thing and ignore others, allows for picking out salient information from noisy data and to remember one event rather than all events. Thus, attention is selective and appears to be as useful for deep learning as it is for people. From a sequence-tosequence standpoint, attention is the action of focusing on specific parts of the input sequence. It can be stochastic and trained with reinforcement learning (hard attention) or differentiable and trained with back-propagation (soft attention). We note that attention changes over time.
As the model generates each word, its attention changes to reflect the relevant parts of the input.

Self-attention
When a sequence-to-sequence model is trying to generate the next word in the summary, this word is usually describing only a part of the input text. Using the whole representation of the input text ( h ) to condition the generation of each word cannot Then, when the model is generating a new word, its attention mechanism can focus on the relevant part of the input sequence, so that the model can only use specific parts of the input.

Greedy decoding
When using greedy decoding, the model at any time step has only one single hypothesis. Since a text sequence can be the most probable despite including tokens that are not the most probable at each time step, greedy decoding is seldom used in practice.

Beam decoding
When using beam search decoding the model iteratively expands each hypothesis one token at a time and in the end of each iteration, it only keeps the beam-size best ones as shown in Fig. 3. Small beam sizes are able to yield good results in terms of ROUGE score while larger beam sizes can yield worse results. To make decoding efficient the decoder expands only hypotheses that look promising. Bad hypotheses should be pruned early to avoid wasting time on them, but pruning compromises optimality.

Proposed methodology
We used the encoder-decoder framework to generate an abstractive headline, based on the introduction of an article as the original text. Using the previous framework expansions, we calculate the context vector, the copy vector to generate words outside of the model dictionary, and the coverage penalty to prevent repetition in generated summaries. Table 1 shows an example of the model output that contains article's introduction and the resulting headline. The general form of the model is shown in Fig. 1, while the flowchart of the model is shown in Fig. 4.

Attention mechanism
We take advantage of the attention mechanism presented by Bahdanau et al. [12], which gives the decoder the ability to identify important portions of the text, which are required in the generation of the next word of the summary. The mechanism input is (1) the tokens of the article w i which are fed one-by-one into the encoder producing a sequence of encoder hidden states h i . This hidden state represents the original text. (2) The hidden state of the decoder s t . On each time step t , the decoder receives the word embedding of the previous word, and has decoder state s t which includes generated words from summary up to this point. Based on these inputs, the mechanism gives a degree of focus to each word of the original text called attention distribution a t , and uses all of them to generate a context vector from which the decoder takes advantage of the generation process as shown in Fig. 1. The attention distribution a t is calculated using Eqs. (1) and (2): where each of v, W h , W s , b attn are learning parameters. a t given by Eq. (2) can be seen as a probability distribution over the words of the original text, which tells the decoder where to focus while generating the next word. The next step in this model is to generate a weighted sum of the hidden states of the decoder, called the context vector h * t : The context vector can be viewed as a fixed-size representation of everything read from the original text up to this moment. The context vector combined with the hidden state of the decoder are passed to two linear layers to generate the words' dictionary distribution P vocab .
where each of V , V ′ , b, b ′ are learning parameters.P vocab is a probability distribution over all of dictionary words.

Copy mechanism
Since there are a number of tokens appearing in the original text that are outside the dictionary of the model, there must be a mechanism for generating these words. We can use the copy mechanism provided by Vinyals et al. [23], which was first introduced in the field of text summarization by Gu et al. [10] to demonstrate the possibility of merging the merits of the two abstractive and extractive approaches. This mechanism provides the ability to copy words from the original text, thus giving the model the ability to generate words out of vocabulary (OOV) so that it is not restricted to a preset fixed dictionary. Copy models expand their decoder by predicting a binary soft switch, called Z j , which determines whether the model will copy or generate. The copy distribution is a probabilistic distribution over the original text, and the joint distribution is calculated as a convex combination for both parts of the model.
where the two parts represent the copy part and the generation part, respectively. We reused the attention distribution p(a j |x, y 1:j−1 ) as a copy distribution, following the pointer-generator model of See et al. [11]. For example, we calculate-using the copy attention-the possibility of copying a 't' token from the original text as the sum of attentions for all places where 't' appeared. In Fig. 1, for each decoder time step, the probability of P(z_j = 0) and P(z_j = 1) is calculated, which determines whether the model will copy a word from the source or generate it from the vocabulary. The vocabulary distribution and the attention distribution are weighted and summed together to obtain the final distribution, from which prediction is made. Note that out-of-vocabulary original text words like ‫"5ب"‬ are included in the final distribution.

Length
We normalized length during beam search phase, using the length penalty defined by Wu et al. [28] which is defined as follows: With a tunable parameter α, where increasing α leads to longer summaries. We set its value to 0.5 to achieve balanced length of addresses. In addition, we have set a minimum length for the output sequence based on the training data.

Repeats
Copy models usually tend to refer to the same tokens in the original text, resulting in the same phrases being generated in the output multiple times. We followed the same method introduced by Gehrmann et al. [29] in calculating the summary-based coverage penalty.
This penalty is increased if the decoder reaches more than 1.0 total attention towards a specific encoded token. By selecting a high enough value for β in Eq. (9), this penalty prevents summaries that would cause a repetition in the output.

Data set
We are dealing with a data set that we built by gathering Arabic articles published by the Arabic website, mawdoo3 [30], in a wide variety of topics. We considered the introduction paragraph of the article as the original text, and in return, we considered the title of the article as the correct summary of this introductory paragraph. We named this newly created dataset, Arabic Headline Summary (AHS). Each entry in the dataset went through several steps of cleaning and enhancing.
This dataset is divided into three sections: training, validation and test, as shown in Table 2. By simulating the work of See et al. [11], we truncate source texts to 900 tokens and target summaries to 100 tokens in both of the training and validation sections. We also limit both input and output vocabulary to the 100,000 most frequent words, and replace the rest with the UNK tokens.

Data preprocessing
Since we created a newly dataset in Arabic language, we did specific preprocessing steps to normalize the dataset entries. The steps we followed are: a. Add a space between (conjunction letters/commas/special characters) and the fol- Justification for a, b, c: ensure that the same word does not have more than one form.
Justification for d: remove noise from data. Justification for e: remove unusual form of text that also could have rare words.

Training details
We re-applied the pointer-generator model as defined by See et al. [11]. The model architecture consists of two-way Long Short Term-Memory (LSTM) encoder that has 256 hidden states in both directions and 128-dimensional word embedding, while the decoder consists of one layer that has 512 hidden states. The model was trained using Adagrad [31], with an initial training rate of 0.15 and an initial assembly rate of 0.1. The training rate decreased to 10 −5 after the nineteenth epoch with a perplexity rate of 12.7354. We do not use dropout and use gradient-clipping with a maximum norm of 2. We implemented our model using pyTorch on the open source tool OpenNMT-py [32]. We ran the experiment on Google Colab with a free Tesla K80 GPU and 12 GB RAM.

Model
We tried our dataset on two models; the first, which is the baseline model, uses sequence-to-sequence framework consisting of an encoder and a decoder using a recurrent neural network with both the attention and coverage mechanism, while the second adds the copy mechanism to the baseline model.

Evaluation metrics
We evaluated the accuracy of the summary on the 20,000 test samples using the ROUGE-1 scale. This scale calculates the precision in the summary, which shows the percentage of tokens from the generated summary that is relevant to the reference summary.
(10) Precision = number of overlapping words between both summaries total words in reference summary .
This scale also uses the recall measure, which shows how far the generated summary covers the reference summary.
whereas the F-measure gives the harmonic mean between precision and recall as follows: These scales are widely used in the text summarization task, where Precision given by Eq. (10) shows the ratio of intersections between the generated summary statements and the reference summary. Whereas the Recall given by Eq. (11) shows how the generated summary fulfills the reference summary.

Results and discussion
The results of the two models that we evaluated in our study are shown in Table 3 and illustrated in Figs. 5 and 6. In Fig. 5, we found that the baseline model achieved better results without the length and coverage penalty, while the pointer-generator model, which uses the copy mechanism, achieved better results than the baseline model, and that reflects the improvement made by the copy mechanism, as illustrated in Fig. 7.
We found that using length penalty with pointer-generator model could lead to slightly improved results, while the results become worse when adding the coverage penalty as illustrated in Fig. 6. The slight improvement in the results could be related to the short length of reference summaries in our dataset, average length 3.3 as shown in Table 2. Consequently, the combined effect of the length penalty, which restricts the length of the summaries (headlines) to match the length limit, alongside the copy mechanism, which copies words not included in the model dictionary, could led to this improvement. Whereas the poor results when using the coverage penalty is related to the fact that this mechanism targets relatively long texts to examine the coverage of these texts for various topics, and its effect was inversed with short texts.
(11) Recall = number of overlapping words between both summaries total words in generated summary . We did not use the ROUGE-2 scale because the average reference summaries length was less than four, therefore, this metric is not a good choice for this case as it depends on pairs of words. Table 4 shows examples of pointer-generator model's results. Example (1) shows how model generated word ‫"رارضأ"‬ (harms) in place of ‫"راضم"‬ (harmful), while Example (2) shows how model generated word ‫"نيسحت"‬ (improve) in place of ‫"ظافح"‬ (maintain).
By comparing the results of the models that we applied to the Arabic dataset, AHS, with other abstractive models applied to Gigaword and CNN Daily Mail datasets, we note that the pointer-generator model with length penalty, has achieved better results

Table 4 Two examples of headline generation by the pointer-generator model with penalties
The word underlined is a word generated by the model

Origin Translation
Example (1) Article Iron deficiency causes many harmful effects to the human body, some of which may be dangerous, some of which may be natural and easily treated, and of these harmful effects (…) Pointer-generator summarization Iron deficiency harms

Reference summarization
The harmful effects of iron deficiency Example (2) Article Many foods can contribute to maintaining a healthy heart, and reduce the risk of many diseases, including the following (…) Pointer-generator summarization Improve heart health Reference summarization Disease prevention for the Arabic dataset as shown in Table 5 and illustrated is Fig. 8. We assume that this improvement is due to two main reasons: 1. The nature of the dataset, which consists of short reference summaries (headlines) comparing to other datasets. 2. The nature of the Arabic language in terms of more language grammar consistency within written text, which contributed to this improvement.

Arabic research results comparison
For comparison with Arabic research that uses deep learning to summarize texts, we relied on the latest Arabic research based on deep learning that approximates the task of summarizing texts, which is extracting keyphrases from text [27]. The length of the extracted key phrases, two to three words, will correspond to the length of the titles generated by the model we worked on.
Since the idea of the research in [27] is close to our research and uses deep learning, it was appropriate to compare the two researches. We applied our model (pointer-generator with a length penalty) to the test dataset of [27], which has 940 entries. The results of the summary contained many (UNK) tokens, which means the word is outside the dictionary of the model, thus the comparison was not possible. We managed to explain what happened with the following reasons: 1. The dataset was pre-processed in [27] in a way that effected the original form of the words, such as: Writing letters ‫،أ"‬ ‫،إ‬ ‫،آ‬ ‫"ا‬ in the normal form ‫."ا"‬ b. Writing letter ‫"ة"‬ in the normal form ‫."ه"‬ 2. The compound Arabic word was divided into its primary parts, such as ‫"تبكر"‬ became ‫بكر"‬ ‫"ت‬ using Stanford Core NLP [33]. Table 6 shows the results of the modification.
These modifications to the dataset have resulted in the words between the two datasets is no longer being compatible, and the first have a set of vocabulary words that differ from their original forms. This resulted in a large number of words unknown to our model, consequently, irrational results for the ROUGE scale. Given the foregoing, we can consider the comparison between the results of the two researches is only feasible by applying the same modifications to our data set and retraining the model. However, modifying the word structure in the data set contradicts the idea of summarizing texts using deep learning and may generate incorrectly spelling headlines.

Conclusion
Sequence-to-sequence framework that has the encoder-decoder form has gained an increased interest in the field of text summarization. Recurrent neural networks have been improved to be used in the field of text summarization. Many studies were taken place in this field regarding the English language. In this study, we re-implemented the latest approaches and mechanisms, that have been followed in English, on the Arabic language. We started by building a new dataset for Arabic language that is convenient for this task, then applied the abstractive neural model with the attention mechanism, which we called the baseline model, and examined its results. By adding the copy mechanism, we expanded the baseline model to match the pointer-generator model defined by See et al., then showed the improved results after taking advantage of both abstractive and extractive approaches. We also tried both of the models with coverage and length penalties and found that pointer-generator model with length penalty achieved the best results. There are other hypotheses that could be tested to improve results in the future. We wish that this study with its new dataset, and the two models it worked on and compared between their results, would be a starting point for future research in this field that can achieve new enhancements for the Arabic language. Future work should consider expanding the data set to cover more articles. The dataset size that we created, AHS, is close to the size of the CNN/Daily Mail dataset, but can still be expanded to the size of the Gigaword collection. It also could apply other mechanisms that may be useful in the domain of sequence-to-sequence framework, such as implementing a coverage mechanism using a separate coverage vector as what has done by See et al. However, the attempting to infer new models that are beneficial with the Arabic language, in particular, will be the best part to work on since it is a unique grammatical language written from right to left.