Evaluation of recent advances in recommender systems on Arabic content

Various recommender systems (RSs) have been developed over recent years, and many of them have concentrated on English content. Thus, the majority of RSs from the literature were compared on English content. However, the research investigations about RSs when using contents in other languages such as Arabic are minimal. The researchers still neglect the field of Arabic RSs. Therefore, we aim through this study to fill this research gap by leveraging the benefit of recent advances in the English RSs field. Our main goal is to investigate recent RSs in an Arabic context. For that, we firstly selected five state-of-the-art RSs devoted originally to English content

features deduced from rating matrix patterns, and then predict missing ratings based on the inner product of the users' and items' feature vectors [8]. However, as a traditional CF method, the MF easily suffers from the rating sparsity issue that usually arises in realworld recommendation platforms in which most users rate only a restricted fraction of items [3]. Furthermore, relying solely on numeric scores makes it hard for these techniques to accurately capture users' preferences [9].
The above two drawbacks have been widely evoked and studied. In fact, to enhance ratings and deal with the rating sparsity issue, several types of auxiliary information have been integrated into MF, namely, tags [10], items' descriptions [11], social relationships [12], and review texts [13]. Among them, reviews remain the most used ones because they contain precious supplementary details that are effective to model users' and items' characteristics; and explain the underlying rationales behind ratings [14]. Consequently, in the last decade, a variety of models have been proposed to use reviews and ratings [5,14,15]. All these methods have shown a positive impact of user reviews on the performance of recommendation accuracy [5,15]. However, most of these existing works have focused only on reviews written in English. From the literature, we notice that fewer works are achieved on other languages such as Arabic, which has become the fourth most spoken language worldwide, and one of the most used languages on the Internet [16]. It also should be noted that the Arabic web users have recently become major consumers of Internet services [17], and therefore they share a large amount of textual content [18]. Thus, it became possible to incorporate such content into RSs to predict users' preferences in an Arabic context.
Motivated by these reasons, we choose to deal with RSs in the Arabic context by exploiting textual reviews written in the Arabic language. Thus, in our study, we carried out experiments with five state-of-the-art RSs from the literature. We notice that the performance of these systems has been demonstrated for English content; however, it has never been proved for content in the Arabic language. Unluckily, there is no available dataset containing Arabic content for assessing RSs. Therefore, in this article, we have focused all our efforts to fill this large need, which is still understudied by the research community. The contribution of our paper can be summarized as follows: • Building new Arabic datasets for RSs. We collected and translated data from the English Amazon datasets. The main goal is to assess RSs on large-scale Arabic data. • Providing a text preprocessing scheme for Arabic and English languages. • Applying the recent RSs in the Arabic context. • Analyzing and evaluating the performance of various state-of-the-art review-based RSs when using Arabic textual data. • Demonstrating that modern RSs provide promising results when applied to the Arabic content. • Shedding light on the large need for RSs devoted to Arabic content and the importance of having additional studies in this direction.
The remainder of the paper is organized as follows: "Research motivation" section explains this work's motivation. The "Related works" section summarizes the related works in this field. The "Methodology" section describes the used methodology for investigating recent RSs in the Arabic context. The "Experiments" section presents the experimental framework and interprets the achieved empirical results. Finally, the "Conclusion and future work" section concludes the paper and provides future work.

Research motivation
After reviewing the RSs literature, we found that only very few RSs have exploited the Arabic content. Among the works that we found are [19,20], which proposed RSs based on Arabic-language reviews to predict users' preferences on items. Although these works were the precursors in the Arabic RSs field, they suffer from different limitations discussed in the next section. The main ones are: • In these studies, methods and recent advances in the RSs field are not empirically used or compared. • The reviews' datasets used in the experimentations of these works are of modest size.
• None of these studies has reported if there is or not a necessity to apply a special scheme for preparing the Arabic reviews before their usage in the RSs.
In view of all those shortcomings as well as the lack of works that apply the recent recommendation approaches in the Arabic context, we decided to conduct this work in order to answer the following research questions: The aforementioned research questions gave a motivation to do this research work. To the best of our knowledge, no study has been carried out in this direction.

Related works
Over the last decade, many researchers have attempted to enhance rating prediction accuracy by extracting useful information from the text of reviews and incorporating them into the recommendation process [21][22][23][24][25][26][27]. HFT [21] and TopicMF [22] fuse latent factors from ratings and latent features from reviews (inferred based on latent Dirichlet allocation model in [21] and non-negative-MF in [22]) adopting a transform function to predict missing ratings. RBLT [23] and ITLFM [24] discover latent features from reviews based on a topic modeling technique. By assuming that the latent features and factors share the same space, they linearly combine them using the MF model for predicting unknown ratings. EFM [25] extracts feature opinion pairs from textual reviews. It uses a phrase level-sentiment analysis for building two matrices, namely, user-feature attention and item-feature quality, which are then merged with the rating matrix to estimate the ratings in the MF approach. ALFM [26] extracts topics from reviews to infer the importance of the target aspect and then incorporates aspect importance into a Latent Factor Model (LFM) for rating prediction. A3NCF [27] extracts features from reviews using topic modeling, then integrates them with embeddings from ratings into network architecture for deriving users' attentions on items. Although all these approaches have been validated to outperform the techniques solely based on rating scores (user-item interactions), they remain limited due to treating review texts as simple bag-of-words that not consider the context of words and consequently lose a lot of semantic information.
To tackle this limitation, several works [28][29][30][31][32][33] modeling the contextual information from the text of reviews with neural treatments are proposed. In these works, the Convolutional Neural Network (CNN) architecture is exploited to model users and items from their associated reviews. For instance, ConvMF [28] incorporates CNN into MF for rating prediction. In this model, the latent semantic features are derived from review text using CNN, which can consider the words' order and their local context. DeepCoNN [29] utilizes two parallel CNN networks to individually capture the latent semantic representations of users and items by using their reviews. Then, the obtained representations of users and items are concatenated and sent to a Factorization Machine (FM) for estimating unknown ratings. To ameliorate DeepCoNN, TransNet [30] extends it by introducing an extra layer for learning the latent representation of the target user-item review during the training phase and then regularizing the output of this layer based on the learned representation. Another improvement of DeepCoNN consists of PARL [31], which plugged into DeepCoNN a plug-and-play model to enrich target user's preferences exploiting reviews written by similar users, especially in the case when reviews of the target user are incomplete or sparse. CARL [32] learns latent features from reviews by utilizing convolution operations and an attention mechanism, then incorporates them with latent rating embeddings into a FM model to derive missing scores. CARP [33] firstly extracts logic units (each one formed by a user viewpoint and an item aspect) from the users' and items' reviews using a self-attention mechanism incorporated into a convolutional layer; then, based on a novel Routing by Bi-Agreement process, it derives the sentiment representations in user-item level for rating prediction.
As we can notice from the literature review, RSs have attracted research interest in the past few years, but mostly for the English content. Specifically, all the works mentioned above have demonstrated high efficiency and accuracy on various English reviews' datasets for RSs, including Yelp 1 and/or Amazon, 2 and so on. Thus, many high-quality systems are now available for English content. However, very restricted researches were achieved to investigate RSs utilizing Arabic reviews' datasets. The existing studies that we found include [19,20], which combine Sentiment Analysis (SA) systems with CFbased RSs to predict users' preferences on items. The work [19] uses three datasets containing, respectively, 100 reviews in Algerian dialect, 1000 reviews in Arabic, and 2000 reviews in French and English. It adopts a semi-supervised support vector machine algorithm for determining the polarity score of the reviews (positive/neutral/negative). Finally, to predict missing ratings, it integrates them into user-based CF RS. In [20], an Opinion Corpus for Arabic (OCA) dataset has been used. It contains 500 Arabic-reviews about different movies from different websites. To determine the reviews' polarity score (− 1, + 1), the Singular Value Decomposition (SVM) classifier was adopted. This phase's output is then combined with numerical ratings into the MF technique for predicting missing ratings. The experimental results of the two studies [19,20] both shown the positive impact of SA incorporation into RSs. Nevertheless, these surveyed works stay limited because they share different weak points, namely: • Their reported experiments have been conducted on small product datasets of less than a thousand reviews, and none of them has focused on large Arabic datasets. • The authors used very simple text analysis techniques to extract information from reviews and did not analyze the impact of the Arabic text preprocessing step on RSs. • These works use only traditional RSs contrarily to modern and performant systems adopted in studies devoted to English content. • None of these works has used several Arabic reviews' datasets to evaluate its proposed RS in the Arabic context properly. Thus, a more in-depth empirical study is required.
However, current works are thus not mature yet. Therefore, they do not derive conclusions about the Arabic content's exploitation by recent advances in the RSs field. Aiming to shed light on this situation, in this work, we report an extensive investigation on incorporating the Arabic content into recent recommendation methods.

Methodology
In this section, we present our methodology used for evaluating RSs while exploiting textual reviews written in the Arabic language. It consists of running experiments with five review-based recommender engines to evaluate and compare their performance when varying the language of reviews from the original language (English) to Arabic. Our methodology can be summarized as follows: first, we prepare different English reviews' datasets and generate their equivalent Arabic versions. Second, we pre-processed the constructed datasets. Third, we applied various recommending paradigms for rating prediction in both contexts (English and Arabic). Finally, we measured and compared their accuracy and efficiency. The details of each phase are explained in the following sub-sections. The overall process is illustrated in Fig. 1.

Data preparation and generation
Being aware of the lack of resources related to the Arabic language in the RSs field, we decided to build four publicly available Arabic datasets for RSs. Each of these datasets was constructed by preparing and translating English-language reviews obtained from a specific publicly-available Amazon dataset (see Footnote 2). In fact, each record in these existing English-language datasets contains one user's review on a specific item in the related category. Thus, each record consists of nine fields [34], namely: • reviewerID: ID (Identifier) of the reviewer in the Amazon platform.
• asin: ID of the item in Amazon websites. • reviewerName: represents the name of the reviewer. • helpful: helpfulness score of the review. • reviewText: contains the review's text.
• overall: the rating score of the item. • summary: contains the review's summary. • unixReviewTime: the review's time (Unix 3 time). • reviewTime: the review's time and date (raw).
In order to prepare the datasets used in our work, we implemented a special parser in Python, allowing us to extract reviews from the original English datasets [JavaScript Object Notation (JSON) files] by retaining only the specific fields that we require. In fact, for each review, only the reviewerID, asin, reviewText, and overall were preserved. The remaining other fields were ignored since they are needless for this task. Thus, each record in our datasets contains one review, including the targeted four fields' information.
However, online reviews are usually short informal texts generated by non-experts. They are characterized by using everyday language, grammar and misspelling errors, non-standard vocabulary such as replicated characters, non-formal abbreviations, and so on. Thus, the reviews' cleaning stage is essential to ensure the quality of their content. Its main goal is to clean reviews text from spelling errors and slang words to help the downstream process easily understand and resolve their meanings. To attain it, we proceed as follows: first, all the text data are converted into lower case letters to ensure that the texts are in a uniform format. The application of this task to our reviews' text makes sure that "The" and "the" or "Do" and "do" are treated as the same. Then, we adopted regular expressions to eliminate extra-characters in any sequence that repeated more than Fig. 1 Steps followed in the proposed methodology twice. After that, we addressed the misspelled words in reviews based on a Spelling corrector in Python called Autocorrect. 4 This tool corrects all the detected misspelled terms and replace them with the correct words. At the final stage, we replace the contractions marked by clitic apostrophes with their extended forms. This is achieved by transforming the Wikipedia English contraction-to-expansion list 5 into a python dictionary and then exploited a regular expression for expanding all the existent contractions.
After that, we used a free online Machine Translation (MT) service to generate the Arabic versions of the prepared English datasets. We decided to use the Google TranslateMT tool due to its high efficiency in the translation task. Thus, we implemented a python script using the Python googletrans library 6 to interact with Google Translate API. 7 For each review in each of the collected datasets, we sent the content of its corresponding reviewText field to the API to get its translation into the Arabic language. Figure 2 shows an example of the obtained translated reviews' texts from the prepared English datasets.

Reviews preprocessing
Due to the texts of reviews are unstructured, it won't be easy to analyze them directly. Thus, it is recommended to preprocess all the reviews before extracting the feature vectors that will be integrated into RSs. To do this, we created our own text preprocessing scheme, which implies four stages, including unwanted characters removal, tokenization, removing standard stop words, and deleting domain-specific stop words. The  Figure 3 illustrates the adopted overall preprocessing flow.

Removing unwanted characters
The first stage of the preprocessing module consists of removing unwanted characters. It removes all marks, extra whitespaces, and any other non-alphabetic characters detected in the reviews. This phase is very significant as such instances do not add any hint or value to reviews' content. Hence, removing these characters will help for manipulating and processing text information. Our preprocessing scheme uses Python's re module 8 for removing all the aforementioned unwanted characters from reviews text.

Tokenization
As another step of preprocessing, there is tokenization. This step aims to split the text into a set of tokens (words) based on the punctuation or whitespaces characters. In our text preprocess scheme, this task was applied for each review text to divide it into multiple words utilizing whitespaces characters. There are many libraries to perform tokenization like NLTK, 9 SpaCy, 10 , and TextBlob. 11 The NLTK was used in this work. The obtained tokens are exploited for further preprocessing stages.

Standard stop words removal
Stop words like prepositions, pronouns, articles, and conjunctions are employed frequently in reviews. Such words carry less meaning than other keywords, and thus they are not significant for the feature extraction task for RS. Therefore, in our study, a stop words removal step was performed before begin the recommendation process. In the literature, there is no unique universal stop words list used by all text mining tools. For this, in our work, we manually created two lists of stop words. The first one is devoted to the English language, and the second one concerns the Arabic language. In this step, we compare each target token with its corresponding-language stop word list. If it appears in the list, then it is eliminated. For instance, the words such as (have, he, in, is, it, its) are suppressed from the English and Arabic reviews.

Domain-specific stop words removal
While we removed the frequent stop words, it is a crucial practice in text mining to identify unusual words with little discriminant value within a specific field or context. These words are called Domain-specific stop words because they differ from one domain to another. The metric most widely utilized to perform the Domain-specific Stop words removal is "Term Frequency Inverse Document Frequency" (TF-IDF). It computes the relevance of words in a document based on their occurrence's frequency on several documents. Using this metric in our scheme, we may pick out the most pertinent words, allowing us to represent better the reviews related to a target dataset. Thus, we 8 https ://docs.pytho n.org/3/libra ry/re.html. 9 https ://pypi.org/proje ct/nltk/. 10 https ://pypi.org/proje ct/spacy /. 11 https ://pypi.org/proje ct/textb lob/. decided to apply the TF-IDF technique on all datasets by retaining the top 20,000 distinct words (with high TF-IDF scores) as a vocabulary for each small dataset (number of reviews < 15,000) and the top 50,000 distinct words for large datasets (number of reviews > 15,000). All words out of the vocabulary are removed and considered as domain-specific stop words. We further filter out the records containing empty reviews after datasets preprocessing.

The recommender systems used
In our experimentation, the five following RSs were used, namely ALFM, A3NCF, PARL, CARL, and CARP. These RSs are presented below. We refer the readers to read the original papers [26,27,[31][32][33] to have more details about these RSs.

ALFM
ALFM is the state-of-the-art recommender model that adopts the Latent Dirichlet Allocation (LDA) paradigm with the Probabilistic MF for rating prediction. This model firstly runs an LDA-based algorithm on reviews' texts to model user's preferences and item's properties in different aspects, thus capturing the importance of aspects for the user and item. It exploits an Aspect-aware Topic Model (ATM) for modeling aspect importance for target user/item as a probability distribution of composite topics, each of which is represented by a set of words from reviews. Then, the output from ATM is combined with ALFM, which associates latent factors with different aspects exploiting the MF approach, such that the model can predict aspect ratings. Finally, the overall rating is obtained by a linear combination of the aspect ratings, which are weighted by the importance of corresponding aspects.

A3NCF
This is the state-of-the-art RS that fuses topic modeling and deep learning. This system firstly adopts LDA to obtain the topics vectors of users and items from textual reviews. Thus, it models users' and items' feature vectors in different topics (the aspects of items that users discuss in reviews) as probability distributions of words that refer to the same topic. For each user and item, their related topics vectors and embedding vectors (from ratings) are fused into an attention neural network to learn their final representation by considering the user's attention weights concerning the different aspects of the target item. The model represents user and item as one-hot encoded vectors and then incorporate them into an embedding layer to get the embedding features for users and items. Finally, an attentive interaction between the user's and item's final representations is feed into fully connected layers (MultiLayer Perceptron) with regression to predict the final ratings.

PARL
PARL is a plug-and-play deep-learning architecture that has been plugged into Deep-CoNN, one of the state-of-the-art review-based RS to improve its prediction accuracy upon different user-item pairs. DeepCoNN RS uses two parallel CNNs and a word embedding method for capturing latent representations from the reviews' words associated with the target user and item. The model concatenates the user and item vectors and then transmits it to a regression layer involving the FM method to predict ratings. Although DeepCoNN has shown good effectiveness for rating prediction task, it remains limited. The review sparsity issue is one of the significant limitations when the reviews' texts are short and scarce. In this case, few useful features can be extracted by CNN from the incomplete text data. PARL extracts useful user-item pair-dependent features from user's auxiliary reviews (written by other users with the same rating scores as given by the target user) to alleviate this limitation. Like DeepCoNN, PARL incorporates the extracted auxiliary reviews into CNN layers to transform them into feature vectors. To preserve the useful features for each target user-item pair, PARL incorporates the obtained feature vectors into an abstracting layer involving a highway network and a gated-mechanism. For final ratings, the auxiliary vectors are combined with their corresponding users' vectors in DeepCoNN.

CARL
This is the state-of-the-art RS that learns context-aware representations for each useritem pair based on their characteristics and interactions by exploiting both the textual reviews and the user-item interaction data. CARL consists of two learning components, namely review-based feature learning and interaction-based feature learning. The first one adopts a CNN (using two parallels subnetworks) and an attention mechanism to jointly learn useful latent features for a user-item pair based on the user and item reviews. In this component, CARL also adopts an abstraction layer to obtain pertinent latent features using an average pooling strategy. On the other hand, in the interactionbased component, complementary features are learned for each user and item based on their interaction data. To produce ratings through each module, CARL feeds each component's latent representation into the FM. The final rating score is then computed based on a dynamic fusion strategy that fuses both components' ratings.

CARP
CARP represents the state-of-the-art RS that uses Capsule Network to extract the semantic contextual information from reviews for rating prediction. CARP is based on two modules: viewpoint and aspect extraction and sentiment capsules. The first component adopts a variant of self-attention stacked over a convolutional layer to capture the logic units formed by a given user viewpoint and an item aspect extracted from user and item reviews. In the second component, positive and negative capsules are exploited by a Routing by Bi-Agreement architecture to jointly choose some logic units as the informative ones and produce output vectors that encode their sentiments. Each constructed vector encodes the user's attitudes on a given item in the target sentiment. Besides, the lengths of the vectors suggest the probability of each of these two sentiments. Finally, to predict a user-item pair's missing rating, the magnitudes and odds in their two corresponding sentiment poles are incorporated into a one-layer highway network and then passed to a rescaled sigmoid function.

Experiments
This section presents some issues about our experiments: first, we introduce the used datasets. Second, we describe the experimental settings and the evaluation metric. Finally, we discuss and interpret the achieved empirical results.

Datasets
To answer the research questions stated in the "Research motivation" section, we run our experiments on several English-language reviews' datasets and their equivalent versions in the Arabic language. Specifically, we used eight datasets: four 5-core datasets, Musical Instruments (MI), Patio Lawn and Garden (PLG), Automotive (Auto), and Instant Video (IV), were chosen from Amazon and contain English-language reviews related to various products. The other four datasets represent the Arabic versions of the first ones. They contain the same reviews but in the Arabic language. We built all datasets based on the preparation and generation process discussed in the "Methodology" section. Table 1 summarizes the statistical details of the used datasets.

Experimental settings
We randomly selected 80% of each dataset as the training set for our evaluations and the remaining 20% as the test set. We trained the adopted models on the training set and evaluated the performance on the test set. To ensure that the testing set reviews are unavailable during the recommending process, such in real-world applications, we utilized the review information only in the training set.
The parameters of all evaluated models (ALFM, A3NCF, PARL, CARL, and CARP) are set as reported in their corresponding papers with the best performance. For the evaluation metric, we used Root Mean Square Error (RMSE) as a performance metric, which is broadly utilized in several related works for performance evaluation [35][36][37], formulated as: where T is the total number of data points being tested, r u,i is the predicted rating, and r u,i is the real rating.

Experimental results and discussions
This sub-section presents the realized experiments to address our research questions: RQ1, RQ2, and RQ3 ("Research motivation" section).
The analysis of the attained empirical findings is organized as follows. We started by discussing the applicability of the modern RSs in the Arabic context. Next, we evaluated the effect of applying a preprocessing scheme on the Arabic texts before incorporating them into RSs. We ended up by assessing the performance of the recent RSs in the Arabic context.

(RQ1) Verifying the applicability of recent RSs in the Arabic context
Appendix 1: Table 2 and Fig. 4 show the performances of the five RSs ALFM, A3NCF, PARL, CARL, and CARP in both contexts: English and Arabic. From the results, we can note that for the four Arabic datasets: Arabic_MI, Arabic_PLG, Arabic_Auto, and Arabic_IV,

Fig. 4
Comparing the adopted recommendation models in the English and Arabic contexts all tested RSs worked well, i.e., overall, they reached good RMSE scores. According to the achieved RMSE scores, we note that each used RS successfully predicted most of the hidden ratings when using Arabic reviews' texts. Therefore, we can conclude from our results that it's possible to apply modern recommender engines to Arabic content.

(RQ2) Analyzing the impact of the preprocessing phase on recent RSs when applied to Arabic content
As done by other researchers in [19,20], and due to the Arabic language's complex structure [17,18,38], we have opted to preprocess the Arabic language reviews before incorporating them into the RSs. However, in this part, we aim to verify if adopting the preprocessing stage is necessary to apply these RSs on the Arabic datasets. To achieve it, we test the five RSs on different Arabic reviews' datasets in both cases: using preprocessing stages and without-using preprocessing tasks. The results are shown in Appendix 1: Table 3 and Fig. 5. By comparing the results achieved from the ALFM, A3NCF, PARL, CARL, and CARP in both cases, we can note that each of these RSs has maintained the same performance on each preprocessed reviews dataset (Arabic MI, Arabic PLG, Arabic Auto, and Arabic IV) and its no-preprocessed version (Arabic_MI*, Arabic_PLG*, Arabic_ Auto* and Arabi_IV*), respectively. Those results confirm that the preprocessing stage is not essential to have better accuracy and performance for the modern RSs when using Arabic texts. Such systems use effective feature extraction techniques that can capture the relevant information from the texts.

(RQ3) Analyzing the performance of the recent RSs in the Arabic context
In this part, we compare the performance of the RSs (ALFM, A3NCF, PARL, CARL, and CARP) in two contexts: English and Arabic. In particular, we aim to verify if the used recommendation models' differences in performance are statistically significant or not when changing the content's language. The experimental results are shown in Appendix 1: Table 2 and Fig. 4. From the experimental results, we can analyze the performances of the tested RSs. The results show that the RS ALFM has maintained very close performance for each English dataset and its Arabic version. The accuracy decrease on the four datasets' pairs (MI and Arabic_MI, PLG and Arabic_PLG, Auto and Arabic_Auto, IV and Arabic_IV) is 0.21%, 0.21%, 0.11%, and 0.20%, respectively. Similarly, A3NCF has also maintained very nearly accuracy on each English dataset and its Arabic version. The accuracy decrease on the four datasets' pairs is 0.21%, 1.06%, 1.12%, and 0.78%, respectively. Such accuracy differences are insignificant (very minor). We explain these because these RSs use Bag-of-Words (BoW) techniques for review text processing, which are negatively impacted by the noise and irrelevant information introduced within the reviews during the translation phase. The results of this phase depend on the quality of the translation tool.
On the other hand, PARL, CARL, and CARP's performance do not degrade when changing the reviews' text language. Specifically, each of these three RSs has maintained the same rating prediction accuracy (accuracy decrease is 0%) on each English dataset and its Arabic version. We suggest that maintaining the same performance is due to processing the translated reviews by Deep Learning architectures, which help delete the noisy and unimportant information and develop appropriate feature representations for RSs. These results confirm the conclusions obtained in other studies for English content [39], which prove that neural network architectures' modularity allows handling heterogeneous and unstructured text content.
By considering those outputs, we can confirm that ALFM, A3NCF, PARL, CARL, and CARP do not lose performance when changing the application context (English to Arabic). Consequently, from these experimental results, we can conclude that applying the recent RSs to Arabic content provides good rating prediction accuracy as when using English content.

Conclusion and future work
This article aims to provide a comprehensive evaluation of modern RSs when applied to Arabic content. For that, extensive experiments have been conducted using different Arabic reviews' datasets. These datasets were built leveraging the English-language reviews regarding different product categories in the Amazon e-platform. The experiments were performed utilizing five recent RSs. Each of these systems was tested on different-language datasets: original English ones and their corresponding Arabic versions. This study was conducted to achieve three objectives.
The first aim is to verify the applicability of recent RSs to the Arabic content. Our experimental results have shown that the used RSs could perform rating prediction tasks while exploiting Arabic data. The second goal was to determine if there is a necessity to apply a particular preprocessing phase on the Arabic content before incorporating it into RSs. We tested each of the RSs on Arabic reviews within two cases: with and withoutpreprocessing. The results showed that the preprocessing stage does not impact RSs' performance. The third objective of this work is to evaluate if the performance of these RSs varies when changing the application context from the original one (English) to Arabic. The experimental findings proved that the adopted RSs had maintained the same performance in both contexts.
The results demonstrated the important potential of recent advances in the RSs field when exploiting Arabic content in terms of accuracy and performance. However, it has to be noted that in our experiments, we only used a MT tool to translate reviews' text from English into the Arabic language for building the Arabic data. Thus, as future work, we will try to test on original Arabic language datasets by collecting real reviews from online Arabic stores to assess RSs in the Arabic context better.
This work will encourage the researcher community and serve as a road map to advance the Arabic RSs field. Such advancement would positively impact various sectors, including Arabic: e-commerce, e-learning, e-tourism, e-government, social networks, etc.