A hybrid semantic query expansion approach for Arabic information retrieval

In fact, most of information retrieval systems retrieve documents based on keywords matching, which are certainly fail at retrieving documents that have similar meaning with syntactical different keywords (form). One of the well-known approaches to overcome this limitation is query expansion (QE). There are several approaches in query expansion field such as statistical approach. This approach depends on term frequency to generate expansion features; nevertheless it does not consider meaning or term dependency. In addition, there are other approaches such as semantic approach which depends on a knowledge base that has a limited number of terms and relations. In this paper, researchers propose a hybrid approach for query expansion which utilizes both statistical and semantic approach. To select the optimal terms for query expansion, researchers propose an effective weighting method based on particle swarm optimization (PSO). A system prototype was implemented as a proof-of-concept, and its accuracy was evaluated. The experimental was carried out based on real dataset. The experimental results confirm that the proposed approach enhances the accuracy of query expansion.

Query expansion is a technique that expands the initial query by adding more terms which are semantically similar to the original user query. As a result, several approaches have been introduced to process user queries. Traditional query expansion methods rely on statistical models such TF/IDF, and BM25 [7,8]. The statistical methods depend on a term-based document retrieval which generates queries that capture the user's interests from a collection of documents. Although these methods are effective, they are not able to provide accurate information to the user query. Since those methods consider terms as atomic units of information, disregarding syntactic and semantic similarities between terms. An alternative to the statistical method is the semantic method, which attempts to find the candidate terms based on the representative meaning of the query in its context [9]. The semantic methods rely on external semantic resources such as WordNet and domain ontology. However, a complex language such as Arabic language suffers from the lack of semantic resources. Therefore, a hybrid method is needed to enhance the performance of query expansion especially for Arabic language.
Therefore, this paper proposes a hybrid semantic query expansion approach for Arabic information retrieval which calculates the weight of each term based on three information retrieval evidences namely word embedding, WordNet, and term frequency. To remove noise from the generated terms, a particle swarm optimization (PSO) is used as a semantic filtering to avoid query drift problems. The rest of the paper is organized as follows: "Background and preliminaries" section presents a brief background and preliminaries. "Related work" section reviews the related query expansion studies. The proposed approach is presented in "Framework for proposed approach" section. "Experiments and evaluation" section presents the experimental results and discussion. Finally, "Conclusion and future work" section concludes the study and outlines the future work.

Background and preliminaries
There are several approaches in query expansion. Query expansion methods which are related to our approaches are presented in the following subsections.

Term frequency
In document retrieval theory, document and query are represented as a vector in vector space. Each term in the vector has a weight which represents the importance of the term in the document as a whole. Several researchers have suggested different weighting functions. Luhn [10] studied term distribution to assign a weight to a term according to its frequency. A few years later, researchers enhanced term frequency performance by computing number of terms related to the document length. Ounis [11] studied the effect of the document length in the collection. Although this method is accepted due to its simplicity and efficiency, yet it ignores the order and semantic relations between terms. In addition, it suffers from data sparsity. As a result, this limitation makes its usage undesirable to measure words similarity.

WordNet
Most query expansion methods utilize the knowledge resource such as WordNet. WordNet is a global lexical database which organizes the terms holding identical meanings into sets called synset [12]. These synsets are connected to each other through pre-defined lexical relations. Arabic WordNet has been constructed by the adaption of the Euro WordNet construction [13]. The Arabic wordNet contains 11,269 concepts [12], comparing with English wordNet which contains 155,287 concepts [14]. Arabic WordNet is commonly used for query expansion where appropriate senses are linked to the original query to provide the desired conceptual information. Voorhees [15] mentioned that this approach makes a little difference in retrieval effectiveness when the initial query is not well molded. On the other hand, the well molded query will improve retrieval effectiveness significantly. Furthermore, using a lexical resource alone for query expansion can cause a topic drift. This is because, the inappropriate changes in the query will cause query to match semantically other similar terms. Unfortunately, using WordNet in a query expansion process generates some noisy terms. Gong, Cheang, and Hou [16] used term semantic network to filter out the noisy terms.

Word embeddings
To obtain effective semantic term representations, term representation may implicitly be learnt by using latent dirichlet allocation [17] and latent semantic analysis [18]. These methods still consider corpora as "bag of words". Hence, they are not effective in capturing the semantic behind the text. Recently, neural network language models [19] have been used to model languages with promising results. Word embeddings are a set of language modeling such as word2vec [20] and Glove [21]. Word embeddings map each word to a vector of a real number. The vector values are learning in a way that resembles a neural network. Consequently, the technique is regularly lumped into the field of deep learning. The main idea behind word embeddings is to find dense, low-dimensions and real-valued vectors for each term within its context. The generated embeddings represent the syntax and semantic relations between terms. In embedding spaces, the words that have the similar meanings should have the similar representations. In addition, the embedding spaces show straight structure that generated word embeddings can be deciphered as relations [20]. This allows vector-oriented reasoning based on the offsets between words.

Particles swarm optimization
Particle swarm optimization is a population stochastic nonlinear optimization technique. It is inspired from the social behavior of birds. It looks for an optimal solution in search space [22]. Each solution in a search space is called a particle. All particles are initialized with velocity, position, and fitness value which are calculated by using an objective function. The algorithm is guided by personal experience (pbest), overall experience (gbest), and the present movement of the particles to decide their next positions in the search space. Further, the experiences are accelerated by two factors known as c1 and c2, and two random numbers are generated between [0, 1]. In each iteration, the pbest and gbest values are calculated. After finding the two best values, particle updates its velocity and positions.

Related work
In order to overcome word mismatch problem in information retrieval, many popular solutions have been proposed by the researchers. Most early studies in Arabic language in the field of IR have focused on morphological analysis of the documents. From another point of view, many efforts have focused on developing Arabic stemmer such as [23][24][25], which depends on a set of rules and uses lookup table for roots. Al-Serhan and Ayesh [26] tackled this drawback by utilizing neural network to extract Arabic root. Although it significantly increases the IR performance, most stemming techniques introduce a large amount of noise in documents. Elayeb and Boun has [27] explained the limitations of morphological analysis in Arabic IR. Traditionally, document and query represent as a vector in a vector space, and each item in the vector has a weight which reflects its importance. Different weighting functions have been suggested. Luhn [10] assigned weight to the term based on its frequencies. Ung and Park [3] proposed a term weighting function which considers the occurrence and the absence of terms. In spite of the fact that has been gotten from these methods which is sensibly great, it does not consider the semantic similar terms. Bai [28] selected expansion terms by computing correlations between pairs of terms using the association rule [5].
One of the most well-known approaches to overcome the limitations of keyword-based method is using thesaurus and domain ontology which attempt to rephrase the query based on its context [29,30]. Yokoyama and Klyuev [31] used Japanese WordNet for query expansion. Alzahrani and Salim [32] used fuzzy concept to assign the value from 1 to 0 to reflect the degrees of similarities between Arabic documents based on ontology. Chauhan, Zhai and Zhou [33,34] exploited ontology of sport domain to develop semantic IR system. Khan [35] developed semantic web search based on ontology. Although these approaches are effective, most complex language like Arabic has scarce of semantic resources like lexicons and ontologies. Traditional information retrieval models treat queries as a set of unrelated terms, disregarding the semantic relationships interweaving them. To enhance the performance of information retrieval, semantic methods utilize document co-occurrence statistics [18], probabilistic latent semantic analysis [36] to represent terms. Although these models have already achieved good performance, they are very costly and time consuming. To learn a viable representation of term based on its context, distributed word representation which is also known as a word embedding has been introduced in information retrieval field [37,38]. Diaz, Mitra, and Roy [39,40] have used contextually associated words which have been generated from word embeddings to extend user query. Liu [41] used fuzzy rules to reweigh the expansion words which have been generated from word embeddings.
As it can be seen from the reviewed studies, some limitations were found. Of these limitations, some studies were focused on statistical method which depends on the exact matching to generate the expansion terms. This is in turns neglected any potential semantic matching. On the other hand, some studies attempted to tackle the aforementioned limitation by using semantic methods, which utilizes the knowledge base during the process of expansion terms generation. Yet, this method it suffers from the limited number of terms and relations that are included in the knowledge base. Therefore, this study proposes a hybrid approach which utilizes statistical and semantic method in order to overcome the mentioned limitations and to produces better results.

Framework for proposed approach
The architecture of the proposed approach is illustrated in Fig. 1 and Table 1 respectively. An overall architecture of proposed approach is presented in Fig. 1a, b, give detailed insight of proposed approach. First the query is submitted by user to retrieve the desired results. This query represents the input of the proposed approach. Then, the initial user query is handled. The meaningful concepts are extracted and processed using Khoja algorithm [42]. In order to get a rich set of associated terms, the initial user query is expanded. This can be done by combining candidate terms from various kinds of information sources aforementioned including WordNet, word embeddings and term frequency. The query is refined in three different stages as shown in Fig. 1b. First, the synonyms for each t i in a user query are obtained using Arabic WordNet. They are combined with the seed query terms for further expansion in second stage. In second stage, Word2Vec is used to extract more semantic similar terms for each t i from the previous stage by computing its cosine distance from the original word in the vector space. Word2Vec (skip Gram) is chosen for our word embedding process because it has proven to be useful in capturing intensive representations of word based on its context [2]. In order to find further nominee expansion terms, most frequent terms are calculated in third stage. Frequent terms are calculated using rapid miner tool on a collection of documents that are retrieved at top ranks in response to the initial query.
The generated expanded terms from the three stages are called as candidate terms and build a candidate term pool. The values of three IR evidence namely word embeddings, WordNet and term frequency are computed for each term in candidate pool. Each evidence has its weight which represents the importance of candidate terms. PSO based term weighting approach is applied to find optimal weights for all the three IR evidences and to determine the final weight of each candidate term as it is shown in proposed PSO term-weighting approach section. After computing the weights of original query terms and candidate terms, all the terms are arranged in descending order of their final weights. And the top K terms are selected for query expansion. Finally, the selected expanded terms are added to original query.

Proposed AQE approach
Researchers proposed approach aims to retrieve more relevant documents. It is providing a convenient way of finding terms that are semantically related to any given query. In this section, researchers describe how extended query term set is obtained based on three IR evidence namely word embeddings, WordNet and term frequency. The researchers construct Qc, the set of synonyms for each t i in a user query, giving a query Q consisting of m terms First the synonyms for each t i in a user query are obtained using Arabic WordNet, where syn(t i ) are the synonyms of term t i in user query. Next, researchers define an extended query term set (EQTS) q ′ as Eq. 2 q ′ is sent to second stage for further expansion. In this stage, Word2Vec is used to extract more semantically similar terms for each t i from the previous stage by computing (1) (2) q ′ = q ∪ q c .

Table 1 Pseudo code of proposed system
Step 1 For each term t i in query q Construct set of synonyms q c based on wordnet Step 2 Create extended query set q ′ by unifying original query q with q c .
Step 3 For each term t i ′ in q ′ Extract the most similar relevant sense of the term within query Context based on word2vec(c) Step 4 Select the most frequent m terms from PRD Step 5 Unify m with c for generating final candidate term that produce the Sense of query context Step 6 For each final candidate term tf from step 5 Compute average weight using Eqs. 6, 7, 8 Step 7 Select optimal average weight for each term in final candidate term using PSO algorithm Step 8 Unify the top optimal term from step 7 with original query q its cosine distance from the original word in the vector space. Word2Vec (skip Gram) is chosen for our word embedding process because it has shown an efficient learning in generating high-quality word embeddings in large-scale unstructured text data [2]. Researchers define the set C of candidate expansion terms as Eq. 3 where NT(t) is the set of K terms that are the nearest to t i in the embedding space. Next, researchers define an extended query term set (EQTS) q ′ as Eq. 4 In order to find further expansion terms, we select the most frequent m terms from a set of pseudo-relevant documents (PRD)-which are retrieved at top ranks in response to the initial query. The size of PRD and the number of selected terms M may be varied as a parameter. All the expanded terms are added to the original query which constitute a set of obvious candidates from which terms may be chosen and utilized to expand Q. In fact, some of the obtained candidate terms may not be related to the meaning of query as a whole. Therefore, term-weighting functions used to select most suitable terms by assigning weights to each term in candidate pool. A term weighting function is mathematically represented by Eq. 5 and discussed in below section. It is based on three evidence namely word embeddings, WordNet, and term frequency. Each IR evidence has its own weight which is multiplied to corresponding IR evidence value.

Proposed PSO term-weighting approach
In fact, some of obtained candidate terms are proximate neighbours of individual query terms, it is preferable to consider terms that are close to the meaning of query as a whole. The proposed PSO term-weighting function aims to select most suitable terms. It assigned weights to each term in candidate pool. It chose and included extra terms to a query. The proposed term weighting function is based on three evidence namely word embeddings, WordNet, and term frequency. The values of this evidence are computed for each term in candidate pool which is mathematically represented in Eq. 5 where sim(Q,t) is the similarity value between t i in candidate pool and all the terms in Q. The first element of proposed term weighting function is w2v. The mathematical expression given as Eq. 6 is used to compute the mean cosine similarity between t i in candidate pool and all the terms in Q in embedding space The second element of proposed term weighting function is wn. This element indicates the mean cosine similarity between t i in candidate pool and all terms in Q which is mathematically expressed as Eq. 7 The third element is tf which is one of the weighting IR evidence used in many termweighting function. This element indicates the number of occurrences of a term in the collection. To restrict the search domain of candidate term, researchers consider only the number of times the candidate terms appear within PRD tf(t) is the number of times the candidate term appears in pseudo-relevant documents (PRD). Each evidence has its weight which represents the importance of candidate terms where the values of weight are between 0 and, 1. The sum of all the weights is ensured to be 1. The ideal values of the different evidences weights were founded out using PSO.
The initial values of weights are taken as positions of particles. Each weigh is multiplied by its value. Then it is summed up to calculate the final weight of the term w_Score . That is mathematically represented in Eq. 9 The initial values of c1, c2, population size, w, velocities and a maximum number of iteration were set and w_Score used as the fitness value. In each iteration the fitness value of each particle is compared with other particles to get best value (best). To obtain global best value (gbest) the current population best fitness is compared with the previous population best fitness. At the end of each iteration, positions and velocities of each particle are updated. The maximum iteration is checked. The maximum weight of each term was maximized so the most appropriate candidate terms for query expansion could be identified. All terms were arranged in descending order according to their final weights. The top M terms were selected for query expansion. Finally, the chosen expanded terms were included to the original query.

Experiments and evaluation
In the following section, we present a set of experiments to evaluate the performance of our proposed approach. The results show that our proposed approach have achieved the best performance compared with all the other approaches.

Experimental environment
This experiment was run on a Dell desktop computer with a 64 bit i5-3470 processor CPU running at 3.20 MHz, with an 8 Mb RAM, running Microsoft windows 7 professional with service pack 1.

Building corpora (index) and query designing
Due to the lack of the available Arabic corpora, Arabic corpus was collected from different Arabic news websites using Vietspider program. Approximately 72 h were needed to collect 8 GB of Arabic Web Pages from different known news websites such as Al-Alam, BBC, CNN, and Al-Jazeera. The collected HTML pages were passed through a series of (7) sim(t, q) = 1 q t i ∈q t.q i .
pre-processes stages including removing non-essential HTML tags from HTML files. Arabic stop words lists such as conjunctions, prepositions, and articles do not have any effect in text mining process [43]. Furthermore, it increases the dimensionality of the text. Therefore, stop words were removed to reduce text dimensionality. Although some Arabic stop words lists are available in different studies such as [4], none of them has shown efficiency in Arabic information retrieval. Therefore, our own Arabic stop words list was used. Nonstop words were stemmed using Khoja algorithm [42]. For effective results, the process of removing stop words was combined with stemming process [5]. These preprocessing steps reduced corpora size to 1 GB. Statistical information about the dataset is provided in Table 2. The processed files were used for training and indexing purposes. For indexing creation purpose, the processed web pages were indexed using lucene. The processed files were then dumped as raw text for the purpose of training the neural network of Word2Vec framework. The parameters of Word2Vec were set as follows: word vector dimensionality 300; negative samples 25; and window size five words. These are as part of the parameter setting described in below section. On another hand, 40 query documents (QDocs) were designed manually by an expert of Arabic language to verify the correctness of our approach. Due to the paper restriction, Table 3 presents only eight queries which were selected randomly as an exemplary sample and shows the expansion terms that obtained by w2v, WordNet and the selected expansion term from PSO. Table 3 illustrates the selected random queries. It presents expansion terms that obtained by WordNet, W2V, tf and the proposed approach. Table 4 presents the selected expansion term and their fitness value.

Parameter setting
The proposed query expansion approach has two unique parameters. N, that is the number of ranked documents selected as most relevant document for query expansion (size of PRD). M that is number of the candidate terms selected for query expansion. To find out the best performance of proposed approach, a set of experiments are performed to select suitable values of N and M. The results of these experiments are presented in following subsections.

Number of top ranked documents (N)
It is important issue to select proper number of PRD documents for query expansion. Therefore, set of experiments are performed to check size impact of PRD on IR performance. Table 5 presents the results for size of PRD varying from 5 to 20.
As it can be seen clearly from Table 5, the best performance of proposed approach in terms of a mean average precision MAP cannot be achieved effectively by a low or high number of documents. However, the highest precision results were achieved when N parameter is set to 10.

Number of candidate terms selected for query expansion (M)
A number of candidate expansion terms were generated by the proposed approach. Selection the most relevant candidate terms M from a whole candidates set is an important issue. The number of candidate terms parameter, that will be added into the submitted queries, was tuned through performing several experiments as shown in Table 6. This is to ensure the accuracy of the obtained results.

Table 3 Selected terms for the randomly chosen query
As it can be seen clearly from Table 6, the best performance of proposed approach in terms of a mean average precision MAP cannot be achieved effectively by a low or high number of candidate terms. However, the highest precision results were achieved when M parameter is set to 6. It is clear from the results of experiments that applying composition of parameters (size of pseudo relevance documents and number of candidate terms) indeed effects on IR performance in terms of MAP positively.

Experimental results
To evaluate the performance of the proposed approach, two analyses is done. First, researchers use standard evaluation metrics: precision, recall and F-measure. Second, to make the result more reliable, statistical analysis is done. During the evaluation, the designed queries were used. The query relevant documents were determined by the Arabic domain specialist who provided the test queries.

Overall performance
F-measure for top ten retrieved documents are computed for the proposed approach and compared with original query, TF, WordNet and W2V approaches respectively as shown in Fig. 2. It demonstrates the higher F-measure values. It was obtained by the proposed approach in compassion with other approaches. It is clear from the experimental results shown in Fig. 2, that the proposed approach is performing better than the  original query for all forty queries. The proposed approach also obtains higher values of F-measure for 38 and 36 queries in comparison to W2V and WordNet approaches respectively. However, F-measure values are equal for two and four queries in comparison to W2V and WordNet approaches respectively. Furthermore, higher F-measure values are achieved by proposed approach over TF approach for 39 queries. Four experiments were also conducted to evaluate the accuracy of proposed approach. The accuracy of proposed approach was calculated and compared against the accuracy of four different approaches including queries without expansion, w2v-based, TF-based and the WordNet based approaches. The accuracy calculation was carried out using Eq. 10. The following subsection presents the evaluation results.

The comparison accuracy between the proposed approach and without expansion
The comparison results between the proposed approach and queries without expansion are represented in Fig. 3. As it is shown by this figure; the average accuracy percentage for queries without expansion is 11% which is relatively less than the accuracy of the proposed approach. However, the highest accuracy obtained by without expansion approach is for query 4, which is still lower than the accuracy of the same query using proposed approach. It is clear from the experimental results shown in Fig. 3 that, proposed approach is more accurate than the without expansion approach for all the queries.

The comparison accuracy between the proposed approach and WordNet based approach
The comparison results between the proposed approach and WordNet are represented in Fig. 4. As it is shown by this figure; the average accuracy percentages for proposed approach and the WordNet based are 53% and 19%, respectively. Accordingly, the Word-Net Based approach obtains very low accuracy for more than 32 input queries, while (10) Accuracy = relevant documents ∩ retrieved documents retrieved documents the accuracy of proposed approach is higher than the WordNet-based for more than 13 input queries. It is worthy to point out that, WordNet based approach is more accurate than without expiation approach, yet less accurate than the proposed approach.

The comparison accuracy between the proposed approach and TF based approach
The comparison results between the proposed approach and TF based approach are represented in Fig. 5. As it is shown by this figure; the average accuracy percentage for TF based approach is 14% which is relatively less than the accuracy of the proposed approach. It is clear from the experimental results shown in Fig. 5 that, proposed approach is more accurate than the TF based approach for all the queries expected for query number 15.

The comparison results between the proposed approach and W2v based approach
The comparison results between proposed approach and the w2v-based approach are shown in Fig. 6. As depicted in this figure, the average accuracy percentage for proposed approach is 53%. This means that proposed approach obtains very high accuracy for more than 21 input queries. While the average accuracy percentage for w2v-based approach is 27.37%. W2v based approach provided higher accuracy values for only 11 input queries. The w2v-based approach was compared to the WordNet-based approach in terms of accuracy. It is worthy to point out that, the average accuracy percentage of WordNet-based approach was only 19%. This means that w2v-base approach obtains high accuracy for more than 3 input queries over wordNet based approach. As shown in Fig. 6; for all tested queries, the accuracy values of the proposed approach are higher than the accuracy values of the wordNet-based and w2v-based approaches. Precision and recall values also computed and compared for the above selected queries using above mentioned approaches as shown in Table 7.
To check the overall performance of the proposed approach, Recall and Precision values are computed and compared with Original query, TF, WordNet, W2V based approaches as shown in Table 8.
As it can be seen clearly from Table 8, the proposed approach outperforms all other query expansion approaches. Figure 7 shows the comparison of Recall-Precision for all approaches.  it is clear from the above results the proposed approach outperform other query expansion approaches due to, proper selection of expansion terms from candidate pool.

Statistical analysis
To make the result more reliable, statistical paired t-test analysis is also computed. Table 9 shows the improvement of proposed approach against other approach is statistically significant at ∝ = 0.05. The proposed approach statistically outperform other approaches as p-values are 0.0257, 0.0258, 0.0295 and 0.0330 for Original query, tfbased, WordNet-based and W2v-based approaches respectively (Fig. 8).
The results from different analysis demonstrate that proposed approach have achieved the best performance compared with all the other approaches. Table 7 lists the accuracy-Recall values of the four approaches for eight randomly selected input queries. By exploring these observations, there are two fundamental key discoveries: the cases that the proposed approach outperformed the other approaches and vice versa. First, for some queries such as queries Q#2, Q#7, Q#6, Q#3 and Q#1, the proposed approach results outperformed the other approaches. For example, in case of query no. 7 the term is synonym to the third query term and term appears with query term in many documents. Therefore, these terms can be added as new term for query expansion. Similarly, for query no. 2 term and term comes with the in many documents (refer to Table 3). Consequently, these terms are added in original query using proposed approach and it improves the accuracy.

Discussions
In case of query no. 3 the term and term generates as synonym for term using wordNet method. Besides, the terms generates as expanded terms using w2v-base approach. The computed accuracy for expanded query using wordNet is 0.4 and for modified query using w2v-base method is 0.5. The terms is not selected as expanded terms using proposed approach, hence it improves the accuracy from 0.4 and 0.5 to 0.7. The reason behind this improvement is that, PSO plays an important role in selecting the suitable terms for query expansion and make query more specific. Therefore, most of the retrieved documents are relevant, and hence the results of the proposed approach were better than the other approaches. Table 8 presents the comparison of MAP of proposed approach with other query expansion approaches.
Second, for some queries including Q#4, Q#5, and Q#8, the WordNet, TF and w2vbased approaches fetched better results than the proposed approach. For example, in query Q#4 the terms is added as new term for query expansion using proposed approach. In such cases, the proposed approach fails to remove inappropriate terms from the candidate pool which cause lags our approach behinds other approaches.

Conclusion and future work
In this paper, a hybrid query expansion approach for Arabic information retrieval was proposed. This approach combines statistical and semantic method to utilize the advantages and strengths of each method. Thus, the term mismatch limitation of the statistical Table 7 Recall and Precision values for the above selected queries using different query expansion approaches