Developing a Prediction Model for Author Collaboration in Bioinformatics Research Using Graph Mining Techniques and Big Data Applications

Nowadays, scientific collaboration has dramatically increased as a result of web-based technologies, advanced communication systems, and development of information and scientific databases. The present study aims to provide a predictive model for author collaborations in bioinformatics research output using graph mining techniques and big data applications. The study is an applied-developmental research adopting a mixed-method approach, i.e. a mix of quantitative and qualitative measures. The research population consisted of all bioinformatics research documents indexed in PubMed (n=699160). The correlations of bioinformatics articles were examined in terms of weight and strength based on article sections including title, abstract, keywords, journal title, and author affiliation using graph mining techniques and big data applications. Eventually, the prediction model of author collaboration in bioinformatics research was developed using the abovementioned tools and expert-assigned weights. The calculations and data analysis were carried out using Expert Choice, Excel, Spark as well as Scala and Python programming languages in a big data server. Accordingly, the research was conducted in three phases: 1) identifying and weighting opinion and systemic weights, the model can help alleviate the contemporary information overload and facilitate collaborator lookup by authors.


Introduction
The increasing boost of human knowledge has contributed to scientific collaborations. Scientific collaboration can occur in the form of compilation book, article translation, article publication in journals and presentation in conferences, research projects, membership in scientific societies, and collaboration with scholarly journals (Ghanei Rad, 2006). Another example of scientific collaborations is faculty member collaborations in supervising, advising, and refereeing student theses (Tabarzeh, 2018). The collaborations may occur at intra-institutional, inter-institutional, domestic, and international levels. One of the concerns of a researcher is to find potential collaborators who can best cooperate in a research project. One of the key issues for researchers is to identify effective scientific collaborators in co-authorship networks. Identification of the best candidates for scientific collaboration helps save time, increase efficiency, boost research quality, and develop science. A co-authorship network is a social network constituting a group of researchers. In a co-authorship network, authors function as nodes while undirected edges represent two authors who have published a joint article (Das et al., 2018). Static social networks such as bibliometric information networks are a type of social networks. PubMed is an example of such networks. PubMed is an information network constituting bibliometric data on medical sciences provided by the National Center for Biotechnology Information (NCBI) at the National Library of Medicine (NLM). Bioinformatics is an interdisciplinary field that includes methods and software for understanding biological information and that involves the interaction of computer, mathematics, statistics, physics, and biology sciences (Benton, 1996). Considering the importance of bioinformatics as an interdisciplinary field, researchers have produced and developed science in this field ( Figure 1). With the increase in the number of online biomedical articles, including bioinformatics articles, in full-text format, it has become vital for the majority of text mining software to understand and cite documents.
bioinformatics (Bayat, 2002) Figure 1. The fields related to One of the methods of predicting scientific collaborators is to use the procedures and algorithms of recommender systems and link prediction methods. Besides, graph theory is important in analyzing information networks. In this method, the set of network data is usually shown as graphs in which the nodes within the network constitute graph heads, and relations among nodes constitute graph links. One of the challenges of graphs is enlarging of the data graph volume that may include millions of nodes and edges. Such enormous volume makes it difficult to understand graphs. Even many computer programs may fail to analyze these graphs. Thus, big data tools should be used to analyze such graphs (AlHasan & Chaoji, 2008). The present study aimed to map the complete graph of co-authorship network in PubMed using link prediction algorithms, network analysis, and big data tools in order to predict the best potential collaborations for a researcher. The model developed in this study may prove useful in other databases to illustrate author collaborations in a given field by applying a given dataset. Therefore, the study tends to develop a predictive model that can predict scientific collaborators based on content components.

Research objectives
The main objective of the present study is to provide an author collaboration prediction model in bioinformatics research using graph mining techniques and big data applications. To this end, the following specific goals are pursued: 1. Calculating the weights and correlations of bioinformatics research articles based on article titles using graph mining and big data applications 2. Calculating the weights and correlations of bioinformatics research articles based on journal titles using graph mining and big data applications 3. Calculating the weights and correlations of bioinformatics research articles based on keywords using graph mining and big data applications 4. Calculating the weights and correlations of bioinformatics research articles based on abstracts using graph mining and big data applications 5. Calculating the weights and correlations of bioinformatics research articles based on affiliation using graph mining and big data applications 6. Developing an author collaboration prediction model for bioinformatics research articles using graph mining and big data applications

Methodology
The study is an applied-developmental research adopting a mixed-method approach, i.e. a mix of quantitative and qualitative measures. The methodology entails three main phases: 1) identifying and weighting the components contributing to similarity measurement of authors; 2) implementing the co-authorship prediction model; and 3) integrating the weights obtained in the previous phases.
Step 1: the literature was reviewed to identify the criteria affecting the selection of scientific collaborators. The focus group method was used to determine the weighting questionnaire. Focus groups provide a method for collecting qualitative data through an informal group discussion on a specific subject (Wilkinson & Silverman, 2004). Subsequently, the research questionnaire was designed using the 9-point Saaty scale (1980). Eventually, the pairwise comparison matrix of expert opinions was calculated based on group AHP. The matrices involved 6 sub-criteria, which were rendered to 30 questionnaire items based on pairwise comparisons.
Step 2: this step involved a quantitative approach in which the co-authorship prediction model was implemented using prediction algorithms, text mining, and big data tools based on graph theory in Python and Scala. Accordingly, the complete matrix of research variables was plotted per variable, and the edge weights were computed by each individual edge.
Step 3: adopting a mixed-method approach, the final co-authorship prediction model was calculated using the expert weightings and system weightings in the Step 2.
All bioinformatics research output indexed in PubMed including 699160 articles was examined in the modelling phase on December 2019. The dataset, sized 18 GB, was downloaded from PubMed in XML format.
The components identified for prioritization using pairwise comparisons were rendered into 30 questionnaire items, on a 9-point Saaty scale (1980), and an open-ended item. In this phase, the data were collected from scientometrics and bioinformatics experts, professors, and professionals. The data were analyzed using Expert Choice and Excel software. The face validity of the questionnaire was examined in the focus group by drawing on the opinions of eight experts in the fields of software, artificial intelligence, scientometrics, library and information science, and bioinformatics. The reliability of the instrument showed an inconsistency rate of 0.8.
Python and Scala were used to implement the prediction model. The modules and libraries used in the research included: Numpy, scikit-learn, SparkContext, SparkContext -PySpark Shell, SparkSession, pyspark.sql.functions, monotonically_increasing_id, pyspark.ml.feature, Hashing, TF, IDF, Normalizer, pyspark.mllib.linalg.distributed, IndexedRow, IndexedRowMatrix, scala.xml.XML, spark.implicits, graphx, SparkContext, RDD, SQL, scala-xml, OS, SYS Due to the enormous volume of data, it was not possible to do the processing on a PC. Thus, we connected to the ASTEC big data server to do the operations. The configurations of the system are illustrated in Table 1.

Identifying and weighting the components contributing to similarity measurement of authors
Following an extensive review of the literature, a number of 79 criteria were identified to classify and weight the components. Then a number of 6 components including common journals, citation, titles, affiliations, keywords, and abstract similarity were selected based on expert opinion and PubMed data. The questionnaire was designed on a 9-point Saaty scale. The pairwise comparison matrix for expert opinion was calculated based on group AHP (Table 2). During the implementation phase, citation component was excluded from the calculation due to a high rate of errors. This is because the citation counts were not available for all PubMed documents but only for open access articles. The keywords were acquired from Kiani et al. (2020). In the second phase of the research, the datasets saved in PubMed in XML format were recalled, and the data were distributed, parsed, and crawled using Spark. Subsequently, PMID, author name, affiliation, article title, keyword, abstract, publication year, and journal title tags were extracted. Scala contains a scala-xml library that is used to parse XML documents. Using scala-xml library, the raw file was parsed to extract the tags. Figure 2 illustrates the data output.

Figure 2. Extraction of data frames
The key was defined as hash to accelerate the searches (Figure 3).

Calculating the weights and correlations of bioinformatics articles based on article titles using graph mining techniques and big data applications
In this phase, the Spark was recalled using Python. In order to identify the similarities among article titles, a complete graph was produced of all authors in pairs in which authors represented graph nodes, and each edge between a pair of authors represented the similarity weight of two titles ( Figure 4). In order to measure similarity weights in article titles, the titles were first segmented into words by using the CountVectorizer that is provided by scikit-learn library for sentence vectorization. The CountVectorizer parses sentences into a set of tokens. It also deletes the tags and special characters and applies the preprocessing to every individual word.
We then rendered the texts into a feature vector to build the incidence matrix for article titles. Term frequency (occurrence) vector was calculated for every article title (Table 3) in order to measure the distance between every pair of article titles based on cosine similarity. To this end, the words were construed as vectors firstly. For example, Article 1 and Article 2 vectors were formed as (2,1,0,0,0,0,0,0) and (1,1,0,4,0,0,0,1) respectively. Subsequently, their cosine similarity was measured in pairs (Table 4). That is, Article Title 1 and Article Title 2; Article Title 1 and Article Title 3; and Article Title 1 and Article Title n were compared in pairs. Cosine similarity value ranges between 0 and 1. When the two vectors (article titles) are the same, the cosine distance is 1; however, when the two vectors (article titles) are utterly different, the cosine distance is 0. The cosine distance between Article Title 1 and Article Title 2 is as follows: Article Title Vector 1: (2, 1, 0, 0, 0, 0, 0, 0) Article Title Vector 2: (1, 1, 0, 4, 0, 0, 0, 1) cosθ = 1 . 2 | 1|| 2| = 2 × 1 + 1 × 1 + 0 × 0 + 0 × 4 + 0 × 0 + 0 × 0 + 0 × 0 + 0 × 1 √2 2 + 1 2 × √1 2 + 1 2 + 4 2 + 1 2 = 3 √5 * 19 = 0.31 The next process is to calculate the Inverse Document Frequency (IDF) that is the normalization of the word frequency. IDF is calculated as follows: When a word appears in all documents, the IDF value for that word is zero. For example, if we have 1000000 article titles with 1000 titles containing the word "internet", IDF is calculated as follows: (internet)= (1000000/1000)= 3 In the next step, TF-IDF is calculated in general. That is, the occurrence of every individual word in the text is multiplied by the IDF. The calculation is done using the following equation: In the next step, the weights are assigned as the weights of the article title edges for a given pair of authors.
The output of pairwise title weights are illustrated in Figure 5.

Calculating the weights and correlations of bioinformatics articles based on journal titles using graph mining techniques and big data applications
In the next step, a complete graph was produced of all authors in pairs to determine the similarities among journal titles. The authors represented the nodes, and each edge between a pair of authors represented the similarity weight of common journal titles ( Figure 6).

Figure 6. Similarity of common journals
In this step, journal titles were compared in pairs, and their similarities were determined to measure edge weights. The output is illustrated in Figure 6.

Calculating the weights and correlations of bioinformatics articles based on keywords using graph mining techniques and big data applications
In this step, the weights and correlations of bioinformatics articles were examined based on their keywords. The weights of articles are calculated in pairs based on keywords similarity in order to measure the similarities of article keywords. The authors represent the nodes, and each edge between a pair of authors represent the similarity weight of article keywords ( Figure 6).

Figure 6. Similarity of article keywords
As with titles, the keywords were compared in article pairs, and the authors were weighted based on keyword similarities.

Calculating the weights and correlations of bioinformatics articles based on abstracts using graph mining techniques and big data applications
The weights of articles were calculated in pairs based on abstracts similarity in order to measure the similarities of article abstracts. In this graph, the authors represent the nodes, and each edge between a pair of authors represent the similarity weight of article abstracts (Figure 8).

Figure 7. Similarity of article abstracts
In this regard, we measured the similarities of article abstracts in pairs, drew their complete graph, and computed the edge weights:

Calculating the weights and correlations of bioinformatics articles based on author affiliations using graph mining techniques and big data applications
The weights of articles were calculated in pairs based on affiliations similarity in order to measure the similarities of author affiliations. In this graph, the authors represent the nodes, and each edge between a pair of authors represent the similarity weight of author affiliations (Figure 9).

Proposing the model of author collaborations in bioinformatics research using graph mining techniques and big data applications
The model for predicting author collaborations was eventually developed using graph mining techniques and big data applications. To this end, the complete graph of all authors of articles was designed such that the authors represented the nodes, and the edges represented the similarity weights of article titles, abstracts, keywords, author affiliations, and journal titles. The weights measured in software were integrated with the weights assigned by experts so that the final weights were calculated between each pair of nodes The final weights for predicting co-authorship are as follows: Similarity nodes = weightArticleTitle * 0.091+ weightabstrac * 0.031+ weightkeyword * 0.055 + weightaffiliation * 0.075 + weightTitleJournal * 0.374

Figure 11. Final weights
The final model is illustrated in Figure 12.

Discussion
One of the key issues in proposing scientific collaborators is to use the researchers' opinions because synergy and expert consensus facilitate the selection of scientific partners. The experts of the focus group concurred that identification of core or mostcited authors in a given field was not the key factor because average authors may assume that core authors show no interest in collaborating with them, or researchers in an institution may be reluctant to collaborate with their colleagues in the same institution (Makarov et al., 2017). Thus, expert opinions matter in selecting scholarly partners. In this study, "thematic phrases in article titles", "thematic phrases in article abstracts", "thematic phrases in article keywords", "similarity of author affiliations", and "publications in common journals" were selected for weighting. The expert weightings were integrated with system weightings to measure similarities in finding scientific collaborators.
With regard to the weights and correlations of bioinformatics articles based on article titles, a review of the literature revealed that Wu et al. (2018), Wang et al. (2007), Li et al. (2019), and Chirita et al. (2007) used TF-IDF feature selection techniques. According to Beel et al. (2015), about 70% of weightings were done using TF-IDF approach. Salton and Buckley (1988) and Rathipriya et al. (2014) asserted that cosine similarity method was superior to Hamming similarity criterion in designing a web recommender system. HashemiNezhad et al. (2018) showed that cosine similarity and Manhattan similarity produced better results compared with Euclidean distance. One of the reasons for the popularity of cosine distance is that it is highly suitable for assessment, particularly for scattered vectors (Farhadi & Jamzadeh, 2018). Kamiar (2014) contends that cosine method is one of the best similarity algorithms, which has a better accuracy than Jaccard and Levenshtein similarities. Magara et al. (2018) compared similarity criteria in recommender systems and concluded that cosine similarity had the best performance compared with other similarity criteria. The title of a work is an echo of its identity; in other words, the title is the first manifestation of the text exposed to the readers. The title is a container whose containee is the main idea of the text. In humanities research, some titles assume metaphorical connotations; thus, there is less consistency. However, the title is especially effective in the subject of this thesis addressing the bioinformatics field. DavarPanah (1996) studied the degree of consistency between articles in Persian and their content in different fields of research. The results showed that article titles in humanities were less consistent with their contents compared with those of medical sciences. In the present study, the experts attached a greater weight to the article titles than to author affiliations, keywords, and abstracts. Nascimento et al. (2011) maintained that key terms in the title weighed three times the key terms in the article body. Mooney and Roy (2000) and Li et al. (2019) used the title component to design a recommender system for books and articles. Achary (2011) used the title component in his content-based recommender system. With regard to the weights and correlations of bioinformatics articles, the experts attached the greatest weight and priority to journal titles. Cabanac (2011) argued that journal contents were the main factor for scholarly text recommender systems. He recommended that reading journal articles and conference proceedings were the best way to get updates of latest developments in a given field. In this section, similarity measurement and weighting were done using TF-IDF and cosine similarity method. In selecting the features, journal ISSNs were also available. Although it was easier to process ISSNs, journal titles were selected as the main component to account for overlaps and proper documentation of author names. This is because one of the functions of journal titles is documentation of author names. Authors typically choose relevant journals based on their expertise. For example, a given author who has specialized in genome and who publishes articles in this field tends to publish in genome-related journals. This distinguishes the authors from different fields. Cota et al. (2010) disambiguated author names using similarity functions assuming that authors tend to publish in the same topics and journals. The results showed that this method was 12% more accurate than supervised and unsupervised methods. Han et al. (2004) used the probability model for measuring the similarity between author names and article terms to disambiguate author names. With regard to the weights and correlations of bioinformatics articles, TF-IDF and cosine similarity algorithms were used to calculate keywords similarity. The weights of edges were calculated based on keywords. Keywords are topics and terms that define article contents. Aanonson (1987) showed that keyword search in the titles helps retrieve not only the relevant documents but also the documents that are not retrievable through thematic search. Ghareh Chamani (2013) used article keywords as the only variable to recommend articles from CiteSeer website. Mooney and Roy (2000) designed a recommender system based on topical terms. The system was developed to recommend books to Amazon customers based on Bayesian algorithm. Achary (2011) used keyword tags in Bibsonomy and CiteSeer for his recommender system. Sun et al. (2011) used the subject component to predict co-authorship in heterogeneous bibliometric networks in DBLP network. Using the content-based method and TF-IDF algorithm, Chirita et al. (2007) developed a keyword-recommender system in web pages by extracting important keywords from web pages. With regard to the weights and correlations of bioinformatics articles based on abstracts, it should be noted that article abstracts are important components because they provide a synopsis of the research. Metadata such as title, author names, publication year, and journal title are common and retrievable features used in different databases for similarity measurement; however, it is not easy to retrieve abstracts in the majority of databases. An abstract contains the gist of a research article that is written meticulously by authors. Cabanac (2011) asserts that it is too costly and difficult to access the full texts and abstracts of scholarly texts for processing. According to expert opinions, abstract component ranked fifth in this study. When two authors have produced similar abstracts, they are likely to have authored similar articles. Thus, their similarity is determined based on their mutual weight. Text mining tools such as cosine similarity and TF-IDF algorithm were used to calculate article similarities in abstracts. Similarity measurement in abstracts has not been carried out in previous studies. Seemingly, researchers have avoided this due to the bulky processing of abstracts and a lack of required datasets. With regard to the weights and correlations of bioinformatics articles based on author affiliations using graph mining techniques and big data applications, one should note that affiliation is an important factor for authors to choose collaborators. Some researchers would prefer to collaborate in science with people in their own institution or region. Still, some researchers prefer partners from outside their institution. Departments, laboratories, schools, and universities impose limitations on researchers due to competitions with their rival counterparts. The main reason for such competitions is government financial support (Roemer & Borchardt, 2015). Makarov et al. (2017) reported that researchers at the Higher School of Economics of National Research University (HSE) often collaborated with researchers from other institutions. Affiliation is an important component that researchers use in altmetrics and bibliometrics (Yan & Guns, 2014;Brandão & Moro, 2012;Ho et al., 2019;Andrikopoulos et al., 2016). Finally, a model was designed to predict author collaborations in bioinformatics research articles using graph mining techniques and big data applications by applying a new method to weight the components. Text mining, information retrieval, and big data tools as well as graph theory were used in this method based on expert opinion and graph theory. The majority of previous studies on co-authorship prediction has already drawn upon topological approaches in an unsupervised manner without expert opinion. However, we addressed content-based methods, expert weighting, and thematic similarity.
conclusion Sufficient information is required for decision-making, thinking, and communication.
Due to the dramatic increase in scholarly research and articles, it is exceedingly difficult for researchers to find potential collaborators. The present study drew on quantitative methods in the big data environment and expert opinion to develop a model that could recommend the most relevant potential scholarly collaborators to a given researcher. The results showed that content-based methods in recommender systems in static networks have considerable potential for finding scientific collaborators in relevant retrieval.
Content-based methods involve the use of different sections of the article content including title, abstract, and keywords to recommend the relevant articles based on their similarity with a set of input articles. One of the operational achievements of this study was the acceleration of relevant author retrievals, which in turn led to more efficiency, a higher quality of research, and scientific development. This recommender system leads to the more convenient selection of authors. Finding a suitable research collaborator is one of the main challenges in interdisciplinary fields such as bioinformatics. In addition to systemic methods, bioinformatics expert opinion was also drawn upon in this study. This model coordinates the concerns of authors for finding the most similar research collaborators with their information needs so that it guarantees good scientific collaborator recommendation. Future studies could address the predictive model for author collaboration based on behavioral characteristics, the predictive model for author collaboration based on fuzzy algorithms, the predictive model for author collaboration in other bibliometric networks such as Scopus, the co-authorship model in scientific social networks, the predictive model for author collaboration without expert weighting, and implementation of the predictive model for author collaboration based on various algorithms such as Jaccard, Euclidean, simple Bayesian, and neural network algorithms.