Expanded graph embedding for joint network alignment and link prediction

Link prediction in social networks has been an active field of study in recent years fueled by the rapid growth of many social networks. Many link prediction methods are harmed by users’ intention of avoiding being traced across networks. They may provide inaccurate information or overlook a great deal of information in multiple networks. This problem was overcome by developing methods for predicting links in a network based on known links in another network. Node alignment between the two networks significantly improves the efficiency of those methods. This research proposes a new embedding method to improve link prediction and node alignment results. The proposed embedding method is based on the Expanded Graph, which is our new novel network that has edges from both networks in addition to edges across the networks. Matrix factorization on the Finite Step Transition and Laplacian similarity matrices of the Expanded Graph has been used to obtain the embeddings for the nodes. Using the proposed embedding techniques, we jointly run network alignment and link prediction tasks iteratively to let them optimize each other’s results. We performed extensive experiments on many datasets to examine the proposed method. We achieved significant improvements in link prediction precision, which was 50% better than the peer’s method, and in recall, which was 500% better in some datasets. We also scale down the processing time of the solution to be more applicable to big social networks. We conclude that computed embedding in this type of problem is more suitable than learning the embedding since it shortens the processing time and gives better results.

, (3) correlation information based [8] or a mixture of them. Another well-studied problem with social networks is the network alignment problem [9]. Network alignment is a graph related problem listed under the general problem of Entity Matching [10]. It aims to align similar nodes between two separate graphs. Researchers in [11] classified network alignment features into (1) profile features, (2) content features, (3) network features. In this study, link prediction was carried out between nodes in one social network based on an analysis of existing link patterns between nodes in another social network. We achieved that by aligning the nodes between the two networks then predicting the missing links between the aligned nodes in the first network and vice versa. The link prediction uses the existing links between the correspondence nodes in the second network to predict links in the first network. During the network alignment task, only the network features were used while profile and content features were ignored. The main objectives of this research are to predict links in a social network based on known links in another network using only network features such that the prediction is not impacted by (1) the heterogeneity of the available attributes in each network, (2) the contrast between the information provided in each network due to the lack of updating changeable information, (3) the intention of social network users to provide incomplete or incorrect information in different social networks to avoid being tracked. The contributions in this study can be summarized by the following points: • We provide a new cross-graph embedding method that catches the properties of two graphs. • We use the proposed method in a framework that jointly applies network alignment and link prediction. • We perform extensive experiments on public datasets and evaluate our method and peer's method [12] on many aspects to show how our method surpasses the existing method.
We have organized the remainder of the paper as follows; in "Related work" Section we surveyed the related work in the field of network alignment, link prediction and graph embedding. In "Notations" Section we provided the notations used in the solution. We introduced our method in "The proposed method" Section. In "Experiments" Section we summarized the experiments we applied to evaluate the proposed method. Finally, we outlined conclusions and future work in "Conclusion" Section.

Related work
Network alignment and link prediction are both well-studied fields. However, researchers usually study these problems in a separate manner. Three categories of features are involved in network alignment, some researchers studied them separately and others combined them in their research. Content features were studied separately by researchers in [13]. They question how important the published tweets are for node alignment.
They extracted text features from the tweets and posts such as high-frequency words, part-of-speech tags and emoticons then trained a classifier to predict the alignment. Researchers in [14] projected the network structure features and attribute features into a heterogeneous information network then embedded the generated network vertices to classify each vertices pair as an alignment or not using highway networks [15]. Researchers in [16] introduced how deep learning can be employed in matching the content features based on the users' posts and tweets. They provided a design space using character or word embedding and RNN or Attention summarization. Recently, researchers in [17] aligned nodes between two directed networks based on the structural and attributes features of the two networks. They embed the structure of the networks using embedding learning based on the nodes and their input edges and output edges. Then they obtained the attributes embedding using a multi-layer auto-encoder. Finally, they unified the two embeddings using an attention layer to produce one embedding to each node which holds the structure and attribute information of the node. Besides that, there are many works in link prediction. In [18], researchers provided a survey and a comparison between many link prediction methods, features and network types. They explored link prediction using local and global similarities, node or link attributes and much more. Researchers in [19] projected the graph snapshot as an entry in a time series. They applied graph embedding learning on this time series using the Long Short Term Memory network to predict the network links in the next time step. In [20], researchers treated the problem as supervised learning of missing edges. In the graph embedding field, researchers in [21] introduced the embedding throw analyzing the similarity matrix such as factorizing it with singular value decomposition SVD. In [22], the researchers did a biased random walk to extract the nodes and their neighbours as sequences using a sampling strategy then learned the embedding. Researchers in [23] wanted to take the advantage of node's attributes to feed the embedding, so they embed the nodes taking in mind their attributes to fully describe the nodes. Local attribute distributions of the node's neighbours in a fixed hop have been passed to a Skip-Gram model to embed the attributed nodes. In [24], researchers studied anonymized user identity linkage between two separate networks using unsupervised embedding on network features. They designed a cross-network embedding model to learn from both networks the Gaussian embeddings. Next, they calculated the distance between the embeddings using the 2 th Wasserstein distance to link the users between networks. Researchers in [12] studied two separate networks together to solve the problem of link prediction in a network based on another network. They used Skip-Gram embedding to embed both network's nodes then used greedy alignment to align networks nodes and finally predict links in each network around the aligned nodes. They repeated those steps so that link prediction and node alignment can support each other's results. The limitation of the literature studies in network alignment is the dependency on network features in addition to other features because when social network users provide inaccurate information in their profiles or social content, the usage of content and profile features is useless; therefore the method that we demand must use only network features. In addition, we want to embed the nodes without taking into account their attributes so that incorrect information in the attributes does not negatively impact the embedding. Accordingly, embedding using SVD decomposition is a suitable method if applied to the network nodes similarity matrix. [12] suffers from two problems: (1) using Skip-Gram embedding for unsupervised embedding learning increases processing time for large networks in particular, (2) predicting missing links by examining the existing links between aligned nodes has poor precision and recall. To address these issues, we devised a new embedding method that depends on calculated embedding based on the networks similarity matrix to minimize embedding time and improve link prediction quality.

Notations
We will provide in this section the problem formal definition and the symbols used in the upcoming sections. A graph (network) G = (V , E) can be defined as a set of vertices V and a set of edges E. We assume that E ⊆ V × V and N = |V | the number of vertices. In social networks, the vertices represent the users and the edges represent the relationship between those users (friendship, follow relationship, ...).
Node Embedding is defined as emb : V → R d where embedding dimension d ≪ N , and emb u is the embedding of u ∈ V .
Given two networks S is the set of the alignment seeds that the system will consider as a ground truth.
k-hop neighbors of a node u ∈ V are all nodes that are exactly k links away from node u and we will denote them as N k (u).

The proposed method
The overall proposed method is roughly summarized in Algorithm 1. All these steps will be explained in detail in the upcoming subsections. These steps are identical to the steps proposed in [12] while our study mainly contributed in the node embedding step to reduce the embedding time by introducing our novel Expanded Graph and embed its nodes using some equations. Also [12] depends on network and profile features where we depend only on network features in order to solve the problem of invalid or wrong profile data.

Node embedding
There are many embedding methods, some of them form the embedding from one graph features [21,22,25] and others form it from multiple graphs features [12]. We introduced a novel two graphs embedding that forms the embedding using the two graphs features in a form of one graph which we named Expanded Graph. After constructing the Expanded Graph, we will derive the similarity measures from its adjacency matrix then we will compute the nodes embedding by applying some equations on the similarity measure matrix.

Construct the expanded graph
Given two graphs G 1 = (V 1 , E 1 ) , G 2 = (V 2 , E 2 ) and the alignment seeds set S, we constructed a new graph G = (V , E) . We named G as the Expanded Graph of G 1 and G 2 . We assumed that V = V 1 ∪ V 2 . E can be initially represented as: E = E 1 ∪ E 2 . It basically consists of the existing edges in both graphs E 1 and E 2 . Figure 1 shows the Expanded Graph in the phase 1. To embed across networks, we should add cross-network edges so we added them as E s . Given seeds set S, E s is a set of edges that we added between the aligned nodes in V 1 and their correspondence nodes in V 2 . Figure 2 shows the Expanded Graph in the phase 2 so for now E = E 1 ∪ E 2 ∪ E s . Also, we added E sim where it is a set of edges that link each node in V 1 and the top h similar nodes in V 2 and vice versa. So finally the Expanded Graph edges can be represented as Figure 3 show the Expanded Graph in the final version. E sim can be written as where topNeb (h,G,G ' ) (u) are the top h similar nodes in graph G ' to the node u from graph G. We defined the similarity between any two nodes u,v from graphs G,G ' respectively as the subgraph similarity sim sub (u, v) [12,26] which is given by the Eq. 1.
where α is a hyperparameter that controls the distribution of the similarities and f k (u, v) is the distance between u and v in the k-hop which is given by Eq. 2. s k (u) is the ordered degree sequence of N k (u) (u neighbors in the k-hop) and dist measure the distance between these sequences. For dist we used Eq. 3.

Forming the similarity measure
We have proposed two similarity measures to use and later in "Node embedding" section we will show the evaluation of those measures and some other measures and how they affect the precision and recall of network alignment and link prediction.

Finite step transition matrix Finite
Step Transition matrix FST [25] is equivalent to walking L steps sequentially and randomly as a random walker. FST is given by Eq. 4 Where P is the transition matrix which is given by Eq. 5. A is the adjacency matrix and Laplacian matrix Laplacian Matrix LAP is given by Eq. 6 where A is the adjacency matrix and D is the degree matrix.

Calculating the embedding
We calculated the embedding using Singular Value Decomposition SVD of the similarity measure. SVD is one form of Matrix Factorization methods where the matrix M can be represented as Eq. 7.
SVD has been selected because it represents the given matrix by three matrixes where V represents a vector for each node where the values of this vector are ordered by the importance in representing the correspondence node according to the Eigen Values in . Therefore, we used the truncated version of SVD where just an exact number of Eigen Values are selected. We set this truncation length to be equal to the number of embedding dimensions [27]. So V has been truncated to V t then the embedding can be obtained by Eq. 8.

Network alignment
After embedding each node in V 1 and V 2 , we used the greedy alignment [28] to align the nodes between each graph. We calculated the similarity between each node embedding emb u : u ∈ V 1 and emb v : v ∈ V 2 using the cosine similarity given by Eq. 9, then we selected the top similar nodes and added them to the obtained aligned nodes set S ' .

Link prediction
Link prediction in a graph is used to predict the probability of linking two nodes where there is no observed link between them. Given two graphs G 1 and G 2 , we are looking to predict this probability on the unobserved links between any two nodes in G 1 where u, v : (u, v) / ∈ E 1 meanwhile there is already an observed link between their alignments in G 2 (ali(u), ali(v)) ∈ E 2 . To achieve that we used supervised learning as Logistic Regression LR. We trained the model with the existing edges in E 1 and E 2 then predict the unobserved edges in each graph as described before. As features, we feed the LR with the embedding of each node of the link and the Hadamard product (element-wise product) between these embeddings [12]. The result of the LR prediction will be in [0,1], we will only consider adding all predicted edges that their scores are above a threshold t to each graph G 1 and G 2 .
So according to Algorithm 1, these three main steps will be repeated as iterations. In each iteration, we will embed all the nodes, align nodes between the networks to expand the alignment seeds and predict links in each network to expand the two graphs with more links. These expansions will still feed the network and enhance the precision and recall more and more in each iteration.

The proposed method variations
From the three main steps described before, we proposed three variation of the proposed method that share the network alignment and link prediction steps and differ in the node embedding step. Those variations are: • EG-FST: It use the FST similarity measure. • EG-LAP: It use the Laplacian similarity measure. • EG-Mini: A customized version of EG-FST without recalculating the embedding in each iteration and without adding E sim edges set. We added this method to evaluate the effect of these two factors in the overall process.

Experiments
We validated our method by running extensive experiments on many datasets to verify the network alignment precision and recall and link prediction precision and recall. We also evaluated the network size effect on time consumed by the proposed method variations. Finally as we used supervised learning, we evaluated how the training rate will affect the measures. All experiments ran on Google Colab on a device with 13GB memory, 2 physical 1-core CPUs (Intel(R) Xeon(R) CPU @ 2.20GHz) and 11441 MiB GPU (NVIDIA TESLA K80).

Datasets
We evaluated our method variations on three datasets Facebook/Twitter, Douban online/offline and DBLP/DBLP disturbed copy. All datasets statistics can be found in Table 1. Interoperability gives an idea of how the edges of the aligned nodes in the two networks overlap. Interoperability of two graphs G 1 and G 2 can be calculated with Eq. 10 where the intersection between two network edges is the intersection between the edges from the first network and the aligned edges from the second.
All datasets consist of the nodes and edges of two networks.
• Facebook/Twitter: [29] has collected this dataset from about.me which is a platform that gathers different online social network accounts associated with the same user. Facebook/Twitter accounts have been collected and the graph map the users as nodes and friendship as edges. • Douban online/offline: Douban is a Chinese online social network published by [30]. They took a subgraph from it (online) and constructed another graph based on real-world relations (offline). The alignment between the two networks is the match between the real world person and the Douban user. • DBLP/DBLP disturbed copy: It is a co-authorship graph collected by [31] which consider the nodes as the authors and an edge links two authors if they have any common academic work. We used the same version used in [12] where researchers randomly generated another graph from it to align with the original while preserving the properties of the graph.

Evaluation protocols
For Network Alignment, we have evaluated the precision at the end of the iterations. Precision is how many aligned nodes have we aligned successfully on how many alignments we have made. Note that the network alignment recall in our proposed method will be the same of precision as in recall it will be how many aligned nodes we have aligned successfully on the overall aligned nodes. And since that we evaluate the network alignment through all the aligned nodes in the last iteration then the number of alignments we will make at the end of the iterations will be the same of the number of overall aligned nodes. Therefore, the network alignment precision and recall will be the same so we will only show the network alignment precision in the results. For link prediction, we have also evaluated the precision and recall at the end of the iterations. The links that we should successfully predict is the set of links that do not exist in the first graph and exists between the aligned nodes in the second graph and vice versa. We have evaluated link prediction by comparing this set with the set of predicted edges in each network.

Datasets evaluation
We have evaluated the following methods on the three datasets and measured the network alignment precision and link prediction precision and link prediction recall. We have also evaluated how the training rate affects the measures. We define the training rate TR in our method as the rate of the seeds we start the proposed method with.
• CENALP: [12] proposed a cross-network skip-gram embedding with the same general algorithm for jointly align nodes and predict links. It is the latest and currently the most advanced method that do the two tasks jointly where all other methods just  do one task or do the two tasks separately. We used the parameters as discussed in the paper ( α = 5, K = 3, q = 0.2, c = 0.5) • EG-FST: Our proposed method using the FST similarity measure. We set the parameters as ( α = 5, K = 3, t = 0.5) • EG-LAP: Our proposed method using the Laplacian similarity measure. We set the same parameters as EG-FST. • EG-Mini: A customized version of EG-FST without recalculating the embedding in each iteration and without adding E sim edges set. We set the same parameters as EG-FST As there is some random initializing in those methods, we have ran them 10 times and took the average value of each measurement. Figures 4, 5 and 6 show that all of our proposed methods outperform the CENALP in Facebook/Twitter and Douban but failed at DBLP. This case is intuitively reasonable since Facebook/Twitter and Douban are realworld people social network that friendship or follow relationship are the edges while in DBLP the edges are based on the co-authorship between papers. We can infer that as expected: as TR increases, we get more gain. Also, the results show that the big benefit comes in the Link Precision recall as the gain was 500% in Douban and 180% in Facebook/Twitter with comparison to CENALP. Douban shows that CENALP failed when the interoperability is somehow low but our methods did not get affected by this factor. In addition, EG-Mini shows that the factor of repeating the embedding throw iterations and adding E sim edges set does not really affect the results with much differences.

Comparing multiple similarity measures
We have compared the similarity measures used in our proposed methods with another similarity measures. We have tested the following similarity measures: • Adjacency Matrix: the default adjacency matrix which measure the distance between nodes as 1 if they are connected and 0 if they are not. • Laplacian Matrix: as described in "The proposed method" Section • Symmetric Normalized Laplacian Matrix SNL: calculated with Eq. 11 where D is the degree matrix and A is the adjacency matrix.
• Transition Matrix: as given in "The proposed method" Section • Personalized PageRank Matrix PPR: Given by Eq. 12 where I is the identity matrix and P is the transition matrix and α is a hyperparameter (we used 0.1 as its value) • Finite Step Transition Matrix FST: as described in "The proposed method" Section. Figure 7 shows the results of network alignment precision and link prediction precision and link prediction recall. As we can find, the best three methods in all measurements are Laplacian, Personalized PageRank and Finite Step Transition. We can infer that if we are looking for more link prediction precision we can use the Laplacian similarity measure, and for more link prediction recall we can use the Finite Step Transition similarity measure.

Network size effect on time
We tested the proposed method against other methods for network size scalability to show how the overall time will be affected. We ran the experiments on five datasets extracted from the DBLP dataset since it is large enough to form these sub-datasets from. Statistics of the five datasets can be found in Table 2. The methods we have compared are: • CENALP/5 [12]: It is the CENALP method as described before, but since this method has a very far duration value range (because of the skip-gram training in As Figure 8 shows, we can find how the SVD embedding scales down the time in a big manner, and how close the EG-GPU and EG-Mini-GPU even that EG-GPU recalculate the embedding in each iteration. The figure also reflects the behaviour of the compared methods on very big datasets where the time will be growing exponentially in CENALP, CENALP-MA and EG-CPU while it is near-linear in EG-GPU, EG-Mini-GPU and EG-Mini-CPU.

Conclusion
We have proposed a new embedding technique that applies single graph embedding algorithms on one graph (the Expanded Graph) generated from multiple graphs. Our method variations have significantly improved link prediction precision by 50% using EG-Lap in addition to the improvement in link prediction recall by 500% in some datasets using EG-FST. Besides that, we used calculated embedding that has scaled down the time taken to be near-linear on big social networks. Thus, we improved the method pioneered by CENALP to achieve more accurate results while applying network alignment and link prediction simultaneously. There is still too much work to do as future work, such as supporting directed graphs, weighted graphs, attributed edges in addition to improving the measurements to further increase link prediction precision and recall.