Basic definitions and the random walk procedure
Let G(V, E) denote a graph of n vertices and m edges, where \(V=\left\{ v_{i}|i=1,\ldots n\right\}\) is the set of vertices and \(E=\left\{ e_{i}|i=1,\ldots m\right\}\) is the set of edges. Let \(A\in \mathcal {R}^{n\times n}\) be the adjacency matrix of the graph G and \(A_{ij}\) are the elements in the matrix A. Let \(D\in \mathcal {R}^{n\times n}\) be the degree matrix, which is a diagonal matrix whose elements on the diagonal are the degrees of each vertex. In this paper, we assume the graph is undirected, unweighted and does not contain self-loops.
Clustering phenomenon is very common in big graph data. A cluster in a graph is a vertex set where the density of the edges inside the cluster is much higher than the density of edges that link the inside vertices and the outside vertices.
Random walk on a graph is a simple stochastic procedure. At the initial state, an agent stays on a chosen vertex (seed vertex). At each step, the agent randomly picks a neighboring vertex and moves to it. The agent repeats this movement and there is certain probability that the agent lands on a vertex after each movement.
Let \(x_{i}^{(t)}\) denote the probability that the agent is on vertex \(v_{i}\) after step t, where \(i=1,2,\ldots n\). \(x_{i}^{(0)}\) is the probability of the initial state. Let s be the seed vertex. We have \(x_{s}^{(0)}=1\), and \(x_{i}^{(0)}=0\) for \(i\ne s\). Let \(x^{(t)}=\left[ x_{1}^{(t)},x_{2}^{(t)},\ldots ,x_{n}^{(t)}\right] ^{T}\) be the probability vector, where the superscript T denotes the transpose of a matrix or a vector. By the definition of the probability, it is easy to see that \(\sum _{i=1}^{n}x_{i}^{(t)}=1\) or \(\left\| x^{(t)}\right\| _{1}=1\).
The random walk procedure is equivalent to a discrete-time stationary Markov chain process. Each vertex is corresponding to a state in the Markov chain and each edge indicates a possible transition between the two states. The Markov transition matrix P can be obtained by normalizing the adjacency matrix to have each column sum up to 1, e.g.
$$\begin{aligned} P_{ij}=\frac{A_{ij}}{\sum _{k=1}^{n}A_{kj}} \end{aligned}$$
(1)
or
$$\begin{aligned} P=AD^{-1}. \end{aligned}$$
(2)
Other forms of the transition matrix P can also be used, for example the lazy random walk uses transition matrix \(P=\frac{1}{2}(I+AD^{-1})\), where I is the identity matrix. Given the transition matrix P, we can calculate \(x^{(t+1)}\) from \(x^{(t)}\) using the equation:
$$\begin{aligned} x^{(t+1)}=Px^{(t)}. \end{aligned}$$
(3)
A closed walk is a walk on a graph where the ending vertex is same as the seed vertex. The period of a vertex is defined as the greatest common divisor of the lengths of all closed walks that start from this vertex. We say a graph is aperiodic if all of its vertices have periods of 1.
For an undirected, connected and aperiodic graph, there exists an equilibrium state \(\pi\), such that \(\pi =P\pi\). This state is unique and irrelevant to the starting point. By iterating Eq. 3, \(x^{(t)}\) converges to \(\pi\). More details about the Markov chain process and the equilibrium state can be found from [20].
Limited random walk procedure
Definitions
We first define the transition matrix P. We assign the same probability to the transition that the walking agent stays in the current vertex and the transition that it moves to any neighboring vertex. We add an identity matrix to the adjacency matrix and then normalize the result to have each column sum to 1. The transition matrix can be written as
$$\begin{aligned} P=\left( I+A\right) \left( I+D\right) ^{-1}. \end{aligned}$$
(4)
Comparing to the transition matrix in Eq. 2, this is similar to adding self-loops to the graph, but increasing the degree of each vertex by 1 instead of 2. This modification fixes the periodicity problem that the graph may have [20]. It greatly improves the stability and accuracy of the algorithm in graph clustering.
At each walking step, the probability vector \(x^{(t)}\) is computed using Eq. 3. Note that, in general, elements in \(x^{(t)}\) that are around the seed vertex are non-zeros and the rest are zeros. So we do not need the full transition matrix to calculate the probability vector for the next step.
Starting from the seed vertex, a normal random walk procedure will eventually explore the whole graph. To reveal a local graph structure, different techniques can be used to limit the scope of the walks. Harel and Koren fix the number of walking steps by a predefined constant [21]. Xin et al. use a stochastic method to determine if a walk should be continued and set the maximum number of walking steps to be 6 according to the principle of “six degrees of separation” [14]. In [13, 17, 22], the random walk function is defined as
$$\begin{aligned} x^{(t+1)}=\alpha x^{(0)}+(1-\alpha )Px^{(t)}, \end{aligned}$$
(5)
where \(\alpha\) is called the teleport probability. The idea is that there is a certain probability that the walking agent will teleport back to the seed vertex and continue walking.
Inspired by the Markov clustering algorithm (MCL) algorithm [12], we adapt the inflation and normalization operation after each step of the transition. The inflation operation is an element-wise super-linear function—a function that grows faster than a linear function. Here we use the power function
$$\begin{aligned} f\left( x\right) =\left[ x_{1}^{r},x_{2}^{r},\ldots ,x_{n}^{r}\right] ^{T}, \end{aligned}$$
(6)
where the exponent \(r>1\). Since x indicates the probability that the agent hits each vertex, x must be normalized to have a sum of 1 after the inflation operation. The normalization function is defined as
$$\begin{aligned} g(x)=\frac{x}{\left\| x\right\| _{1}}, \end{aligned}$$
(7)
where \(\left\| x\right\| _{1}=\sum _{i=1}^{n}\left| x_{i}\right|\) is the \(L_{1}\) norm of the vector x. Since \(x_{i}\ge 0\) and \(\sum _{i=1}^{n}x_{i}=1\), Eq. 7 can also be written in a vector form as
$$\begin{aligned} g(x)=\frac{x}{x^{T}\cdot \mathbf {1}}, \end{aligned}$$
(8)
where \(\mathbf {1}=\left[ 1,1,\ldots 1\right] ^{T}\). The inflation and normalization operation enhance large values and depress small values in the vector x.
We call the aforementioned procedure the limited random walk (LRW) procedure. Comparing to the normal random walk procedure defined in "Basic definitions and the random walk procedure" section, LRW involves inflation and normalization operations in each walking step. These nonlinear operations limit the agent to walk around the neighborhood of the seed vertex, especially if there is a clear graph cluster boundary.
The MCL algorithm simulates flow within a graph. It uses the inflation and normalization operation to enhance the flow within a cluster and reduce the flow between clusters. The MCL procedure is a time-inhomogeneous Markov Chain in which the transition matrix varies over time. The MCL algorithm starts the random walk from all vertices simultaneously—there are n agents walking on the graph at the same time. The walking can only continue after all agents have completed a walking step and the result probability matrix has been inflated and normalized. Unlike in the MCL algorithm, the LRW procedure is a time-homogeneous Markov Chain. We initiate random walk from a single seed vertex, and do the inflation on the probability values of this walking agent. This design has many advantages. First, it avoids unnecessary walks since the graph structure around the seed vertex may be exposed by a single walk. Second, the procedure is suitable for the local clustering problems because it does not require the whole graph data. Third, if multiple walks are required, each walk procedure can be executed independently. Thus the algorithm is fully parallelizable.
The LRW procedure involves a nonlinear operation, thus it is difficult to analyze its properties on a general graph model. Next we study the equilibrium of the LRW procedure.
Equilibrium of the LRW procedure
We first prove the existence of equilibrium of the LRW procedure. Let X be the set of values of the probability vector x. We have
$$\begin{aligned} X = \{ (x_{1},x_{2},\ldots ,x_{n})\ \vert \ 0\le x_{1},x_{2},\ldots ,x_{n}\le 1 \ \mathrm {and}\ x_{1}+x_{2}+\cdots x_{n}=1\} . \end{aligned}$$
(9)
The LRW procedure defined by Eqs. 3, 6 and 7 is a function that maps X to itself. Let \(\mathcal {L}:\ X\rightarrow X\), such that
$$\begin{aligned} \mathcal {L}(x)=g(f(Px)). \end{aligned}$$
(10)
Theorem 1
There exists a fixed-point
\(x^{*}\)
such that
\(\mathcal {L}\left( x^{*}\right) =x^{*}\).
Proof
We use the Brouwer fixed-point theorem to prove this statement.
Given \(0\le x_{1},x_{2},\cdots ,x_{n}\le 1\), the set X is clearly bounded and closed. Thus X is a compact set.
Let \(u,v\in X\) and \(w=\lambda u+(1-\lambda )v\), where \(\lambda \in \mathcal {R}\) and \(0\le \lambda \le 1\). So \(w_{i}=\lambda u_{i}+(1-\lambda )v_{i}\) for \(i=1,2,\cdots n\). Obviously \(0\le w_{i}\le 1\).
Further,
$$\begin{aligned} \sum _{i=1}^{n}w_{i}= & \, \sum _{i=1}^{n}\left( \lambda u_{i}+\left( 1-\lambda \right) v_{i}\right) \\= & \, \lambda \sum _{i=1}^{n}u_{i}+\left( 1-\lambda \right) \sum _{i=1}^{n}v_{i}\\= & \, 1 \end{aligned}$$
Thus \(w\in X\). This indicates that the set X is convex.
Since function f(x) is continuous over the set X and function g(x) is continuous over the codomain of function f(x), function \(\mathcal {L}\) is continuous over the set X.
Given \(\mathcal {L}\) is a continuous function that maps a convex set to itself, according to the Brouwer fixed-point theorem, there is a point \(x^{*}\) such that \(\mathcal {L}\left( x^{*}\right) =x^{*}\). \(\square\)
Theorem 1 shows the existence of fixed-point of the LRW procedure, i.e., the LRW procedure will not escape from a fixed-point whenever the point is reached. Since the LRW procedure is a non-linear discrete dynamic system, it is difficult to analytically investigate the system behavior. However, when \(r=1\), the LRW procedure is simply a Markov chain process, in which the fixed-point \(x^{*}\) is the unique equilibrium state \(\pi\) and the global attractor. In another extreme case when \(r\rightarrow \infty\), a fixed-point can be an unstable equilibrium and the LRW procedure may have limit cycles that oscillate around a star structure in the graph. In one state of the oscillation, the probability value of the center of a star structure is close to one. In practice, we chose r from (1, 2]. This makes the LRW procedure close to a linear system and oscillations are extremely rare. In this case, the fixed-points of the LRW procedure are stable equilibriums.
Limited random walk on general graphs
Without any prior knowledge of the cluster formation, we normally start the LRW procedure from an initial state where \(x_{s}=1\), \(x_{i}=0\) for \(i\ne s\) and s is the seed vertex. During the LRW procedure, there are two simultaneous processes—the spreading process and the contracting process. When the two processes can balance each other, a stationary state is reached.
During the spreading process, the probability values spread as the walking agent visits new vertices. The number of visited vertices increases exponentially at first. The growth rate depends on the average degree of the graph. The newly visited vertices will always receive the smallest probability values. If the graph has an average degree of d, it is not difficult to see that the expected probability value of a newly visited vertex at step t is \(\left( {1}/{d}\right) ^{t}\). As the walking continues, the probability values tend to be distributed more evenly among all visited vertices.
The other ongoing process during the LRW procedure is the contracting process. During this process, the probability values of the visited vertices contract to some vertices. Since the graph is usually heterogeneous, some vertices (and groups of vertices) will receive higher probability values as the procedure continues. The inflation operation further enhances this contracting effect. The degrees of a vertex and its surrounding vertices determine whether the probability values concentrate to or diffuse from these vertices. Some vertices, normally the center of a star structure, receive larger probability values than others. We call these vertices attractor vertices and they can be used to represent the structure of a graph.
Because the density of edges inside a cluster is higher than that of linking the vertices inside and outside the cluster, the probability that a walking agent visits vertices outside the cluster is small. Thus, the LRW procedure will find attractor vertices that the seed vertex is associated. We can use these vertices as features to cluster the vertices.
The larger the inflation exponent r is, the faster the algorithm converges to the attractor vertices. The LRW procedure tries to find the attractor vertices that are near the seed vertex. However, if r is too large, the probability values concentrate to the nearest attractor vertex (or the seed vertex itself) before the graph is sufficiently explored. If r is too small, the probability values will concentrate to the attractor vertices that may belong to other clusters. The performance of the LRW algorithm depends on choosing a proper inflation exponent r. From this aspect, it is similar to the MCL algorithm. In practice, r is normally chosen between 1 and 2 and the value 2 was found to be suitable for most graphs.
LRW for global graph clustering problems
In this section, we propose how to LRW in global graph clustering problems. Our algorithm is divided into two phases—graph exploring phase and cluster merging phase. To improve the performance on big graph data, we also propose a multi-stage strategy.
Graph exploring phase
In the graph exploring phase, the LRW procedure is started from several seed vertices. At each iteration, the agent moves one step as defined in Eq. 3. Then the probability vector x is inflated by Eq. 6 and normalized by Eq. 7. The iteration stops when the probability vector x converges or the predefined maximum number of iterations is reached. Let \(x^{(*,i)}\) denote the final probability vector of a random walk that was started from the seed vertex \(v_{i}\). As described in the previous section, the LRW procedure explores the vertices that are close to the seed vertex. Thus, the vector \(x^{(*,i)}\) has non-zero elements only on these neighboring vertices.
Algorithm 1 illustrates the graph exploring from a seed vertex set Q. Note that for small graph data, we can set the seed vertex set \(Q=V\) (i.e. the whole graph). In such case, the LRW procedure is executed on every vertex of the graph and the multi-stage strategy is not used.
Note that the threshold \(\epsilon\) limits the number of nonzero elements in the probability vector x. It is easy to prove that the number of nonzero elements in \(x^{(t,i)}\) is less than \({1}/{\epsilon }\). A larger \(\epsilon\) eliminates very small values in \(x^{(t,i)}\) and prevent unnecessary computing efforts. However, \(\epsilon\) does not impose a limit on the largest cluster we can find. Further, the choice of \(\epsilon\) has little impact on the final clustering results because either the LRW procedure finds the most dominant attractor vertices in a cluster or the small clusters are merged in the cluster merging phase.
Cluster merging phase
After the graph has been explored, we will find the clusters in the cluster merging phase. We treat each \(x^{(*,i)}\) as the attractor vector for the vertex \(v_{i}\). Vertices belonging to the same cluster have attractor vectors that are close to each other. Any unsupervised clustering algorithm, such as k-means or single linkage clustering method, can be applied to find the desired number of (k) clusters. Because of the computational complexity of these clustering algorithms, we design a fast merging algorithm that can efficiently cluster vertices according to their attractor vectors.
Each element \(x_{j}^{(*,i)}\) in \(x^{(*,i)}\) is the probability value of the stationary state that the walking agent hits the vertex \(v_{j}\) when the seed vertex is \(v_{i}\). The attractor vector \(x^{(*,i)}\) is determined by the graph structure of the cluster that the initial vertex \(v_{i}\) has. Thus, vertices in the same cluster should have very similar attractor vectors. We first find the vertex that has the largest value in the vector \(x^{(*,i)}\). Suppose \(m=\arg \max _{j}\left( x_{j}^{(*,i)}\right)\), we call \(v_{m}\) the attractor vertex of vertex \(v_{i}\). Grouping vertices by their attractor vertex can be done in a fast way (complexity of \(O\text{(1) }\)) using a dictionary data structure. After the grouping, each vertex is assigned to a cluster that is identified by the attractor vertex. However, it is possible that some vertices in one cluster do not have the same attractor vertex. This may happen when the cluster is large and the edge density in the cluster is low. We then apply the following cluster merging algorithm to handle this overclustering problem.
The vertices that have large values, which are determined by a threshold relative to \(x_{m}^{(*,i)}\), in \(x^{(*,i)}\) are called significant vertices for vertex \(v_{i}\). If two vertices have large enough overlaps of their significant vertices, they should be grouped into the same cluster. From this observation, we first collect significant vertices for the found clusters. Then we merge clusters if their significant vertices overlap more than a half. Note that the attractor vertex and the significant vertices are always in the same cluster as the seed vertex. This is very useful when we use the multi-stage graph strategy.
Algorithm 2 shows the details of the merging phase of the LRW algorithm. Note that, for small graph data, we set the seed vertex set Q = V and the initial clustering dictionary \(\mathcal {D}\) to be empty.
Multi-stage strategy
For small graph data, we can do the LRW procedure on every vertex of the graph. So the seed vertex set \(Q=V\). The graph clustering is completed after a graph exploring phase and a cluster merging phase. However, when the graph data is large, it is time-consuming to perform the LRW procedure from every vertex of the graph. A multi-stage strategy can be used to greatly reduce the number of required walkings. First, we start the LRW procedure from a randomly selected vertex set. After the first round of the graph exploring, some clusters can be found after the cluster merging phase. Next we generate a new seed vertex set by randomly selecting vertices from those vertices that have not been clustered. Then we do the graph exploration from the new seed vertex set. We repeat this procedure until all vertices are clustered.
Algorithm 3 shows the global graph clustering algorithm using the multi-stage strategy.
LRW for local graph clustering problems
For the local graph clustering problems, the LRW procedure can efficiently find the cluster from a given seed vertex. To achieve this, we first perform graph exploring from the seed vertex in the same way as described in "Graph exploring phase" section. Let \(x^{(*)}\) be the probability vector after the graph exploration. If a probability value in \(x^{(*)}\) is large enough, the corresponding vertex is assigned to the local cluster without further computation. Similar to the global graph clustering algorithm, we use a relative threshold \(\eta\) that is related to the maximum value in \(x^{(*)}\). Vertices whose probability values are greater than \(\eta \cdot \max \left( x_{j}^{(*)}\right)\) are called significant vertices. The significant vertices are assigned to the local cluster directly. A small value of \(\eta\) will reduce the computational complexity, but may decrease the accuracy of the algorithm. Suitable values of \(\eta\) were experimentally found to be between 0.3 and 0.5.
The vertices with low probability values can either be outside the cluster or inside the cluster but with relatively low significance. Unlike [9, 15, 16], which involve a sweep operation and a cluster fitness function, we do another round of graph exploring from these insignificant vertices. After the second graph exploring is completed, we apply the cluster merging algorithm described in "Cluster merging phase" section.
Algorithm 4 presents the LRW local clustering algorithm.
Computational complexity
We first analyze the computational complexity of the LRW algorithm for the global graph clustering problem. We assume the graph G(V, E) has clusters. Let \(\bar{n}_{c}\) be the average cluster size—the number of vertices in the cluster, and C is the number of clusters. We have \(\bar{n}_{c}\cdot C=n\). Note \(C\ll n\). The most time-consuming part of the algorithm is the graph exploring phase. For each vertex, every iteration involves a multiplication of the transition matrix P and the probability vector x. The LRW procedure visits not only the vertices in the cluster but also a certain amount of vertices close to the cluster. Let \(\gamma\) be the coefficient that indicates how far the LRW procedure explores the graph before it converges. Notice the maximum number of nonzero elements in a probability vector is \({1}/{\epsilon }\). Let J denote the number of vertices that the LRW procedure visits in each iteration, thus \(J=\min \left( \gamma \bar{n}_{c},{1}/{\epsilon }\right)\). Thus the transition step at each iteration has complexity of \(O\left( J\bar{n}_{c}\right)\). The inflation and normalization steps, which operate on the probability vector x, have the complexity of \(O\left( J\right)\). Let K be the number of iterations for the LRW procedure to converge. So, the computational complexity for a complete LRW procedure on each vertex is \(O(KJ\bar{n}_{c})\). For a global clustering problem when performing the LRW procedure on every vertex, the graph exploration phase has a complexity of \(O\left( KJ\bar{n}_{c}n\right)\). In the worst case, the algorithm has a complexity of \(O\left( n^{3}\right)\). This is an extremely rare case and it only happens when the graph is small; does not have a cluster structure; and the edge density is high. This worst case scenario is identical to the MCL algorithm [12]. Notice that the variables J and K have upper bounds and \(\bar{n}_{c}\) is determined by the graph structure, the algorithm has a complexity of O(n) for big graph data.
The computational complexity of the cluster merging phase involves merging clusters that were found using the attractor vertices. This merging requires \({C \atopwithdelims ()2}\) times of set comparison operations, where C is the number of clusters found by the attractor vertices. The complexity of this phase is roughly \(O(C^{2})\). This does not impose a significant impact to the overall complexity of the algorithm, since \(C\ll n\). The time spent in this phase is often negligible. Experiments show that the clusters found using the attractor vertices are close to the final results. For applications where speed is more important than accuracy, the cluster merging phase can be left out.
When the LRW algorithm is used in local graph clustering problems, the first graph exploration (started from the seed vertex) has a complexity of \(O(KJ\bar{n}_{c})\). After the first graph exploration, there are LJ vertices to be further explored, where L is related to the threshold \(\eta\) and \(L<1\). The overall complexity of the LRW local clustering algorithm is thus \(O(LKJ^{2}\bar{n}_{c})\).
The LRW algorithm is a typical example of embarrassingly parallel paradigm. In the graph exploring phase, each random walk can be executed independently. Therefore it can be entirely implemented in a parallel computing environment such as a high-performance computing system. The time spent for graph exploring phase decreases roughly linearly with respect to the number of available computing resources. The two-phase design also fits the MapReduce programming model and can easily be adapted into any MapReduce framework [23].