Identification of top-K influential communities in big networks

Zhan, Justin; Guidibande, Vivek; Parsa, Sai Phani Krishna

doi:10.1186/s40537-016-0050-7

Research
Open access
Published: 08 September 2016

Identification of top-K influential communities in big networks

Justin Zhan¹,
Vivek Guidibande¹ &
Sai Phani Krishna Parsa¹

Journal of Big Data volume 3, Article number: 16 (2016) Cite this article

5369 Accesses
21 Citations
12 Altmetric
Metrics details

Abstract

Because communities are the fundamental component of big data/large data network graphs, community detection in large-scale graphs is an important area to study. Communities are a collection of a set of nodes with similar features. In a given graph there can be many features for clustering the nodes to form communities. Varying the features of interest can form several communities. Out of all communities that are formed, only a few communities are dominant and most influential for a given network graph. This might well contain influential nodes; i.e., for each possible feature of clustering, there will be only a few influential communities in the graph. Identification of such communities is a salient subject for research. This paper present a technique to identify the top-K communities, based on the average Katz centrality of all the communities in a network of communities and the distinctive nature of the communities. One can use these top-K communities to spread information efficiently into the network, as these communities are capable of influencing neighboring communities and thus spreading the information into the network efficiently.

Introduction

Community detection in large networks [1–3] is emerging in various kinds of applications. Network data contain a wealth of information, e.g., top-K nodes and top-K communities based on community strength/influence; when uncovered, this data can bolster predictive models and elucidate general network dynamics. Moreover, when solving complex problems, a diverse set of domains—such as optical character recognition (OCR) analysis; protein complex detection; and community discovery in social networks, neurology, genetics, transportation, social network analysis, structural analysis, and computation—can be represented in the form of graphs and networks for data representation. In particular, social networks are being represented in the form of graphs to address fundamental problems, such as discovering communities in the network [4, 5] or discovering the community that is most likely to contain the query node. Social-network graphs generally consist of nodes that represent users, and community detection on social-network graphs means identifying a set of similar nodes (users). In a network, the bond among the nodes inside a community would be denser than with those outside the community. Many existing clustering algorithms are available that converge the nodes in a graph with good bonding. Out of the entire network, by identifying the top users of a community or of an entire network as well as top communities based on strength/influence, the focus of interest could be limited to a set of nodes/communities capable of spreading information into the entire network. For example, consider an e-commerce business offer: instead of sending rewards coupons to all the users, instead, identify the top users, who are loyal and frequent buyers, and most likely to use the rewards coupons. Sending these top users rewards coupons will benefit the business. Other strategies can be applied to new or infrequent buyers.

In all previous studies on these problems, a community has been defined as a densely connected sub-graph. According to Faisal et al. [6], this focus ignores another key aspect, namely, influence or importance; these authors presented interesting scenarios that highlighted importance and need to find the most influential communities in a network. Previously, Doo [7] focused on detecting the top-K influential communities in undirected graphs. They defined the influence of a community as the minimum weight of nodes in that community; the top influential community was the one with largest influence value. On the other hand, Du et al. [8] ranked communities according to the strength of each community, which varies with time. In this paper, we define the strength of communities in terms of their average Katz Centralities, taking into consideration each community’s distinctive nature. All the communities in this study were to connect to a maximum number of communities. If a community is immediately connected to more number of communities than others are, then it can influence them all. By this means, a message can be propagated to the maximum number of communities present in the graph.

Related work

Ample of work has been done to find the most influential community in a network. One of the most significant methods used include the classic centrality measures, such as degree, betweenness, or similar kind of measures.

Xie et al. [4] proposed a method to extract the community structure, which appeared to be connected by means of a unique spectral property of the graph Laplacian of the adjacency matrix. This group used such structural parameters as algebraic connectivity and node degree distribution for community exploration. Similar to our work, they took into consideration the edge structure; in addition, they used the greedy algorithm for modularity optimization. Li et al. [5] used another approach, which was to study the flooding time, which is the time taken for the information to spread from one node/community to the other node/community. In this approach, processes were considered in which the topology of the graph at time t depended only on their topology at t-1. In their case, most of the dynamic graphs were Markovian and ergodic.

One emphatic approach for detecting the most influential community could be forecasted by detecting the number of nodes whose information radiates the most. One such model was proposed by Ma et al. [9], in which mining of social networks were done using heat-diffusion processes. Based on this, candidate was selected. The basic formula used for undirected social networks was:

$$\frac{{{\text{f}}_{\text{i}} \left( {t + \Delta t} \right) - {\text{f}}_{\text{i}} (t)}}{\Delta t} = \alpha \mathop \sum \limits_{{{\text{j}}:\left( {{\text{v}}_{\text{j}} ,{\text{v}}_{\text{i}} } \right)\varepsilon {\text{E}}}} \left( {{\text{f}}_{\text{j}} \left( t \right) - {\text{f}}_{\text{i}} \left( t \right)} \right)$$

where there is a social network graph, G = (V, E), where V is the vertex set and ${\text{V}} = \left\{ {{\text{v}}_{ 1} ,{\text{v}}_{ 2} , \ldots {\text{v}}_{n} } \right\}.{\text{ E}} = \left\{ {\left( {{\text{v}}_{\text{i}} ,{\text{v}}_{\text{j}} } \right)|{\text{there is an edge from vi to vj}}} \right\}$. The value ${\text{f}}_{\text{i}} \left( t \right)$ describes the heat at node v_i at time t. ${\text{f}}\left( t \right)$ denotes the vector consisting of ${\text{f}}_{\text{i}} \left( t \right)$. The heat should be proportional to the time period Δt, and the heat difference ${\text{f}}_{\text{j}} \left( t \right) - {\text{f}}_{\text{i}} \left( t \right)$. α is the thermal conductivity, and E is the set of edges.

Faisal et al. [6] proposed an approach of identifying the boundary nodes in a community, as they played a vital role in communication for energy-efficient graph processing. They calculated the similarity of the nodes by using the Jaccard similarity, which is given as:

$${\text{Sim }}\left( {{\text{i}},{\text{ j}}} \right) = \frac{|Adj\left( i \right)\mathop \cap \nolimits Adj\left( j \right)|}{|Adj\left( i \right)\mathop \cup \nolimits Adj\left( j \right)|}$$

where $Adj\left( i \right)$ and $Adj\left( j \right)$ are the adjacent list of nodes i and j, respectively.

Sweeney et al. [10] used a game theoretic model to detect communities in large networks. Modified Laplacian matrices along with neighborhood similarities were used, and a given network was segregated into dense networks. Kim [11] computed the popularity of a node in a community. Wu et al. [12] described a new method using distance centrality, and detected the communities without a present community number by considering the most central node and determining the similarities among all other nodes. Zhang and Wu [13] found the core nodes for the local community detection, and Mahmood and Small [14] found that each node could only be represented efficiently as a linear combination of nodes spanning the same subspace.

Preliminaries

Katz centrality measures the relative influence of each node in a given network by taking into account the node’s immediate neighbors as well as non-immediate nodes that could be connected to the node by way of its immediate neighbors. Similar to Sub graph centrality and Total communicability, Katz centrality covers both local and global influence of a node on the entire network. The matrix resolving $\left( {{\text{I}} - \alpha {\text{A}}} \right)^{ - 1}$ first was used to rank nodes in a network in the early 1950s, when Katz used the column sums to calculate node importance [15]. The Katz centrality score of a node i was given by either ${{\text{[(I}} - \alpha {\text{A}}{{\text{)}}^{ - 1}}.1]_i}\;{\text{or}}\;{({\text{I}} - \alpha {{\text{A}}^{\text{T}}})^{ - 1}}.1{]_i},$ depending on whether broadcast or receiving scores were required (a directed graph) [16]. In an undirected graph in which the Adjacency matrix obtained is a symmetric matrix (A = A^T), either of the formulae can be used to compute Katz centrality scores. The column matrix containing all number ones may be replaced by an arbitrary (positive) preference vector, v as required. Katz centrality of node i counts all walks beginning at node i, penalizing the contribution of walks of length k by α^k.

$$\left( {{\text{I}} - \alpha {\text{A}}} \right)^{ - 1} = {\text{I}} + \alpha {\text{A}} + \alpha^{ 2} {\text{A}}^{ 2} + \cdots + \alpha^{\text{k}} {\text{A}}^{\text{k}} + \cdots = \mathop \sum \limits_{k = 0}^{\infty } \;\;\;\alpha^{\text{k}} {\text{A}}^{\text{k}}$$

(1)

The bounds on α $\left( {0 < \alpha < 1/\lambda_{1} } \right)$ ensure that the matrix I − αA is invertible and that the power series in (1) converges to its inverse. The bounds on α also force (I − αA)⁻¹ to be nonnegative, as I − αA is a nonsingular M-matrix. Hence, both the diagonal entries and the row/column sums of (I − αA)⁻¹ are positive, and thus can be used for ranking purposes.

Given a graph, G = (V, E), a walk of length m denotes a set of m nodes $\left\{ {{\text{v}}_{ 1} ,{\text{v}}_{ 2} ,{\text{v}}_{ 3} , \ldots ,{\text{v}}_{m} } \right\},{\text{ and E}} = \left\{ {{\text{e}}_{ 1} ,{\text{e}}_{ 2} ,{\text{e}}_{ 3} , \ldots } \right\}$ is the set of edges. Then, A is the adjacency matrix of the network G, denoting the immediate connectivity among the nodes. The Katz centrality of a node v_i is given by:

$${\text{C}}_{\text{Katz}} \left( {{\text{v}}_{\text{i}} } \right) = \alpha \, \sum\limits_{\text{j = 1}}^{n} {\text{A}_{{j,{\text{I}}}} \,\text{C}_{{\text{Katz}}} \left( {\text{v}_{j} } \right) + \beta }$$

(2)

where α is a constant called damping factor, which is usually considered to be less than the largest eigenvalue, λ (i.e., α < 1/λ₁) and β is a bias constant (also called the exogenous vector), which is used to avoid the zero centrality values. Hence, each node has a minimum, positive amount of centrality that it can transfer to other nodes by referring to them. In particular, when measuring the receiving capacity, the centrality of nodes that are never referred to is exactly this minimum positive amount. When measuring the broadcasting ability of a node, linked nodes have a higher centrality or the centrality of nodes that are never broadcasting to any other nodes. It follows that highly linked nodes have high centrality regardless of the centrality of the neighboring nodes. However, nodes that receive few links still may have high centrality if their neighboring nodes have a large centrality.

From (2), it is evident that Katz centrality is a parameter dependent index, i.e., it depends on α and β. Their values play a decisive role in obtaining Katz centrality values that fluctuate. Different choices of α and β lead to different centrality values, resulting in different node rankings. For instance, if α → 0+, then the Katz centrality reduces to a degree centrality [17]. If α → (1/λ₁), then it reduces to an eigenvector centrality [18]; for example, if α = (1/λ₁) and β = 0, then the Katz centrality is the same as the eigenvector centrality. Hence, these parameters can be taken as a medium to tune between the rankings of nodes based either on a local influence (short walks) or a global influence (long walks).

In case of undirected graphs, both the receiving and broadcasting abilities are alike [16]. However, this is not the case for directed graphs. Table 1 provides the limiting behavior of various schemes.

Table 1 Limiting behavior of various schemes

Identification of top-K influential communities in big networks

Abstract

Introduction

Related work

Preliminaries

Map equation framework

Discussion

Implementation

The IGraph package

Conclusions

Experiments

Experimental environment

Results

Facebook dataset

Autonomous systems dataset

Wikipedia dataset

Run time analysis

Declarations

Availability of datasets

Facebook dataset

Autonomous systems dataset

Wikipedia dataset

Abbreviations

References

Acknowledgements

Authors’ contributions

Authors’ information

Competing interests

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords