An optimized approach for community detection and ranking
 Matin Pirouz^{1}Email authorView ORCID ID profile,
 Justin Zhan^{1} and
 Shahab Tayeb^{2}
Received: 28 August 2016
Accepted: 3 November 2016
Published: 10 November 2016
Abstract
Community structures and relation patterns, and ranking them for social networks provide us with great knowledge about network. Such knowledge can be utilized for target marketing or grouping similar, yet distinct, nodes. The evergrowing variety of social networks necessitates detection of minute and scattered communities, which are important problems across different research fields including biology, social studies, physics, etc. Existing community detection algorithms such as fast and folding or modularity based are either incapable of finding graph anomalies or too slow and impractical for large graphs. The main contributions of this work are twofold: (i) we optimize the Attractor algorithm, speeding it up by a factor depending on complexity of the graph; i.e. the more complex a social graph is, the better result the algorithm will achieve, and (ii) we propose a community ranker algorithm for the first time. The former is achieved by amalgamating loops and incorporating breadthfirst search (BFS) algorithm for edge alignments and to fill in the missing cache, preserving a constant of time equal to the number of edges in the graph. For the latter, we make the first attempt to enumerate how influential each community is in a given graph, ranking them based on their normalized impact factor.
Keywords
Introduction
Communities are essentially the strength of connection patterns among members of online social networks. Such connections as friendship over Facebook or following a cooking page on Instagram result in generation of voluminous data, causing a wide array of relationships. Graph theory is utilized to depict relations between community nodes while statistical properties of nodes are used to find the patterns. Node centrality defines community boundaries and impacts adjacencies [1]. Optimization of the quality function, i.e. modularity, is represented as eigenvectors of the network matrix [2].
Community detection algorithms have been around for some time now; however, with the growing size of todays networks, how to find small communities in big graphs, with billion of nodes and edges, is a major challenge. Graphs of Online Social Networks (OSN) such as Facebook, Twitter, etc. are growing every day. This growth introduces both large and small communities, for instance those with less than 100 members. There is a need for an algorithm capable of finding both large and small communities efficiently, in terms of time and processing overhead. Moreover, finding anomalies and communities outliers also prove challenging [3, 4]. A key undertaking would be to rank these communities after detecting them. For example, Facebook has over 20,000 communities and it is important to calculate which community has more influence over the entire members in the graph. A higher rank for a community implies a higher probability of active and influential members. Being able to find communities’ ranks helps us with (a) analyzing the network graph and relations, (b) finding the most suitable data mining techniques, (c) predicting the information flow, and (d) comprehending public sentiment.
In this study, detecting and characterizing communities are discussed. Optimized Attractor algorithm is introduced which finds communities using Jaccard distance, similar to Attractor algorithm; however, the performance of our approach does not deteriorate for much bigger or smaller graphs, unlike Attractor. Furthermore, a novel algorithm is introduced to rank the communities using the ratio of intracommunity and intercommunity links between links. Unlike most community detection algorithms, Optimized Attractor normalizes the size of communities based on a threshold. The performance of Optimized Attractor was measured on known benchmarks, where it outperforms modularitybased methods.
The rest of this paper is organized as following: "Related work" section reviews an indispensable comparison for existing community detection algorithms, be it from Betweenness algorithm to modularity maximization approaches to. "Our approach" section introduces the proposed method, Optimized Attractor, after discussing the required preliminaries. "Analysis" section depicts an exhaustive analysis of performance and time complexity of Optimized Attractor. Discussion of the results and future directions are given in "Discussion and future work" section. "Conclusion" section outlines the concluding remarks.
Related work
Our approach

First a breadthfirst search (BFS) [18] is added to the algorithm before going thorough the initial distance calculations. It helps the program run almost ten percent faster (an algorithm for traversing or searching tree or graph data structures). It starts at the tree root (or some arbitrary node of a graph, sometimes referred to as a ‘search key’) and explores the neighbor nodes first, before moving to the next level neighbors [18].

The original Attractor algorithm has been optimized by the method explained in "Algorithm" section. Using this optimization method, the program runs faster by a factor of almost one fifth. (shown in Table 1 ).

A new algorithm called “Community Ranker” is introduced. As the name suggest, CR (Community Ranker) method finds a value for each community that was found by Algorithm 1. How communities have been ranked is provided in "Algorithm" section by Algorithm 2.
Comparison between Attractor and Optimized Attractor
Data set  Attractor  Optimized Attractor  OA and BFS 

Karate Club  0.012278  0.012091  0.011930 
American Football  0.119706  0.101960  0.101330 
Polbooks  0.149019  0.138020  0.137216 
Friendship  446.488900  325.466700  305.337700 
Amazon  379.636510  297.4342910  285.840700 
Road  327.805610  317.141900015  303.112400 
Preliminaries
Social networks are graphs created by members of social media. These graphs are made from social media members and the relationships between them in forms of nodes and edges respectively. A friendship on the Facebook is an example of a twosided (undirected) connection and following somebody’s page Instagram or tweeter is an example of a onesided (directed) connection and a connections in DBLP (Digital Bibliography & Library Project) network is an example of a weighed edge. In social networks, connections between people reveal a lot of information about the graph and communities existing in the graph. A community is a group of people with mutual interest. As same as real life in every socialmedia, there exist various communities. This communities are made from members of social media and their pages. Example of communities could be friendship groups or groups of people with same occupations or simply groups of people with same interests. In social graphs, relationship inside the community and relationships outside are respectively shown by internal edges and external edges. Number of internaledges are dense within a community where number of external edges are low. In contrast between two separate communities there is no internal edge but a small number of external edges are existed. Among all the communities that exist in social media, there is a need to distinguish those who are active and more important in the graph. Figuring out the active communities can help with advertisings tasks, statistics and analyses needed for suggestions, searches and trend analytics. In addition, finding the ranking list of communities shows how important each community is. Communities with higher ranks have a better communication inside and outside the community. Local computations are good for big graphs. Community detection methods are divided into two subgroups. The first well known method is to cluster vertexes based on their similarity. And the second method is graph partitioning base on sparse cut.
Notations
Adjacency matrix is a matrix of the size \(n\times n\) where n is the number of nodes in the graph. It has ones whenever two nodes are connected and zero otherwise. Every connection between two nodes is defined as an edge. Edges can be directed or undirected. Directed edges are those that connect two nodes in only one way. Undirected edges connect two nodes twosided only with one edge. However, to show more features of nodes, for example the importance of each nod and edges connected to that node, there are weights defined for each edge. In this paper nodes are shown as n and the set of nodes is shown with N. Edges and set of edges are shown as l and L. Communities and the set of communities c and C. Start and end node are represented as s and \(R_{c}\) represents Rank of community c. And the number of nodes in each communities is represented by \(\eta\). Also number of edges between two communities are show as \(\gamma\). S is used to keep the similarity value. Mutual and exclusive neighbors are shown with m and x. And finally, neighboring Set is represented as \(\varUpsilon\).
Algorithm
Analysis
Time complexity analysis
The reason for this phenomenon is that the process of finding communities was done using three loops in the original algorithm, which was performed over every graph edge. The Optimized Attractor reduces this to two loops; i.e. the algorithm is improve with a constant of E for every datasets with edge number of E. Hence, Attractor time complicity which is O(\(E + k E+ TE\)) changed to O(\(E + kE\)); therefore, it runs faster with constant factor of O(E). Note that K is the average number of exclusive neighbors for two linked nodes and T is a constant number which is \(3\le T\le 50\). In OA, T is removed as all the calculations of third loop are compensated in second loop. The way, this has been possible is that instead of having third loop, all the filtering and flagging will be done in the second loop where similarities are being counted.
In the second algorithm (Community Ranker), the time complexity is \(O(C+E)\). The given time is coming from the number of communities and the number of existing edges in each community. The time complexity for CR is O(C), where C is the number of communities.
Experimental analysis
This experiment was performed on a computer with Intel Xeon(R) E51607 @ 3GHz processors and 16 GB RAM. A single core was used to run the algorithm. All the nodes and edges were loaded in the machines main memory before calculating the time spent for the Attractor or community Ranker algorithm. All the proposed algorithms were implemented in Python programming language, and the library used was networkx [18]. A matplotlib library was used for the Plots draw from the graphs.
Data set
The studied datasets
Data set  Number of nodes  Number of edges 

Karate Club  34  78 
American Football  115  613 
Polbooks  105  441 
Friendship  58,228  214,078 
Amazon  334,863  925,872 
Road  1,088,092  1,541,898 
Discussion and future work
Similarity analysis of the Optimized Attractor
Data set  Attractor  Modularity 

Karate Club  1  0.916128 
American Football  1  0.952536 
Polbooks  1  0.854031 
Top four communities
Data set  Num of comm  Max nodes  Min nodes  Process time 

Karate Club  3  17  1  0.001000 
American Football  12  14  5  0.001000 
Polbooks  4  44  1  0.001000 
Friendship  15,280  15,280  1  0.132000 
Amazon  33,931  1494  1  0.613000 
Road  22,978  876,500  1  0.992000 
Conclusion
Notations and symbols
Notation  Definition and description 

G  Given graph 
N, n  Set of nodes, node 
L, l  Set of links, link 
C, c  Set of community, community 
\(\eta\)  Number of nodes in each C 
\(R_{c}\)  Rank of community c 
\(\gamma\)  Number of edges between two C 
\(e_{b}\)  Designated edges between two C 
\(n_{s}^e \;{\rm and}\; n_{t}^e\)  Nodes on the side of an edge 
S  Similarity value 
s, e  Start and end node 
m, x  Mutual, exclusive neighbor 
\(\varUpsilon\)  Neighboring set 
Summary of findings
We proposed a ranking system for communities in which rate communities based on their influence on the rest of network. As the interest in finding top communities and top nodes in each community grows, various methods have been developed to discover the ranking pattern. Our algorithm uses centrality concept to find top communities in each dataset. In this method, the number of intra community offspring of each node is counted and compared to the total number of nodes in the dataset and then a method of normalization is applied. As a result, the proposed algorithm can find how topology important a node can be (more offspring to outside communities shows more influence).
Future work
We hope our results will motivate more studies on community ranking system. And having findings not only based on node relation but also on content base relations between nodes such as, comments, likes or follows. The speed and performance of finding communities is improved in this work but still there is a need to improve the accuracy related issues. Also the proposed algorithm can find community as an individual but there are a lot of nodes in which are common between different communities and can not be pointed as members of a specific communities. So there is a need to find the common part known as overlapping communities.
Declarations
Authors' contributions
MP, as the first author, performed the primary literature review, data collection and experiments, and also drafted the manuscript. JZ and ST worked with MP to develop the algorithm, the paper, and the framework. All authors read and approved the final manuscript.
Acknowledgements
We gratefully acknowledge the United States Department of Defense (DoD Grant #W911 NF1310130), the National Science Foundation (NSF Grant #1560625), and Oak Ridge National Laboratory (ORNL Contract #4000144962) and United Hall Foundation (UHF #1592) for their support and finance for this project.
Competing interests
The authors declare that they have no competing interests.
Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.
Authors’ Affiliations
References
 Latora V, Marchiori M. A measure of centrality based on network efficiency. New J Phys. 2007;9(6):188.View ArticleGoogle Scholar
 Newman MEJ. Modularity and community structure in networks. In: Proceedings of the National Academy of Sciences, vol 103.23. 2006. p. 8577–82.Google Scholar
 Gupta M, Gao J, Aggarwal CC, Han J. Outlier detection for temporal data: a survey. IEEE Trans Knowl Data Eng. 2014;26(9):2250–67. doi:10.1109/TKDE.2013.184.MathSciNetView ArticleMATHGoogle Scholar
 Wu S, Wang S. Informationtheoretic outlier detection for largescale categorical data. IEEE Trans Knowl Data Eng. 2013;25(3):589–602. doi:10.1109/TKDE.2011.261.View ArticleGoogle Scholar
 Girvan M, Newman ME. Community structure in social and biological networks. Proc Natl Acad Sci. 2002;99(12):7821–6.MathSciNetView ArticleMATHGoogle Scholar
 Newman ME. The structure and function of complex networks. SIAM Rev. 2003;45(2):167–256.MathSciNetView ArticleMATHGoogle Scholar
 Clauset A, Newman ME, Moore C. Finding community structure in very large networks. Phys Rev E. 2004;70:066111. doi:10.1103/PhysRevE.70.066111.View ArticleGoogle Scholar
 Duch J, Arenas A. Community detection in complex networks using extremal optimization. Phys Rev E. 2005;72:027104.View ArticleGoogle Scholar
 Shao J, Han Z, Yang Q, Zhou T. Community detection based on distance dynamics. In Proceedings of the 21th ACM SIGKDD international conference on knowledge discovery and data mining. New York: ACM; 2015. p. 10751084.Google Scholar
 Shi J, Malik J. Normalized cuts and image segmentation. IEEE Trans Pattern Anal Mach Intell. 2000;22(8):888–905.View ArticleGoogle Scholar
 Wang C, Tang W, Sun B, Fang J, Wang Y. Review on community detection algorithms in social networks. In: 2015 IEEE international conference on progress in informatics and computing (PIC). Piscataway: IEEE; 2015. p. 5515.Google Scholar
 Blondel VD, Guillaume JL, Lambiotte R, Lefebvre E. Fast unfolding of communities in large networks. J Stat Mech Theory Exp. 2008;2008(10):P10008.View ArticleGoogle Scholar
 Bohlin L, Edler D, Lancichinetti A, Rosvall M. Community detection and visualization of networks with the map equation framework. In: Measuring scholarly impact. Berlin: Springer International Publishing; 2014. p. 3–34Google Scholar
 Karypis George, Kumar Vipin. A fast and high quality multilevel scheme for partitioning irregular graphs. SIAM J Sci Comput. 1998;20(1):359392.MathSciNetView ArticleGoogle Scholar
 Hubert L, Arabie P. Comparing partitions. J Classif. 1985;2(1):193–218.View ArticleMATHGoogle Scholar
 Du N, Jia X, Gao J, Gopalakrishnan V, Zhang A. Tracking temporal community strength in dynamic networks. IEEE Trans Knowl Data Eng. 2015;27(11):3125–37. doi:10.1109/TKDE.2015.2432815.View ArticleGoogle Scholar
 Leiserson CE, Schardl TB. A workefficient parallel breadthfirst search algorithm (or how to cope with the nondeterminism of reducers). In: Proceedings of the 22nd annual ACM symposium on Parallelism in algorithms and architectures (SPAA ‘10). New York: ACM; 2010. p. 303–14. doi:http://dx.doi.org/10.1145/1810479.1810534.
 Networkx. https://networkx.github.io/. Accessed 27 Sep 2016.
 Zachary’s Karate Club. https://networkdata.ics.uci.edu/data.php?id=105. Accessed 27 Sep 2016.
 American College Football. http://wwwpersonal.umich.edu/~mejn/netdata. Accessed 27 Sep 2016.
 PolBooks—Krebs’ Amazon books. http://vlado.fmf.unilj.si/pub/networks/data/mix/mixed.htm. Accessed 27 Sep 2016.
 High energy physics—theory collaboration network. https://snap.stanford.edu/data/caHepTh.html. Accessed 27 Sep 2016.
 Brightkite. https://snap.stanford.edu/data/locbrightkite.html. Accessed 27 Sep 2016.
 Pennsylvania road network. https://snap.stanford.edu/data/roadNetPA.html. Accessed 27 Sep 2016.
 Amazon product copurchasing network and groundtruth communities. https://snap.stanford.edu/data/comAmazon.html. Accessed 27 Sep 2016.